Filtering & Processing Flags¶
Complete documentation for filtering and result processing flags
Table of Contents¶
Overview¶
Filtering and processing flags allow you to refine, organize, and enhance search results after retrieval.
Total Filter Flags: 7
Categories¶
- Date Filtering - Filter by publication date range
- Deduplication - Remove duplicate articles (5 strategies)
- Sorting - Order results (5 methods)
- Enrichment - Enhance metadata from external APIs
Date Filtering¶
-d, --date¶
Description: Filter results by publication date range.
Syntax:
lixplore [SOURCE] -q "QUERY" -d FROM_DATE TO_DATE [OPTIONS]
lixplore [SOURCE] -q "QUERY" --date FROM_DATE TO_DATE [OPTIONS]
Type: Two date values (FROM TO)
Format: YYYY-MM-DD
Default: No date filter
Examples¶
Example 1: Recent Publications (Last 5 Years)
Example 2: Historical Research
Example 3: Current Year Only
Example 4: Decade Analysis
Example 5: Pre-COVID Research
Tips¶
- Always use YYYY-MM-DD format
- Combine with
--sort newestfor latest-first order - Use statistics flag to analyze publication trends
- Works best with PubMed and Crossref
- Date filtering happens after API retrieval (client-side)
Warnings¶
- Not all sources support date filtering equally
- Some older articles may lack proper date metadata
- Preprints (arXiv) may use submission date vs publication date
- DOAJ may have indexing delays
Related Flags¶
--sort- Sort by date--stat- Publication trend analysis-m, --max_results- Limit results
Deduplication¶
-D, --deduplicate¶
Description: Remove duplicate articles from multi-source searches using advanced matching algorithms.
Syntax:
lixplore [SOURCES] -q "QUERY" -D [STRATEGY] [OPTIONS]
lixplore [SOURCES] -q "QUERY" --deduplicate [STRATEGY] [OPTIONS]
Type: Optional string value
Strategies:
- auto (default) - Multi-level matching (DOI + title + author)
- doi_only - Match only by DOI
- title_only - Match only by title similarity
- strict - High similarity threshold (0.95)
- loose - Low similarity threshold (0.75)
Default: auto (when used without value)
Examples¶
Example 1: Auto Deduplication (Recommended)
Search all sources and remove duplicates automatically.Example 2: Strict Deduplication
Very conservative matching (high threshold).Example 3: Loose Deduplication
Aggressive duplicate removal (catches more variants).Example 4: DOI-Only Matching
Only match articles with same DOI.Example 5: Title-Only Matching
Match by title similarity only.Advanced Deduplication Options¶
--dedup-threshold¶
Description: Set title similarity threshold for deduplication.
Syntax:
Type: Float (0.0-1.0)
Default: 0.85
Example:
Higher threshold = stricter matching (fewer false duplicates).--dedup-keep¶
Description: Choose which duplicate to keep when matches found.
Syntax:
Type: String choice
Options:
- first - Keep first occurrence (chronological order)
- most_complete - Keep entry with most metadata (default)
- prefer_doi - Prefer entries with DOI
Example:
Keep DOI versions when duplicates found.--dedup-merge¶
Description: Merge metadata from duplicates instead of discarding.
Syntax:
Type: Boolean flag
Example:
Combine best metadata from all duplicate entries.Deduplication Algorithms¶
1. DOI Matching (Most Reliable): - Exact string match of DOI - Normalized (lowercase, whitespace removed) - If DOIs match → Duplicate confirmed
2. Title Similarity: - SequenceMatcher algorithm - Normalized strings (lowercase, extra spaces removed) - Threshold comparison (default 0.85) - Score ≥ threshold → Likely duplicate
3. Author Matching: - Normalize author names (handle various formats) - Count common authors - Min 2 common authors → Confirmed - Used as secondary validation for title matches
4. Combined (Auto Strategy):
IF both_have_doi:
return doi_match()
ELSE IF titles_similar(threshold):
IF have_author_data:
return authors_match()
ELSE:
return True
ELSE IF many_authors_match AND titles_somewhat_similar(0.7):
return True
ELSE:
return False
Tips¶
- ALWAYS use
-Dwhen searching multiple sources autostrategy works well for most cases- Use
strictfor important bibliographies - Use
loosewhen expecting many variants --dedup-mergeprovides most complete metadata- Check deduplication stats in output
Warnings¶
- Deduplication is not perfect (may miss variants)
- Very different titles for same article may not match
- Author name variations can cause misses
- Processing time increases with result count
- Some false positives/negatives possible
Related Flags¶
-A, --all- Search all sources (needs dedup)-s, --sources- Multi-source search--enrich- Fill missing metadata--sort- Organize results
Sorting¶
--sort¶
Description: Sort results by various criteria.
Syntax:
Type: String choice
Options:
- relevant - Default API order (most relevant first)
- newest - Latest publications first (desc by year)
- oldest - Earliest publications first (asc by year)
- journal - Alphabetical by journal name
- author - Alphabetical by first author last name
Default: relevant (original API order)
Examples¶
Example 1: Latest Research First
Get most recent COVID-19 research.Example 2: Historical Perspective
Chronological order from earliest publications.Example 3: Journal Alphabetical
Organize by journal name (A-Z).Example 4: Author Alphabetical
Organize by first author's last name.Example 5: Sort + Select + Export
Get latest 30 ML papers across all sources.Example 6: Sort for Statistics
Analyze recent publication trends.Example 7: Historical Analysis
Get earliest 20 quantum physics papers.Example 8: Journal-Based Export
Export cardiology papers organized by journal.Example 9: Author Bibliography
Einstein's papers in chronological order.Example 10: Multi-Sort Workflow
# Get latest papers
lixplore -P -q "CRISPR" -m 100 --sort newest -S first:20 -X xlsx -o latest_crispr.xlsx
# Get earliest papers
lixplore -P -q "CRISPR" -m 100 --sort oldest -S first:20 -X xlsx -o early_crispr.xlsx
Sorting Behavior¶
By Year (newest/oldest): - Uses publication year field - Missing years sorted to end - Same year → maintains original order
By Journal: - Alphabetical case-insensitive - Missing journal → sorted to end - Exact string match (not normalized)
By Author: - Uses first author's last name - Extracts last word as surname - Missing authors → sorted to end - Case-insensitive
Relevant (Default): - Preserves API ranking - Usually by relevance score - Source-dependent algorithm
Tips¶
- Use
newestfor current research trends - Use
oldestfor historical studies - Combine with
-S first:Nto get top N after sorting - Journal sort useful for journal-specific analysis
- Author sort good for alphabetical bibliographies
- Sorting happens client-side (after retrieval)
Warnings¶
- Some articles may lack year metadata
- Journal names not normalized (variants exist)
- Author extraction may fail for unusual formats
- Sorting large result sets can be slow
- Original relevance lost with custom sort
Related Flags¶
-S, --select- Select subset after sorting--stat- Analyze sorted trends-d, --date- Filter before sorting-D, --deduplicate- Dedupe before sorting
Metadata Enrichment¶
--enrich¶
Description: Enrich article metadata by querying external APIs for missing or additional information.
Syntax:
Type: Optional list of API names
APIs:
- crossref - Crossref API enrichment
- pubmed - PubMed API enrichment
- arxiv - arXiv API enrichment
- all - All available APIs (default if no args)
Default: all (when used without value)
Examples¶
Example 1: Enrich from All APIs
Enrich metadata from all available sources.Example 2: Crossref Only
Add DOI and citation data from Crossref.Example 3: PubMed Only
Add PubMed IDs and abstracts.Example 4: Multiple APIs
Enrich from Crossref and PubMed.Example 5: Enrich Before Export
Complete metadata before exporting citations.Example 6: Enrich + Statistics
Enrich data for better statistics accuracy.Example 7: arXiv Enrichment
Add arXiv links and preprint data.What Gets Enriched¶
Crossref Enrichment: - DOI validation and links - Journal information (name, volume, issue, pages) - Publication dates - Publisher information - Citation counts - ISSN/ISBN - License information
PubMed Enrichment: - PubMed ID (PMID) - PubMed Central ID (PMCID) - Abstracts (if missing) - MeSH terms - Article types - Grant information - Author affiliations
arXiv Enrichment: - arXiv ID - Preprint links - Categories - PDF links - Submission dates - Updated versions
Enrichment Process¶
1. DOI Discovery:
2. Metadata Completion:
For each article:
For each enrichment API:
IF has_identifier(doi/pmid/arxiv_id):
Query API
Merge results (keep best data)
3. Field Priority:
IF multiple_sources_have_field:
Priority: PubMed > Crossref > arXiv > Original
(Most reliable source wins)
Tips¶
- Use enrichment for incomplete search results
- Crossref best for DOI and citation data
- PubMed best for biomedical metadata
- Increases export quality (especially BibTeX)
- Adds 2-5 seconds per 10 articles
- Progress shown during enrichment
Warnings¶
- Slower search (additional API calls)
- API rate limits may apply
- Not all articles can be enriched
- Some APIs require internet connection
- May not find matches for very obscure articles
- Crossref rate limit: 50 requests/second
Related Flags¶
-X, --export- Export enriched data-D, --deduplicate- Use before enrichment--stat- Statistics on enriched data
Best Practices¶
1. Always Deduplicate Multi-Source Searches¶
# CORRECT
lixplore -A -q "query" -m 50 -D
# INCORRECT (will have many duplicates)
lixplore -A -q "query" -m 50
2. Combine Date + Sort + Select¶
# Get latest 20 papers from last 5 years
lixplore -P -q "CRISPR" -d 2020-01-01 2024-12-31 -m 100 --sort newest -S first:20 -X xlsx
3. Enrich Before Export¶
4. Use Appropriate Dedup Strategy¶
# For final bibliography (strict)
lixplore -A -q "query" -m 100 -D strict --dedup-keep most_complete
# For discovery (loose)
lixplore -A -q "query" -m 200 -D loose
5. Filter → Deduplicate → Sort → Select → Export¶
lixplore -A -q "machine learning" \
-d 2020-01-01 2024-12-31 \ # 1. Filter by date
-m 200 \ # 2. Get results
-D \ # 3. Deduplicate
--sort newest \ # 4. Sort
-S first:50 \ # 5. Select top 50
-X xlsx # 6. Export
Troubleshooting¶
Problem: Too many duplicates remain¶
Solution: Use stricter deduplication
Problem: Missing some article variants¶
Solution: Use looser deduplication
Problem: Date filter not working¶
Solution: Check date format and source compatibility
# Correct format
lixplore -P -q "query" -d 2020-01-01 2024-12-31 -m 50
# Wrong format
lixplore -P -q "query" -d 2020 2024 -m 50 # Won't work
Problem: Enrichment too slow¶
Solution: Limit APIs or skip enrichment
# Just Crossref (faster)
lixplore -P -q "query" -m 50 --enrich crossref
# Skip enrichment for speed
lixplore -P -q "query" -m 50
Related Documentation¶
- Source Flags - Multi-source searching
- Search Flags - Query construction
- Display Flags - View results
- Export Flags - Export processed results
Last Updated: 2024-12-28 Lixplore Version: 2.0+