Filtering and Deduplication Guide¶
Master guide for filtering, deduplicating, and processing search results
Table of Contents¶
Date Filtering¶
Basic Usage¶
Common Date Ranges¶
Last Year:
Last 5 Years:
Historical Period:
Current Month:
Best Practices¶
- Always use YYYY-MM-DD format
- Combine with sorting for better organization
- Use statistics to analyze trends
- Remember: Date filtering is client-side (fetches then filters)
Deduplication Strategies¶
Why Deduplicate?¶
Multi-source searches often return the same article multiple times: - Same article indexed in PubMed and Crossref - Preprint in arXiv, published version in journal - Different metadata completeness
Strategy Overview¶
| Strategy | Threshold | Use Case |
|---|---|---|
auto |
0.85 | Balanced (recommended) |
strict |
0.95 | Final bibliography |
loose |
0.75 | Discovery, variant detection |
doi_only |
Exact | DOI-based only |
title_only |
0.85 | When DOI unavailable |
Auto Strategy (Recommended)¶
How it works: 1. Match by DOI (if both have DOI) 2. Match by title similarity (≥0.85) 3. Verify with author overlap 4. Keep most complete metadata
Strict Strategy (High Precision)¶
Best for: - Final bibliographies - Publication-ready references - When false duplicates are costly
Loose Strategy (High Recall)¶
Best for: - Initial discovery - Catching all variants - Preliminary screening
DOI-Only Strategy¶
Best for: - When metadata is unreliable - High-quality journal articles - Crossref + PubMed combinations
Advanced Options¶
Custom Threshold:
Keep Preference:
# Keep entry with most metadata
lixplore -A -q "query" -m 100 -D --dedup-keep most_complete
# Prefer entries with DOI
lixplore -A -q "query" -m 100 -D --dedup-keep prefer_doi
# Keep first found
lixplore -A -q "query" -m 100 -D --dedup-keep first
Merge Metadata:
Combines best data from all duplicates into single entry.Deduplication Examples¶
Example 1: Comprehensive Lit Review
lixplore -A -q "cancer immunotherapy" \
-m 200 \
-D strict \
--dedup-keep most_complete \
--dedup-merge \
--enrich \
-X bibtex
Example 2: Quick Discovery
Example 3: DOI-Based (Journal Articles Only)
Result Sorting¶
Sort Methods¶
| Method | Description | Use Case |
|---|---|---|
relevant |
API default | Trust source ranking |
newest |
Latest first | Current research |
oldest |
Earliest first | Historical perspective |
journal |
Alphabetical | Journal-organized biblio |
author |
By first author | Author-organized biblio |
Sort Examples¶
Latest Research:
Historical Timeline:
Journal Organization:
Author Bibliography:
Sort + Select Combinations¶
Top 10 Latest:
Last 5 Oldest:
Even Newest:
Selection Patterns¶
Pattern Types¶
Numbers:
Ranges:
Keywords:
# Odd articles
lixplore -P -q "query" -m 100 -S odd
# Even articles
lixplore -P -q "query" -m 100 -S even
# First N
lixplore -P -q "query" -m 100 -S first:20
# Last N
lixplore -P -q "query" -m 100 -S last:10
# Top N (alias for first:N)
lixplore -P -q "query" -m 100 -S top:15
Mixed Patterns:
Selection Strategies¶
Sample Every Other:
Top Results:
Quality Filter:
Stratified Sample:
Complete Workflows¶
Workflow 1: Systematic Literature Review¶
# Step 1: Comprehensive search across all sources
lixplore -A -q "(cancer OR tumor) AND (immunotherapy OR checkpoint inhibitor)" \
-m 500 \
-D strict \
--dedup-keep most_complete \
--dedup-merge
# Step 2: Filter to recent research
lixplore -A -q "..." \
-m 500 \
-d 2020-01-01 2024-12-31 \
-D strict \
--dedup-merge
# Step 3: Sort by newest and select top 100
lixplore -A -q "..." \
-m 500 \
-d 2020-01-01 2024-12-31 \
-D strict \
--sort newest \
-S first:100
# Step 4: Enrich and export
lixplore -A -q "..." \
-m 500 \
-d 2020-01-01 2024-12-31 \
-D strict \
--dedup-merge \
--sort newest \
-S first:100 \
--enrich \
-X xlsx,bibtex \
-o systematic_review
Workflow 2: Current Awareness¶
# Weekly update: Latest 20 papers in field
lixplore -s PC -q "machine learning healthcare" \
-d 2024-12-01 2024-12-31 \
-m 100 \
-D \
--sort newest \
-S first:20 \
-X xlsx \
-o weekly_update.xlsx
Workflow 3: Historical Analysis¶
# Publication trends over decades
lixplore -P -q "diabetes treatment" \
-d 1970-01-01 2024-12-31 \
-m 1000 \
--sort oldest \
--stat \
--stat-top 50
Workflow 4: Quality Screening¶
# Step 1: Get large dataset
lixplore -A -q "research topic" -m 500 -D
# Step 2: Sort by newest (proxy for quality)
lixplore -A -q "research topic" -m 500 -D --sort newest
# Step 3: Manual review top 100 with abstracts
lixplore -A -q "research topic" -m 500 -D --sort newest -S first:100 -a
# Step 4: Annotate during review
lixplore --annotate 5 --rating 5 --tags "excellent,must-cite"
lixplore --annotate 8 --rating 4 --tags "relevant"
# Step 5: Export high-rated only
lixplore --filter-annotations "min_rating=4"
lixplore --export-annotations markdown
Workflow 5: Multi-Stage Filtering¶
# Stage 1: Broad search
lixplore -A -q "broad topic" -m 1000 -D
# Stage 2: Date filter
lixplore -A -q "broad topic" \
-m 1000 \
-d 2020-01-01 2024-12-31 \
-D
# Stage 3: Sort and select top tier
lixplore -A -q "broad topic" \
-m 1000 \
-d 2020-01-01 2024-12-31 \
-D \
--sort newest \
-S first:200
# Stage 4: Manual screening with annotations
lixplore -A -q "broad topic" \
-m 1000 \
-d 2020-01-01 2024-12-31 \
-D \
--sort newest \
-S first:200 \
-a
# Stage 5: Second-round selection
lixplore -A -q "broad topic" \
-m 1000 \
-d 2020-01-01 2024-12-31 \
-D \
--sort newest \
-S first:200 \
-S odd # 100 articles for deep review
# Stage 6: Final export after annotation
lixplore --filter-annotations "min_rating=4,priority=high"
lixplore --export-annotations markdown
Best Practices¶
1. Always Deduplicate Multi-Source¶
2. Filter → Deduplicate → Sort → Select → Export¶
lixplore -A -q "query" \
-d 2020-01-01 2024-12-31 \ # 1. Filter
-m 200 \
-D \ # 2. Deduplicate
--sort newest \ # 3. Sort
-S first:50 \ # 4. Select
-X xlsx # 5. Export
3. Use Appropriate Dedup Strategy¶
- Final bibliography → strict
- Exploration → loose
- Balanced → auto
4. Combine Date + Sort for Latest¶
5. Statistics for Large Sets¶
Troubleshooting¶
Too many duplicates remain:
Missing valid variants:
Date filter not working:
Sort not as expected:
# Some articles may lack year/journal metadata
# Consider enrichment first
lixplore -A -q "query" -m 100 -D --enrich --sort newest
Last Updated: 2024-12-28