genomoncology/biomcp # codebase.md

This is page 10 of 15. Use http://codebase.md/genomoncology/biomcp?lines=false&page={x} to view the full context.

# Directory Structure

```
├── .github
│   ├── actions
│   │   └── setup-python-env
│   │       └── action.yml
│   ├── dependabot.yml
│   └── workflows
│       ├── ci.yml
│       ├── deploy-docs.yml
│       ├── main.yml.disabled
│       ├── on-release-main.yml
│       └── validate-codecov-config.yml
├── .gitignore
├── .pre-commit-config.yaml
├── BIOMCP_DATA_FLOW.md
├── CHANGELOG.md
├── CNAME
├── codecov.yaml
├── docker-compose.yml
├── Dockerfile
├── docs
│   ├── apis
│   │   ├── error-codes.md
│   │   ├── overview.md
│   │   └── python-sdk.md
│   ├── assets
│   │   ├── biomcp-cursor-locations.png
│   │   ├── favicon.ico
│   │   ├── icon.png
│   │   ├── logo.png
│   │   ├── mcp_architecture.txt
│   │   └── remote-connection
│   │       ├── 00_connectors.png
│   │       ├── 01_add_custom_connector.png
│   │       ├── 02_connector_enabled.png
│   │       ├── 03_connect_to_biomcp.png
│   │       ├── 04_select_google_oauth.png
│   │       └── 05_success_connect.png
│   ├── backend-services-reference
│   │   ├── 01-overview.md
│   │   ├── 02-biothings-suite.md
│   │   ├── 03-cbioportal.md
│   │   ├── 04-clinicaltrials-gov.md
│   │   ├── 05-nci-cts-api.md
│   │   ├── 06-pubtator3.md
│   │   └── 07-alphagenome.md
│   ├── blog
│   │   ├── ai-assisted-clinical-trial-search-analysis.md
│   │   ├── images
│   │   │   ├── deep-researcher-video.png
│   │   │   ├── researcher-announce.png
│   │   │   ├── researcher-drop-down.png
│   │   │   ├── researcher-prompt.png
│   │   │   ├── trial-search-assistant.png
│   │   │   └── what_is_biomcp_thumbnail.png
│   │   └── researcher-persona-resource.md
│   ├── changelog.md
│   ├── CNAME
│   ├── concepts
│   │   ├── 01-what-is-biomcp.md
│   │   ├── 02-the-deep-researcher-persona.md
│   │   └── 03-sequential-thinking-with-the-think-tool.md
│   ├── developer-guides
│   │   ├── 01-server-deployment.md
│   │   ├── 02-contributing-and-testing.md
│   │   ├── 03-third-party-endpoints.md
│   │   ├── 04-transport-protocol.md
│   │   ├── 05-error-handling.md
│   │   ├── 06-http-client-and-caching.md
│   │   ├── 07-performance-optimizations.md
│   │   └── generate_endpoints.py
│   ├── faq-condensed.md
│   ├── FDA_SECURITY.md
│   ├── genomoncology.md
│   ├── getting-started
│   │   ├── 01-quickstart-cli.md
│   │   ├── 02-claude-desktop-integration.md
│   │   └── 03-authentication-and-api-keys.md
│   ├── how-to-guides
│   │   ├── 01-find-articles-and-cbioportal-data.md
│   │   ├── 02-find-trials-with-nci-and-biothings.md
│   │   ├── 03-get-comprehensive-variant-annotations.md
│   │   ├── 04-predict-variant-effects-with-alphagenome.md
│   │   ├── 05-logging-and-monitoring-with-bigquery.md
│   │   └── 06-search-nci-organizations-and-interventions.md
│   ├── index.md
│   ├── policies.md
│   ├── reference
│   │   ├── architecture-diagrams.md
│   │   ├── quick-architecture.md
│   │   ├── quick-reference.md
│   │   └── visual-architecture.md
│   ├── robots.txt
│   ├── stylesheets
│   │   ├── announcement.css
│   │   └── extra.css
│   ├── troubleshooting.md
│   ├── tutorials
│   │   ├── biothings-prompts.md
│   │   ├── claude-code-biomcp-alphagenome.md
│   │   ├── nci-prompts.md
│   │   ├── openfda-integration.md
│   │   ├── openfda-prompts.md
│   │   ├── pydantic-ai-integration.md
│   │   └── remote-connection.md
│   ├── user-guides
│   │   ├── 01-command-line-interface.md
│   │   ├── 02-mcp-tools-reference.md
│   │   └── 03-integrating-with-ides-and-clients.md
│   └── workflows
│       └── all-workflows.md
├── example_scripts
│   ├── mcp_integration.py
│   └── python_sdk.py
├── glama.json
├── LICENSE
├── lzyank.toml
├── Makefile
├── mkdocs.yml
├── package-lock.json
├── package.json
├── pyproject.toml
├── README.md
├── scripts
│   ├── check_docs_in_mkdocs.py
│   ├── check_http_imports.py
│   └── generate_endpoints_doc.py
├── smithery.yaml
├── src
│   └── biomcp
│       ├── __init__.py
│       ├── __main__.py
│       ├── articles
│       │   ├── __init__.py
│       │   ├── autocomplete.py
│       │   ├── fetch.py
│       │   ├── preprints.py
│       │   ├── search_optimized.py
│       │   ├── search.py
│       │   └── unified.py
│       ├── biomarkers
│       │   ├── __init__.py
│       │   └── search.py
│       ├── cbioportal_helper.py
│       ├── circuit_breaker.py
│       ├── cli
│       │   ├── __init__.py
│       │   ├── articles.py
│       │   ├── biomarkers.py
│       │   ├── diseases.py
│       │   ├── health.py
│       │   ├── interventions.py
│       │   ├── main.py
│       │   ├── openfda.py
│       │   ├── organizations.py
│       │   ├── server.py
│       │   ├── trials.py
│       │   └── variants.py
│       ├── connection_pool.py
│       ├── constants.py
│       ├── core.py
│       ├── diseases
│       │   ├── __init__.py
│       │   ├── getter.py
│       │   └── search.py
│       ├── domain_handlers.py
│       ├── drugs
│       │   ├── __init__.py
│       │   └── getter.py
│       ├── exceptions.py
│       ├── genes
│       │   ├── __init__.py
│       │   └── getter.py
│       ├── http_client_simple.py
│       ├── http_client.py
│       ├── individual_tools.py
│       ├── integrations
│       │   ├── __init__.py
│       │   ├── biothings_client.py
│       │   └── cts_api.py
│       ├── interventions
│       │   ├── __init__.py
│       │   ├── getter.py
│       │   └── search.py
│       ├── logging_filter.py
│       ├── metrics_handler.py
│       ├── metrics.py
│       ├── openfda
│       │   ├── __init__.py
│       │   ├── adverse_events_helpers.py
│       │   ├── adverse_events.py
│       │   ├── cache.py
│       │   ├── constants.py
│       │   ├── device_events_helpers.py
│       │   ├── device_events.py
│       │   ├── drug_approvals.py
│       │   ├── drug_labels_helpers.py
│       │   ├── drug_labels.py
│       │   ├── drug_recalls_helpers.py
│       │   ├── drug_recalls.py
│       │   ├── drug_shortages_detail_helpers.py
│       │   ├── drug_shortages_helpers.py
│       │   ├── drug_shortages.py
│       │   ├── exceptions.py
│       │   ├── input_validation.py
│       │   ├── rate_limiter.py
│       │   ├── utils.py
│       │   └── validation.py
│       ├── organizations
│       │   ├── __init__.py
│       │   ├── getter.py
│       │   └── search.py
│       ├── parameter_parser.py
│       ├── prefetch.py
│       ├── query_parser.py
│       ├── query_router.py
│       ├── rate_limiter.py
│       ├── render.py
│       ├── request_batcher.py
│       ├── resources
│       │   ├── __init__.py
│       │   ├── getter.py
│       │   ├── instructions.md
│       │   └── researcher.md
│       ├── retry.py
│       ├── router_handlers.py
│       ├── router.py
│       ├── shared_context.py
│       ├── thinking
│       │   ├── __init__.py
│       │   ├── sequential.py
│       │   └── session.py
│       ├── thinking_tool.py
│       ├── thinking_tracker.py
│       ├── trials
│       │   ├── __init__.py
│       │   ├── getter.py
│       │   ├── nci_getter.py
│       │   ├── nci_search.py
│       │   └── search.py
│       ├── utils
│       │   ├── __init__.py
│       │   ├── cancer_types_api.py
│       │   ├── cbio_http_adapter.py
│       │   ├── endpoint_registry.py
│       │   ├── gene_validator.py
│       │   ├── metrics.py
│       │   ├── mutation_filter.py
│       │   ├── query_utils.py
│       │   ├── rate_limiter.py
│       │   └── request_cache.py
│       ├── variants
│       │   ├── __init__.py
│       │   ├── alphagenome.py
│       │   ├── cancer_types.py
│       │   ├── cbio_external_client.py
│       │   ├── cbioportal_mutations.py
│       │   ├── cbioportal_search_helpers.py
│       │   ├── cbioportal_search.py
│       │   ├── constants.py
│       │   ├── external.py
│       │   ├── filters.py
│       │   ├── getter.py
│       │   ├── links.py
│       │   └── search.py
│       └── workers
│           ├── __init__.py
│           ├── worker_entry_stytch.js
│           ├── worker_entry.js
│           └── worker.py
├── tests
│   ├── bdd
│   │   ├── cli_help
│   │   │   ├── help.feature
│   │   │   └── test_help.py
│   │   ├── conftest.py
│   │   ├── features
│   │   │   └── alphagenome_integration.feature
│   │   ├── fetch_articles
│   │   │   ├── fetch.feature
│   │   │   └── test_fetch.py
│   │   ├── get_trials
│   │   │   ├── get.feature
│   │   │   └── test_get.py
│   │   ├── get_variants
│   │   │   ├── get.feature
│   │   │   └── test_get.py
│   │   ├── search_articles
│   │   │   ├── autocomplete.feature
│   │   │   ├── search.feature
│   │   │   ├── test_autocomplete.py
│   │   │   └── test_search.py
│   │   ├── search_trials
│   │   │   ├── search.feature
│   │   │   └── test_search.py
│   │   ├── search_variants
│   │   │   ├── search.feature
│   │   │   └── test_search.py
│   │   └── steps
│   │       └── test_alphagenome_steps.py
│   ├── config
│   │   └── test_smithery_config.py
│   ├── conftest.py
│   ├── data
│   │   ├── ct_gov
│   │   │   ├── clinical_trials_api_v2.yaml
│   │   │   ├── trials_NCT04280705.json
│   │   │   └── trials_NCT04280705.txt
│   │   ├── myvariant
│   │   │   ├── myvariant_api.yaml
│   │   │   ├── myvariant_field_descriptions.csv
│   │   │   ├── variants_full_braf_v600e.json
│   │   │   ├── variants_full_braf_v600e.txt
│   │   │   └── variants_part_braf_v600_multiple.json
│   │   ├── openfda
│   │   │   ├── drugsfda_detail.json
│   │   │   ├── drugsfda_search.json
│   │   │   ├── enforcement_detail.json
│   │   │   └── enforcement_search.json
│   │   └── pubtator
│   │       ├── pubtator_autocomplete.json
│   │       └── pubtator3_paper.txt
│   ├── integration
│   │   ├── test_openfda_integration.py
│   │   ├── test_preprints_integration.py
│   │   ├── test_simple.py
│   │   └── test_variants_integration.py
│   ├── tdd
│   │   ├── articles
│   │   │   ├── test_autocomplete.py
│   │   │   ├── test_cbioportal_integration.py
│   │   │   ├── test_fetch.py
│   │   │   ├── test_preprints.py
│   │   │   ├── test_search.py
│   │   │   └── test_unified.py
│   │   ├── conftest.py
│   │   ├── drugs
│   │   │   ├── __init__.py
│   │   │   └── test_drug_getter.py
│   │   ├── openfda
│   │   │   ├── __init__.py
│   │   │   ├── test_adverse_events.py
│   │   │   ├── test_device_events.py
│   │   │   ├── test_drug_approvals.py
│   │   │   ├── test_drug_labels.py
│   │   │   ├── test_drug_recalls.py
│   │   │   ├── test_drug_shortages.py
│   │   │   └── test_security.py
│   │   ├── test_biothings_integration_real.py
│   │   ├── test_biothings_integration.py
│   │   ├── test_circuit_breaker.py
│   │   ├── test_concurrent_requests.py
│   │   ├── test_connection_pool.py
│   │   ├── test_domain_handlers.py
│   │   ├── test_drug_approvals.py
│   │   ├── test_drug_recalls.py
│   │   ├── test_drug_shortages.py
│   │   ├── test_endpoint_documentation.py
│   │   ├── test_error_scenarios.py
│   │   ├── test_europe_pmc_fetch.py
│   │   ├── test_mcp_integration.py
│   │   ├── test_mcp_tools.py
│   │   ├── test_metrics.py
│   │   ├── test_nci_integration.py
│   │   ├── test_nci_mcp_tools.py
│   │   ├── test_network_policies.py
│   │   ├── test_offline_mode.py
│   │   ├── test_openfda_unified.py
│   │   ├── test_pten_r173_search.py
│   │   ├── test_render.py
│   │   ├── test_request_batcher.py.disabled
│   │   ├── test_retry.py
│   │   ├── test_router.py
│   │   ├── test_shared_context.py.disabled
│   │   ├── test_unified_biothings.py
│   │   ├── thinking
│   │   │   ├── __init__.py
│   │   │   └── test_sequential.py
│   │   ├── trials
│   │   │   ├── test_backward_compatibility.py
│   │   │   ├── test_getter.py
│   │   │   └── test_search.py
│   │   ├── utils
│   │   │   ├── test_gene_validator.py
│   │   │   ├── test_mutation_filter.py
│   │   │   ├── test_rate_limiter.py
│   │   │   └── test_request_cache.py
│   │   ├── variants
│   │   │   ├── constants.py
│   │   │   ├── test_alphagenome_api_key.py
│   │   │   ├── test_alphagenome_comprehensive.py
│   │   │   ├── test_alphagenome.py
│   │   │   ├── test_cbioportal_mutations.py
│   │   │   ├── test_cbioportal_search.py
│   │   │   ├── test_external_integration.py
│   │   │   ├── test_external.py
│   │   │   ├── test_extract_gene_aa_change.py
│   │   │   ├── test_filters.py
│   │   │   ├── test_getter.py
│   │   │   ├── test_links.py
│   │   │   └── test_search.py
│   │   └── workers
│   │       └── test_worker_sanitization.js
│   └── test_pydantic_ai_integration.py
├── THIRD_PARTY_ENDPOINTS.md
├── tox.ini
├── uv.lock
└── wrangler.toml
```

# Files

--------------------------------------------------------------------------------
/docs/how-to-guides/04-predict-variant-effects-with-alphagenome.md:
--------------------------------------------------------------------------------

```markdown
# How to Predict Variant Effects with AlphaGenome

This guide demonstrates how to use Google DeepMind's AlphaGenome to predict regulatory effects of genetic variants on gene expression, chromatin accessibility, and splicing.

## Overview

AlphaGenome predicts how DNA variants affect:

- **Gene Expression**: Log-fold changes in nearby genes
- **Chromatin Accessibility**: ATAC-seq/DNase-seq signal changes
- **Splicing**: Effects on splice sites and exon inclusion
- **Regulatory Elements**: Impact on enhancers, promoters, and TFBS
- **3D Chromatin**: Changes in chromatin interactions

For technical details on the AlphaGenome integration, see the [AlphaGenome API Reference](../backend-services-reference/07-alphagenome.md).

## Setup and API Key

### Get Your API Key

1. Visit [AlphaGenome Portal](https://deepmind.google.com/science/alphagenome)
2. Register for non-commercial use
3. Receive API key via email

For detailed setup instructions, see [Authentication and API Keys](../getting-started/03-authentication-and-api-keys.md#alphagenome).

### Configure API Key

**Option 1: Environment Variable (Personal Use)**

```bash
export ALPHAGENOME_API_KEY="your-key-here"
```

**Option 2: Per-Request (AI Assistants)**

```
"Predict effects of BRAF V600E. My AlphaGenome API key is YOUR_KEY_HERE"
```

**Option 3: Configuration File**

```python
# .env file
ALPHAGENOME_API_KEY=your-key-here
```

### Install AlphaGenome (Optional)

For local predictions:

```bash
git clone https://github.com/google-deepmind/alphagenome.git
cd alphagenome && pip install .
```

## Basic Variant Prediction

### Simple Prediction

Predict effects of BRAF V600E mutation:

```bash
# CLI
biomcp variant predict chr7 140753336 A T

# Python
result = await client.variants.predict(
    chromosome="chr7",
    position=140753336,
    reference="A",
    alternate="T"
)

# MCP Tool
result = await alphagenome_predictor(
    chromosome="chr7",
    position=140753336,
    reference="A",
    alternate="T"
)
```

### Understanding Results

```python
# Gene expression changes
for gene in result.gene_expression:
    print(f"{gene.name}: {gene.log2_fold_change}")
    # Positive = increased expression
    # Negative = decreased expression
    # |value| > 1.0 = strong effect

# Chromatin accessibility
for region in result.chromatin:
    print(f"{region.type}: {region.change}")
    # Positive = more open chromatin
    # Negative = more closed chromatin

# Splicing effects
for splice in result.splicing:
    print(f"{splice.event}: {splice.delta_psi}")
    # PSI = Percent Spliced In
    # Positive = increased inclusion
```

## Tissue-Specific Predictions

### Single Tissue Analysis

Predict effects in specific tissues using UBERON terms:

```python
# Breast tissue analysis
result = await alphagenome_predictor(
    chromosome="chr17",
    position=41246481,
    reference="G",
    alternate="A",
    tissue_types=["UBERON:0000310"]  # breast
)

# Common tissue codes:
# UBERON:0000310 - breast
# UBERON:0002107 - liver
# UBERON:0002367 - prostate
# UBERON:0000955 - brain
# UBERON:0002048 - lung
# UBERON:0001155 - colon
```

### Multi-Tissue Comparison

Compare effects across tissues:

```python
tissues = [
    "UBERON:0000310",  # breast
    "UBERON:0002107",  # liver
    "UBERON:0002048"   # lung
]

results = {}
for tissue in tissues:
    results[tissue] = await alphagenome_predictor(
        chromosome="chr17",
        position=41246481,
        reference="G",
        alternate="A",
        tissue_types=[tissue]
    )

# Compare gene expression across tissues
for tissue, result in results.items():
    print(f"\n{tissue}:")
    for gene in result.gene_expression[:3]:
        print(f"  {gene.name}: {gene.log2_fold_change}")
```

## Analysis Window Sizes

### Choosing the Right Interval

Different interval sizes capture different regulatory effects:

```python
# Short-range (promoter effects)
result_2kb = await alphagenome_predictor(
    chromosome="chr7",
    position=140753336,
    reference="A",
    alternate="T",
    interval_size=2048  # 2kb
)

# Medium-range (enhancer-promoter)
result_128kb = await alphagenome_predictor(
    chromosome="chr7",
    position=140753336,
    reference="A",
    alternate="T",
    interval_size=131072  # 128kb (default)
)

# Long-range (TAD-level effects)
result_1mb = await alphagenome_predictor(
    chromosome="chr7",
    position=140753336,
    reference="A",
    alternate="T",
    interval_size=1048576  # 1Mb
)
```

**Interval Size Guide:**

- **2kb**: Promoter variants, TSS mutations
- **16kb**: Local regulatory elements
- **128kb**: Enhancer-promoter interactions (default)
- **512kb**: Long-range regulatory
- **1Mb**: TAD boundaries, super-enhancers

## Clinical Workflows

### Workflow 1: VUS (Variant of Unknown Significance) Analysis

```python
async def analyze_vus(chromosome: str, position: int, ref: str, alt: str):
    # Step 1: Think about the analysis
    await think(
        thought=f"Analyzing VUS at {chromosome}:{position} {ref}>{alt}",
        thoughtNumber=1
    )

    # Step 2: Get variant annotations
    variant_id = f"{chromosome}:g.{position}{ref}>{alt}"
    try:
        known_variant = await variant_getter(variant_id)
        if known_variant.clinical_significance:
            return f"Already classified: {known_variant.clinical_significance}"
    except:
        pass  # Variant not in databases

    # Step 3: Predict regulatory effects
    prediction = await alphagenome_predictor(
        chromosome=chromosome,
        position=position,
        reference=ref,
        alternate=alt,
        interval_size=131072
    )

    # Step 4: Analyze impact
    high_impact_genes = [
        g for g in prediction.gene_expression
        if abs(g.log2_fold_change) > 1.0
    ]

    # Step 5: Search literature
    if high_impact_genes:
        gene_symbols = [g.name for g in high_impact_genes[:3]]
        articles = await article_searcher(
            genes=gene_symbols,
            keywords=["pathogenic", "disease", "mutation"]
        )

    return {
        "variant": f"{chromosome}:{position} {ref}>{alt}",
        "high_impact_genes": high_impact_genes,
        "regulatory_assessment": assess_regulatory_impact(prediction),
        "literature_support": len(articles) if high_impact_genes else 0
    }

def assess_regulatory_impact(prediction):
    """Classify regulatory impact severity"""
    max_expression_change = max(
        abs(g.log2_fold_change) for g in prediction.gene_expression
    ) if prediction.gene_expression else 0

    if max_expression_change > 2.0:
        return "HIGH - Strong regulatory effect"
    elif max_expression_change > 1.0:
        return "MODERATE - Significant regulatory effect"
    elif max_expression_change > 0.5:
        return "LOW - Mild regulatory effect"
    else:
        return "MINIMAL - No significant regulatory effect"
```

### Workflow 2: Non-coding Variant Prioritization

```python
async def prioritize_noncoding_variants(variants: list[dict], disease_genes: list[str]):
    """Rank non-coding variants by predicted impact on disease genes"""

    results = []

    for variant in variants:
        # Predict effects
        prediction = await alphagenome_predictor(
            chromosome=variant["chr"],
            position=variant["pos"],
            reference=variant["ref"],
            alternate=variant["alt"]
        )

        # Check impact on disease genes
        disease_impact = {}
        for gene in prediction.gene_expression:
            if gene.name in disease_genes:
                disease_impact[gene.name] = gene.log2_fold_change

        # Calculate priority score
        if disease_impact:
            max_impact = max(abs(v) for v in disease_impact.values())
            results.append({
                "variant": variant,
                "disease_genes_affected": disease_impact,
                "priority_score": max_impact,
                "chromatin_changes": len([c for c in prediction.chromatin if c.change > 0.5])
            })

    # Sort by priority
    results.sort(key=lambda x: x["priority_score"], reverse=True)
    return results

# Example usage
variants_to_test = [
    {"chr": "chr17", "pos": 41246000, "ref": "A", "alt": "G"},
    {"chr": "chr17", "pos": 41246500, "ref": "C", "alt": "T"},
    {"chr": "chr17", "pos": 41247000, "ref": "G", "alt": "A"}
]

breast_cancer_genes = ["BRCA1", "BRCA2", "TP53", "PTEN"]
prioritized = await prioritize_noncoding_variants(variants_to_test, breast_cancer_genes)
```

### Workflow 3: Splicing Analysis

```python
async def analyze_splicing_variant(gene: str, exon: int, variant_pos: int, ref: str, alt: str):
    """Analyze potential splicing effects of a variant"""

    # Get gene information
    gene_info = await gene_getter(gene)
    chromosome = f"chr{gene_info.genomic_location.chr}"

    # Predict splicing effects
    prediction = await alphagenome_predictor(
        chromosome=chromosome,
        position=variant_pos,
        reference=ref,
        alternate=alt,
        interval_size=16384  # Smaller window for splicing
    )

    # Analyze splicing predictions
    splicing_effects = []
    for event in prediction.splicing:
        if abs(event.delta_psi) > 0.1:  # 10% change in splicing
            splicing_effects.append({
                "type": event.event_type,
                "change": event.delta_psi,
                "affected_exon": event.exon,
                "interpretation": interpret_splicing(event)
            })

    # Search for similar splicing variants
    articles = await article_searcher(
        genes=[gene],
        keywords=[f"exon {exon}", "splicing", "splice site"]
    )

    return {
        "variant": f"{gene} exon {exon} {ref}>{alt}",
        "splicing_effects": splicing_effects,
        "likely_consequence": predict_consequence(splicing_effects),
        "literature_precedent": len(articles)
    }

def interpret_splicing(event):
    """Interpret splicing changes"""
    if event.delta_psi > 0.5:
        return "Strong increase in exon inclusion"
    elif event.delta_psi > 0.1:
        return "Moderate increase in exon inclusion"
    elif event.delta_psi < -0.5:
        return "Strong exon skipping"
    elif event.delta_psi < -0.1:
        return "Moderate exon skipping"
    else:
        return "Minimal splicing change"
```

## Research Applications

### Enhancer Variant Analysis

```python
async def analyze_enhancer_variant(chr: str, pos: int, ref: str, alt: str, target_gene: str):
    """Analyze variant in potential enhancer region"""

    # Use larger window to capture enhancer-promoter interactions
    prediction = await alphagenome_predictor(
        chromosome=chr,
        position=pos,
        reference=ref,
        alternate=alt,
        interval_size=524288  # 512kb
    )

    # Find target gene effect
    target_effect = None
    for gene in prediction.gene_expression:
        if gene.name == target_gene:
            target_effect = gene.log2_fold_change
            break

    # Analyze chromatin changes
    chromatin_opening = sum(
        1 for c in prediction.chromatin
        if c.change > 0 and c.type == "enhancer"
    )

    return {
        "variant_location": f"{chr}:{pos}",
        "target_gene": target_gene,
        "expression_change": target_effect,
        "enhancer_activity": "increased" if chromatin_opening > 0 else "decreased",
        "likely_enhancer": abs(target_effect or 0) > 0.5 and chromatin_opening > 0
    }
```

### Pharmacogenomic Predictions

```python
async def predict_drug_response_variant(drug_target: str, variant: dict):
    """Predict how variant affects drug target expression"""

    # Get drug information
    drug_info = await drug_getter(drug_target)
    target_genes = drug_info.targets

    # Predict variant effects
    prediction = await alphagenome_predictor(
        chromosome=variant["chr"],
        position=variant["pos"],
        reference=variant["ref"],
        alternate=variant["alt"],
        tissue_types=["UBERON:0002107"]  # liver for drug metabolism
    )

    # Check effects on drug targets
    target_effects = {}
    for gene in prediction.gene_expression:
        if gene.name in target_genes:
            target_effects[gene.name] = gene.log2_fold_change

    # Interpret results
    if any(abs(effect) > 1.0 for effect in target_effects.values()):
        response = "Likely altered drug response"
    elif any(abs(effect) > 0.5 for effect in target_effects.values()):
        response = "Possible altered drug response"
    else:
        response = "Unlikely to affect drug response"

    return {
        "drug": drug_target,
        "variant": variant,
        "target_effects": target_effects,
        "prediction": response,
        "recommendation": "Consider dose adjustment" if "altered" in response else "Standard dosing"
    }
```

## Best Practices

### 1. Validate Input Coordinates

```python
# Always use "chr" prefix
chromosome = "chr7"  # ✅ Correct
# chromosome = "7"   # ❌ Wrong

# Use 1-based positions (not 0-based)
position = 140753336  # ✅ 1-based
```

### 2. Handle API Errors Gracefully

```python
try:
    result = await alphagenome_predictor(...)
except Exception as e:
    if "API key" in str(e):
        print("Please provide AlphaGenome API key")
    elif "Invalid sequence" in str(e):
        print("Check chromosome and position")
    else:
        print(f"Prediction failed: {e}")
```

### 3. Combine with Other Tools

```python
# Complete variant analysis pipeline
async def comprehensive_variant_analysis(variant_id: str):
    # 1. Get known annotations
    known = await variant_getter(variant_id)

    # 2. Predict regulatory effects
    prediction = await alphagenome_predictor(
        chromosome=f"chr{known.chr}",
        position=known.pos,
        reference=known.ref,
        alternate=known.alt
    )

    # 3. Search literature
    articles = await article_searcher(
        variants=[variant_id],
        genes=[known.gene.symbol]
    )

    # 4. Find relevant trials
    trials = await trial_searcher(
        other_terms=[known.gene.symbol, "mutation"]
    )

    return {
        "annotations": known,
        "predictions": prediction,
        "literature": articles,
        "trials": trials
    }
```

### 4. Interpret Results Appropriately

```python
def interpret_expression_change(log2_fc):
    """Convert log2 fold change to interpretation"""
    if log2_fc > 2.0:
        return "Very strong increase (>4x)"
    elif log2_fc > 1.0:
        return "Strong increase (2-4x)"
    elif log2_fc > 0.5:
        return "Moderate increase (1.4-2x)"
    elif log2_fc < -2.0:
        return "Very strong decrease (<0.25x)"
    elif log2_fc < -1.0:
        return "Strong decrease (0.25-0.5x)"
    elif log2_fc < -0.5:
        return "Moderate decrease (0.5-0.7x)"
    else:
        return "Minimal change"
```

## Limitations and Considerations

### Technical Limitations

- **Human only**: GRCh38 reference genome
- **SNVs only**: No indels or structural variants
- **Exact coordinates**: Must have precise genomic position
- **Sequence context**: Requires reference sequence match

### Interpretation Caveats

- **Predictions not certainties**: Validate with functional studies
- **Context matters**: Cell type, developmental stage affect outcomes
- **Indirect effects**: May miss complex regulatory cascades
- **Population variation**: Individual genetic background influences

## Troubleshooting

### Common Issues

**"API key required"**

- Set environment variable or provide per-request
- Check key validity at AlphaGenome portal

**"Invalid sequence length"**

- Verify chromosome format (use "chr" prefix)
- Check position is within chromosome bounds
- Ensure ref/alt are single nucleotides

**"No results returned"**

- May be no genes in analysis window
- Try larger interval size
- Check if variant is in gene desert

**Installation issues**

- Ensure Python 3.10+
- Try `pip install --upgrade pip` first
- Check for conflicting protobuf versions

## Next Steps

- Explore [comprehensive variant annotations](03-get-comprehensive-variant-annotations.md)
- Learn about [article searches](01-find-articles-and-cbioportal-data.md) for variants
- Set up [logging and monitoring](05-logging-and-monitoring-with-bigquery.md)

```

--------------------------------------------------------------------------------
/docs/how-to-guides/06-search-nci-organizations-and-interventions.md:
--------------------------------------------------------------------------------

```markdown
# How to Search NCI Organizations and Interventions

This guide demonstrates how to use BioMCP's NCI-specific tools to search for cancer research organizations, interventions (drugs, devices, procedures), and biomarkers.

## Prerequisites

All NCI tools require an API key from [api.cancer.gov](https://api.cancer.gov):

```bash
# Set as environment variable
export NCI_API_KEY="your-key-here"

# Or provide per-request in your prompts
"Find cancer centers in Boston, my NCI API key is YOUR_KEY"
```

## Organization Search and Lookup

### Understanding Organization Search

The NCI Organization database contains:

- Cancer research centers and hospitals
- Clinical trial sponsors
- Academic institutions
- Pharmaceutical companies
- Government facilities

### Basic Organization Search

Find organizations by name:

```bash
# CLI
biomcp organization search --name "MD Anderson" --api-key YOUR_KEY

# Python
orgs = await nci_organization_searcher(
    name="MD Anderson",
    api_key="your-key"
)

# MCP/AI Assistant
"Search for MD Anderson Cancer Center, my NCI API key is YOUR_KEY"
```

### Location-Based Search

**CRITICAL**: Always use city AND state together to avoid Elasticsearch errors!

```python
# ✅ CORRECT - City and state together
orgs = await nci_organization_searcher(
    city="Houston",
    state="TX",
    api_key="your-key"
)

# ❌ WRONG - Will cause API error
orgs = await nci_organization_searcher(
    city="Houston",  # Missing state!
    api_key="your-key"
)

# ❌ WRONG - Will cause API error
orgs = await nci_organization_searcher(
    state="TX",  # Missing city!
    api_key="your-key"
)
```

### Organization Types

Search by organization type:

```python
# Find academic cancer centers
academic_centers = await nci_organization_searcher(
    organization_type="Academic",
    api_key="your-key"
)

# Find pharmaceutical companies
pharma_companies = await nci_organization_searcher(
    organization_type="Industry",
    api_key="your-key"
)

# Find government research facilities
gov_facilities = await nci_organization_searcher(
    organization_type="Government",
    api_key="your-key"
)
```

Valid organization types:

- `Academic` - Universities and medical schools
- `Industry` - Pharmaceutical and biotech companies
- `Government` - NIH, FDA, VA hospitals
- `Community` - Community hospitals and clinics
- `Network` - Research networks and consortiums
- `Other` - Other organization types

### Getting Organization Details

Retrieve complete information about a specific organization:

```python
# Get organization by ID
org_details = await nci_organization_getter(
    organization_id="NCI-2011-03337",
    api_key="your-key"
)

# Returns:
# - Full name and aliases
# - Contact information
# - Address and location
# - Associated clinical trials
# - Organization type and status
```

### Practical Organization Workflows

#### Find Regional Cancer Centers

```python
async def find_cancer_centers_by_region(state: str, cities: list[str]):
    """Find all cancer centers in specific cities within a state"""

    all_centers = []

    for city in cities:
        # ALWAYS use city + state together
        centers = await nci_organization_searcher(
            city=city,
            state=state,
            organization_type="Academic",
            api_key=os.getenv("NCI_API_KEY")
        )
        all_centers.extend(centers)

    # Remove duplicates
    unique_centers = {org['id']: org for org in all_centers}

    return list(unique_centers.values())

# Example: Find cancer centers in major Texas cities
texas_centers = await find_cancer_centers_by_region(
    state="TX",
    cities=["Houston", "Dallas", "San Antonio", "Austin"]
)
```

#### Find Trial Sponsors

```python
async def find_trial_sponsors_by_type(org_type: str, name_filter: str = None):
    """Find organizations sponsoring trials"""

    # Search organizations
    orgs = await nci_organization_searcher(
        name=name_filter,
        organization_type=org_type,
        api_key=os.getenv("NCI_API_KEY")
    )

    # For each org, get details including trial count
    sponsors = []
    for org in orgs[:10]:  # Limit to avoid rate limits
        details = await nci_organization_getter(
            organization_id=org['id'],
            api_key=os.getenv("NCI_API_KEY")
        )
        if details.get('trial_count', 0) > 0:
            sponsors.append(details)

    return sorted(sponsors, key=lambda x: x.get('trial_count', 0), reverse=True)

# Find pharmaceutical companies with active trials
pharma_sponsors = await find_trial_sponsors_by_type("Industry")
```

## Intervention Search and Lookup

### Understanding Interventions

Interventions in clinical trials include:

- **Drugs**: Chemotherapy, targeted therapy, immunotherapy
- **Devices**: Medical devices, diagnostic tools
- **Procedures**: Surgical techniques, radiation protocols
- **Biologicals**: Cell therapies, vaccines, antibodies
- **Behavioral**: Lifestyle interventions, counseling
- **Other**: Dietary supplements, alternative therapies

### Drug Search

Find specific drugs or drug classes:

```bash
# CLI - Find a specific drug
biomcp intervention search --name pembrolizumab --type Drug --api-key YOUR_KEY

# CLI - Find drug class
biomcp intervention search --name "PD-1 inhibitor" --type Drug --api-key YOUR_KEY
```

```python
# Python - Search with synonyms
drugs = await nci_intervention_searcher(
    name="pembrolizumab",
    intervention_type="Drug",
    synonyms=True,  # Include Keytruda, MK-3475, etc.
    api_key="your-key"
)

# Search for drug combinations
combos = await nci_intervention_searcher(
    name="nivolumab AND ipilimumab",
    intervention_type="Drug",
    api_key="your-key"
)
```

### Device and Procedure Search

```python
# Find medical devices
devices = await nci_intervention_searcher(
    intervention_type="Device",
    name="robot",  # Surgical robots
    api_key="your-key"
)

# Find procedures
procedures = await nci_intervention_searcher(
    intervention_type="Procedure",
    name="minimally invasive",
    api_key="your-key"
)

# Find radiation protocols
radiation = await nci_intervention_searcher(
    intervention_type="Radiation",
    name="proton beam",
    api_key="your-key"
)
```

### Getting Intervention Details

```python
# Get complete intervention information
intervention = await nci_intervention_getter(
    intervention_id="INT123456",
    api_key="your-key"
)

# Returns:
# - Official name and synonyms
# - Intervention type and subtype
# - Mechanism of action (for drugs)
# - FDA approval status
# - Associated clinical trials
# - Manufacturer information
```

### Practical Intervention Workflows

#### Drug Development Pipeline

```python
async def analyze_drug_pipeline(drug_target: str):
    """Analyze drugs in development for a specific target"""

    # Search for drugs targeting specific pathway
    drugs = await nci_intervention_searcher(
        name=drug_target,
        intervention_type="Drug",
        api_key=os.getenv("NCI_API_KEY")
    )

    pipeline = {
        "preclinical": [],
        "phase1": [],
        "phase2": [],
        "phase3": [],
        "approved": []
    }

    for drug in drugs:
        # Get detailed information
        details = await nci_intervention_getter(
            intervention_id=drug['id'],
            api_key=os.getenv("NCI_API_KEY")
        )

        # Categorize by development stage
        if details.get('fda_approved'):
            pipeline['approved'].append(details)
        else:
            # Check associated trials for phase
            trial_phases = details.get('trial_phases', [])
            if 'PHASE3' in trial_phases:
                pipeline['phase3'].append(details)
            elif 'PHASE2' in trial_phases:
                pipeline['phase2'].append(details)
            elif 'PHASE1' in trial_phases:
                pipeline['phase1'].append(details)
            else:
                pipeline['preclinical'].append(details)

    return pipeline

# Analyze PD-1/PD-L1 inhibitor pipeline
pd1_pipeline = await analyze_drug_pipeline("PD-1 inhibitor")
```

#### Compare Similar Interventions

```python
async def compare_interventions(intervention_names: list[str]):
    """Compare multiple interventions side by side"""

    comparisons = []

    for name in intervention_names:
        # Search for intervention
        results = await nci_intervention_searcher(
            name=name,
            synonyms=True,
            api_key=os.getenv("NCI_API_KEY")
        )

        if results:
            # Get detailed info for first match
            details = await nci_intervention_getter(
                intervention_id=results[0]['id'],
                api_key=os.getenv("NCI_API_KEY")
            )

            comparisons.append({
                "name": details['name'],
                "type": details['type'],
                "synonyms": details.get('synonyms', []),
                "fda_approved": details.get('fda_approved', False),
                "trial_count": len(details.get('trials', [])),
                "mechanism": details.get('mechanism_of_action', 'Not specified')
            })

    return comparisons

# Compare checkpoint inhibitors
comparison = await compare_interventions([
    "pembrolizumab",
    "nivolumab",
    "atezolizumab",
    "durvalumab"
])
```

## Biomarker Search

### Understanding Biomarker Types

The NCI API supports two biomarker types:

- `reference_gene` - Gene-based biomarkers (e.g., EGFR, BRAF)
- `branch` - Pathway/branch biomarkers

**Note**: You cannot search by gene symbol directly; use the name parameter.

### Basic Biomarker Search

```python
# Search for PD-L1 biomarkers
pdl1_biomarkers = await nci_biomarker_searcher(
    name="PD-L1",
    api_key="your-key"
)

# Search for specific biomarker type
gene_biomarkers = await nci_biomarker_searcher(
    biomarker_type="reference_gene",
    api_key="your-key"
)
```

### Biomarker Analysis Workflow

```python
async def analyze_trial_biomarkers(disease: str):
    """Find biomarkers used in trials for a disease"""

    # Get all biomarkers
    all_biomarkers = await nci_biomarker_searcher(
        biomarker_type="reference_gene",
        api_key=os.getenv("NCI_API_KEY")
    )

    # Filter by disease association
    disease_biomarkers = []
    for biomarker in all_biomarkers:
        if disease.lower() in str(biomarker).lower():
            disease_biomarkers.append(biomarker)

    # Group by frequency
    biomarker_counts = {}
    for bio in disease_biomarkers:
        name = bio.get('name', 'Unknown')
        biomarker_counts[name] = biomarker_counts.get(name, 0) + 1

    # Sort by frequency
    return sorted(
        biomarker_counts.items(),
        key=lambda x: x[1],
        reverse=True
    )

# Find most common biomarkers in lung cancer trials
lung_biomarkers = await analyze_trial_biomarkers("lung cancer")
```

## Combined Workflows

### Regional Drug Development Analysis

```python
async def analyze_regional_drug_development(
    state: str,
    cities: list[str],
    drug_class: str
):
    """Analyze drug development in a specific region"""

    # Step 1: Find organizations in the region
    organizations = []
    for city in cities:
        orgs = await nci_organization_searcher(
            city=city,
            state=state,
            organization_type="Industry",
            api_key=os.getenv("NCI_API_KEY")
        )
        organizations.extend(orgs)

    # Step 2: Find drugs of interest
    drugs = await nci_intervention_searcher(
        name=drug_class,
        intervention_type="Drug",
        api_key=os.getenv("NCI_API_KEY")
    )

    # Step 3: Cross-reference trials
    regional_development = []
    for drug in drugs[:10]:  # Limit for performance
        drug_details = await nci_intervention_getter(
            intervention_id=drug['id'],
            api_key=os.getenv("NCI_API_KEY")
        )

        # Check if any trials are sponsored by regional orgs
        for trial in drug_details.get('trials', []):
            for org in organizations:
                if org['id'] in str(trial):
                    regional_development.append({
                        'drug': drug_details['name'],
                        'organization': org['name'],
                        'location': f"{org.get('city', '')}, {org.get('state', '')}",
                        'trial': trial
                    })

    return regional_development

# Analyze immunotherapy development in California
ca_immuno = await analyze_regional_drug_development(
    state="CA",
    cities=["San Francisco", "San Diego", "Los Angeles"],
    drug_class="immunotherapy"
)
```

### Organization to Intervention Pipeline

```python
async def org_to_intervention_pipeline(org_name: str):
    """Trace from organization to their interventions"""

    # Find organization
    orgs = await nci_organization_searcher(
        name=org_name,
        api_key=os.getenv("NCI_API_KEY")
    )

    if not orgs:
        return None

    # Get organization details
    org_details = await nci_organization_getter(
        organization_id=orgs[0]['id'],
        api_key=os.getenv("NCI_API_KEY")
    )

    # Get their trials
    org_trials = org_details.get('trials', [])

    # Extract unique interventions
    interventions = set()
    for trial_id in org_trials[:20]:  # Sample trials
        trial = await trial_getter(
            nct_id=trial_id,
            source="nci",
            api_key=os.getenv("NCI_API_KEY")
        )

        if trial.get('interventions'):
            interventions.update(trial['interventions'])

    # Get details for each intervention
    intervention_details = []
    for intervention_name in interventions:
        results = await nci_intervention_searcher(
            name=intervention_name,
            api_key=os.getenv("NCI_API_KEY")
        )
        if results:
            intervention_details.append(results[0])

    return {
        'organization': org_details,
        'trial_count': len(org_trials),
        'interventions': intervention_details
    }

# Analyze Genentech's intervention portfolio
genentech_portfolio = await org_to_intervention_pipeline("Genentech")
```

## Best Practices

### 1. Always Use City + State Together

```python
# ✅ GOOD - Prevents API errors
await nci_organization_searcher(city="Boston", state="MA")

# ❌ BAD - Will cause Elasticsearch error
await nci_organization_searcher(city="Boston")
```

### 2. Handle Rate Limits

```python
import asyncio

async def search_with_rate_limit(searches: list):
    """Execute searches with rate limiting"""
    results = []

    for search in searches:
        result = await search()
        results.append(result)

        # Add delay to respect rate limits
        await asyncio.sleep(0.1)  # 10 requests per second

    return results
```

### 3. Use Pagination for Large Results

```python
async def get_all_organizations(org_type: str):
    """Get all organizations of a type using pagination"""

    all_orgs = []
    page = 1

    while True:
        orgs = await nci_organization_searcher(
            organization_type=org_type,
            page=page,
            page_size=100,  # Maximum allowed
            api_key=os.getenv("NCI_API_KEY")
        )

        if not orgs:
            break

        all_orgs.extend(orgs)
        page += 1

        # Note: Total count may not be available
        if len(orgs) < 100:
            break

    return all_orgs
```

### 4. Cache Results

```python
from functools import lru_cache
import hashlib

@lru_cache(maxsize=100)
async def cached_org_search(city: str, state: str, org_type: str):
    """Cache organization searches to reduce API calls"""

    return await nci_organization_searcher(
        city=city,
        state=state,
        organization_type=org_type,
        api_key=os.getenv("NCI_API_KEY")
    )
```

## Troubleshooting

### Common Errors and Solutions

1. **"Search Too Broad" Error**

   - Always use city + state together for location searches
   - Add more specific filters (name, type)
   - Reduce page_size parameter

2. **"NCI API key required"**

   - Set NCI_API_KEY environment variable
   - Or provide api_key parameter in function calls
   - Or include in prompt: "my NCI API key is YOUR_KEY"

3. **No Results Found**

   - Check spelling of organization/drug names
   - Try partial name matches
   - Remove filters and broaden search
   - Enable synonyms for intervention searches

4. **Rate Limit Exceeded**
   - Add delays between requests
   - Reduce concurrent requests
   - Cache frequently accessed data
   - Consider upgrading API key tier

### Debugging Tips

```python
# Enable debug logging
import logging
logging.basicConfig(level=logging.DEBUG)

# Test API key
async def test_nci_connection():
    try:
        result = await nci_organization_searcher(
            name="Mayo",
            api_key=os.getenv("NCI_API_KEY")
        )
        print(f"✅ API key valid, found {len(result)} results")
    except Exception as e:
        print(f"❌ API key error: {e}")

# Check specific organization exists
async def verify_org_id(org_id: str):
    try:
        org = await nci_organization_getter(
            organization_id=org_id,
            api_key=os.getenv("NCI_API_KEY")
        )
        print(f"✅ Organization found: {org['name']}")
    except:
        print(f"❌ Organization ID not found: {org_id}")
```

## Next Steps

- Review [NCI prompts examples](../tutorials/nci-prompts.md) for AI assistant usage
- Explore [trial search with biomarkers](02-find-trials-with-nci-and-biothings.md)
- Learn about [variant effect prediction](04-predict-variant-effects-with-alphagenome.md)
- Set up [API authentication](../getting-started/03-authentication-and-api-keys.md)

```

--------------------------------------------------------------------------------
/tests/tdd/test_router.py:
--------------------------------------------------------------------------------

```python
"""Comprehensive tests for the unified router module."""

import json
from unittest.mock import patch

import pytest

from biomcp.exceptions import (
    InvalidDomainError,
    InvalidParameterError,
    QueryParsingError,
    SearchExecutionError,
)
from biomcp.router import fetch, format_results, search


class TestFormatResults:
    """Test the format_results function."""

    def test_format_article_results(self):
        """Test formatting article results."""
        results = [
            {
                "pmid": "12345",
                "title": "Test Article",
                "abstract": "This is a test abstract",
                # Note: url in input is ignored, always generates PubMed URL
            }
        ]

        # Mock thinking tracker to prevent reminder
        with patch("biomcp.router.get_thinking_reminder", return_value=""):
            formatted = format_results(results, "article", 1, 10, 1)

        assert "results" in formatted
        assert len(formatted["results"]) == 1
        result = formatted["results"][0]
        assert result["id"] == "12345"
        assert result["title"] == "Test Article"
        assert "test abstract" in result["text"]
        assert result["url"] == "https://pubmed.ncbi.nlm.nih.gov/12345/"

    def test_format_trial_results_api_v2(self):
        """Test formatting trial results with API v2 structure."""
        results = [
            {
                "protocolSection": {
                    "identificationModule": {
                        "nctId": "NCT12345",
                        "briefTitle": "Test Trial",
                    },
                    "descriptionModule": {
                        "briefSummary": "This is a test trial summary"
                    },
                    "statusModule": {"overallStatus": "RECRUITING"},
                    "designModule": {"phases": ["PHASE3"]},
                }
            }
        ]

        # Mock thinking tracker to prevent reminder
        with patch("biomcp.router.get_thinking_reminder", return_value=""):
            formatted = format_results(results, "trial", 1, 10, 1)

        assert "results" in formatted
        assert len(formatted["results"]) == 1
        result = formatted["results"][0]
        assert result["id"] == "NCT12345"
        assert result["title"] == "Test Trial"
        assert "test trial summary" in result["text"]
        assert "NCT12345" in result["url"]

    def test_format_trial_results_legacy(self):
        """Test formatting trial results with legacy structure."""
        results = [
            {
                "NCT Number": "NCT67890",
                "Study Title": "Legacy Trial",
                "Brief Summary": "Legacy trial summary",
                "Study Status": "COMPLETED",
                "Phases": "Phase 2",
            }
        ]

        # Mock thinking tracker to prevent reminder
        with patch("biomcp.router.get_thinking_reminder", return_value=""):
            formatted = format_results(results, "trial", 1, 10, 1)

        assert "results" in formatted
        assert len(formatted["results"]) == 1
        result = formatted["results"][0]
        assert result["id"] == "NCT67890"
        assert result["title"] == "Legacy Trial"
        assert "Legacy trial summary" in result["text"]

    def test_format_variant_results(self):
        """Test formatting variant results."""
        results = [
            {
                "_id": "chr7:g.140453136A>T",
                "dbsnp": {"rsid": "rs121913529"},
                "dbnsfp": {"genename": "BRAF"},
                "clinvar": {"rcv": {"clinical_significance": "Pathogenic"}},
            }
        ]

        # Mock thinking tracker to prevent reminder
        with patch("biomcp.router.get_thinking_reminder", return_value=""):
            formatted = format_results(results, "variant", 1, 10, 1)

        assert "results" in formatted
        assert len(formatted["results"]) == 1
        result = formatted["results"][0]
        assert result["id"] == "chr7:g.140453136A>T"
        assert "BRAF" in result["title"]
        assert "Pathogenic" in result["text"]
        assert "rs121913529" in result["url"]

    def test_format_results_invalid_domain(self):
        """Test format_results with invalid domain."""
        with pytest.raises(InvalidDomainError) as exc_info:
            format_results([], "invalid_domain", 1, 10, 0)

        assert "Unknown domain: invalid_domain" in str(exc_info.value)

    def test_format_results_malformed_data(self):
        """Test format_results handles malformed data gracefully."""
        results = [
            {"title": "Good Article", "pmid": "123"},
            None,  # Malformed - will be skipped
            {
                "invalid": "data"
            },  # Missing required fields but won't fail (treated as preprint)
        ]

        # Mock thinking tracker to prevent reminder
        with patch("biomcp.router.get_thinking_reminder", return_value=""):
            formatted = format_results(results, "article", 1, 10, 3)

        # Should skip None but include the third (treated as preprint with empty fields)
        assert len(formatted["results"]) == 2
        assert formatted["results"][0]["id"] == "123"
        assert formatted["results"][1]["id"] == ""  # Empty ID for invalid data


@pytest.mark.asyncio
class TestSearchFunction:
    """Test the unified search function."""

    async def test_search_article_domain(self):
        """Test search with article domain."""
        mock_result = json.dumps([
            {"pmid": "123", "title": "Test", "abstract": "Abstract"}
        ])

        with patch(
            "biomcp.articles.unified.search_articles_unified"
        ) as mock_search:
            mock_search.return_value = mock_result

            # Mock thinking tracker to prevent reminder
            with patch("biomcp.router.get_thinking_reminder", return_value=""):
                result = await search(
                    query="",
                    domain="article",
                    genes="BRAF",
                    diseases=["cancer"],
                    page_size=10,
                )

            assert "results" in result
            assert len(result["results"]) == 1
            assert result["results"][0]["id"] == "123"

    async def test_search_trial_domain(self):
        """Test search with trial domain."""
        mock_result = json.dumps({
            "studies": [
                {
                    "protocolSection": {
                        "identificationModule": {"nctId": "NCT123"},
                    }
                }
            ]
        })

        with patch("biomcp.trials.search.search_trials") as mock_search:
            mock_search.return_value = mock_result

            # Mock thinking tracker to prevent reminder
            with patch("biomcp.router.get_thinking_reminder", return_value=""):
                result = await search(
                    query="",
                    domain="trial",
                    conditions=["cancer"],
                    phase="Phase 3",
                    page_size=20,
                )

            assert "results" in result
            mock_search.assert_called_once()

    async def test_search_variant_domain(self):
        """Test search with variant domain."""
        mock_result = json.dumps([
            {"_id": "rs123", "gene": {"symbol": "BRAF"}}
        ])

        with patch("biomcp.variants.search.search_variants") as mock_search:
            mock_search.return_value = mock_result

            # Mock thinking tracker to prevent reminder
            with patch("biomcp.router.get_thinking_reminder", return_value=""):
                result = await search(
                    query="",
                    domain="variant",
                    genes="BRAF",
                    significance="pathogenic",
                    page_size=10,
                )

            assert "results" in result
            assert len(result["results"]) == 1

    async def test_search_unified_query(self):
        """Test search with unified query language."""
        with patch("biomcp.router._unified_search") as mock_unified:
            mock_unified.return_value = {
                "results": [{"id": "1", "title": "Test"}]
            }

            result = await search(
                query="gene:BRAF AND disease:cancer",
                max_results_per_domain=20,
            )

            assert "results" in result
            mock_unified.assert_called_once_with(
                query="gene:BRAF AND disease:cancer",
                max_results_per_domain=20,
                domains=None,
                explain_query=False,
            )

    async def test_search_no_domain_or_query(self):
        """Test search without domain or query raises error."""
        with pytest.raises(InvalidParameterError) as exc_info:
            await search(query="")

        assert "query or domain" in str(exc_info.value)

    async def test_search_invalid_domain(self):
        """Test search with invalid domain."""
        with pytest.raises(InvalidDomainError):
            await search(query="", domain="invalid_domain")

    async def test_search_get_schema(self):
        """Test search with get_schema flag."""
        result = await search(query="", get_schema=True)

        assert "domains" in result
        assert "cross_domain_fields" in result
        assert "domain_fields" in result
        assert isinstance(result["cross_domain_fields"], dict)

    async def test_search_pagination_validation(self):
        """Test search with invalid pagination parameters."""
        with pytest.raises(InvalidParameterError) as exc_info:
            await search(
                query="",
                domain="article",
                page=0,  # Invalid - must be >= 1
                page_size=10,
            )

        assert "page" in str(exc_info.value)

    async def test_search_parameter_parsing(self):
        """Test parameter parsing for list inputs."""
        mock_result = json.dumps([])

        with patch(
            "biomcp.articles.unified.search_articles_unified"
        ) as mock_search:
            mock_search.return_value = mock_result

            # Test with JSON array string
            await search(
                query="",
                domain="article",
                genes='["BRAF", "KRAS"]',
                diseases="cancer,melanoma",  # Comma-separated
            )

            # Check the request was parsed correctly
            call_args = mock_search.call_args[0][0]
            assert call_args.genes == ["BRAF", "KRAS"]
            assert call_args.diseases == ["cancer", "melanoma"]


@pytest.mark.asyncio
class TestFetchFunction:
    """Test the unified fetch function."""

    async def test_fetch_article(self):
        """Test fetching article details."""
        mock_result = json.dumps([
            {
                "pmid": 12345,
                "title": "Test Article",
                "abstract": "Full abstract",
                "full_text": "Full text content",
            }
        ])

        with patch("biomcp.articles.fetch.fetch_articles") as mock_fetch:
            mock_fetch.return_value = mock_result

            result = await fetch(
                domain="article",
                id="12345",
            )

            assert result["id"] == "12345"
            assert result["title"] == "Test Article"
            assert result["text"] == "Full text content"
            assert "metadata" in result

    async def test_fetch_article_invalid_pmid(self):
        """Test fetching article with invalid identifier."""
        result = await fetch(domain="article", id="not_a_number")

        # Should return an error since "not_a_number" is neither a valid PMID nor DOI
        assert "error" in result
        assert "Invalid identifier format" in result["error"]
        assert "not_a_number" in result["error"]

    async def test_fetch_trial_all_sections(self):
        """Test fetching trial with all sections."""
        mock_protocol = json.dumps({
            "title": "Test Trial",
            "nct_id": "NCT123",
            "brief_summary": "Summary",
        })
        mock_locations = json.dumps({"locations": [{"city": "Boston"}]})
        mock_outcomes = json.dumps({
            "outcomes": {"primary_outcomes": ["Outcome1"]}
        })
        mock_references = json.dumps({"references": [{"pmid": "456"}]})

        with (
            patch("biomcp.trials.getter._trial_protocol") as mock_p,
            patch("biomcp.trials.getter._trial_locations") as mock_l,
            patch("biomcp.trials.getter._trial_outcomes") as mock_o,
            patch("biomcp.trials.getter._trial_references") as mock_r,
        ):
            mock_p.return_value = mock_protocol
            mock_l.return_value = mock_locations
            mock_o.return_value = mock_outcomes
            mock_r.return_value = mock_references

            result = await fetch(domain="trial", id="NCT123", detail="all")

            assert result["id"] == "NCT123"
            assert "metadata" in result
            assert "locations" in result["metadata"]
            assert "outcomes" in result["metadata"]
            assert "references" in result["metadata"]

    async def test_fetch_trial_invalid_detail(self):
        """Test fetching trial with invalid detail parameter."""
        with pytest.raises(InvalidParameterError) as exc_info:
            await fetch(
                domain="trial",
                id="NCT123",
                detail="invalid_section",
            )

        assert "one of:" in str(exc_info.value)

    async def test_fetch_variant(self):
        """Test fetching variant details."""
        mock_result = json.dumps([
            {
                "_id": "rs123",
                "gene": {"symbol": "BRAF"},
                "clinvar": {"clinical_significance": "Pathogenic"},
                "tcga": {"cancer_types": {}},
                "external_links": {"dbSNP": "https://example.com"},
            }
        ])

        with patch("biomcp.variants.getter.get_variant") as mock_get:
            mock_get.return_value = mock_result

            result = await fetch(domain="variant", id="rs123")

            assert result["id"] == "rs123"
            assert "TCGA Data: Available" in result["text"]
            assert "external_links" in result["metadata"]

    async def test_fetch_variant_list_response(self):
        """Test fetching variant when API returns list."""
        mock_result = json.dumps([
            {"_id": "rs123", "gene": {"symbol": "BRAF"}}
        ])

        with patch("biomcp.variants.getter.get_variant") as mock_get:
            mock_get.return_value = mock_result

            result = await fetch(domain="variant", id="rs123")

            assert result["id"] == "rs123"

    async def test_fetch_invalid_domain(self):
        """Test fetch with invalid domain."""
        with pytest.raises(InvalidDomainError):
            await fetch(domain="invalid", id="123")

    async def test_fetch_error_handling(self):
        """Test fetch error handling."""
        with patch("biomcp.articles.fetch.fetch_articles") as mock_fetch:
            mock_fetch.side_effect = Exception("API Error")

            with pytest.raises(SearchExecutionError) as exc_info:
                await fetch(domain="article", id="123")

            assert "Failed to execute search" in str(exc_info.value)

    async def test_fetch_domain_auto_detection_pmid(self):
        """Test domain auto-detection for PMID."""
        with patch("biomcp.articles.fetch._article_details") as mock_fetch:
            mock_fetch.return_value = json.dumps([
                {"pmid": "12345", "title": "Test"}
            ])

            # Numeric ID should auto-detect as article
            result = await fetch(id="12345")
            assert result["id"] == "12345"
            mock_fetch.assert_called_once()

    async def test_fetch_domain_auto_detection_nct(self):
        """Test domain auto-detection for NCT ID."""
        with patch("biomcp.trials.getter.get_trial") as mock_get:
            mock_get.return_value = json.dumps({
                "protocolSection": {
                    "identificationModule": {"briefTitle": "Test Trial"}
                }
            })

            # NCT ID should auto-detect as trial
            result = await fetch(id="NCT12345")
            assert "NCT12345" in result["url"]
            mock_get.assert_called()

    async def test_fetch_domain_auto_detection_doi(self):
        """Test domain auto-detection for DOI."""
        with patch("biomcp.articles.fetch._article_details") as mock_fetch:
            mock_fetch.return_value = json.dumps([
                {"doi": "10.1038/nature12345", "title": "Test"}
            ])

            # DOI should auto-detect as article
            await fetch(id="10.1038/nature12345")
            mock_fetch.assert_called_once()

    async def test_fetch_domain_auto_detection_variant(self):
        """Test domain auto-detection for variant IDs."""
        with patch("biomcp.variants.getter.get_variant") as mock_get:
            mock_get.return_value = json.dumps([{"_id": "rs12345"}])

            # rsID should auto-detect as variant
            await fetch(id="rs12345")
            mock_get.assert_called_once()

        # Test HGVS notation
        with patch("biomcp.variants.getter.get_variant") as mock_get:
            mock_get.return_value = json.dumps([
                {"_id": "chr7:g.140453136A>T"}
            ])

            await fetch(id="chr7:g.140453136A>T")
            mock_get.assert_called_once()


@pytest.mark.asyncio
class TestUnifiedSearch:
    """Test the _unified_search internal function."""

    async def test_unified_search_explain_query(self):
        """Test unified search with explain_query flag."""
        from biomcp.router import _unified_search

        result = await _unified_search(
            query="gene:BRAF AND disease:cancer", explain_query=True
        )

        assert "original_query" in result
        assert "parsed_structure" in result
        assert "routing_plan" in result
        assert "schema" in result

    async def test_unified_search_execution(self):
        """Test unified search normal execution."""
        from biomcp.router import _unified_search

        with patch("biomcp.query_router.execute_routing_plan") as mock_execute:
            mock_execute.return_value = {
                "articles": json.dumps([{"pmid": "123", "title": "Article 1"}])
            }

            result = await _unified_search(
                query="gene:BRAF", max_results_per_domain=10
            )

            assert "results" in result
            assert isinstance(result["results"], list)

    async def test_unified_search_parse_error(self):
        """Test unified search with invalid query."""
        from biomcp.router import _unified_search

        with patch("biomcp.query_parser.QueryParser.parse") as mock_parse:
            mock_parse.side_effect = Exception("Parse error")

            with pytest.raises(QueryParsingError):
                await _unified_search(
                    query="invalid::query", max_results_per_domain=10
                )

```

--------------------------------------------------------------------------------
/src/biomcp/integrations/biothings_client.py:
--------------------------------------------------------------------------------

```python
"""BioThings API client for unified access to the BioThings suite.

The BioThings suite (https://biothings.io) provides high-performance biomedical
data APIs including:
- MyGene.info - Gene annotations and information
- MyVariant.info - Genetic variant annotations (existing integration enhanced)
- MyDisease.info - Disease ontology and synonyms
- MyChem.info - Drug/chemical annotations and information

This module provides a centralized client for interacting with all BioThings APIs,
handling common concerns like error handling, rate limiting, and response parsing.
While MyVariant.info has specialized modules for complex variant operations, this
client provides the base layer for all BioThings API interactions.
"""

import logging
from typing import Any
from urllib.parse import quote

from pydantic import BaseModel, Field

from .. import http_client
from ..constants import (
    MYVARIANT_GET_URL,
)

logger = logging.getLogger(__name__)

# BioThings API endpoints
MYGENE_BASE_URL = "https://mygene.info/v3"
MYGENE_QUERY_URL = f"{MYGENE_BASE_URL}/query"
MYGENE_GET_URL = f"{MYGENE_BASE_URL}/gene"

MYDISEASE_BASE_URL = "https://mydisease.info/v1"
MYDISEASE_QUERY_URL = f"{MYDISEASE_BASE_URL}/query"
MYDISEASE_GET_URL = f"{MYDISEASE_BASE_URL}/disease"

MYCHEM_BASE_URL = "https://mychem.info/v1"
MYCHEM_QUERY_URL = f"{MYCHEM_BASE_URL}/query"
MYCHEM_GET_URL = f"{MYCHEM_BASE_URL}/chem"


class GeneInfo(BaseModel):
    """Gene information from MyGene.info."""

    gene_id: str = Field(alias="_id")
    symbol: str | None = None
    name: str | None = None
    summary: str | None = None
    alias: list[str] | None = Field(default_factory=list)
    entrezgene: int | str | None = None
    ensembl: dict[str, Any] | None = None
    refseq: dict[str, Any] | None = None
    type_of_gene: str | None = None
    taxid: int | None = None


class DiseaseInfo(BaseModel):
    """Disease information from MyDisease.info."""

    disease_id: str = Field(alias="_id")
    name: str | None = None
    mondo: dict[str, Any] | None = None
    definition: str | None = None
    synonyms: list[str] | None = Field(default_factory=list)
    xrefs: dict[str, Any] | None = None
    phenotypes: list[dict[str, Any]] | None = None


class DrugInfo(BaseModel):
    """Drug/chemical information from MyChem.info."""

    drug_id: str = Field(alias="_id")
    name: str | None = None
    tradename: list[str] | None = Field(default_factory=list)
    drugbank_id: str | None = None
    chebi_id: str | None = None
    chembl_id: str | None = None
    pubchem_cid: str | None = None
    unii: str | dict[str, Any] | None = None
    inchikey: str | None = None
    formula: str | None = None
    description: str | None = None
    indication: str | None = None
    pharmacology: dict[str, Any] | None = None
    mechanism_of_action: str | None = None


class BioThingsClient:
    """Unified client for BioThings APIs (MyGene, MyVariant, MyDisease, MyChem)."""

    def __init__(self):
        """Initialize the BioThings client."""
        self.logger = logger

    async def get_gene_info(
        self, gene_id_or_symbol: str, fields: list[str] | None = None
    ) -> GeneInfo | None:
        """Get gene information from MyGene.info.

        Args:
            gene_id_or_symbol: Gene ID (Entrez, Ensembl) or symbol (e.g., "TP53")
            fields: Optional list of fields to return

        Returns:
            GeneInfo object or None if not found
        """
        try:
            # First, try direct GET (works for Entrez IDs)
            if gene_id_or_symbol.isdigit():
                return await self._get_gene_by_id(gene_id_or_symbol, fields)

            # For symbols, we need to query first
            query_result = await self._query_gene(gene_id_or_symbol)
            if not query_result:
                return None

            # Get the best match
            gene_id = query_result[0].get("_id")
            if not gene_id:
                return None

            # Now get full details
            return await self._get_gene_by_id(gene_id, fields)

        except Exception as e:
            self.logger.warning(
                f"Failed to get gene info for {gene_id_or_symbol}: {e}"
            )
            return None

    async def _query_gene(self, symbol: str) -> list[dict[str, Any]] | None:
        """Query MyGene.info for a gene symbol."""
        params = {
            "q": f"symbol:{quote(symbol)}",
            "species": "human",
            "fields": "_id,symbol,name,taxid",
            "size": 5,
        }

        response, error = await http_client.request_api(
            url=MYGENE_QUERY_URL,
            request=params,
            method="GET",
            domain="mygene",
        )

        if error or not response:
            return None

        hits = response.get("hits", [])
        # Filter for human genes (taxid 9606)
        human_hits = [h for h in hits if h.get("taxid") == 9606]
        return human_hits if human_hits else hits

    async def _get_gene_by_id(
        self, gene_id: str, fields: list[str] | None = None
    ) -> GeneInfo | None:
        """Get gene details by ID from MyGene.info."""
        if fields is None:
            fields = [
                "symbol",
                "name",
                "summary",
                "alias",
                "type_of_gene",
                "ensembl",
                "refseq",
                "entrezgene",
            ]

        params = {"fields": ",".join(fields)}

        response, error = await http_client.request_api(
            url=f"{MYGENE_GET_URL}/{gene_id}",
            request=params,
            method="GET",
            domain="mygene",
        )

        if error or not response:
            return None

        try:
            return GeneInfo(**response)
        except Exception as e:
            self.logger.warning(f"Failed to parse gene response: {e}")
            return None

    async def batch_get_genes(
        self, gene_ids: list[str], fields: list[str] | None = None
    ) -> list[GeneInfo]:
        """Get multiple genes in a single request.

        Args:
            gene_ids: List of gene IDs or symbols
            fields: Optional list of fields to return

        Returns:
            List of GeneInfo objects
        """
        if not gene_ids:
            return []

        if fields is None:
            fields = ["symbol", "name", "summary", "alias", "type_of_gene"]

        # MyGene supports POST for batch queries
        data = {
            "ids": ",".join(gene_ids),
            "fields": ",".join(fields),
            "species": "human",
        }

        response, error = await http_client.request_api(
            url=MYGENE_GET_URL,
            request=data,
            method="POST",
            domain="mygene",
        )

        if error or not response:
            return []

        results = []
        for item in response:
            try:
                if "notfound" not in item:
                    results.append(GeneInfo(**item))
            except Exception as e:
                self.logger.warning(f"Failed to parse gene in batch: {e}")
                continue

        return results

    async def get_disease_info(
        self, disease_id_or_name: str, fields: list[str] | None = None
    ) -> DiseaseInfo | None:
        """Get disease information from MyDisease.info.

        Args:
            disease_id_or_name: Disease ID (MONDO, DOID) or name
            fields: Optional list of fields to return

        Returns:
            DiseaseInfo object or None if not found
        """
        try:
            # Check if it's an ID (starts with known prefixes)
            if any(
                disease_id_or_name.upper().startswith(prefix)
                for prefix in ["MONDO:", "DOID:", "OMIM:", "MESH:"]
            ):
                return await self._get_disease_by_id(
                    disease_id_or_name, fields
                )

            # Otherwise, query by name
            query_result = await self._query_disease(disease_id_or_name)
            if not query_result:
                return None

            # Get the best match
            disease_id = query_result[0].get("_id")
            if not disease_id:
                return None

            # Now get full details
            return await self._get_disease_by_id(disease_id, fields)

        except Exception as e:
            self.logger.warning(
                f"Failed to get disease info for {disease_id_or_name}: {e}"
            )
            return None

    async def _query_disease(self, name: str) -> list[dict[str, Any]] | None:
        """Query MyDisease.info for a disease name."""
        params = {
            "q": quote(name),
            "fields": "_id,name,mondo",
            "size": 10,
        }

        response, error = await http_client.request_api(
            url=MYDISEASE_QUERY_URL,
            request=params,
            method="GET",
            domain="mydisease",
        )

        if error or not response:
            return None

        return response.get("hits", [])

    async def _get_disease_by_id(
        self, disease_id: str, fields: list[str] | None = None
    ) -> DiseaseInfo | None:
        """Get disease details by ID from MyDisease.info."""
        if fields is None:
            fields = [
                "name",
                "mondo",
                "definition",
                "synonyms",
                "xrefs",
                "phenotypes",
            ]

        params = {"fields": ",".join(fields)}

        response, error = await http_client.request_api(
            url=f"{MYDISEASE_GET_URL}/{quote(disease_id, safe='')}",
            request=params,
            method="GET",
            domain="mydisease",
        )

        if error or not response:
            return None

        try:
            # Extract definition from mondo if available
            if "mondo" in response and isinstance(response["mondo"], dict):
                if (
                    "definition" in response["mondo"]
                    and "definition" not in response
                ):
                    response["definition"] = response["mondo"]["definition"]
                # Extract synonyms from mondo
                if "synonym" in response["mondo"]:
                    mondo_synonyms = response["mondo"]["synonym"]
                    if isinstance(mondo_synonyms, dict):
                        # Handle exact synonyms
                        exact = mondo_synonyms.get("exact", [])
                        if isinstance(exact, list):
                            response["synonyms"] = exact
                    elif isinstance(mondo_synonyms, list):
                        response["synonyms"] = mondo_synonyms

            return DiseaseInfo(**response)
        except Exception as e:
            self.logger.warning(f"Failed to parse disease response: {e}")
            return None

    async def get_disease_synonyms(self, disease_id_or_name: str) -> list[str]:
        """Get disease synonyms for query expansion.

        Args:
            disease_id_or_name: Disease ID or name

        Returns:
            List of synonyms including the original term
        """
        disease_info = await self.get_disease_info(disease_id_or_name)
        if not disease_info:
            return [disease_id_or_name]

        synonyms = [disease_id_or_name]
        if disease_info.name and disease_info.name != disease_id_or_name:
            synonyms.append(disease_info.name)

        if disease_info.synonyms:
            synonyms.extend(disease_info.synonyms)

        # Remove duplicates while preserving order
        seen = set()
        unique_synonyms = []
        for syn in synonyms:
            if syn.lower() not in seen:
                seen.add(syn.lower())
                unique_synonyms.append(syn)

        return unique_synonyms[
            :5
        ]  # Limit to top 5 to avoid overly broad searches

    async def get_drug_info(
        self, drug_id_or_name: str, fields: list[str] | None = None
    ) -> DrugInfo | None:
        """Get drug/chemical information from MyChem.info.

        Args:
            drug_id_or_name: Drug ID (DrugBank, ChEMBL, etc.) or name
            fields: Optional list of fields to return

        Returns:
            DrugInfo object or None if not found
        """
        try:
            # Check if it's an ID (starts with known prefixes)
            if any(
                drug_id_or_name.upper().startswith(prefix)
                for prefix in ["DRUGBANK:", "DB", "CHEMBL", "CHEBI:", "CID"]
            ):
                return await self._get_drug_by_id(drug_id_or_name, fields)

            # Otherwise, query by name
            query_result = await self._query_drug(drug_id_or_name)
            if not query_result:
                return None

            # Get the best match
            drug_id = query_result[0].get("_id")
            if not drug_id:
                return None

            # Now get full details
            return await self._get_drug_by_id(drug_id, fields)

        except Exception as e:
            self.logger.warning(
                f"Failed to get drug info for {drug_id_or_name}: {e}"
            )
            return None

    async def _query_drug(self, name: str) -> list[dict[str, Any]] | None:
        """Query MyChem.info for a drug name."""
        params = {
            "q": quote(name),
            "fields": "_id,name,drugbank.name,chebi.name,chembl.pref_name,unii.display_name",
            "size": 10,
        }

        response, error = await http_client.request_api(
            url=MYCHEM_QUERY_URL,
            request=params,
            method="GET",
            domain="mychem",
        )

        if error or not response:
            return None

        hits = response.get("hits", [])

        # Sort hits to prioritize those with actual drug names
        def score_hit(hit):
            score = hit.get("_score", 0)
            # Boost score if hit has drug name fields
            if hit.get("drugbank", {}).get("name"):
                score += 10
            if hit.get("chembl", {}).get("pref_name"):
                score += 5
            if hit.get("unii", {}).get("display_name"):
                score += 3
            return score

        hits.sort(key=score_hit, reverse=True)
        return hits

    async def _get_drug_by_id(
        self, drug_id: str, fields: list[str] | None = None
    ) -> DrugInfo | None:
        """Get drug details by ID from MyChem.info."""
        if fields is None:
            fields = [
                "name",
                "drugbank",
                "chebi",
                "chembl",
                "pubchem",
                "unii",
                "inchikey",
                "formula",
                "description",
                "indication",
                "pharmacology",
                "mechanism_of_action",
            ]

        params = {"fields": ",".join(fields)}

        response, error = await http_client.request_api(
            url=f"{MYCHEM_GET_URL}/{quote(drug_id, safe='')}",
            request=params,
            method="GET",
            domain="mychem",
        )

        if error or not response:
            return None

        try:
            # Handle array response (multiple results)
            if isinstance(response, list):
                if not response:
                    return None
                # Take the first result
                response = response[0]

            # Extract fields from nested structures
            self._extract_drugbank_fields(response)
            self._extract_chebi_fields(response)
            self._extract_chembl_fields(response)
            self._extract_pubchem_fields(response)
            self._extract_unii_fields(response)

            return DrugInfo(**response)
        except Exception as e:
            self.logger.warning(f"Failed to parse drug response: {e}")
            return None

    def _extract_drugbank_fields(self, response: dict[str, Any]) -> None:
        """Extract DrugBank fields from response."""
        if "drugbank" in response and isinstance(response["drugbank"], dict):
            db = response["drugbank"]
            response["drugbank_id"] = db.get("id")
            response["name"] = response.get("name") or db.get("name")
            response["tradename"] = db.get("products", {}).get("name", [])
            if isinstance(response["tradename"], str):
                response["tradename"] = [response["tradename"]]
            response["indication"] = db.get("indication")
            response["mechanism_of_action"] = db.get("mechanism_of_action")
            response["description"] = db.get("description")

    def _extract_chebi_fields(self, response: dict[str, Any]) -> None:
        """Extract ChEBI fields from response."""
        if "chebi" in response and isinstance(response["chebi"], dict):
            response["chebi_id"] = response["chebi"].get("id")
            if not response.get("name"):
                response["name"] = response["chebi"].get("name")

    def _extract_chembl_fields(self, response: dict[str, Any]) -> None:
        """Extract ChEMBL fields from response."""
        if "chembl" in response and isinstance(response["chembl"], dict):
            response["chembl_id"] = response["chembl"].get(
                "molecule_chembl_id"
            )
            if not response.get("name"):
                response["name"] = response["chembl"].get("pref_name")

    def _extract_pubchem_fields(self, response: dict[str, Any]) -> None:
        """Extract PubChem fields from response."""
        if "pubchem" in response and isinstance(response["pubchem"], dict):
            response["pubchem_cid"] = str(response["pubchem"].get("cid", ""))

    def _extract_unii_fields(self, response: dict[str, Any]) -> None:
        """Extract UNII fields from response."""
        if "unii" in response and isinstance(response["unii"], dict):
            unii_data = response["unii"]
            # Set UNII code
            response["unii"] = unii_data.get("unii", "")
            # Use display name as drug name if not already set
            if not response.get("name") and unii_data.get("display_name"):
                response["name"] = unii_data["display_name"]
            # Use NCIT description if no description
            if not response.get("description") and unii_data.get(
                "ncit_description"
            ):
                response["description"] = unii_data["ncit_description"]

    async def get_variant_info(
        self, variant_id: str, fields: list[str] | None = None
    ) -> dict[str, Any] | None:
        """Get variant information from MyVariant.info.

        This is a wrapper around the existing MyVariant integration.

        Args:
            variant_id: Variant ID (rsID, HGVS)
            fields: Optional list of fields to return

        Returns:
            Variant data dictionary or None if not found
        """
        params = {"fields": "all" if fields is None else ",".join(fields)}

        response, error = await http_client.request_api(
            url=f"{MYVARIANT_GET_URL}/{variant_id}",
            request=params,
            method="GET",
            domain="myvariant",
        )

        if error or not response:
            return None

        return response

```

--------------------------------------------------------------------------------
/docs/user-guides/02-mcp-tools-reference.md:
--------------------------------------------------------------------------------

```markdown
# MCP Tools Reference

BioMCP provides 35 specialized tools for biomedical research through the Model Context Protocol (MCP). This reference covers all available tools, their parameters, and usage patterns.

## Related Guides

- **Conceptual Overview**: [Sequential Thinking with the Think Tool](../concepts/03-sequential-thinking-with-the-think-tool.md)
- **Practical Examples**: See the [How-to Guides](../how-to-guides/01-find-articles-and-cbioportal-data.md) for real-world usage patterns
- **Integration Setup**: [Claude Desktop Integration](../getting-started/02-claude-desktop-integration.md)

## Tool Categories

| Category            | Count | Tools                                                          |
| ------------------- | ----- | -------------------------------------------------------------- |
| **Core Tools**      | 3     | `search`, `fetch`, `think`                                     |
| **Article Tools**   | 2     | `article_searcher`, `article_getter`                           |
| **Trial Tools**     | 6     | `trial_searcher`, `trial_getter`, + 4 detail getters           |
| **Variant Tools**   | 3     | `variant_searcher`, `variant_getter`, `alphagenome_predictor`  |
| **BioThings Tools** | 3     | `gene_getter`, `disease_getter`, `drug_getter`                 |
| **NCI Tools**       | 6     | Organization, intervention, biomarker, and disease tools       |
| **OpenFDA Tools**   | 12    | Adverse events, labels, devices, approvals, recalls, shortages |

## Core Unified Tools

### 1. search

**Universal search across all biomedical domains with unified query language.**

```python
search(
    query: str = None,              # Unified query syntax
    domain: str = None,             # Target domain
    genes: list[str] = None,        # Gene symbols
    diseases: list[str] = None,     # Disease/condition terms
    variants: list[str] = None,     # Variant notations
    chemicals: list[str] = None,    # Drug/chemical names
    keywords: list[str] = None,     # Additional keywords
    conditions: list[str] = None,   # Trial conditions
    interventions: list[str] = None,# Trial interventions
    lat: float = None,              # Latitude for trials
    long: float = None,             # Longitude for trials
    page: int = 1,                  # Page number
    page_size: int = 10,            # Results per page
    api_key: str = None             # For NCI domains
) -> dict
```

**Domains:** `article`, `trial`, `variant`, `gene`, `drug`, `disease`, `nci_organization`, `nci_intervention`, `nci_biomarker`, `nci_disease`, `fda_adverse`, `fda_label`, `fda_device`, `fda_approval`, `fda_recall`, `fda_shortage`

**Query Language Examples:**

- `"gene:BRAF AND disease:melanoma"`
- `"drugs.tradename:gleevec"`
- `"gene:TP53 AND (mutation OR variant)"`

**Usage Examples:**

```python
# Domain-specific search
search(domain="article", genes=["BRAF"], diseases=["melanoma"])

# Unified query language
search(query="gene:EGFR AND mutation:T790M")

# Clinical trials by location
search(domain="trial", conditions=["lung cancer"], lat=40.7128, long=-74.0060)

# FDA adverse events
search(domain="fda_adverse", chemicals=["aspirin"])

# FDA drug approvals
search(domain="fda_approval", chemicals=["keytruda"])
```

### 2. fetch

**Retrieve detailed information for any biomedical record.**

```python
fetch(
    id: str,                    # Record identifier
    domain: str = None,         # Domain (auto-detected if not provided)
    detail: str = None,         # Specific section for trials
    api_key: str = None         # For NCI records
) -> dict
```

**Supported IDs:**

- Articles: PMID (e.g., "38768446"), DOI (e.g., "10.1101/2024.01.20")
- Trials: NCT ID (e.g., "NCT03006926")
- Variants: HGVS, rsID, genomic coordinates
- Genes/Drugs/Diseases: Names or database IDs
- FDA Records: Report IDs, Application Numbers (e.g., "BLA125514"), Recall Numbers, etc.

**Detail Options for Trials:** `protocol`, `locations`, `outcomes`, `references`, `all`

**Usage Examples:**

```python
# Fetch article by PMID
fetch(id="38768446", domain="article")

# Fetch trial with specific details
fetch(id="NCT03006926", domain="trial", detail="locations")

# Auto-detect domain
fetch(id="rs121913529")  # Variant
fetch(id="BRAF")         # Gene

# Fetch FDA records
fetch(id="BLA125514", domain="fda_approval")  # Drug approval
fetch(id="D-0001-2023", domain="fda_recall")   # Drug recall
```

### 3. think

**Sequential thinking tool for structured problem-solving.**

```python
think(
    thought: str,               # Current reasoning step
    thoughtNumber: int,         # Sequential number (1, 2, 3...)
    totalThoughts: int = None,  # Estimated total thoughts
    nextThoughtNeeded: bool = True  # Continue thinking?
) -> str
```

**CRITICAL:** Always use `think` BEFORE any other BioMCP operation!

**Usage Pattern:**

```python
# Step 1: Problem decomposition
think(
    thought="Breaking down query: need to find BRAF inhibitor trials...",
    thoughtNumber=1,
    nextThoughtNeeded=True
)

# Step 2: Search strategy
think(
    thought="Will search trials for BRAF V600E melanoma, then articles...",
    thoughtNumber=2,
    nextThoughtNeeded=True
)

# Final step: Synthesis
think(
    thought="Ready to synthesize findings from 5 trials and 12 articles...",
    thoughtNumber=3,
    nextThoughtNeeded=False  # Analysis complete
)
```

## Article Tools

### 4. article_searcher

**Search PubMed/PubTator3 for biomedical literature.**

```python
article_searcher(
    chemicals: list[str] = None,
    diseases: list[str] = None,
    genes: list[str] = None,
    keywords: list[str] = None,    # Supports OR with "|"
    variants: list[str] = None,
    include_preprints: bool = True,
    include_cbioportal: bool = True,
    page: int = 1,
    page_size: int = 10
) -> str
```

**Features:**

- Automatic cBioPortal integration for gene searches
- Preprint inclusion from bioRxiv/medRxiv
- OR logic in keywords: `"V600E|p.V600E|c.1799T>A"`

**Example:**

```python
# Search with multiple filters
article_searcher(
    genes=["BRAF"],
    diseases=["melanoma"],
    keywords=["resistance|resistant"],
    include_cbioportal=True
)
```

### 5. article_getter

**Fetch detailed article information.**

```python
article_getter(
    pmid: str  # PubMed ID, PMC ID, or DOI
) -> str
```

**Supports:**

- PubMed IDs: "38768446"
- PMC IDs: "PMC7498215"
- DOIs: "10.1101/2024.01.20.23288905"

## Trial Tools

### 6. trial_searcher

**Search ClinicalTrials.gov with comprehensive filters.**

```python
trial_searcher(
    conditions: list[str] = None,
    interventions: list[str] = None,
    other_terms: list[str] = None,
    recruiting_status: str = "ANY",  # "OPEN", "CLOSED", "ANY"
    phase: str = None,               # "PHASE1", "PHASE2", etc.
    lat: float = None,               # Location-based search
    long: float = None,
    distance: int = None,            # Miles from coordinates
    age_group: str = None,           # "CHILD", "ADULT", "OLDER_ADULT"
    sex: str = None,                 # "MALE", "FEMALE", "ALL"
    study_type: str = None,          # "INTERVENTIONAL", "OBSERVATIONAL"
    funder_type: str = None,         # "NIH", "INDUSTRY", etc.
    page: int = 1,
    page_size: int = 10
) -> str
```

**Location Search Example:**

```python
# Trials near Boston
trial_searcher(
    conditions=["breast cancer"],
    lat=42.3601,
    long=-71.0589,
    distance=50,
    recruiting_status="OPEN"
)
```

### 7-11. Trial Detail Getters

```python
# Get complete trial information
trial_getter(nct_id: str) -> str

# Get specific sections
trial_protocol_getter(nct_id: str) -> str     # Core protocol info
trial_locations_getter(nct_id: str) -> str    # Sites and contacts
trial_outcomes_getter(nct_id: str) -> str     # Outcome measures
trial_references_getter(nct_id: str) -> str   # Publications
```

## Variant Tools

### 12. variant_searcher

**Search MyVariant.info for genetic variants.**

```python
variant_searcher(
    gene: str = None,
    hgvs: str = None,
    hgvsp: str = None,              # Protein HGVS
    hgvsc: str = None,              # Coding DNA HGVS
    rsid: str = None,
    region: str = None,             # "chr7:140753336-140753337"
    significance: str = None,        # Clinical significance
    frequency_min: float = None,
    frequency_max: float = None,
    cadd_score_min: float = None,
    sift_prediction: str = None,
    polyphen_prediction: str = None,
    sources: list[str] = None,
    include_cbioportal: bool = True,
    page: int = 1,
    page_size: int = 10
) -> str
```

**Significance Options:** `pathogenic`, `likely_pathogenic`, `uncertain_significance`, `likely_benign`, `benign`

**Example:**

```python
# Find rare pathogenic BRCA1 variants
variant_searcher(
    gene="BRCA1",
    significance="pathogenic",
    frequency_max=0.001,
    cadd_score_min=20
)
```

### 13. variant_getter

**Fetch comprehensive variant details.**

```python
variant_getter(
    variant_id: str,              # HGVS, rsID, or MyVariant ID
    include_external: bool = True  # Include TCGA, 1000 Genomes
) -> str
```

### 14. alphagenome_predictor

**Predict variant effects using Google DeepMind's AlphaGenome.**

```python
alphagenome_predictor(
    chromosome: str,              # e.g., "chr7"
    position: int,                # 1-based position
    reference: str,               # Reference allele
    alternate: str,               # Alternate allele
    interval_size: int = 131072,  # Analysis window
    tissue_types: list[str] = None,  # UBERON terms
    significance_threshold: float = 0.5,
    api_key: str = None          # AlphaGenome API key
) -> str
```

**Requires:** AlphaGenome API key (environment variable or per-request)

**Tissue Examples:**

- `UBERON:0002367` - prostate gland
- `UBERON:0001155` - colon
- `UBERON:0002048` - lung

**Example:**

```python
# Predict BRAF V600E effects
alphagenome_predictor(
    chromosome="chr7",
    position=140753336,
    reference="A",
    alternate="T",
    tissue_types=["UBERON:0002367"],  # prostate
    api_key="your-key"
)
```

## BioThings Tools

### 15. gene_getter

**Get gene information from MyGene.info.**

```python
gene_getter(
    gene_id_or_symbol: str  # Gene symbol or Entrez ID
) -> str
```

**Returns:** Official name, aliases, summary, genomic location, database links

### 16. disease_getter

**Get disease information from MyDisease.info.**

```python
disease_getter(
    disease_id_or_name: str  # Disease name or ontology ID
) -> str
```

**Returns:** Definition, synonyms, MONDO/DOID IDs, associated phenotypes

### 17. drug_getter

**Get drug/chemical information from MyChem.info.**

```python
drug_getter(
    drug_id_or_name: str  # Drug name or database ID
) -> str
```

**Returns:** Chemical structure, mechanism, indications, trade names, identifiers

## NCI-Specific Tools

All NCI tools require an API key from [api.cancer.gov](https://api.cancer.gov).

### 18-19. Organization Tools

```python
# Search organizations
nci_organization_searcher(
    name: str = None,
    organization_type: str = None,
    city: str = None,              # Must use with state
    state: str = None,             # Must use with city
    api_key: str = None
) -> str

# Get organization details
nci_organization_getter(
    organization_id: str,
    api_key: str = None
) -> str
```

### 20-21. Intervention Tools

```python
# Search interventions
nci_intervention_searcher(
    name: str = None,
    intervention_type: str = None,  # "Drug", "Device", etc.
    synonyms: bool = True,
    api_key: str = None
) -> str

# Get intervention details
nci_intervention_getter(
    intervention_id: str,
    api_key: str = None
) -> str
```

### 22. Biomarker Search

```python
nci_biomarker_searcher(
    name: str = None,
    biomarker_type: str = None,
    api_key: str = None
) -> str
```

### 23. Disease Search (NCI)

```python
nci_disease_searcher(
    name: str = None,
    include_synonyms: bool = True,
    category: str = None,
    api_key: str = None
) -> str
```

## OpenFDA Tools

All OpenFDA tools support optional API keys for higher rate limits (240/min vs 40/min). Get a free key at [open.fda.gov/apis/authentication](https://open.fda.gov/apis/authentication/).

### 24. openfda_adverse_searcher

**Search FDA Adverse Event Reporting System (FAERS).**

```python
openfda_adverse_searcher(
    drug: str = None,
    reaction: str = None,
    serious: bool = None,        # Filter serious events only
    limit: int = 25,
    skip: int = 0,
    api_key: str = None          # Optional OpenFDA API key
) -> str
```

**Example:**

```python
# Find serious bleeding events for warfarin
openfda_adverse_searcher(
    drug="warfarin",
    reaction="bleeding",
    serious=True,
    api_key="your-key"  # Optional
)
```

### 25. openfda_adverse_getter

**Get detailed adverse event report.**

```python
openfda_adverse_getter(
    report_id: str,              # Safety report ID
    api_key: str = None
) -> str
```

### 26. openfda_label_searcher

**Search FDA drug product labels.**

```python
openfda_label_searcher(
    name: str = None,
    indication: str = None,      # Search by indication
    boxed_warning: bool = False, # Filter for boxed warnings
    section: str = None,         # Specific label section
    limit: int = 25,
    skip: int = 0,
    api_key: str = None
) -> str
```

### 27. openfda_label_getter

**Get complete drug label information.**

```python
openfda_label_getter(
    set_id: str,                 # Label set ID
    sections: list[str] = None,  # Specific sections to retrieve
    api_key: str = None
) -> str
```

**Label Sections:** `indications_and_usage`, `contraindications`, `warnings_and_precautions`, `dosage_and_administration`, `adverse_reactions`, `drug_interactions`, `pregnancy`, `pediatric_use`, `geriatric_use`

### 28. openfda_device_searcher

**Search FDA device adverse event reports (MAUDE).**

```python
openfda_device_searcher(
    device: str = None,
    manufacturer: str = None,
    problem: str = None,
    product_code: str = None,    # FDA product code
    genomics_only: bool = True,  # Filter genomic/diagnostic devices
    limit: int = 25,
    skip: int = 0,
    api_key: str = None
) -> str
```

**Note:** FDA uses abbreviated device names (e.g., "F1CDX" for "FoundationOne CDx").

### 29. openfda_device_getter

**Get detailed device event report.**

```python
openfda_device_getter(
    mdr_report_key: str,         # MDR report key
    api_key: str = None
) -> str
```

### 30. openfda_approval_searcher

**Search FDA drug approval records (Drugs@FDA).**

```python
openfda_approval_searcher(
    drug: str = None,
    application_number: str = None,  # NDA/BLA number
    approval_year: str = None,       # YYYY format
    limit: int = 25,
    skip: int = 0,
    api_key: str = None
) -> str
```

### 31. openfda_approval_getter

**Get drug approval details.**

```python
openfda_approval_getter(
    application_number: str,     # NDA/BLA number
    api_key: str = None
) -> str
```

### 32. openfda_recall_searcher

**Search FDA drug recall records.**

```python
openfda_recall_searcher(
    drug: str = None,
    recall_class: str = None,    # "1", "2", or "3"
    status: str = None,          # "ongoing" or "completed"
    reason: str = None,
    since_date: str = None,      # YYYYMMDD format
    limit: int = 25,
    skip: int = 0,
    api_key: str = None
) -> str
```

**Recall Classes:**

- Class 1: Dangerous or defective products that could cause serious health problems or death
- Class 2: Products that might cause temporary health problems or pose slight threat
- Class 3: Products unlikely to cause adverse health consequences

### 33. openfda_recall_getter

**Get drug recall details.**

```python
openfda_recall_getter(
    recall_number: str,          # FDA recall number
    api_key: str = None
) -> str
```

### 34. openfda_shortage_searcher

**Search FDA drug shortage database.**

```python
openfda_shortage_searcher(
    drug: str = None,
    status: str = None,          # "current" or "resolved"
    therapeutic_category: str = None,
    limit: int = 25,
    skip: int = 0,
    api_key: str = None
) -> str
```

### 35. openfda_shortage_getter

**Get drug shortage details.**

```python
openfda_shortage_getter(
    drug_name: str,
    api_key: str = None
) -> str
```

## Best Practices

### 1. Always Think First

```python
# ✅ CORRECT - Think before searching
think(thought="Planning BRAF melanoma research...", thoughtNumber=1)
results = article_searcher(genes=["BRAF"], diseases=["melanoma"])

# ❌ INCORRECT - Skipping think tool
results = article_searcher(genes=["BRAF"])  # Poor results!
```

### 2. Use Unified Tools for Flexibility

```python
# Unified search supports complex queries
results = search(query="gene:EGFR AND (mutation:T790M OR mutation:C797S)")

# Unified fetch auto-detects domain
details = fetch(id="NCT03006926")  # Knows it's a trial
```

### 3. Leverage Domain-Specific Features

```python
# Article search with cBioPortal
articles = article_searcher(
    genes=["KRAS"],
    include_cbioportal=True  # Adds cancer genomics context
)

# Variant search with multiple filters
variants = variant_searcher(
    gene="TP53",
    significance="pathogenic",
    frequency_max=0.01,
    cadd_score_min=25
)
```

### 4. Handle API Keys Properly

```python
# For personal use - environment variable
# export NCI_API_KEY="your-key"
nci_results = search(domain="nci_organization", name="Mayo Clinic")

# For shared environments - per-request
nci_results = search(
    domain="nci_organization",
    name="Mayo Clinic",
    api_key="user-provided-key"
)
```

### 5. Use Appropriate Page Sizes

```python
# Large result sets - increase page_size
results = article_searcher(
    genes=["TP53"],
    page_size=50  # Get more results at once
)

# Iterative exploration - use pagination
page1 = trial_searcher(conditions=["cancer"], page=1, page_size=10)
page2 = trial_searcher(conditions=["cancer"], page=2, page_size=10)
```

## Error Handling

All tools include comprehensive error handling:

- **Invalid parameters**: Clear error messages with valid options
- **API failures**: Graceful degradation with informative messages
- **Rate limits**: Automatic retry with exponential backoff
- **Missing API keys**: Helpful instructions for obtaining keys

## Tool Selection Guide

| If you need to...              | Use this tool                                     |
| ------------------------------ | ------------------------------------------------- |
| Search across multiple domains | `search` with query language                      |
| Get any record by ID           | `fetch` with auto-detection                       |
| Plan your research approach    | `think` (always first!)                           |
| Find recent papers             | `article_searcher`                                |
| Locate clinical trials         | `trial_searcher`                                  |
| Analyze genetic variants       | `variant_searcher` + `variant_getter`             |
| Predict variant effects        | `alphagenome_predictor`                           |
| Get gene/drug/disease info     | `gene_getter`, `drug_getter`, `disease_getter`    |
| Access NCI databases           | `nci_*` tools with API key                        |
| Check drug adverse events      | `openfda_adverse_searcher`                        |
| Review FDA drug labels         | `openfda_label_searcher` + `openfda_label_getter` |
| Investigate device issues      | `openfda_device_searcher`                         |
| Find drug approvals            | `openfda_approval_searcher`                       |
| Check drug recalls             | `openfda_recall_searcher`                         |
| Monitor drug shortages         | `openfda_shortage_searcher`                       |

## Next Steps

- Review [Sequential Thinking](../concepts/03-sequential-thinking-with-the-think-tool.md) methodology
- Explore [How-to Guides](../how-to-guides/01-find-articles-and-cbioportal-data.md) for complex workflows
- Set up [API Keys](../getting-started/03-authentication-and-api-keys.md) for enhanced features

```

--------------------------------------------------------------------------------
/src/biomcp/domain_handlers.py:
--------------------------------------------------------------------------------

```python
"""Domain-specific result handlers for BioMCP.

This module contains formatting functions for converting raw API responses
from different biomedical data sources into a standardized format.
"""

import logging
from typing import Any

from biomcp.constants import (
    DEFAULT_SIGNIFICANCE,
    DEFAULT_TITLE,
    METADATA_AUTHORS,
    METADATA_COMPLETION_DATE,
    METADATA_CONSEQUENCE,
    METADATA_GENE,
    METADATA_JOURNAL,
    METADATA_PHASE,
    METADATA_RSID,
    METADATA_SIGNIFICANCE,
    METADATA_SOURCE,
    METADATA_START_DATE,
    METADATA_STATUS,
    METADATA_YEAR,
    RESULT_ID,
    RESULT_METADATA,
    RESULT_SNIPPET,
    RESULT_TITLE,
    RESULT_URL,
    SNIPPET_LENGTH,
)

logger = logging.getLogger(__name__)


class ArticleHandler:
    """Handles formatting for article/publication results."""

    @staticmethod
    def format_result(result: dict[str, Any]) -> dict[str, Any]:
        """Format a single article result.

        Args:
            result: Raw article data from PubTator3 or preprint APIs

        Returns:
            Standardized article result with id, title, snippet, url, and metadata
        """
        if "pmid" in result:
            # PubMed article
            # Clean up title - remove extra spaces
            title = result.get("title", "").strip()
            title = " ".join(title.split())  # Normalize whitespace

            # Use default if empty
            if not title:
                title = DEFAULT_TITLE

            return {
                RESULT_ID: result["pmid"],
                RESULT_TITLE: title,
                RESULT_SNIPPET: result.get("abstract", "")[:SNIPPET_LENGTH]
                + "..."
                if result.get("abstract")
                else "",
                RESULT_URL: f"https://pubmed.ncbi.nlm.nih.gov/{result['pmid']}/",
                RESULT_METADATA: {
                    METADATA_YEAR: result.get("pub_year")
                    or (
                        result.get("date", "")[:4]
                        if result.get("date")
                        else None
                    ),
                    METADATA_JOURNAL: result.get("journal", ""),
                    METADATA_AUTHORS: result.get("authors", [])[:3],
                },
            }
        else:
            # Preprint result
            return {
                RESULT_ID: result.get("doi", result.get("id", "")),
                RESULT_TITLE: result.get("title", ""),
                RESULT_SNIPPET: result.get("abstract", "")[:SNIPPET_LENGTH]
                + "..."
                if result.get("abstract")
                else "",
                RESULT_URL: result.get("url", ""),
                RESULT_METADATA: {
                    METADATA_YEAR: result.get("pub_year"),
                    METADATA_SOURCE: result.get("source", ""),
                    METADATA_AUTHORS: result.get("authors", [])[:3],
                },
            }


class TrialHandler:
    """Handles formatting for clinical trial results."""

    @staticmethod
    def format_result(result: dict[str, Any]) -> dict[str, Any]:
        """Format a single trial result.

        Handles both ClinicalTrials.gov API v2 nested structure and legacy formats.

        Args:
            result: Raw trial data from ClinicalTrials.gov API

        Returns:
            Standardized trial result with id, title, snippet, url, and metadata
        """
        # Handle ClinicalTrials.gov API v2 nested structure
        if "protocolSection" in result:
            # API v2 format - extract from nested modules
            protocol = result.get("protocolSection", {})
            identification = protocol.get("identificationModule", {})
            status = protocol.get("statusModule", {})
            description = protocol.get("descriptionModule", {})

            nct_id = identification.get("nctId", "")
            brief_title = identification.get("briefTitle", "")
            official_title = identification.get("officialTitle", "")
            brief_summary = description.get("briefSummary", "")
            overall_status = status.get("overallStatus", "")
            start_date = status.get("startDateStruct", {}).get("date", "")
            completion_date = status.get(
                "primaryCompletionDateStruct", {}
            ).get("date", "")

            # Extract phase from designModule
            design = protocol.get("designModule", {})
            phases = design.get("phases", [])
            phase = phases[0] if phases else ""
        elif "NCT Number" in result:
            # Legacy flat format from search results
            nct_id = result.get("NCT Number", "")
            brief_title = result.get("Study Title", "")
            official_title = ""  # Not available in this format
            brief_summary = result.get("Brief Summary", "")
            overall_status = result.get("Study Status", "")
            phase = result.get("Phases", "")
            start_date = result.get("Start Date", "")
            completion_date = result.get("Completion Date", "")
        else:
            # Original legacy format or simplified structure
            nct_id = result.get("nct_id", "")
            brief_title = result.get("brief_title", "")
            official_title = result.get("official_title", "")
            brief_summary = result.get("brief_summary", "")
            overall_status = result.get("overall_status", "")
            phase = result.get("phase", "")
            start_date = result.get("start_date", "")
            completion_date = result.get("primary_completion_date", "")

        return {
            RESULT_ID: nct_id,
            RESULT_TITLE: brief_title or official_title or DEFAULT_TITLE,
            RESULT_SNIPPET: brief_summary[:SNIPPET_LENGTH] + "..."
            if brief_summary
            else "",
            RESULT_URL: f"https://clinicaltrials.gov/study/{nct_id}",
            RESULT_METADATA: {
                METADATA_STATUS: overall_status,
                METADATA_PHASE: phase,
                METADATA_START_DATE: start_date,
                METADATA_COMPLETION_DATE: completion_date,
            },
        }


class VariantHandler:
    """Handles formatting for genetic variant results."""

    @staticmethod
    def format_result(result: dict[str, Any]) -> dict[str, Any]:
        """Format a single variant result.

        Args:
            result: Raw variant data from MyVariant.info API

        Returns:
            Standardized variant result with id, title, snippet, url, and metadata
        """
        # Extract gene symbol - MyVariant.info stores this in multiple locations
        gene = (
            result.get("dbnsfp", {}).get("genename", "")
            or result.get("dbsnp", {}).get("gene", {}).get("symbol", "")
            or ""
        )
        # Handle case where gene is a list
        if isinstance(gene, list):
            gene = gene[0] if gene else ""

        # Extract rsid
        rsid = result.get("dbsnp", {}).get("rsid", "") or ""

        # Extract clinical significance
        clinvar = result.get("clinvar", {})
        significance = ""
        if isinstance(clinvar.get("rcv"), dict):
            significance = clinvar["rcv"].get("clinical_significance", "")
        elif isinstance(clinvar.get("rcv"), list) and clinvar["rcv"]:
            significance = clinvar["rcv"][0].get("clinical_significance", "")

        # Build a meaningful title
        hgvs = ""
        if "dbnsfp" in result and "hgvsp" in result["dbnsfp"]:
            hgvs = result["dbnsfp"]["hgvsp"]
            if isinstance(hgvs, list):
                hgvs = hgvs[0] if hgvs else ""

        title = f"{gene} {hgvs}".strip() or result.get("_id", DEFAULT_TITLE)

        return {
            RESULT_ID: result.get("_id", ""),
            RESULT_TITLE: title,
            RESULT_SNIPPET: f"Clinical significance: {significance or DEFAULT_SIGNIFICANCE}",
            RESULT_URL: f"https://www.ncbi.nlm.nih.gov/snp/{rsid}"
            if rsid
            else "",
            RESULT_METADATA: {
                METADATA_GENE: gene,
                METADATA_RSID: rsid,
                METADATA_SIGNIFICANCE: significance,
                METADATA_CONSEQUENCE: result.get("cadd", {}).get(
                    "consequence", ""
                ),
            },
        }


class GeneHandler:
    """Handles formatting for gene information results from MyGene.info."""

    @staticmethod
    def format_result(result: dict[str, Any]) -> dict[str, Any]:
        """Format a single gene result.

        Args:
            result: Raw gene data from MyGene.info API

        Returns:
            Standardized gene result with id, title, snippet, url, and metadata
        """
        # Extract gene information
        gene_id = result.get("_id", result.get("entrezgene", ""))
        symbol = result.get("symbol", "")
        name = result.get("name", "")
        summary = result.get("summary", "")

        # Build title
        title = (
            f"{symbol}: {name}"
            if symbol and name
            else symbol or name or DEFAULT_TITLE
        )

        # Create snippet from summary
        snippet = (
            summary[:SNIPPET_LENGTH] + "..."
            if summary and len(summary) > SNIPPET_LENGTH
            else summary
        )

        return {
            RESULT_ID: str(gene_id),
            RESULT_TITLE: title,
            RESULT_SNIPPET: snippet or "No summary available",
            RESULT_URL: f"https://www.genenames.org/data/gene-symbol-report/#!/symbol/{symbol}"
            if symbol
            else "",
            RESULT_METADATA: {
                "entrezgene": result.get("entrezgene"),
                "symbol": symbol,
                "name": name,
                "type_of_gene": result.get("type_of_gene", ""),
                "ensembl": result.get("ensembl", {}).get("gene")
                if isinstance(result.get("ensembl"), dict)
                else None,
                "refseq": result.get("refseq", {}),
            },
        }


class DrugHandler:
    """Handles formatting for drug/chemical information results from MyChem.info."""

    @staticmethod
    def format_result(result: dict[str, Any]) -> dict[str, Any]:
        """Format a single drug result.

        Args:
            result: Raw drug data from MyChem.info API

        Returns:
            Standardized drug result with id, title, snippet, url, and metadata
        """
        # Extract drug information
        drug_id = result.get("_id", "")
        name = result.get("name", "")
        drugbank_id = result.get("drugbank_id", "")
        description = result.get("description", "")
        indication = result.get("indication", "")

        # Build title
        title = name or drug_id or DEFAULT_TITLE

        # Create snippet from description or indication
        snippet_text = indication or description
        snippet = (
            snippet_text[:SNIPPET_LENGTH] + "..."
            if snippet_text and len(snippet_text) > SNIPPET_LENGTH
            else snippet_text
        )

        # Determine URL based on available IDs
        url = ""
        if drugbank_id:
            url = f"https://www.drugbank.ca/drugs/{drugbank_id}"
        elif result.get("pubchem_cid"):
            url = f"https://pubchem.ncbi.nlm.nih.gov/compound/{result['pubchem_cid']}"

        return {
            RESULT_ID: drug_id,
            RESULT_TITLE: title,
            RESULT_SNIPPET: snippet or "No description available",
            RESULT_URL: url,
            RESULT_METADATA: {
                "drugbank_id": drugbank_id,
                "chembl_id": result.get("chembl_id", ""),
                "pubchem_cid": result.get("pubchem_cid", ""),
                "chebi_id": result.get("chebi_id", ""),
                "formula": result.get("formula", ""),
                "tradename": result.get("tradename", []),
            },
        }


class DiseaseHandler:
    """Handles formatting for disease information results from MyDisease.info."""

    @staticmethod
    def format_result(result: dict[str, Any]) -> dict[str, Any]:
        """Format a single disease result.

        Args:
            result: Raw disease data from MyDisease.info API

        Returns:
            Standardized disease result with id, title, snippet, url, and metadata
        """
        # Extract disease information
        disease_id = result.get("_id", "")
        name = result.get("name", "")
        definition = result.get("definition", "")
        mondo_info = result.get("mondo", {})

        # Build title
        title = name or disease_id or DEFAULT_TITLE

        # Create snippet from definition
        snippet = (
            definition[:SNIPPET_LENGTH] + "..."
            if definition and len(definition) > SNIPPET_LENGTH
            else definition
        )

        # Extract MONDO ID for URL
        mondo_id = mondo_info.get("id") if isinstance(mondo_info, dict) else ""
        url = (
            f"https://monarchinitiative.org/disease/{mondo_id}"
            if mondo_id
            else ""
        )

        return {
            RESULT_ID: disease_id,
            RESULT_TITLE: title,
            RESULT_SNIPPET: snippet or "No definition available",
            RESULT_URL: url,
            RESULT_METADATA: {
                "mondo_id": mondo_id,
                "definition": definition,
                "synonyms": result.get("synonyms", []),
                "xrefs": result.get("xrefs", {}),
                "phenotypes": len(result.get("phenotypes", [])),
            },
        }


class NCIOrganizationHandler:
    """Handles formatting for NCI organization results."""

    @staticmethod
    def format_result(result: dict[str, Any]) -> dict[str, Any]:
        """Format a single NCI organization result.

        Args:
            result: Raw organization data from NCI CTS API

        Returns:
            Standardized organization result with id, title, snippet, url, and metadata
        """
        org_id = result.get("id", result.get("org_id", ""))
        name = result.get("name", "Unknown Organization")
        org_type = result.get("type", result.get("category", ""))
        city = result.get("city", "")
        state = result.get("state", "")

        # Build location string
        location_parts = [p for p in [city, state] if p]
        location = ", ".join(location_parts) if location_parts else ""

        # Create snippet
        snippet_parts = []
        if org_type:
            snippet_parts.append(f"Type: {org_type}")
        if location:
            snippet_parts.append(f"Location: {location}")
        snippet = " | ".join(snippet_parts) or "No details available"

        return {
            RESULT_ID: org_id,
            RESULT_TITLE: name,
            RESULT_SNIPPET: snippet,
            RESULT_URL: "",  # NCI doesn't provide direct URLs to organizations
            RESULT_METADATA: {
                "type": org_type,
                "city": city,
                "state": state,
                "country": result.get("country", ""),
            },
        }


class NCIInterventionHandler:
    """Handles formatting for NCI intervention results."""

    @staticmethod
    def format_result(result: dict[str, Any]) -> dict[str, Any]:
        """Format a single NCI intervention result.

        Args:
            result: Raw intervention data from NCI CTS API

        Returns:
            Standardized intervention result with id, title, snippet, url, and metadata
        """
        int_id = result.get("id", result.get("intervention_id", ""))
        name = result.get("name", "Unknown Intervention")
        int_type = result.get("type", result.get("category", ""))
        synonyms = result.get("synonyms", [])

        # Create snippet
        snippet_parts = []
        if int_type:
            snippet_parts.append(f"Type: {int_type}")
        if synonyms:
            if isinstance(synonyms, list) and synonyms:
                snippet_parts.append(
                    f"Also known as: {', '.join(synonyms[:3])}"
                )
            elif isinstance(synonyms, str):
                snippet_parts.append(f"Also known as: {synonyms}")
        snippet = " | ".join(snippet_parts) or "No details available"

        return {
            RESULT_ID: int_id,
            RESULT_TITLE: name,
            RESULT_SNIPPET: snippet,
            RESULT_URL: "",  # NCI doesn't provide direct URLs to interventions
            RESULT_METADATA: {
                "type": int_type,
                "synonyms": synonyms,
                "description": result.get("description", ""),
            },
        }


class NCIBiomarkerHandler:
    """Handles formatting for NCI biomarker results."""

    @staticmethod
    def format_result(result: dict[str, Any]) -> dict[str, Any]:
        """Format a single NCI biomarker result.

        Args:
            result: Raw biomarker data from NCI CTS API

        Returns:
            Standardized biomarker result with id, title, snippet, url, and metadata
        """
        bio_id = result.get("id", result.get("biomarker_id", ""))
        name = result.get("name", "Unknown Biomarker")
        gene = result.get("gene", result.get("gene_symbol", ""))
        bio_type = result.get("type", result.get("category", ""))
        assay_type = result.get("assay_type", "")

        # Build title
        title = name
        if gene and gene not in name:
            title = f"{gene} - {name}"

        # Create snippet
        snippet_parts = []
        if bio_type:
            snippet_parts.append(f"Type: {bio_type}")
        if assay_type:
            snippet_parts.append(f"Assay: {assay_type}")
        snippet = (
            " | ".join(snippet_parts) or "Biomarker for trial eligibility"
        )

        return {
            RESULT_ID: bio_id,
            RESULT_TITLE: title,
            RESULT_SNIPPET: snippet,
            RESULT_URL: "",  # NCI doesn't provide direct URLs to biomarkers
            RESULT_METADATA: {
                "gene": gene,
                "type": bio_type,
                "assay_type": assay_type,
                "trial_count": result.get("trial_count", 0),
            },
        }


class NCIDiseaseHandler:
    """Handles formatting for NCI disease vocabulary results."""

    @staticmethod
    def format_result(result: dict[str, Any]) -> dict[str, Any]:
        """Format a single NCI disease result.

        Args:
            result: Raw disease data from NCI CTS API

        Returns:
            Standardized disease result with id, title, snippet, url, and metadata
        """
        disease_id = result.get("id", result.get("disease_id", ""))
        name = result.get(
            "name", result.get("preferred_name", "Unknown Disease")
        )
        category = result.get("category", result.get("type", ""))
        synonyms = result.get("synonyms", [])

        # Create snippet
        snippet_parts = []
        if category:
            snippet_parts.append(f"Category: {category}")
        if synonyms:
            if isinstance(synonyms, list) and synonyms:
                snippet_parts.append(
                    f"Also known as: {', '.join(synonyms[:3])}"
                )
                if len(synonyms) > 3:
                    snippet_parts.append(f"and {len(synonyms) - 3} more")
            elif isinstance(synonyms, str):
                snippet_parts.append(f"Also known as: {synonyms}")
        snippet = " | ".join(snippet_parts) or "NCI cancer vocabulary term"

        return {
            RESULT_ID: disease_id,
            RESULT_TITLE: name,
            RESULT_SNIPPET: snippet,
            RESULT_URL: "",  # NCI doesn't provide direct URLs to disease terms
            RESULT_METADATA: {
                "category": category,
                "synonyms": synonyms,
                "codes": result.get("codes", {}),
            },
        }


def get_domain_handler(
    domain: str,
) -> (
    type[ArticleHandler]
    | type[TrialHandler]
    | type[VariantHandler]
    | type[GeneHandler]
    | type[DrugHandler]
    | type[DiseaseHandler]
    | type[NCIOrganizationHandler]
    | type[NCIInterventionHandler]
    | type[NCIBiomarkerHandler]
    | type[NCIDiseaseHandler]
):
    """Get the appropriate handler class for a domain.

    Args:
        domain: The domain name ('article', 'trial', 'variant', 'gene', 'drug', 'disease',
                               'nci_organization', 'nci_intervention', 'nci_biomarker', 'nci_disease')

    Returns:
        The handler class for the domain

    Raises:
        ValueError: If domain is not recognized
    """
    handlers: dict[
        str,
        type[ArticleHandler]
        | type[TrialHandler]
        | type[VariantHandler]
        | type[GeneHandler]
        | type[DrugHandler]
        | type[DiseaseHandler]
        | type[NCIOrganizationHandler]
        | type[NCIInterventionHandler]
        | type[NCIBiomarkerHandler]
        | type[NCIDiseaseHandler],
    ] = {
        "article": ArticleHandler,
        "trial": TrialHandler,
        "variant": VariantHandler,
        "gene": GeneHandler,
        "drug": DrugHandler,
        "disease": DiseaseHandler,
        "nci_organization": NCIOrganizationHandler,
        "nci_intervention": NCIInterventionHandler,
        "nci_biomarker": NCIBiomarkerHandler,
        "nci_disease": NCIDiseaseHandler,
    }

    handler = handlers.get(domain)
    if handler is None:
        raise ValueError(f"Unknown domain: {domain}")

    return handler

```

--------------------------------------------------------------------------------
/src/biomcp/variants/external.py:
--------------------------------------------------------------------------------

```python
"""External data sources for enhanced variant annotations."""

import asyncio
import json
import logging
import re
from typing import Any
from urllib.parse import quote

from pydantic import BaseModel, Field

from .. import http_client

# Import CBioPortalVariantData from the new module
from .cbio_external_client import CBioPortalVariantData

logger = logging.getLogger(__name__)

# TCGA/GDC API endpoints
GDC_BASE = "https://api.gdc.cancer.gov"
GDC_SSMS_ENDPOINT = f"{GDC_BASE}/ssms"  # Simple Somatic Mutations

# 1000 Genomes API endpoints
ENSEMBL_REST_BASE = "https://rest.ensembl.org"
ENSEMBL_VARIATION_ENDPOINT = f"{ENSEMBL_REST_BASE}/variation/human"

# Import constants


class TCGAVariantData(BaseModel):
    """TCGA/GDC variant annotation data."""

    cosmic_id: str | None = None
    tumor_types: list[str] = Field(default_factory=list)
    mutation_frequency: float | None = None
    mutation_count: int | None = None
    affected_cases: int | None = None
    consequence_type: str | None = None
    clinical_significance: str | None = None


class ThousandGenomesData(BaseModel):
    """1000 Genomes variant annotation data."""

    global_maf: float | None = Field(
        None, description="Global minor allele frequency"
    )
    afr_maf: float | None = Field(None, description="African population MAF")
    amr_maf: float | None = Field(None, description="American population MAF")
    eas_maf: float | None = Field(
        None, description="East Asian population MAF"
    )
    eur_maf: float | None = Field(None, description="European population MAF")
    sas_maf: float | None = Field(
        None, description="South Asian population MAF"
    )
    ancestral_allele: str | None = None
    most_severe_consequence: str | None = None


# CBioPortalVariantData is now imported from cbio_external_client.py


class EnhancedVariantAnnotation(BaseModel):
    """Enhanced variant annotation combining multiple sources."""

    variant_id: str
    tcga: TCGAVariantData | None = None
    thousand_genomes: ThousandGenomesData | None = None
    cbioportal: CBioPortalVariantData | None = None
    error_sources: list[str] = Field(default_factory=list)


class TCGAClient:
    """Client for TCGA/GDC API."""

    async def get_variant_data(
        self, variant_id: str
    ) -> TCGAVariantData | None:
        """Fetch variant data from TCGA/GDC.

        Args:
            variant_id: Can be gene AA change (e.g., "BRAF V600E") or genomic coordinates
        """
        try:
            # Determine the search field based on variant_id format
            # If it looks like "GENE AA_CHANGE" format, use gene_aa_change field
            if " " in variant_id and not variant_id.startswith("chr"):
                search_field = "gene_aa_change"
                search_value = variant_id
            else:
                # Otherwise try genomic_dna_change
                search_field = "genomic_dna_change"
                search_value = variant_id

            # First, search for the variant
            params = {
                "filters": json.dumps({
                    "op": "in",
                    "content": {
                        "field": search_field,
                        "value": [search_value],
                    },
                }),
                "fields": "cosmic_id,genomic_dna_change,gene_aa_change,ssm_id",
                "format": "json",
                "size": "5",  # Get a few in case of multiple matches
            }

            response, error = await http_client.request_api(
                url=GDC_SSMS_ENDPOINT,
                method="GET",
                request=params,
                domain="gdc",
            )

            if error or not response:
                return None

            data = response.get("data", {})
            hits = data.get("hits", [])

            if not hits:
                return None

            # Get the first hit
            hit = hits[0]
            ssm_id = hit.get("ssm_id")
            cosmic_id = hit.get("cosmic_id")

            # For gene_aa_change searches, verify we have the right variant
            if search_field == "gene_aa_change":
                gene_aa_changes = hit.get("gene_aa_change", [])
                if (
                    isinstance(gene_aa_changes, list)
                    and search_value not in gene_aa_changes
                ):
                    # This SSM has multiple AA changes, but not the one we're looking for
                    return None

            if not ssm_id:
                return None

            # Now query SSM occurrences to get project information
            occ_params = {
                "filters": json.dumps({
                    "op": "in",
                    "content": {"field": "ssm.ssm_id", "value": [ssm_id]},
                }),
                "fields": "case.project.project_id",
                "format": "json",
                "size": "2000",  # Get more occurrences
            }

            occ_response, occ_error = await http_client.request_api(
                url="https://api.gdc.cancer.gov/ssm_occurrences",
                method="GET",
                request=occ_params,
                domain="gdc",
            )

            if occ_error or not occ_response:
                # Return basic info without occurrence data
                cosmic_id_str = (
                    cosmic_id[0]
                    if isinstance(cosmic_id, list) and cosmic_id
                    else cosmic_id
                )
                return TCGAVariantData(
                    cosmic_id=cosmic_id_str,
                    tumor_types=[],
                    affected_cases=0,
                    consequence_type="missense_variant",  # Most COSMIC variants are missense
                )

            # Process occurrence data
            occ_data = occ_response.get("data", {})
            occ_hits = occ_data.get("hits", [])

            # Count by project
            project_counts = {}
            for occ in occ_hits:
                case = occ.get("case", {})
                project = case.get("project", {})
                if project_id := project.get("project_id"):
                    project_counts[project_id] = (
                        project_counts.get(project_id, 0) + 1
                    )

            # Extract tumor types
            tumor_types = []
            total_cases = 0
            for project_id, count in project_counts.items():
                # Extract tumor type from project ID
                # TCGA format: "TCGA-LUAD" -> "LUAD"
                # Other formats: "MMRF-COMMPASS" -> "MMRF-COMMPASS", "CPTAC-3" -> "CPTAC-3"
                if project_id.startswith("TCGA-") and "-" in project_id:
                    tumor_type = project_id.split("-")[-1]
                    tumor_types.append(tumor_type)
                else:
                    # For non-TCGA projects, use the full project ID
                    tumor_types.append(project_id)
                total_cases += count

            # Handle cosmic_id as list
            cosmic_id_str = (
                cosmic_id[0]
                if isinstance(cosmic_id, list) and cosmic_id
                else cosmic_id
            )

            return TCGAVariantData(
                cosmic_id=cosmic_id_str,
                tumor_types=tumor_types,
                affected_cases=total_cases,
                consequence_type="missense_variant",  # Default for now
            )

        except (KeyError, ValueError, TypeError, IndexError) as e:
            # Log the error for debugging while gracefully handling API response issues
            # KeyError: Missing expected fields in API response
            # ValueError: Invalid data format or conversion issues
            # TypeError: Unexpected data types in response
            # IndexError: Array access issues with response data
            logger.warning(
                f"Failed to fetch TCGA variant data for {variant_id}: {type(e).__name__}: {e}"
            )
            return None


class ThousandGenomesClient:
    """Client for 1000 Genomes data via Ensembl REST API."""

    def _extract_population_frequencies(
        self, populations: list[dict]
    ) -> dict[str, Any]:
        """Extract population frequencies from Ensembl response."""
        # Note: Multiple entries per population (one per allele), we want the alternate allele frequency
        # The reference allele will have higher frequency for rare variants
        pop_data: dict[str, float] = {}

        for pop in populations:
            pop_name = pop.get("population", "")
            frequency = pop.get("frequency", 0)

            # Map 1000 Genomes population codes - taking the minor allele frequency
            if pop_name == "1000GENOMES:phase_3:ALL":
                if "global_maf" not in pop_data or frequency < pop_data.get(
                    "global_maf", 1
                ):
                    pop_data["global_maf"] = frequency
            elif pop_name == "1000GENOMES:phase_3:AFR":
                if "afr_maf" not in pop_data or frequency < pop_data.get(
                    "afr_maf", 1
                ):
                    pop_data["afr_maf"] = frequency
            elif pop_name == "1000GENOMES:phase_3:AMR":
                if "amr_maf" not in pop_data or frequency < pop_data.get(
                    "amr_maf", 1
                ):
                    pop_data["amr_maf"] = frequency
            elif pop_name == "1000GENOMES:phase_3:EAS":
                if "eas_maf" not in pop_data or frequency < pop_data.get(
                    "eas_maf", 1
                ):
                    pop_data["eas_maf"] = frequency
            elif pop_name == "1000GENOMES:phase_3:EUR":
                if "eur_maf" not in pop_data or frequency < pop_data.get(
                    "eur_maf", 1
                ):
                    pop_data["eur_maf"] = frequency
            elif pop_name == "1000GENOMES:phase_3:SAS" and (
                "sas_maf" not in pop_data
                or frequency < pop_data.get("sas_maf", 1)
            ):
                pop_data["sas_maf"] = frequency

        return pop_data

    async def get_variant_data(
        self, variant_id: str
    ) -> ThousandGenomesData | None:
        """Fetch variant data from 1000 Genomes via Ensembl."""
        try:
            # Try to get rsID or use the variant ID directly
            encoded_id = quote(variant_id, safe="")
            url = f"{ENSEMBL_VARIATION_ENDPOINT}/{encoded_id}"

            # Request with pops=1 to get population data
            params = {"content-type": "application/json", "pops": "1"}

            response, error = await http_client.request_api(
                url=url,
                method="GET",
                request=params,
                domain="ensembl",
            )

            if error or not response:
                return None

            # Extract population frequencies
            populations = response.get("populations", [])
            pop_data = self._extract_population_frequencies(populations)

            # Get most severe consequence
            consequence = None
            if mappings := response.get("mappings", []):
                # Extract consequences from transcript consequences
                all_consequences = []
                for mapping in mappings:
                    if transcript_consequences := mapping.get(
                        "transcript_consequences", []
                    ):
                        for tc in transcript_consequences:
                            if consequence_terms := tc.get(
                                "consequence_terms", []
                            ):
                                all_consequences.extend(consequence_terms)

                if all_consequences:
                    # Take the first unique consequence
                    seen = set()
                    unique_consequences = []
                    for c in all_consequences:
                        if c not in seen:
                            seen.add(c)
                            unique_consequences.append(c)
                    consequence = (
                        unique_consequences[0] if unique_consequences else None
                    )

            # Only return data if we found population frequencies
            if pop_data:
                return ThousandGenomesData(
                    **pop_data,
                    ancestral_allele=response.get("ancestral_allele"),
                    most_severe_consequence=consequence,
                )
            else:
                # No population data found
                return None

        except (KeyError, ValueError, TypeError, AttributeError) as e:
            # Log the error for debugging while gracefully handling API response issues
            # KeyError: Missing expected fields in API response
            # ValueError: Invalid data format or conversion issues
            # TypeError: Unexpected data types in response
            # AttributeError: Missing attributes on response objects
            logger.warning(
                f"Failed to fetch 1000 Genomes data for {variant_id}: {type(e).__name__}: {e}"
            )
            return None


class ExternalVariantAggregator:
    """Aggregates variant data from multiple external sources."""

    def __init__(self):
        self.tcga_client = TCGAClient()
        self.thousand_genomes_client = ThousandGenomesClient()
        # Import here to avoid circular imports
        from .cbio_external_client import CBioPortalExternalClient

        self.cbioportal_client = CBioPortalExternalClient()

    def _extract_gene_aa_change(
        self, variant_data: dict[str, Any]
    ) -> str | None:
        """Extract gene and AA change in format like 'BRAF V600A' from variant data."""
        logger.info("_extract_gene_aa_change called")
        try:
            # First try to get gene name from CADD data
            gene_name = None
            if (cadd := variant_data.get("cadd")) and (
                gene := cadd.get("gene")
            ):
                gene_name = gene.get("genename")

            # If not found in CADD, try other sources
            if not gene_name:
                # Try docm
                if docm := variant_data.get("docm"):
                    gene_name = docm.get("gene") or docm.get("genename")

                # Try dbnsfp
                if not gene_name and (dbnsfp := variant_data.get("dbnsfp")):
                    gene_name = dbnsfp.get("genename")

            if not gene_name:
                return None

            # Now try to get the protein change
            aa_change = None

            # Try to get from docm first (it has clean p.V600A format)
            if (docm := variant_data.get("docm")) and (
                aa := docm.get("aa_change")
            ):
                # Convert p.V600A to V600A
                aa_change = aa.replace("p.", "")

            # Try hgvsp if not found
            if (
                not aa_change
                and (hgvsp_list := variant_data.get("hgvsp"))
                and isinstance(hgvsp_list, list)
                and hgvsp_list
            ):
                # Take the first one and clean it
                hgvsp = hgvsp_list[0]
                # Remove p. prefix
                aa_change = hgvsp.replace("p.", "")
                # Handle formats like Val600Ala -> V600A
                if "Val" in aa_change or "Ala" in aa_change:
                    # Try to extract the short form
                    match = re.search(r"[A-Z]\d+[A-Z]", aa_change)
                    if match:
                        aa_change = match.group()

            # Try CADD data
            if (
                not aa_change
                and (cadd := variant_data.get("cadd"))
                and (gene_info := cadd.get("gene"))
                and (prot := gene_info.get("prot"))
            ):
                protpos = prot.get("protpos")
                if protpos and cadd.get("oaa") and cadd.get("naa"):
                    aa_change = f"{cadd['oaa']}{protpos}{cadd['naa']}"

            if gene_name and aa_change:
                result = f"{gene_name} {aa_change}"
                logger.info(f"Extracted gene/AA change: {result}")
                return result

            logger.warning(
                f"Failed to extract gene/AA change: gene_name={gene_name}, aa_change={aa_change}"
            )
            return None
        except (
            KeyError,
            ValueError,
            TypeError,
            AttributeError,
            re.error,
        ) as e:
            # Log the error for debugging while gracefully handling data extraction issues
            # KeyError: Missing expected fields in variant data
            # ValueError: Invalid data format or conversion issues
            # TypeError: Unexpected data types in variant data
            # AttributeError: Missing attributes on data objects
            # re.error: Regular expression matching errors
            logger.warning(
                f"Failed to extract gene/AA change from variant data: {type(e).__name__}: {e}"
            )
            return None

    async def get_enhanced_annotations(
        self,
        variant_id: str,
        include_tcga: bool = True,
        include_1000g: bool = True,
        include_cbioportal: bool = True,
        variant_data: dict[str, Any] | None = None,
    ) -> EnhancedVariantAnnotation:
        """Fetch and aggregate variant annotations from external sources.

        Args:
            variant_id: The variant identifier (rsID or HGVS)
            include_tcga: Whether to include TCGA data
            include_1000g: Whether to include 1000 Genomes data
            include_cbioportal: Whether to include cBioPortal data
            variant_data: Optional variant data from MyVariant.info to extract gene/protein info
        """
        logger.info(
            f"get_enhanced_annotations called for {variant_id}, include_cbioportal={include_cbioportal}"
        )
        tasks: list[Any] = []
        task_names = []

        # Extract gene/AA change once for sources that need it
        gene_aa_change = None
        if variant_data:
            logger.info(
                f"Extracting gene/AA from variant_data keys: {list(variant_data.keys())}"
            )
            gene_aa_change = self._extract_gene_aa_change(variant_data)
        else:
            logger.warning("No variant_data provided for gene/AA extraction")

        if include_tcga:
            # Try to extract gene and protein change from variant data for TCGA
            tcga_id = gene_aa_change if gene_aa_change else variant_id
            tasks.append(self.tcga_client.get_variant_data(tcga_id))
            task_names.append("tcga")

        if include_1000g:
            tasks.append(
                self.thousand_genomes_client.get_variant_data(variant_id)
            )
            task_names.append("thousand_genomes")

        if include_cbioportal and gene_aa_change:
            # cBioPortal requires gene/AA format
            logger.info(
                f"Adding cBioPortal task with gene_aa_change: {gene_aa_change}"
            )
            tasks.append(
                self.cbioportal_client.get_variant_data(gene_aa_change)
            )
            task_names.append("cbioportal")
        elif include_cbioportal and not gene_aa_change:
            logger.warning(
                "Skipping cBioPortal: no gene/AA change could be extracted"
            )

        # Run all queries in parallel
        results = await asyncio.gather(*tasks, return_exceptions=True)

        # Build the enhanced annotation
        annotation = EnhancedVariantAnnotation(variant_id=variant_id)

        for _i, (result, name) in enumerate(
            zip(results, task_names, strict=False)
        ):
            if isinstance(result, Exception):
                annotation.error_sources.append(name)
            elif result is not None:
                setattr(annotation, name, result)
            else:
                # No data found for this source
                pass

        return annotation


def format_enhanced_annotations(
    annotation: EnhancedVariantAnnotation,
) -> dict[str, Any]:
    """Format enhanced annotations for display."""
    formatted: dict[str, Any] = {
        "variant_id": annotation.variant_id,
        "external_annotations": {},
    }

    external_annot = formatted["external_annotations"]

    if annotation.tcga:
        external_annot["tcga"] = {
            "tumor_types": annotation.tcga.tumor_types,
            "affected_cases": annotation.tcga.affected_cases,
            "cosmic_id": annotation.tcga.cosmic_id,
            "consequence": annotation.tcga.consequence_type,
        }

    if annotation.thousand_genomes:
        external_annot["1000_genomes"] = {
            "global_maf": annotation.thousand_genomes.global_maf,
            "population_frequencies": {
                "african": annotation.thousand_genomes.afr_maf,
                "american": annotation.thousand_genomes.amr_maf,
                "east_asian": annotation.thousand_genomes.eas_maf,
                "european": annotation.thousand_genomes.eur_maf,
                "south_asian": annotation.thousand_genomes.sas_maf,
            },
            "ancestral_allele": annotation.thousand_genomes.ancestral_allele,
            "consequence": annotation.thousand_genomes.most_severe_consequence,
        }

    if annotation.cbioportal:
        cbio_data: dict[str, Any] = {
            "studies": annotation.cbioportal.studies,
            "total_cases": annotation.cbioportal.total_cases,
        }

        # Add cancer type distribution if available
        if annotation.cbioportal.cancer_type_distribution:
            cbio_data["cancer_types"] = (
                annotation.cbioportal.cancer_type_distribution
            )

        # Add mutation type distribution if available
        if annotation.cbioportal.mutation_types:
            cbio_data["mutation_types"] = annotation.cbioportal.mutation_types

        # Add hotspot count if > 0
        if annotation.cbioportal.hotspot_count > 0:
            cbio_data["hotspot_samples"] = annotation.cbioportal.hotspot_count

        # Add mean VAF if available
        if annotation.cbioportal.mean_vaf is not None:
            cbio_data["mean_vaf"] = annotation.cbioportal.mean_vaf

        # Add sample type distribution if available
        if annotation.cbioportal.sample_types:
            cbio_data["sample_types"] = annotation.cbioportal.sample_types

        external_annot["cbioportal"] = cbio_data

    if annotation.error_sources:
        external_annot["errors"] = annotation.error_sources

    return formatted

```

--------------------------------------------------------------------------------
/tests/tdd/trials/test_search.py:
--------------------------------------------------------------------------------

```python
import pytest

from biomcp.trials.search import (
    CLOSED_STATUSES,
    AgeGroup,
    DateField,
    InterventionType,
    LineOfTherapy,
    PrimaryPurpose,
    RecruitingStatus,
    SortOrder,
    SponsorType,
    StudyDesign,
    StudyType,
    TrialPhase,
    TrialQuery,
    _build_biomarker_expression_essie,
    _build_brain_mets_essie,
    _build_excluded_mutations_essie,
    _build_line_of_therapy_essie,
    _build_prior_therapy_essie,
    _build_progression_essie,
    _build_required_mutations_essie,
    _inject_ids,
    convert_query,
)


@pytest.mark.asyncio
async def test_convert_query_basic_parameters():
    """Test basic parameter conversion from TrialQuery to API format."""
    query = TrialQuery(conditions=["lung cancer"])
    params = await convert_query(query)

    assert "markupFormat" in params
    assert params["markupFormat"] == ["markdown"]
    assert "query.cond" in params
    assert params["query.cond"] == ["lung cancer"]
    assert "filter.overallStatus" in params
    assert "RECRUITING" in params["filter.overallStatus"][0]


@pytest.mark.asyncio
async def test_convert_query_multiple_conditions():
    """Test conversion of multiple conditions to API format."""
    query = TrialQuery(conditions=["lung cancer", "metastatic"])
    params = await convert_query(query)

    assert "query.cond" in params
    # The query should contain the original terms, but may have expanded synonyms
    cond_value = params["query.cond"][0]
    assert "lung cancer" in cond_value
    assert "metastatic" in cond_value
    assert cond_value.startswith("(") and cond_value.endswith(")")


@pytest.mark.asyncio
async def test_convert_query_terms_parameter():
    """Test conversion of terms parameter to API format."""
    query = TrialQuery(terms=["immunotherapy"])
    params = await convert_query(query)

    assert "query.term" in params
    assert params["query.term"] == ["immunotherapy"]


@pytest.mark.asyncio
async def test_convert_query_interventions_parameter():
    """Test conversion of interventions parameter to API format."""
    query = TrialQuery(interventions=["pembrolizumab"])
    params = await convert_query(query)

    assert "query.intr" in params
    assert params["query.intr"] == ["pembrolizumab"]


@pytest.mark.asyncio
async def test_convert_query_nct_ids():
    """Test conversion of NCT IDs to API format."""
    query = TrialQuery(nct_ids=["NCT04179552"])
    params = await convert_query(query)

    assert "query.id" in params
    assert params["query.id"] == ["NCT04179552"]
    # Note: The implementation keeps filter.overallStatus when using nct_ids
    # So we don't assert its absence


@pytest.mark.asyncio
async def test_convert_query_recruiting_status():
    """Test conversion of recruiting status to API format."""
    # Test open status
    query = TrialQuery(recruiting_status=RecruitingStatus.OPEN)
    params = await convert_query(query)

    assert "filter.overallStatus" in params
    assert "RECRUITING" in params["filter.overallStatus"][0]

    # Test closed status
    query = TrialQuery(recruiting_status=RecruitingStatus.CLOSED)
    params = await convert_query(query)

    assert "filter.overallStatus" in params
    assert all(
        status in params["filter.overallStatus"][0]
        for status in CLOSED_STATUSES
    )

    # Test any status
    query = TrialQuery(recruiting_status=RecruitingStatus.ANY)
    params = await convert_query(query)

    assert "filter.overallStatus" not in params


@pytest.mark.asyncio
async def test_convert_query_location_parameters():
    """Test conversion of location parameters to API format."""
    query = TrialQuery(lat=40.7128, long=-74.0060, distance=10)
    params = await convert_query(query)

    assert "filter.geo" in params
    assert params["filter.geo"] == ["distance(40.7128,-74.006,10mi)"]


@pytest.mark.asyncio
async def test_convert_query_study_type():
    """Test conversion of study type to API format."""
    query = TrialQuery(study_type=StudyType.INTERVENTIONAL)
    params = await convert_query(query)

    assert "filter.advanced" in params
    assert "AREA[StudyType]Interventional" in params["filter.advanced"][0]


@pytest.mark.asyncio
async def test_convert_query_phase():
    """Test conversion of phase to API format."""
    query = TrialQuery(phase=TrialPhase.PHASE3)
    params = await convert_query(query)

    assert "filter.advanced" in params
    assert "AREA[Phase]PHASE3" in params["filter.advanced"][0]


@pytest.mark.asyncio
async def test_convert_query_date_range():
    """Test conversion of date range to API format."""
    query = TrialQuery(
        min_date="2020-01-01",
        max_date="2020-12-31",
        date_field=DateField.LAST_UPDATE,
    )
    params = await convert_query(query)

    assert "filter.advanced" in params
    assert (
        "AREA[LastUpdatePostDate]RANGE[2020-01-01,2020-12-31]"
        in params["filter.advanced"][0]
    )

    # Test min date only
    query = TrialQuery(
        min_date="2021-01-01",
        date_field=DateField.STUDY_START,
    )
    params = await convert_query(query)

    assert "filter.advanced" in params
    assert (
        "AREA[StartDate]RANGE[2021-01-01,MAX]" in params["filter.advanced"][0]
    )


@pytest.mark.asyncio
async def test_convert_query_sort_order():
    """Test conversion of sort order to API format."""
    query = TrialQuery(sort=SortOrder.RELEVANCE)
    params = await convert_query(query)

    assert "sort" in params
    assert params["sort"] == ["@relevance"]

    query = TrialQuery(sort=SortOrder.LAST_UPDATE)
    params = await convert_query(query)

    assert "sort" in params
    assert params["sort"] == ["LastUpdatePostDate:desc"]


@pytest.mark.asyncio
async def test_convert_query_intervention_type():
    """Test conversion of intervention type to API format."""
    query = TrialQuery(intervention_type=InterventionType.DRUG)
    params = await convert_query(query)

    assert "filter.advanced" in params
    assert "AREA[InterventionType]Drug" in params["filter.advanced"][0]


@pytest.mark.asyncio
async def test_convert_query_sponsor_type():
    """Test conversion of sponsor type to API format."""
    query = TrialQuery(sponsor_type=SponsorType.ACADEMIC)
    params = await convert_query(query)

    assert "filter.advanced" in params
    assert "AREA[SponsorType]Academic" in params["filter.advanced"][0]


@pytest.mark.asyncio
async def test_convert_query_study_design():
    """Test conversion of study design to API format."""
    query = TrialQuery(study_design=StudyDesign.RANDOMIZED)
    params = await convert_query(query)

    assert "filter.advanced" in params
    assert "AREA[StudyDesign]Randomized" in params["filter.advanced"][0]


@pytest.mark.asyncio
async def test_convert_query_age_group():
    """Test conversion of age group to API format."""
    query = TrialQuery(age_group=AgeGroup.ADULT)
    params = await convert_query(query)

    assert "filter.advanced" in params
    assert "AREA[StdAge]Adult" in params["filter.advanced"][0]


@pytest.mark.asyncio
async def test_convert_query_primary_purpose():
    """Test conversion of primary purpose to API format."""
    query = TrialQuery(primary_purpose=PrimaryPurpose.TREATMENT)
    params = await convert_query(query)

    assert "filter.advanced" in params
    assert (
        "AREA[DesignPrimaryPurpose]Treatment" in params["filter.advanced"][0]
    )


@pytest.mark.asyncio
async def test_convert_query_next_page_hash():
    """Test conversion of next_page_hash to API format."""
    query = TrialQuery(next_page_hash="abc123")
    params = await convert_query(query)

    assert "pageToken" in params
    assert params["pageToken"] == ["abc123"]


@pytest.mark.asyncio
async def test_convert_query_complex_parameters():
    """Test conversion of multiple parameters to API format."""
    query = TrialQuery(
        conditions=["diabetes"],
        terms=["obesity"],
        interventions=["metformin"],
        primary_purpose=PrimaryPurpose.TREATMENT,
        study_type=StudyType.INTERVENTIONAL,
        intervention_type=InterventionType.DRUG,
        recruiting_status=RecruitingStatus.OPEN,
        phase=TrialPhase.PHASE3,
        age_group=AgeGroup.ADULT,
        sort=SortOrder.RELEVANCE,
    )
    params = await convert_query(query)

    assert "query.cond" in params
    # Disease synonym expansion may add synonyms to diabetes
    assert "diabetes" in params["query.cond"][0]
    assert "query.term" in params
    assert params["query.term"] == ["obesity"]
    assert "query.intr" in params
    assert params["query.intr"] == ["metformin"]
    assert "filter.advanced" in params
    assert (
        "AREA[DesignPrimaryPurpose]Treatment" in params["filter.advanced"][0]
    )
    assert "AREA[StudyType]Interventional" in params["filter.advanced"][0]
    assert "AREA[InterventionType]Drug" in params["filter.advanced"][0]
    assert "AREA[Phase]PHASE3" in params["filter.advanced"][0]
    assert "AREA[StdAge]Adult" in params["filter.advanced"][0]
    assert "filter.overallStatus" in params
    assert "RECRUITING" in params["filter.overallStatus"][0]
    assert "sort" in params
    assert params["sort"] == ["@relevance"]


# Test TrialQuery field validation for CLI input processing
# noinspection PyTypeChecker
def test_trial_query_field_validation_basic():
    """Test basic field validation for TrialQuery."""
    # Test list fields conversion
    query = TrialQuery(conditions="diabetes")
    assert query.conditions == ["diabetes"]

    query = TrialQuery(interventions="metformin")
    assert query.interventions == ["metformin"]

    query = TrialQuery(terms="blood glucose")
    assert query.terms == ["blood glucose"]

    query = TrialQuery(nct_ids="NCT01234567")
    assert query.nct_ids == ["NCT01234567"]


# noinspection PyTypeChecker
def test_trial_query_field_validation_recruiting_status():
    """Test recruiting status field validation."""
    # Exact match uppercase
    query = TrialQuery(recruiting_status="OPEN")
    assert query.recruiting_status == RecruitingStatus.OPEN

    # Exact match lowercase
    query = TrialQuery(recruiting_status="closed")
    assert query.recruiting_status == RecruitingStatus.CLOSED

    # Invalid value
    with pytest.raises(ValueError) as excinfo:
        TrialQuery(recruiting_status="invalid")
    assert "validation error for TrialQuery" in str(excinfo.value)


# noinspection PyTypeChecker
@pytest.mark.asyncio
async def test_trial_query_field_validation_combined():
    """Test combined parameters validation."""
    query = TrialQuery(
        conditions=["diabetes", "obesity"],
        interventions="metformin",
        recruiting_status="open",
        study_type="interventional",
        lat=40.7128,
        long=-74.0060,
        distance=10,
    )

    assert query.conditions == ["diabetes", "obesity"]
    assert query.interventions == ["metformin"]
    assert query.recruiting_status == RecruitingStatus.OPEN
    assert query.study_type == StudyType.INTERVENTIONAL
    assert query.lat == 40.7128
    assert query.long == -74.0060
    assert query.distance == 10

    # Check that the query can be converted to parameters properly
    params = await convert_query(query)
    assert "query.cond" in params
    # The query should contain the original terms, but may have expanded synonyms
    cond_value = params["query.cond"][0]
    assert "diabetes" in cond_value
    assert "obesity" in cond_value
    assert cond_value.startswith("(") and cond_value.endswith(")")
    assert "query.intr" in params
    assert "metformin" in params["query.intr"][0]
    assert "filter.geo" in params
    assert "distance(40.7128,-74.006,10mi)" in params["filter.geo"][0]


# noinspection PyTypeChecker
@pytest.mark.asyncio
async def test_trial_query_field_validation_terms():
    """Test terms parameter validation."""
    # Single term as string
    query = TrialQuery(terms="cancer")
    assert query.terms == ["cancer"]

    # Multiple terms as list
    query = TrialQuery(terms=["cancer", "therapy"])
    assert query.terms == ["cancer", "therapy"]

    # Check parameter generation
    params = await convert_query(query)
    assert "query.term" in params
    assert "(cancer OR therapy)" in params["query.term"][0]


# noinspection PyTypeChecker
@pytest.mark.asyncio
async def test_trial_query_field_validation_nct_ids():
    """Test NCT IDs parameter validation."""
    # Single NCT ID
    query = TrialQuery(nct_ids="NCT01234567")
    assert query.nct_ids == ["NCT01234567"]

    # Multiple NCT IDs
    query = TrialQuery(nct_ids=["NCT01234567", "NCT89012345"])
    assert query.nct_ids == ["NCT01234567", "NCT89012345"]

    # Check parameter generation
    params = await convert_query(query)
    assert "query.id" in params
    assert "NCT01234567,NCT89012345" in params["query.id"][0]


# noinspection PyTypeChecker
@pytest.mark.asyncio
async def test_trial_query_field_validation_date_range():
    """Test date range parameters validation."""
    # Min date only with date field
    query = TrialQuery(min_date="2020-01-01", date_field=DateField.STUDY_START)
    assert query.min_date == "2020-01-01"
    assert query.date_field == DateField.STUDY_START

    # Min and max date with date field using lazy mapping
    query = TrialQuery(
        min_date="2020-01-01",
        max_date="2021-12-31",
        date_field="last update",  # space not underscore.
    )
    assert query.min_date == "2020-01-01"
    assert query.max_date == "2021-12-31"
    assert query.date_field == DateField.LAST_UPDATE

    # Check parameter generation
    params = await convert_query(query)
    assert "filter.advanced" in params
    assert (
        "AREA[LastUpdatePostDate]RANGE[2020-01-01,2021-12-31]"
        in params["filter.advanced"][0]
    )


# noinspection PyTypeChecker
def test_trial_query_field_validation_primary_purpose():
    """Test primary purpose parameter validation."""
    # Exact match uppercase
    query = TrialQuery(primary_purpose=PrimaryPurpose.TREATMENT)
    assert query.primary_purpose == PrimaryPurpose.TREATMENT

    # Exact match lowercase
    query = TrialQuery(primary_purpose=PrimaryPurpose.PREVENTION)
    assert query.primary_purpose == PrimaryPurpose.PREVENTION

    # Case-insensitive
    query = TrialQuery(primary_purpose="ScReeNING")
    assert query.primary_purpose == PrimaryPurpose.SCREENING

    # Invalid
    with pytest.raises(ValueError):
        TrialQuery(primary_purpose="invalid")


def test_inject_ids_with_many_ids_and_condition():
    """Test _inject_ids function with 300 IDs and a condition to ensure filter.ids is used."""
    # Create a params dict with a condition (indicating other filters present)
    params = {
        "query.cond": ["melanoma"],
        "format": ["json"],
        "markupFormat": ["markdown"],
    }

    # Generate 300 NCT IDs
    nct_ids = [f"NCT{str(i).zfill(8)}" for i in range(1, 301)]

    # Call _inject_ids with has_other_filters=True
    _inject_ids(params, nct_ids, has_other_filters=True)

    # Assert that filter.ids is used (not query.id)
    assert "filter.ids" in params
    assert "query.id" not in params

    # Verify the IDs are properly formatted
    ids_param = params["filter.ids"][0]
    assert ids_param.startswith("NCT")
    assert "NCT00000001" in ids_param
    assert "NCT00000300" in ids_param

    # Verify it's a comma-separated list
    assert "," in ids_param
    assert ids_param.count(",") == 299  # 300 IDs = 299 commas


def test_inject_ids_without_other_filters():
    """Test _inject_ids function with only NCT IDs (no other filters)."""
    # Create a minimal params dict
    params = {
        "format": ["json"],
        "markupFormat": ["markdown"],
    }

    # Use a small number of NCT IDs
    nct_ids = ["NCT00000001", "NCT00000002", "NCT00000003"]

    # Call _inject_ids with has_other_filters=False
    _inject_ids(params, nct_ids, has_other_filters=False)

    # Assert that query.id is used (not filter.ids) for small lists
    assert "query.id" in params
    assert "filter.ids" not in params

    # Verify the format
    assert params["query.id"][0] == "NCT00000001,NCT00000002,NCT00000003"


def test_inject_ids_large_list_without_filters():
    """Test _inject_ids with a large ID list but no other filters."""
    params = {
        "format": ["json"],
        "markupFormat": ["markdown"],
    }

    # Generate enough IDs to exceed 1800 character limit
    nct_ids = [f"NCT{str(i).zfill(8)}" for i in range(1, 201)]  # ~2200 chars

    # Call _inject_ids with has_other_filters=False
    _inject_ids(params, nct_ids, has_other_filters=False)

    # Assert that filter.ids is used for large lists even without other filters
    assert "filter.ids" in params
    assert "query.id" not in params


# Tests for new Essie builder functions
def test_build_prior_therapy_essie():
    """Test building Essie fragments for prior therapies."""
    # Single therapy
    fragments = _build_prior_therapy_essie(["osimertinib"])
    assert len(fragments) == 1
    assert (
        fragments[0]
        == 'AREA[EligibilityCriteria]("osimertinib" AND (prior OR previous OR received))'
    )

    # Multiple therapies
    fragments = _build_prior_therapy_essie(["osimertinib", "erlotinib"])
    assert len(fragments) == 2
    assert (
        fragments[0]
        == 'AREA[EligibilityCriteria]("osimertinib" AND (prior OR previous OR received))'
    )
    assert (
        fragments[1]
        == 'AREA[EligibilityCriteria]("erlotinib" AND (prior OR previous OR received))'
    )

    # Empty strings are filtered out
    fragments = _build_prior_therapy_essie(["osimertinib", "", "erlotinib"])
    assert len(fragments) == 2


def test_build_progression_essie():
    """Test building Essie fragments for progression on therapy."""
    fragments = _build_progression_essie(["pembrolizumab"])
    assert len(fragments) == 1
    assert (
        fragments[0]
        == 'AREA[EligibilityCriteria]("pembrolizumab" AND (progression OR resistant OR refractory))'
    )


def test_build_required_mutations_essie():
    """Test building Essie fragments for required mutations."""
    fragments = _build_required_mutations_essie(["EGFR L858R", "T790M"])
    assert len(fragments) == 2
    assert fragments[0] == 'AREA[EligibilityCriteria]("EGFR L858R")'
    assert fragments[1] == 'AREA[EligibilityCriteria]("T790M")'


def test_build_excluded_mutations_essie():
    """Test building Essie fragments for excluded mutations."""
    fragments = _build_excluded_mutations_essie(["KRAS G12C"])
    assert len(fragments) == 1
    assert fragments[0] == 'AREA[EligibilityCriteria](NOT "KRAS G12C")'


def test_build_biomarker_expression_essie():
    """Test building Essie fragments for biomarker expression."""
    biomarkers = {"PD-L1": "≥50%", "TMB": "≥10 mut/Mb"}
    fragments = _build_biomarker_expression_essie(biomarkers)
    assert len(fragments) == 2
    assert 'AREA[EligibilityCriteria]("PD-L1" AND "≥50%")' in fragments
    assert 'AREA[EligibilityCriteria]("TMB" AND "≥10 mut/Mb")' in fragments

    # Empty values are filtered out
    biomarkers = {"PD-L1": "≥50%", "TMB": "", "HER2": "positive"}
    fragments = _build_biomarker_expression_essie(biomarkers)
    assert len(fragments) == 2


def test_build_line_of_therapy_essie():
    """Test building Essie fragment for line of therapy."""
    # First line
    fragment = _build_line_of_therapy_essie(LineOfTherapy.FIRST_LINE)
    assert (
        fragment
        == 'AREA[EligibilityCriteria]("first line" OR "first-line" OR "1st line" OR "frontline" OR "treatment naive" OR "previously untreated")'
    )

    # Second line
    fragment = _build_line_of_therapy_essie(LineOfTherapy.SECOND_LINE)
    assert (
        fragment
        == 'AREA[EligibilityCriteria]("second line" OR "second-line" OR "2nd line" OR "one prior line" OR "1 prior line")'
    )

    # Third line plus
    fragment = _build_line_of_therapy_essie(LineOfTherapy.THIRD_LINE_PLUS)
    assert (
        fragment
        == 'AREA[EligibilityCriteria]("third line" OR "third-line" OR "3rd line" OR "≥2 prior" OR "at least 2 prior" OR "heavily pretreated")'
    )


def test_build_brain_mets_essie():
    """Test building Essie fragment for brain metastases filter."""
    # Allow brain mets (no filter)
    fragment = _build_brain_mets_essie(True)
    assert fragment == ""

    # Exclude brain mets
    fragment = _build_brain_mets_essie(False)
    assert fragment == 'AREA[EligibilityCriteria](NOT "brain metastases")'


@pytest.mark.asyncio
async def test_convert_query_with_eligibility_fields():
    """Test conversion of query with new eligibility-focused fields."""
    query = TrialQuery(
        conditions=["lung cancer"],
        prior_therapies=["osimertinib"],
        progression_on=["erlotinib"],
        required_mutations=["EGFR L858R"],
        excluded_mutations=["T790M"],
        biomarker_expression={"PD-L1": "≥50%"},
        line_of_therapy=LineOfTherapy.SECOND_LINE,
        allow_brain_mets=False,
    )
    params = await convert_query(query)

    # Check that query.term contains all the Essie fragments
    assert "query.term" in params
    term = params["query.term"][0]

    # Prior therapy
    assert (
        'AREA[EligibilityCriteria]("osimertinib" AND (prior OR previous OR received))'
        in term
    )

    # Progression
    assert (
        'AREA[EligibilityCriteria]("erlotinib" AND (progression OR resistant OR refractory))'
        in term
    )

    # Required mutation
    assert 'AREA[EligibilityCriteria]("EGFR L858R")' in term

    # Excluded mutation
    assert 'AREA[EligibilityCriteria](NOT "T790M")' in term

    # Biomarker expression
    assert 'AREA[EligibilityCriteria]("PD-L1" AND "≥50%")' in term

    # Line of therapy
    assert 'AREA[EligibilityCriteria]("second line" OR "second-line"' in term

    # Brain mets exclusion
    assert 'AREA[EligibilityCriteria](NOT "brain metastases")' in term

    # All fragments should be combined with AND
    assert " AND " in term


@pytest.mark.asyncio
async def test_convert_query_with_custom_fields_and_page_size():
    """Test conversion of query with custom return fields and page size."""
    query = TrialQuery(
        conditions=["diabetes"],
        return_fields=["NCTId", "BriefTitle", "OverallStatus"],
        page_size=100,
    )
    params = await convert_query(query)

    assert "fields" in params
    assert params["fields"] == ["NCTId,BriefTitle,OverallStatus"]

    assert "pageSize" in params
    assert params["pageSize"] == ["100"]


@pytest.mark.asyncio
async def test_convert_query_eligibility_with_existing_terms():
    """Test that eligibility Essie fragments are properly combined with existing terms."""
    query = TrialQuery(
        terms=["immunotherapy"],
        prior_therapies=["chemotherapy"],
    )
    params = await convert_query(query)

    assert "query.term" in params
    term = params["query.term"][0]

    # Should contain both the original term and the new Essie fragment
    assert "immunotherapy" in term
    assert (
        'AREA[EligibilityCriteria]("chemotherapy" AND (prior OR previous OR received))'
        in term
    )
    # Should be combined with AND
    assert "immunotherapy AND AREA[EligibilityCriteria]" in term

```

--------------------------------------------------------------------------------
/tests/data/pubtator/pubtator3_paper.txt:
--------------------------------------------------------------------------------

```
Nucleic Acids Research, 2024, 52, W540–W546
https://doi.org/10.1093/nar/gkae235
Advance access publication date: 4 April 2024
Web Server issue

PubTator 3.0: an AI-powered literature resource for
unlocking biomedical knowledge
Chih-Hsuan Wei † , Alexis Allot † , Po-Ting Lai , Robert Leaman , Shubo Tian , Ling Luo ,
Qiao Jin , Zhizheng Wang , Qingyu Chen and Zhiyong Lu *
National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH),
Bethesda, MD 20894, USA
To whom correspondence should be addressed. Tel: +1 301 594 7089; Email: [email protected]
The first two authors should be regarded as Joint First Authors.
Present addresses:
Alexis Allot, The Neuro (Montreal Neurological Institute-Hospital), McGill University, Montreal, Quebec H3A 2B4, Canada.
Ling Luo, School of Computer Science and Technology, Dalian University of Technology, 116024 Dalian, China.
Qingyu Chen, Biomedical Informatics and Data Science, Yale School of Medicine, New Haven, CT 06510, USA.
†

Abstract
PubTator 3.0 (https://www.ncbi.nlm.nih.gov/research/pubtator3/) is a biomedical literature resource using state-of-the-art AI techniques to offer
semantic and relation searches for key concepts like proteins, genetic variants, diseases and chemicals. It currently provides over one billion
entity and relation annotations across approximately 36 million PubMed abstracts and 6 million full-text articles from the PMC open access
subset, updated weekly. PubTator 3.0’s online interface and API utilize these precomputed entity relations and synonyms to provide advanced
search capabilities and enable large-scale analyses, streamlining many complex information needs. We showcase the retrieval quality of PubTator
3.0 using a series of entity pair queries, demonstrating that PubTator 3.0 retrieves a greater number of articles than either PubMed or Google
Scholar, with higher precision in the top 20 results. We further show that integrating ChatGPT (GPT-4) with PubTator APIs dramatically improves
the factuality and verifiability of its responses. In summary, PubTator 3.0 offers a comprehensive set of features and tools that allow researchers
to navigate the ever-expanding wealth of biomedical literature, expediting research and unlocking valuable insights for scientific discovery.

Graphical abstract

Introduction
The biomedical literature is a primary resource to address information needs across the biological and clinical sciences (1),
however the requirements for literature search vary widely.
Activities such as formulating a research hypothesis require
an exploratory approach, whereas tasks like interpreting the
clinical significance of genetic variants are more focused.
Traditional keyword-based search methods have long
formed the foundation of biomedical literature search (2).
While generally effective for basic search, these methods also
have significant limitations, such as missing relevant articles

due to differing terminology or including irrelevant articles because surface-level term matches cannot adequately represent
the required association between query terms. These limitations cost time and risk information needs remaining unmet.
Natural language processing (NLP) methods provide substantial value for creating bioinformatics resources (3–5), and
may improve literature search by enabling semantic and relation search (6). In semantic search, users indicate specific
concepts of interest (entities) for which the system has precomputed matches regardless of the terminology used. Relation search increases precision by allowing users to specify the

Received: January 18, 2024. Revised: March 2, 2024. Editorial Decision: March 16, 2024. Accepted: March 21, 2024
Published by Oxford University Press on behalf of Nucleic Acids Research 2024.
This work is written by (a) US Government employee(s) and is in the public domain in the US.

Downloaded from https://academic.oup.com/nar/article/52/W1/W540/7640526 by guest on 10 March 2025

*

W541

Nucleic Acids Research, 2024, Vol. 52, Web Server issue

type of relationship desired between entities, such as whether
a chemical enhances or reduces expression of a gene. In this regard, we present PubTator 3.0, a novel resource engineered to
support semantic and relation search in the biomedical literature. Its search capabilities allow users to explore automated
entity annotations for six key biomedical entities: genes, diseases, chemicals, genetic variants, species, and cell lines. PubTator 3.0 also identifies and makes searchable 12 common
types of relations between entities, enhancing its utility for
both targeted and exploratory searches. Focusing on relations
and entity types of interest across the biomedical sciences allows PubTator 3.0 to retrieve information precisely while providing broad utility (see detailed comparisons with its predecessor in Supplementary Table S1).

The PubTator 3.0 online interface, illustrated in Figure 1
and Supplementary Figure S1, is designed for interactive literature exploration, supporting semantic, relation, keyword,
and Boolean queries. An auto-complete function provides semantic search suggestions to assist users with query formulation. For example, it automatically suggests replacing either ‘COVID-19 or "SARS-CoV-2 infection’ with the semantic term ‘@DISEASE_COVID_19 . Relation queries – new to
PubTator 3.0 – provide increased precision, allowing users
to target articles which discuss specific relationships between
entities.
PubTator 3.0 offers unified search results, simultaneously
searching approximately 36 million PubMed abstracts and
over 6 million full-text articles from the PMC Open Access Subset (PMC-OA), improving access to the substantial
amount of relevant information present in the article full text
(7). Search results are prioritized based on the depth of the relationship between the query terms: articles containing identifiable relations between semantic terms receive the highest
priority, while articles where semantic or keyword terms cooccur nearby (e.g. within the same sentence) receive secondary
priority. Search results are also prioritized based on the article
section where the match appears (e.g. matches within the title receive higher priority). Users can further refine results by
employing filters, narrowing articles returned to specific publication types, journals, or article sections.
PubTator 3.0 is supported by an NLP pipeline, depicted in
Figure 2A. This pipeline, run weekly, first identifies articles
newly added to PubMed and PMC-OA. Articles are then processed through three major steps: (i) named entity recognition,
provided by the recently developed deep-learning transformer
model AIONER (8), (ii) identifier mapping and (iii) relation
extraction, performed by BioREx (9) of 12 common types of
relations (described in Supplementary Table S2).
In total, PubTator 3.0 contains over 1.6 billion entity annotations (4.6 million unique identifiers) and 33 million relations
(8.8 million unique pairs). It provides enhanced entity recognition and normalization performance over its previous version,
PubTator 2 (10), also known as PubTator Central (Figure 2B
and Supplementary Table S3). We show the relation extraction performance of PubTator 3.0 in Figure 2C and its comparison results to the previous state-of-the-art systems (11–13)
on the BioCreative V Chemical-Disease Relation (14) corpus,
finding that PubTator 3.0 provided substantially higher accuracy. Moreover, when evaluating a randomized sample of
entity pair queries compared to PubMed and Google Scholar,

Materials and methods
Data sources and article processing
PubTator 3.0 downloads new articles weekly from the BioC
PubMed API (https://www.ncbi.nlm.nih.gov/research/bionlp/
APIs/BioC-PubMed/) and the BioC PMC API (https://www.
ncbi.nlm.nih.gov/research/bionlp/APIs/BioC-PMC/) in BioCXML format (16). Local abbreviations are identified using
Ab3P (17). Article text and extracted data are stored internally using MongoDB and indexed for search with Solr, ensuring robust and scalable accessibility unconstrained by external
dependencies such as the NCBI eUtils API.

Entity recognition and normalization/linking
PubTator 3.0 uses AIONER (8), a recently developed named
entity recognition (NER) model, to recognize entities of six
types: genes/proteins, chemicals, diseases, species, genetic
variants, and cell lines. AIONER utilizes a flexible tagging
scheme to integrate training data created separately into a
single resource. These training datasets include NLM-Gene
(18), NLM-Chem (19), NCBI-Disease (20), BC5CDR (14),
tmVar3 (21), Species-800 (22), BioID (23) and BioRED (15).
This consolidation creates a larger training set, improving
the model’s ability to generalize to unseen data. Furthermore,
it enables recognizing multiple entity types simultaneously,
enhancing efficiency and simplifying the challenge of distinguishing boundaries between entities that reference others,
such as the disorder ‘Alpha-1 antitrypsin deficiency’ and the
protein ‘Alpha-1 antitrypsin’. We previously evaluated the performance of AIONER on 14 benchmark datasets (8), including the test sets for the aforementioned training sets. This evaluation demonstrated that AIONER’s performance surpasses
or matches previous state-of-the-art methods.
Entity mentions found by AIONER are normalized (linked)
to a unique identifier in an appropriate entity database. Normalization is performed by a module designed for (or adapted
to) each entity type, using the latest version. The recentlyupgraded GNorm2 system (24) normalizes genes to NCBI
Gene identifiers and species mentions to NCBI Taxonomy.
tmVar3 (21), also recently upgraded, normalizes genetic variants; it uses dbSNP identifiers for variants listed in dbSNP
and HGNV format otherwise. Chemicals are normalized by
the NLM-Chem tagger (19) to MeSH identifiers (25). TaggerOne (26) normalizes diseases to MeSH and cell lines to
Cellosaurus (27) using a new normalization-only mode. This
mode only applies the normalization model, which converts
both mentions and lexicon names into high-dimensional TFIDF vectors and learns a mapping, as before. However, it
now augments the training data by mapping each lexicon
name to itself, resulting in a large performance improvement for names present in the lexicon but not in the annotated training data. These enhancements provide a significant overall improvement in entity normalization performance (Supplementary Table S3).

Relation extraction
Relations for PubTator 3.0 are extracted by the unified relation extraction model BioREx (9), designed to simulta-

Downloaded from https://academic.oup.com/nar/article/52/W1/W540/7640526 by guest on 10 March 2025

System overview

PubTator 3.0 consistently returns a greater number of articles with higher precision in the top 20 results (Figure 2D and
Supplementary Table S4).

W542

Nucleic Acids Research, 2024, Vol. 52, Web Server issue

neously extract 12 types of relations across eight entity
type pairs: chemical–chemical, chemical–disease, chemical–
gene, chemical–variant, disease–gene, disease–variant, gene–
gene and variant–variant. Detailed definitions of these relation types and their corresponding entity pairs are presented in
Supplementary Table S2. Deep-learning methods for relation
extraction, such as BioREx, require ample training data. However, training data for relation extraction is fragmented into
many datasets, often tailored to specific entity pairs. BioREx
overcomes this limitation with a data-centric approach, reconciling discrepancies between disparate training datasets to
construct a comprehensive, unified dataset.
We evaluated the relations extracted by BioREx using performance on manually annotated relation extraction datasets
as well as a comparative analysis between BioREx and notable
comparable systems. BioREx established a new performance
benchmark on the BioRED corpus test set (15), elevating the
performance from 74.4% (F-score) to 79.6%, and demonstrating higher performance than alternative models such as
transfer learning (TL), multi-task learning (MTL), and stateof-the-art models trained on isolated datasets (9). For PubTator 3.0, we replaced its deep learning module, PubMedBERT
(28), with LinkBERT (29), further increasing the performance
to 82.0%. Furthermore, we conducted a comparative analysis between BioREx and SemRep (11), a widely used rule-

based method for extracting diverse relations, the CD-REST
(13) system, and the previous state-of-the-art system (12), using the BioCreative V Chemical Disease Relation corpus test
set (14). Our evaluation demonstrated that PubTator 3.0 provided substantially higher F-score than previous methods.

Programmatic access and data formats
PubTator 3.0 offers programmatic access through its
API and bulk download. The API (https://www.ncbi.
nlm.nih.gov/research/pubtator3/) supports keyword, entity and relation search, and also supports exporting
annotations in XML and JSON-based BioC (16) formats and tab-delimited free text. The PubTator 3.0 FTP
site (https://ftp.ncbi.nlm.nih.gov/pub/lu/PubTator3) provides bulk downloads of annotated articles and extraction
summaries for entities and relations. Programmatic access supports more flexible query options; for example,
the information need ‘what chemicals reduce expression
of JAK1?’ can be answered directly via API (e.g. https:
//www.ncbi.nlm.nih.gov/research/pubtator3-api/relations?
e1=@GENE_JAK1&type=negative_correlate&e2=Chemical)
or by filtering the bulk relations file. Additionally, the PubTator 3.0 API supports annotation of user-defined free
text.

Downloaded from https://academic.oup.com/nar/article/52/W1/W540/7640526 by guest on 10 March 2025

Figure 1. PubTator 3.0 system overview and search results page: 1. Query auto-complete enhances search accuracy and synonym matching. 2. Natural
language processing (NLP)-enhanced relevance: Search results are prioritized according to the strength of the relationship between the entities queried.
3. Users can further refine results with facet filters—section, journal and type. 4. Search results include highlighted entity snippets explaining relevance.
5. Histogram visualizes number of results by publication year. 6. Entity highlighting can be switched on or off according to user preference.

Nucleic Acids Research, 2024, Vol. 52, Web Server issue

W543

Case study I: entity relation queries
We analyzed the retrieval quality of PubTator 3.0 by preparing a series of 12 entity pairs to serve as case studies for
comparison between PubTator 3.0, PubMed and Google
Scholar. To provide an equal comparison, we filtered about
30% of the Google Scholar results for articles not present
in PubMed. To ensure that the number of results would
remain low enough to allow filtering Google Scholar results for articles not in PubMed, we identified entity pairs
first discussed together in the literature in 2022 or later. We
then randomly selected two entity pairs of each of the following types: disease/gene, chemical/disease, chemical/gene,
chemical/chemical, gene/gene and disease/variant. None of

the relation pairs selected appears in the training set. The
comparison was performed with respect to a snapshot of the
search results returned by all search engines on 19 May 2023.
We manually evaluated the top 20 results for each system and
each query; articles were judged to be relevant if they mentioned both entities in the query and supported a relationship
between them. Two curators independently judged each article, and discrepancies were discussed until agreement. The
curators were not blinded to the retrieval method but were
required to record the text supporting the relationship, if relevant. This experiment evaluated the relevance of the top 20
results for each retrieval method, regardless of whether the
article appeared in PubMed.

Downloaded from https://academic.oup.com/nar/article/52/W1/W540/7640526 by guest on 10 March 2025

Figure 2. (A) The PubTator 3.0 processing pipeline: AIONER (8) identifies six types of entities in PubMed abstracts and PMC-OA full-text articles. Entity
annotations are associated with database identifiers by specialized mappers and BioREx (9) identifies relations between entities. Extracted data is
stored in MongoDB and made searchable using Solr. (B) Entity recognition performance for each entity type compared with PubTator2 (also known as
PubTatorCentral) (13) on the BioRED corpus (15). (C) Relation extraction performance compared with SemRep (11) and notable previous best systems
(12,13) on the BioCreative V Chemical-Disease Relation (14) corpus. (D) Comparison of information retrieval for PubTator 3.0, PubMed, and Google
Scholar for entity pair queries, with respect to total article count and top-20 article precision.

W544

Case study II: retrieval-augmented generation
In the era of large language models (LLMs), PubTator 3.0 can
also enhance their factual accuracy via retrieval augmented
generation. Despite their strong language ability, LLMs are
prone to generating incorrect assertions, sometimes known
as hallucinations (30,31). For example, when requested to
cite sources for questions such as ‘which diseases can doxorubicin treat’, GPT-4 frequently provides seemingly plausible but nonexistent references. Augmenting GPT-4 with PubTator 3.0 APIs can anchor the model’s response to verifiable
references via the extracted relations, significantly reducing
hallucinations.
We assessed the citation accuracy of responses from three
GPT-4 variations: PubTator-augmented GPT-4, PubMedaugmented GPT-4 and standard GPT-4. We performed a qualitative evaluation based on eight questions selected as follows. We identified entities mentioned in the PubMed query
logs and randomly selected from entities searched both frequently and rarely. We then identified the common queries for
each entity that request relational information and adapted
one into a natural language question. Each question is therefore grounded on common information needs of real PubMed
users. For example, the questions ‘What can be caused by
tocilizumab?’ and ‘What can be treated by doxorubicin?’
are adapted from the user queries ‘tocilizumab side effects’
and ‘doxorubicin treatment’ respectively. Such questions typically require extracting information from multiple articles
and an understanding of biomedical entities and relationship descriptions. Supplementary Table S5 lists the questions
chosen.
We augmented the GPT-4 large language model (LLM) with
PubTator 3.0 via the function calling mechanism of the OpenAI ChatCompletion API. This integration involved prompt-

ing GPT-4 with descriptions of three PubTator APIs: (i) find
entity ID, which retrieves PubTator entity identifiers; (ii) find
related entities, which identifies related entities based on an
input entity and specified relations and (iii) export relevant
search results, which returns PubMed article identifiers containing textual evidence for specific entity relationships. Our
instructions prompted GPT-4 to decompose user questions
into sub-questions addressable by these APIs, execute the
function calls, and synthesize the responses into a coherent final answer. Our prompt promoted a summarized response by
instructing GPT-4 to start its message with ‘Summary:’ and requested the response include citations to the articles providing
evidence. The PubMed augmentation experiments provided
GPT-4 with access to PubMed database search via the National Center for Biotechnology Information (NCBI) E-utils
APIs (32). We used Azure OpenAI Services (version 2023-0701-preview) and GPT-4 (version 2023-06-13) and set the decoding temperature to zero to obtain deterministic outputs.
The full prompts are provided in Supplementary Table S6.
PubTator-augmented GPT-4 generally processed the questions in three steps: (i) finding the standard entity identifiers, (ii) finding its related entity identifiers and (iii) searching PubMed articles. For example, to answer ‘What drugs can
treat breast cancer?’, GPT-4 first found the PubTator entity
identifier for breast cancer (@DISEASE_Breast_Cancer) using
the Find Entity ID API. It then used the Find Related Entities
API to identify entities related to @DISEASE_Breast_Cancer
through a ‘treat’ relation. For demonstration purposes, we
limited the maximum number of output entities to five. Finally,
GPT-4 called the Export Relevant Search Results API for the
PubMed article identifiers containing evidence for these relationships. The raw responses to each prompt for each method
are provided in Supplementary Table S6.
We manually evaluated the accuracy of the citations in
the responses by reviewing each PubMed article and verifying whether each PubMed article cited supported the
stated relationship (e.g. Tamoxifen treating breast cancer).
Supplementary Table S5 reports the proportion of the cited
articles with valid supporting evidence for each method. GPT4 frequently generated fabricated citations, widely known
as the hallucination issue. While PubMed-augmented GPT-4
showed a higher proportion of accurate citations, some articles cited did not support the relation claims. This is likely
because PubMed is based on keyword and Boolean search and
does not support queries for specific relationships. Responses
generated by PubTator-augmented GPT-4 demonstrated the
highest level of citation accuracy, underscoring the potential of PubTator 3.0 as a high-quality knowledge source for
addressing biomedical information needs through retrievalaugmented generation with LLMs such as GPT-4. In our experiment, using Azure for ChatGPT, the cost was approximately $1 for two questions with GPT-4-Turbo, or 40 questions when downgraded to GPT-3.5-Turbo, including the cost
of input/output tokens.

Discussion
Previous versions of PubTator have fulfilled over one billion
API requests since 2015, supporting a wide range of research
applications. Numerous studies have harnessed PubTator annotations for disease-specific gene research, including efforts
to prioritize candidate genes (33), determine gene–phenotype
associations (34), and identify the genetic underpinnings of

Downloaded from https://academic.oup.com/nar/article/52/W1/W540/7640526 by guest on 10 March 2025

Our analysis is summarized in Figure 2D, and
Supplementary Table S4 presents a detailed comparison
of the quality of retrieved results between PubTator 3.0,
PubMed and Google Scholar. Our results demonstrate that
PubTator 3.0 retrieves a greater number of articles than the
comparison systems and its precision is higher for the top
20 results. For instance, PubTator 3.0 returned 346 articles
for the query ‘GLPG0634 + ulcerative colitis’, and manual
review of the top 20 articles showed that all contained
statements about an association between GLPG0634 and
ulcerative colitis. In contrast, PubMed only returned a total
of 18 articles, with only 12 mentioning an association. Moreover, when searching for ‘COVID19 + PON1’, PubTator 3.0
returns 212 articles in PubMed, surpassing the 43 articles
obtained from Google Scholar, only 29 of which are sourced
from PubMed. These disparities can be attributed to several
factors: (i) PubTator 3.0’s search includes full texts available
in PMC-OA, resulting in significantly broader coverage of
articles, (ii) entity normalization improves recall, for example,
by matching ‘paraoxonase 1’ to ‘PON1’, (iii) PubTator 3.0
prioritizes articles containing relations between the query
entities, (iv) Pubtator 3.0 prioritizes articles where the entities
appear nearby, rather than distant paragraphs. Across the 12
information retrieval case studies, PubTator 3.0 demonstrated
an overall precision of 90.0% for the top 20 articles (216 out
of 240), which is significantly higher than PubMed’s precision
of 81.6% (84 out of 103) and Google Scholar’s precision of
48.5% (98 out of 202).

Nucleic Acids Research, 2024, Vol. 52, Web Server issue

W545

Nucleic Acids Research, 2024, Vol. 52, Web Server issue

Conclusion
PubTator 3.0 offers a comprehensive set of features and tools
that allow researchers to navigate the ever-expanding wealth
of biomedical literature, expediting research and unlocking
valuable insights for scientific discovery. The PubTator 3.0 interface, API, and bulk file downloads are available at https:
//www.ncbi.nlm.nih.gov/research/pubtator3/.

Data availability
Data is available through the online interface at https://
www.ncbi.nlm.nih.gov/research/pubtator3/, through the API
at https://www.ncbi.nlm.nih.gov/research/pubtator3/api or
bulk FTP download at https://ftp.ncbi.nlm.nih.gov/pub/lu/
PubTator3/.
The source code for each component of PubTator 3.0
is openly accessible. The AIONER named entity recognizer
is available at https://github.com/ncbi/AIONER. GNorm2,
for gene name normalization, is available at https://github.
com/ncbi/GNorm2. The tmVar3 variant name normalizer
is available at https://github.com/ncbi/tmVar3. The NLMChem Tagger, for chemical name normalization, is available
at https://ftp.ncbi.nlm.nih.gov/pub/lu/NLMChem. The TaggerOne system, for disease and cell line normalization, is available at https://www.ncbi.nlm.nih.gov/research/bionlp/Tools/
taggerone. The BioREx relation extraction system is available
at https://github.com/ncbi/BioREx. The code for customizing
ChatGPT with the PubTator 3.0 API is available at https:
//github.com/ncbi-nlp/pubtator-gpt. The details of the applications, performance, evaluation data, and citations for each
tool are shown in Supplementary Table S7. All source code is
also available at https://doi.org/10.5281/zenodo.10839630.

Supplementary data
Supplementary Data are available at NAR Online.

Funding
Intramural Research Program of the National Library of
Medicine (NLM), National Institutes of Health; ODSS Support of the Exploration of Cloud in NIH Intramural Research.
Funding for open access charge: Intramural Research Program
of the National Library of Medicine, National Institutes of
Health.

Conflict of interest statement
None declared.

```