This is page 10 of 15. Use http://codebase.md/genomoncology/biomcp?lines=false&page={x} to view the full context. # Directory Structure ``` ├── .github │ ├── actions │ │ └── setup-python-env │ │ └── action.yml │ ├── dependabot.yml │ └── workflows │ ├── ci.yml │ ├── deploy-docs.yml │ ├── main.yml.disabled │ ├── on-release-main.yml │ └── validate-codecov-config.yml ├── .gitignore ├── .pre-commit-config.yaml ├── BIOMCP_DATA_FLOW.md ├── CHANGELOG.md ├── CNAME ├── codecov.yaml ├── docker-compose.yml ├── Dockerfile ├── docs │ ├── apis │ │ ├── error-codes.md │ │ ├── overview.md │ │ └── python-sdk.md │ ├── assets │ │ ├── biomcp-cursor-locations.png │ │ ├── favicon.ico │ │ ├── icon.png │ │ ├── logo.png │ │ ├── mcp_architecture.txt │ │ └── remote-connection │ │ ├── 00_connectors.png │ │ ├── 01_add_custom_connector.png │ │ ├── 02_connector_enabled.png │ │ ├── 03_connect_to_biomcp.png │ │ ├── 04_select_google_oauth.png │ │ └── 05_success_connect.png │ ├── backend-services-reference │ │ ├── 01-overview.md │ │ ├── 02-biothings-suite.md │ │ ├── 03-cbioportal.md │ │ ├── 04-clinicaltrials-gov.md │ │ ├── 05-nci-cts-api.md │ │ ├── 06-pubtator3.md │ │ └── 07-alphagenome.md │ ├── blog │ │ ├── ai-assisted-clinical-trial-search-analysis.md │ │ ├── images │ │ │ ├── deep-researcher-video.png │ │ │ ├── researcher-announce.png │ │ │ ├── researcher-drop-down.png │ │ │ ├── researcher-prompt.png │ │ │ ├── trial-search-assistant.png │ │ │ └── what_is_biomcp_thumbnail.png │ │ └── researcher-persona-resource.md │ ├── changelog.md │ ├── CNAME │ ├── concepts │ │ ├── 01-what-is-biomcp.md │ │ ├── 02-the-deep-researcher-persona.md │ │ └── 03-sequential-thinking-with-the-think-tool.md │ ├── developer-guides │ │ ├── 01-server-deployment.md │ │ ├── 02-contributing-and-testing.md │ │ ├── 03-third-party-endpoints.md │ │ ├── 04-transport-protocol.md │ │ ├── 05-error-handling.md │ │ ├── 06-http-client-and-caching.md │ │ ├── 07-performance-optimizations.md │ │ └── generate_endpoints.py │ ├── faq-condensed.md │ ├── FDA_SECURITY.md │ ├── genomoncology.md │ ├── getting-started │ │ ├── 01-quickstart-cli.md │ │ ├── 02-claude-desktop-integration.md │ │ └── 03-authentication-and-api-keys.md │ ├── how-to-guides │ │ ├── 01-find-articles-and-cbioportal-data.md │ │ ├── 02-find-trials-with-nci-and-biothings.md │ │ ├── 03-get-comprehensive-variant-annotations.md │ │ ├── 04-predict-variant-effects-with-alphagenome.md │ │ ├── 05-logging-and-monitoring-with-bigquery.md │ │ └── 06-search-nci-organizations-and-interventions.md │ ├── index.md │ ├── policies.md │ ├── reference │ │ ├── architecture-diagrams.md │ │ ├── quick-architecture.md │ │ ├── quick-reference.md │ │ └── visual-architecture.md │ ├── robots.txt │ ├── stylesheets │ │ ├── announcement.css │ │ └── extra.css │ ├── troubleshooting.md │ ├── tutorials │ │ ├── biothings-prompts.md │ │ ├── claude-code-biomcp-alphagenome.md │ │ ├── nci-prompts.md │ │ ├── openfda-integration.md │ │ ├── openfda-prompts.md │ │ ├── pydantic-ai-integration.md │ │ └── remote-connection.md │ ├── user-guides │ │ ├── 01-command-line-interface.md │ │ ├── 02-mcp-tools-reference.md │ │ └── 03-integrating-with-ides-and-clients.md │ └── workflows │ └── all-workflows.md ├── example_scripts │ ├── mcp_integration.py │ └── python_sdk.py ├── glama.json ├── LICENSE ├── lzyank.toml ├── Makefile ├── mkdocs.yml ├── package-lock.json ├── package.json ├── pyproject.toml ├── README.md ├── scripts │ ├── check_docs_in_mkdocs.py │ ├── check_http_imports.py │ └── generate_endpoints_doc.py ├── smithery.yaml ├── src │ └── biomcp │ ├── __init__.py │ ├── __main__.py │ ├── articles │ │ ├── __init__.py │ │ ├── autocomplete.py │ │ ├── fetch.py │ │ ├── preprints.py │ │ ├── search_optimized.py │ │ ├── search.py │ │ └── unified.py │ ├── biomarkers │ │ ├── __init__.py │ │ └── search.py │ ├── cbioportal_helper.py │ ├── circuit_breaker.py │ ├── cli │ │ ├── __init__.py │ │ ├── articles.py │ │ ├── biomarkers.py │ │ ├── diseases.py │ │ ├── health.py │ │ ├── interventions.py │ │ ├── main.py │ │ ├── openfda.py │ │ ├── organizations.py │ │ ├── server.py │ │ ├── trials.py │ │ └── variants.py │ ├── connection_pool.py │ ├── constants.py │ ├── core.py │ ├── diseases │ │ ├── __init__.py │ │ ├── getter.py │ │ └── search.py │ ├── domain_handlers.py │ ├── drugs │ │ ├── __init__.py │ │ └── getter.py │ ├── exceptions.py │ ├── genes │ │ ├── __init__.py │ │ └── getter.py │ ├── http_client_simple.py │ ├── http_client.py │ ├── individual_tools.py │ ├── integrations │ │ ├── __init__.py │ │ ├── biothings_client.py │ │ └── cts_api.py │ ├── interventions │ │ ├── __init__.py │ │ ├── getter.py │ │ └── search.py │ ├── logging_filter.py │ ├── metrics_handler.py │ ├── metrics.py │ ├── openfda │ │ ├── __init__.py │ │ ├── adverse_events_helpers.py │ │ ├── adverse_events.py │ │ ├── cache.py │ │ ├── constants.py │ │ ├── device_events_helpers.py │ │ ├── device_events.py │ │ ├── drug_approvals.py │ │ ├── drug_labels_helpers.py │ │ ├── drug_labels.py │ │ ├── drug_recalls_helpers.py │ │ ├── drug_recalls.py │ │ ├── drug_shortages_detail_helpers.py │ │ ├── drug_shortages_helpers.py │ │ ├── drug_shortages.py │ │ ├── exceptions.py │ │ ├── input_validation.py │ │ ├── rate_limiter.py │ │ ├── utils.py │ │ └── validation.py │ ├── organizations │ │ ├── __init__.py │ │ ├── getter.py │ │ └── search.py │ ├── parameter_parser.py │ ├── prefetch.py │ ├── query_parser.py │ ├── query_router.py │ ├── rate_limiter.py │ ├── render.py │ ├── request_batcher.py │ ├── resources │ │ ├── __init__.py │ │ ├── getter.py │ │ ├── instructions.md │ │ └── researcher.md │ ├── retry.py │ ├── router_handlers.py │ ├── router.py │ ├── shared_context.py │ ├── thinking │ │ ├── __init__.py │ │ ├── sequential.py │ │ └── session.py │ ├── thinking_tool.py │ ├── thinking_tracker.py │ ├── trials │ │ ├── __init__.py │ │ ├── getter.py │ │ ├── nci_getter.py │ │ ├── nci_search.py │ │ └── search.py │ ├── utils │ │ ├── __init__.py │ │ ├── cancer_types_api.py │ │ ├── cbio_http_adapter.py │ │ ├── endpoint_registry.py │ │ ├── gene_validator.py │ │ ├── metrics.py │ │ ├── mutation_filter.py │ │ ├── query_utils.py │ │ ├── rate_limiter.py │ │ └── request_cache.py │ ├── variants │ │ ├── __init__.py │ │ ├── alphagenome.py │ │ ├── cancer_types.py │ │ ├── cbio_external_client.py │ │ ├── cbioportal_mutations.py │ │ ├── cbioportal_search_helpers.py │ │ ├── cbioportal_search.py │ │ ├── constants.py │ │ ├── external.py │ │ ├── filters.py │ │ ├── getter.py │ │ ├── links.py │ │ └── search.py │ └── workers │ ├── __init__.py │ ├── worker_entry_stytch.js │ ├── worker_entry.js │ └── worker.py ├── tests │ ├── bdd │ │ ├── cli_help │ │ │ ├── help.feature │ │ │ └── test_help.py │ │ ├── conftest.py │ │ ├── features │ │ │ └── alphagenome_integration.feature │ │ ├── fetch_articles │ │ │ ├── fetch.feature │ │ │ └── test_fetch.py │ │ ├── get_trials │ │ │ ├── get.feature │ │ │ └── test_get.py │ │ ├── get_variants │ │ │ ├── get.feature │ │ │ └── test_get.py │ │ ├── search_articles │ │ │ ├── autocomplete.feature │ │ │ ├── search.feature │ │ │ ├── test_autocomplete.py │ │ │ └── test_search.py │ │ ├── search_trials │ │ │ ├── search.feature │ │ │ └── test_search.py │ │ ├── search_variants │ │ │ ├── search.feature │ │ │ └── test_search.py │ │ └── steps │ │ └── test_alphagenome_steps.py │ ├── config │ │ └── test_smithery_config.py │ ├── conftest.py │ ├── data │ │ ├── ct_gov │ │ │ ├── clinical_trials_api_v2.yaml │ │ │ ├── trials_NCT04280705.json │ │ │ └── trials_NCT04280705.txt │ │ ├── myvariant │ │ │ ├── myvariant_api.yaml │ │ │ ├── myvariant_field_descriptions.csv │ │ │ ├── variants_full_braf_v600e.json │ │ │ ├── variants_full_braf_v600e.txt │ │ │ └── variants_part_braf_v600_multiple.json │ │ ├── openfda │ │ │ ├── drugsfda_detail.json │ │ │ ├── drugsfda_search.json │ │ │ ├── enforcement_detail.json │ │ │ └── enforcement_search.json │ │ └── pubtator │ │ ├── pubtator_autocomplete.json │ │ └── pubtator3_paper.txt │ ├── integration │ │ ├── test_openfda_integration.py │ │ ├── test_preprints_integration.py │ │ ├── test_simple.py │ │ └── test_variants_integration.py │ ├── tdd │ │ ├── articles │ │ │ ├── test_autocomplete.py │ │ │ ├── test_cbioportal_integration.py │ │ │ ├── test_fetch.py │ │ │ ├── test_preprints.py │ │ │ ├── test_search.py │ │ │ └── test_unified.py │ │ ├── conftest.py │ │ ├── drugs │ │ │ ├── __init__.py │ │ │ └── test_drug_getter.py │ │ ├── openfda │ │ │ ├── __init__.py │ │ │ ├── test_adverse_events.py │ │ │ ├── test_device_events.py │ │ │ ├── test_drug_approvals.py │ │ │ ├── test_drug_labels.py │ │ │ ├── test_drug_recalls.py │ │ │ ├── test_drug_shortages.py │ │ │ └── test_security.py │ │ ├── test_biothings_integration_real.py │ │ ├── test_biothings_integration.py │ │ ├── test_circuit_breaker.py │ │ ├── test_concurrent_requests.py │ │ ├── test_connection_pool.py │ │ ├── test_domain_handlers.py │ │ ├── test_drug_approvals.py │ │ ├── test_drug_recalls.py │ │ ├── test_drug_shortages.py │ │ ├── test_endpoint_documentation.py │ │ ├── test_error_scenarios.py │ │ ├── test_europe_pmc_fetch.py │ │ ├── test_mcp_integration.py │ │ ├── test_mcp_tools.py │ │ ├── test_metrics.py │ │ ├── test_nci_integration.py │ │ ├── test_nci_mcp_tools.py │ │ ├── test_network_policies.py │ │ ├── test_offline_mode.py │ │ ├── test_openfda_unified.py │ │ ├── test_pten_r173_search.py │ │ ├── test_render.py │ │ ├── test_request_batcher.py.disabled │ │ ├── test_retry.py │ │ ├── test_router.py │ │ ├── test_shared_context.py.disabled │ │ ├── test_unified_biothings.py │ │ ├── thinking │ │ │ ├── __init__.py │ │ │ └── test_sequential.py │ │ ├── trials │ │ │ ├── test_backward_compatibility.py │ │ │ ├── test_getter.py │ │ │ └── test_search.py │ │ ├── utils │ │ │ ├── test_gene_validator.py │ │ │ ├── test_mutation_filter.py │ │ │ ├── test_rate_limiter.py │ │ │ └── test_request_cache.py │ │ ├── variants │ │ │ ├── constants.py │ │ │ ├── test_alphagenome_api_key.py │ │ │ ├── test_alphagenome_comprehensive.py │ │ │ ├── test_alphagenome.py │ │ │ ├── test_cbioportal_mutations.py │ │ │ ├── test_cbioportal_search.py │ │ │ ├── test_external_integration.py │ │ │ ├── test_external.py │ │ │ ├── test_extract_gene_aa_change.py │ │ │ ├── test_filters.py │ │ │ ├── test_getter.py │ │ │ ├── test_links.py │ │ │ └── test_search.py │ │ └── workers │ │ └── test_worker_sanitization.js │ └── test_pydantic_ai_integration.py ├── THIRD_PARTY_ENDPOINTS.md ├── tox.ini ├── uv.lock └── wrangler.toml ``` # Files -------------------------------------------------------------------------------- /docs/how-to-guides/04-predict-variant-effects-with-alphagenome.md: -------------------------------------------------------------------------------- ```markdown # How to Predict Variant Effects with AlphaGenome This guide demonstrates how to use Google DeepMind's AlphaGenome to predict regulatory effects of genetic variants on gene expression, chromatin accessibility, and splicing. ## Overview AlphaGenome predicts how DNA variants affect: - **Gene Expression**: Log-fold changes in nearby genes - **Chromatin Accessibility**: ATAC-seq/DNase-seq signal changes - **Splicing**: Effects on splice sites and exon inclusion - **Regulatory Elements**: Impact on enhancers, promoters, and TFBS - **3D Chromatin**: Changes in chromatin interactions For technical details on the AlphaGenome integration, see the [AlphaGenome API Reference](../backend-services-reference/07-alphagenome.md). ## Setup and API Key ### Get Your API Key 1. Visit [AlphaGenome Portal](https://deepmind.google.com/science/alphagenome) 2. Register for non-commercial use 3. Receive API key via email For detailed setup instructions, see [Authentication and API Keys](../getting-started/03-authentication-and-api-keys.md#alphagenome). ### Configure API Key **Option 1: Environment Variable (Personal Use)** ```bash export ALPHAGENOME_API_KEY="your-key-here" ``` **Option 2: Per-Request (AI Assistants)** ``` "Predict effects of BRAF V600E. My AlphaGenome API key is YOUR_KEY_HERE" ``` **Option 3: Configuration File** ```python # .env file ALPHAGENOME_API_KEY=your-key-here ``` ### Install AlphaGenome (Optional) For local predictions: ```bash git clone https://github.com/google-deepmind/alphagenome.git cd alphagenome && pip install . ``` ## Basic Variant Prediction ### Simple Prediction Predict effects of BRAF V600E mutation: ```bash # CLI biomcp variant predict chr7 140753336 A T # Python result = await client.variants.predict( chromosome="chr7", position=140753336, reference="A", alternate="T" ) # MCP Tool result = await alphagenome_predictor( chromosome="chr7", position=140753336, reference="A", alternate="T" ) ``` ### Understanding Results ```python # Gene expression changes for gene in result.gene_expression: print(f"{gene.name}: {gene.log2_fold_change}") # Positive = increased expression # Negative = decreased expression # |value| > 1.0 = strong effect # Chromatin accessibility for region in result.chromatin: print(f"{region.type}: {region.change}") # Positive = more open chromatin # Negative = more closed chromatin # Splicing effects for splice in result.splicing: print(f"{splice.event}: {splice.delta_psi}") # PSI = Percent Spliced In # Positive = increased inclusion ``` ## Tissue-Specific Predictions ### Single Tissue Analysis Predict effects in specific tissues using UBERON terms: ```python # Breast tissue analysis result = await alphagenome_predictor( chromosome="chr17", position=41246481, reference="G", alternate="A", tissue_types=["UBERON:0000310"] # breast ) # Common tissue codes: # UBERON:0000310 - breast # UBERON:0002107 - liver # UBERON:0002367 - prostate # UBERON:0000955 - brain # UBERON:0002048 - lung # UBERON:0001155 - colon ``` ### Multi-Tissue Comparison Compare effects across tissues: ```python tissues = [ "UBERON:0000310", # breast "UBERON:0002107", # liver "UBERON:0002048" # lung ] results = {} for tissue in tissues: results[tissue] = await alphagenome_predictor( chromosome="chr17", position=41246481, reference="G", alternate="A", tissue_types=[tissue] ) # Compare gene expression across tissues for tissue, result in results.items(): print(f"\n{tissue}:") for gene in result.gene_expression[:3]: print(f" {gene.name}: {gene.log2_fold_change}") ``` ## Analysis Window Sizes ### Choosing the Right Interval Different interval sizes capture different regulatory effects: ```python # Short-range (promoter effects) result_2kb = await alphagenome_predictor( chromosome="chr7", position=140753336, reference="A", alternate="T", interval_size=2048 # 2kb ) # Medium-range (enhancer-promoter) result_128kb = await alphagenome_predictor( chromosome="chr7", position=140753336, reference="A", alternate="T", interval_size=131072 # 128kb (default) ) # Long-range (TAD-level effects) result_1mb = await alphagenome_predictor( chromosome="chr7", position=140753336, reference="A", alternate="T", interval_size=1048576 # 1Mb ) ``` **Interval Size Guide:** - **2kb**: Promoter variants, TSS mutations - **16kb**: Local regulatory elements - **128kb**: Enhancer-promoter interactions (default) - **512kb**: Long-range regulatory - **1Mb**: TAD boundaries, super-enhancers ## Clinical Workflows ### Workflow 1: VUS (Variant of Unknown Significance) Analysis ```python async def analyze_vus(chromosome: str, position: int, ref: str, alt: str): # Step 1: Think about the analysis await think( thought=f"Analyzing VUS at {chromosome}:{position} {ref}>{alt}", thoughtNumber=1 ) # Step 2: Get variant annotations variant_id = f"{chromosome}:g.{position}{ref}>{alt}" try: known_variant = await variant_getter(variant_id) if known_variant.clinical_significance: return f"Already classified: {known_variant.clinical_significance}" except: pass # Variant not in databases # Step 3: Predict regulatory effects prediction = await alphagenome_predictor( chromosome=chromosome, position=position, reference=ref, alternate=alt, interval_size=131072 ) # Step 4: Analyze impact high_impact_genes = [ g for g in prediction.gene_expression if abs(g.log2_fold_change) > 1.0 ] # Step 5: Search literature if high_impact_genes: gene_symbols = [g.name for g in high_impact_genes[:3]] articles = await article_searcher( genes=gene_symbols, keywords=["pathogenic", "disease", "mutation"] ) return { "variant": f"{chromosome}:{position} {ref}>{alt}", "high_impact_genes": high_impact_genes, "regulatory_assessment": assess_regulatory_impact(prediction), "literature_support": len(articles) if high_impact_genes else 0 } def assess_regulatory_impact(prediction): """Classify regulatory impact severity""" max_expression_change = max( abs(g.log2_fold_change) for g in prediction.gene_expression ) if prediction.gene_expression else 0 if max_expression_change > 2.0: return "HIGH - Strong regulatory effect" elif max_expression_change > 1.0: return "MODERATE - Significant regulatory effect" elif max_expression_change > 0.5: return "LOW - Mild regulatory effect" else: return "MINIMAL - No significant regulatory effect" ``` ### Workflow 2: Non-coding Variant Prioritization ```python async def prioritize_noncoding_variants(variants: list[dict], disease_genes: list[str]): """Rank non-coding variants by predicted impact on disease genes""" results = [] for variant in variants: # Predict effects prediction = await alphagenome_predictor( chromosome=variant["chr"], position=variant["pos"], reference=variant["ref"], alternate=variant["alt"] ) # Check impact on disease genes disease_impact = {} for gene in prediction.gene_expression: if gene.name in disease_genes: disease_impact[gene.name] = gene.log2_fold_change # Calculate priority score if disease_impact: max_impact = max(abs(v) for v in disease_impact.values()) results.append({ "variant": variant, "disease_genes_affected": disease_impact, "priority_score": max_impact, "chromatin_changes": len([c for c in prediction.chromatin if c.change > 0.5]) }) # Sort by priority results.sort(key=lambda x: x["priority_score"], reverse=True) return results # Example usage variants_to_test = [ {"chr": "chr17", "pos": 41246000, "ref": "A", "alt": "G"}, {"chr": "chr17", "pos": 41246500, "ref": "C", "alt": "T"}, {"chr": "chr17", "pos": 41247000, "ref": "G", "alt": "A"} ] breast_cancer_genes = ["BRCA1", "BRCA2", "TP53", "PTEN"] prioritized = await prioritize_noncoding_variants(variants_to_test, breast_cancer_genes) ``` ### Workflow 3: Splicing Analysis ```python async def analyze_splicing_variant(gene: str, exon: int, variant_pos: int, ref: str, alt: str): """Analyze potential splicing effects of a variant""" # Get gene information gene_info = await gene_getter(gene) chromosome = f"chr{gene_info.genomic_location.chr}" # Predict splicing effects prediction = await alphagenome_predictor( chromosome=chromosome, position=variant_pos, reference=ref, alternate=alt, interval_size=16384 # Smaller window for splicing ) # Analyze splicing predictions splicing_effects = [] for event in prediction.splicing: if abs(event.delta_psi) > 0.1: # 10% change in splicing splicing_effects.append({ "type": event.event_type, "change": event.delta_psi, "affected_exon": event.exon, "interpretation": interpret_splicing(event) }) # Search for similar splicing variants articles = await article_searcher( genes=[gene], keywords=[f"exon {exon}", "splicing", "splice site"] ) return { "variant": f"{gene} exon {exon} {ref}>{alt}", "splicing_effects": splicing_effects, "likely_consequence": predict_consequence(splicing_effects), "literature_precedent": len(articles) } def interpret_splicing(event): """Interpret splicing changes""" if event.delta_psi > 0.5: return "Strong increase in exon inclusion" elif event.delta_psi > 0.1: return "Moderate increase in exon inclusion" elif event.delta_psi < -0.5: return "Strong exon skipping" elif event.delta_psi < -0.1: return "Moderate exon skipping" else: return "Minimal splicing change" ``` ## Research Applications ### Enhancer Variant Analysis ```python async def analyze_enhancer_variant(chr: str, pos: int, ref: str, alt: str, target_gene: str): """Analyze variant in potential enhancer region""" # Use larger window to capture enhancer-promoter interactions prediction = await alphagenome_predictor( chromosome=chr, position=pos, reference=ref, alternate=alt, interval_size=524288 # 512kb ) # Find target gene effect target_effect = None for gene in prediction.gene_expression: if gene.name == target_gene: target_effect = gene.log2_fold_change break # Analyze chromatin changes chromatin_opening = sum( 1 for c in prediction.chromatin if c.change > 0 and c.type == "enhancer" ) return { "variant_location": f"{chr}:{pos}", "target_gene": target_gene, "expression_change": target_effect, "enhancer_activity": "increased" if chromatin_opening > 0 else "decreased", "likely_enhancer": abs(target_effect or 0) > 0.5 and chromatin_opening > 0 } ``` ### Pharmacogenomic Predictions ```python async def predict_drug_response_variant(drug_target: str, variant: dict): """Predict how variant affects drug target expression""" # Get drug information drug_info = await drug_getter(drug_target) target_genes = drug_info.targets # Predict variant effects prediction = await alphagenome_predictor( chromosome=variant["chr"], position=variant["pos"], reference=variant["ref"], alternate=variant["alt"], tissue_types=["UBERON:0002107"] # liver for drug metabolism ) # Check effects on drug targets target_effects = {} for gene in prediction.gene_expression: if gene.name in target_genes: target_effects[gene.name] = gene.log2_fold_change # Interpret results if any(abs(effect) > 1.0 for effect in target_effects.values()): response = "Likely altered drug response" elif any(abs(effect) > 0.5 for effect in target_effects.values()): response = "Possible altered drug response" else: response = "Unlikely to affect drug response" return { "drug": drug_target, "variant": variant, "target_effects": target_effects, "prediction": response, "recommendation": "Consider dose adjustment" if "altered" in response else "Standard dosing" } ``` ## Best Practices ### 1. Validate Input Coordinates ```python # Always use "chr" prefix chromosome = "chr7" # ✅ Correct # chromosome = "7" # ❌ Wrong # Use 1-based positions (not 0-based) position = 140753336 # ✅ 1-based ``` ### 2. Handle API Errors Gracefully ```python try: result = await alphagenome_predictor(...) except Exception as e: if "API key" in str(e): print("Please provide AlphaGenome API key") elif "Invalid sequence" in str(e): print("Check chromosome and position") else: print(f"Prediction failed: {e}") ``` ### 3. Combine with Other Tools ```python # Complete variant analysis pipeline async def comprehensive_variant_analysis(variant_id: str): # 1. Get known annotations known = await variant_getter(variant_id) # 2. Predict regulatory effects prediction = await alphagenome_predictor( chromosome=f"chr{known.chr}", position=known.pos, reference=known.ref, alternate=known.alt ) # 3. Search literature articles = await article_searcher( variants=[variant_id], genes=[known.gene.symbol] ) # 4. Find relevant trials trials = await trial_searcher( other_terms=[known.gene.symbol, "mutation"] ) return { "annotations": known, "predictions": prediction, "literature": articles, "trials": trials } ``` ### 4. Interpret Results Appropriately ```python def interpret_expression_change(log2_fc): """Convert log2 fold change to interpretation""" if log2_fc > 2.0: return "Very strong increase (>4x)" elif log2_fc > 1.0: return "Strong increase (2-4x)" elif log2_fc > 0.5: return "Moderate increase (1.4-2x)" elif log2_fc < -2.0: return "Very strong decrease (<0.25x)" elif log2_fc < -1.0: return "Strong decrease (0.25-0.5x)" elif log2_fc < -0.5: return "Moderate decrease (0.5-0.7x)" else: return "Minimal change" ``` ## Limitations and Considerations ### Technical Limitations - **Human only**: GRCh38 reference genome - **SNVs only**: No indels or structural variants - **Exact coordinates**: Must have precise genomic position - **Sequence context**: Requires reference sequence match ### Interpretation Caveats - **Predictions not certainties**: Validate with functional studies - **Context matters**: Cell type, developmental stage affect outcomes - **Indirect effects**: May miss complex regulatory cascades - **Population variation**: Individual genetic background influences ## Troubleshooting ### Common Issues **"API key required"** - Set environment variable or provide per-request - Check key validity at AlphaGenome portal **"Invalid sequence length"** - Verify chromosome format (use "chr" prefix) - Check position is within chromosome bounds - Ensure ref/alt are single nucleotides **"No results returned"** - May be no genes in analysis window - Try larger interval size - Check if variant is in gene desert **Installation issues** - Ensure Python 3.10+ - Try `pip install --upgrade pip` first - Check for conflicting protobuf versions ## Next Steps - Explore [comprehensive variant annotations](03-get-comprehensive-variant-annotations.md) - Learn about [article searches](01-find-articles-and-cbioportal-data.md) for variants - Set up [logging and monitoring](05-logging-and-monitoring-with-bigquery.md) ``` -------------------------------------------------------------------------------- /docs/how-to-guides/06-search-nci-organizations-and-interventions.md: -------------------------------------------------------------------------------- ```markdown # How to Search NCI Organizations and Interventions This guide demonstrates how to use BioMCP's NCI-specific tools to search for cancer research organizations, interventions (drugs, devices, procedures), and biomarkers. ## Prerequisites All NCI tools require an API key from [api.cancer.gov](https://api.cancer.gov): ```bash # Set as environment variable export NCI_API_KEY="your-key-here" # Or provide per-request in your prompts "Find cancer centers in Boston, my NCI API key is YOUR_KEY" ``` ## Organization Search and Lookup ### Understanding Organization Search The NCI Organization database contains: - Cancer research centers and hospitals - Clinical trial sponsors - Academic institutions - Pharmaceutical companies - Government facilities ### Basic Organization Search Find organizations by name: ```bash # CLI biomcp organization search --name "MD Anderson" --api-key YOUR_KEY # Python orgs = await nci_organization_searcher( name="MD Anderson", api_key="your-key" ) # MCP/AI Assistant "Search for MD Anderson Cancer Center, my NCI API key is YOUR_KEY" ``` ### Location-Based Search **CRITICAL**: Always use city AND state together to avoid Elasticsearch errors! ```python # ✅ CORRECT - City and state together orgs = await nci_organization_searcher( city="Houston", state="TX", api_key="your-key" ) # ❌ WRONG - Will cause API error orgs = await nci_organization_searcher( city="Houston", # Missing state! api_key="your-key" ) # ❌ WRONG - Will cause API error orgs = await nci_organization_searcher( state="TX", # Missing city! api_key="your-key" ) ``` ### Organization Types Search by organization type: ```python # Find academic cancer centers academic_centers = await nci_organization_searcher( organization_type="Academic", api_key="your-key" ) # Find pharmaceutical companies pharma_companies = await nci_organization_searcher( organization_type="Industry", api_key="your-key" ) # Find government research facilities gov_facilities = await nci_organization_searcher( organization_type="Government", api_key="your-key" ) ``` Valid organization types: - `Academic` - Universities and medical schools - `Industry` - Pharmaceutical and biotech companies - `Government` - NIH, FDA, VA hospitals - `Community` - Community hospitals and clinics - `Network` - Research networks and consortiums - `Other` - Other organization types ### Getting Organization Details Retrieve complete information about a specific organization: ```python # Get organization by ID org_details = await nci_organization_getter( organization_id="NCI-2011-03337", api_key="your-key" ) # Returns: # - Full name and aliases # - Contact information # - Address and location # - Associated clinical trials # - Organization type and status ``` ### Practical Organization Workflows #### Find Regional Cancer Centers ```python async def find_cancer_centers_by_region(state: str, cities: list[str]): """Find all cancer centers in specific cities within a state""" all_centers = [] for city in cities: # ALWAYS use city + state together centers = await nci_organization_searcher( city=city, state=state, organization_type="Academic", api_key=os.getenv("NCI_API_KEY") ) all_centers.extend(centers) # Remove duplicates unique_centers = {org['id']: org for org in all_centers} return list(unique_centers.values()) # Example: Find cancer centers in major Texas cities texas_centers = await find_cancer_centers_by_region( state="TX", cities=["Houston", "Dallas", "San Antonio", "Austin"] ) ``` #### Find Trial Sponsors ```python async def find_trial_sponsors_by_type(org_type: str, name_filter: str = None): """Find organizations sponsoring trials""" # Search organizations orgs = await nci_organization_searcher( name=name_filter, organization_type=org_type, api_key=os.getenv("NCI_API_KEY") ) # For each org, get details including trial count sponsors = [] for org in orgs[:10]: # Limit to avoid rate limits details = await nci_organization_getter( organization_id=org['id'], api_key=os.getenv("NCI_API_KEY") ) if details.get('trial_count', 0) > 0: sponsors.append(details) return sorted(sponsors, key=lambda x: x.get('trial_count', 0), reverse=True) # Find pharmaceutical companies with active trials pharma_sponsors = await find_trial_sponsors_by_type("Industry") ``` ## Intervention Search and Lookup ### Understanding Interventions Interventions in clinical trials include: - **Drugs**: Chemotherapy, targeted therapy, immunotherapy - **Devices**: Medical devices, diagnostic tools - **Procedures**: Surgical techniques, radiation protocols - **Biologicals**: Cell therapies, vaccines, antibodies - **Behavioral**: Lifestyle interventions, counseling - **Other**: Dietary supplements, alternative therapies ### Drug Search Find specific drugs or drug classes: ```bash # CLI - Find a specific drug biomcp intervention search --name pembrolizumab --type Drug --api-key YOUR_KEY # CLI - Find drug class biomcp intervention search --name "PD-1 inhibitor" --type Drug --api-key YOUR_KEY ``` ```python # Python - Search with synonyms drugs = await nci_intervention_searcher( name="pembrolizumab", intervention_type="Drug", synonyms=True, # Include Keytruda, MK-3475, etc. api_key="your-key" ) # Search for drug combinations combos = await nci_intervention_searcher( name="nivolumab AND ipilimumab", intervention_type="Drug", api_key="your-key" ) ``` ### Device and Procedure Search ```python # Find medical devices devices = await nci_intervention_searcher( intervention_type="Device", name="robot", # Surgical robots api_key="your-key" ) # Find procedures procedures = await nci_intervention_searcher( intervention_type="Procedure", name="minimally invasive", api_key="your-key" ) # Find radiation protocols radiation = await nci_intervention_searcher( intervention_type="Radiation", name="proton beam", api_key="your-key" ) ``` ### Getting Intervention Details ```python # Get complete intervention information intervention = await nci_intervention_getter( intervention_id="INT123456", api_key="your-key" ) # Returns: # - Official name and synonyms # - Intervention type and subtype # - Mechanism of action (for drugs) # - FDA approval status # - Associated clinical trials # - Manufacturer information ``` ### Practical Intervention Workflows #### Drug Development Pipeline ```python async def analyze_drug_pipeline(drug_target: str): """Analyze drugs in development for a specific target""" # Search for drugs targeting specific pathway drugs = await nci_intervention_searcher( name=drug_target, intervention_type="Drug", api_key=os.getenv("NCI_API_KEY") ) pipeline = { "preclinical": [], "phase1": [], "phase2": [], "phase3": [], "approved": [] } for drug in drugs: # Get detailed information details = await nci_intervention_getter( intervention_id=drug['id'], api_key=os.getenv("NCI_API_KEY") ) # Categorize by development stage if details.get('fda_approved'): pipeline['approved'].append(details) else: # Check associated trials for phase trial_phases = details.get('trial_phases', []) if 'PHASE3' in trial_phases: pipeline['phase3'].append(details) elif 'PHASE2' in trial_phases: pipeline['phase2'].append(details) elif 'PHASE1' in trial_phases: pipeline['phase1'].append(details) else: pipeline['preclinical'].append(details) return pipeline # Analyze PD-1/PD-L1 inhibitor pipeline pd1_pipeline = await analyze_drug_pipeline("PD-1 inhibitor") ``` #### Compare Similar Interventions ```python async def compare_interventions(intervention_names: list[str]): """Compare multiple interventions side by side""" comparisons = [] for name in intervention_names: # Search for intervention results = await nci_intervention_searcher( name=name, synonyms=True, api_key=os.getenv("NCI_API_KEY") ) if results: # Get detailed info for first match details = await nci_intervention_getter( intervention_id=results[0]['id'], api_key=os.getenv("NCI_API_KEY") ) comparisons.append({ "name": details['name'], "type": details['type'], "synonyms": details.get('synonyms', []), "fda_approved": details.get('fda_approved', False), "trial_count": len(details.get('trials', [])), "mechanism": details.get('mechanism_of_action', 'Not specified') }) return comparisons # Compare checkpoint inhibitors comparison = await compare_interventions([ "pembrolizumab", "nivolumab", "atezolizumab", "durvalumab" ]) ``` ## Biomarker Search ### Understanding Biomarker Types The NCI API supports two biomarker types: - `reference_gene` - Gene-based biomarkers (e.g., EGFR, BRAF) - `branch` - Pathway/branch biomarkers **Note**: You cannot search by gene symbol directly; use the name parameter. ### Basic Biomarker Search ```python # Search for PD-L1 biomarkers pdl1_biomarkers = await nci_biomarker_searcher( name="PD-L1", api_key="your-key" ) # Search for specific biomarker type gene_biomarkers = await nci_biomarker_searcher( biomarker_type="reference_gene", api_key="your-key" ) ``` ### Biomarker Analysis Workflow ```python async def analyze_trial_biomarkers(disease: str): """Find biomarkers used in trials for a disease""" # Get all biomarkers all_biomarkers = await nci_biomarker_searcher( biomarker_type="reference_gene", api_key=os.getenv("NCI_API_KEY") ) # Filter by disease association disease_biomarkers = [] for biomarker in all_biomarkers: if disease.lower() in str(biomarker).lower(): disease_biomarkers.append(biomarker) # Group by frequency biomarker_counts = {} for bio in disease_biomarkers: name = bio.get('name', 'Unknown') biomarker_counts[name] = biomarker_counts.get(name, 0) + 1 # Sort by frequency return sorted( biomarker_counts.items(), key=lambda x: x[1], reverse=True ) # Find most common biomarkers in lung cancer trials lung_biomarkers = await analyze_trial_biomarkers("lung cancer") ``` ## Combined Workflows ### Regional Drug Development Analysis ```python async def analyze_regional_drug_development( state: str, cities: list[str], drug_class: str ): """Analyze drug development in a specific region""" # Step 1: Find organizations in the region organizations = [] for city in cities: orgs = await nci_organization_searcher( city=city, state=state, organization_type="Industry", api_key=os.getenv("NCI_API_KEY") ) organizations.extend(orgs) # Step 2: Find drugs of interest drugs = await nci_intervention_searcher( name=drug_class, intervention_type="Drug", api_key=os.getenv("NCI_API_KEY") ) # Step 3: Cross-reference trials regional_development = [] for drug in drugs[:10]: # Limit for performance drug_details = await nci_intervention_getter( intervention_id=drug['id'], api_key=os.getenv("NCI_API_KEY") ) # Check if any trials are sponsored by regional orgs for trial in drug_details.get('trials', []): for org in organizations: if org['id'] in str(trial): regional_development.append({ 'drug': drug_details['name'], 'organization': org['name'], 'location': f"{org.get('city', '')}, {org.get('state', '')}", 'trial': trial }) return regional_development # Analyze immunotherapy development in California ca_immuno = await analyze_regional_drug_development( state="CA", cities=["San Francisco", "San Diego", "Los Angeles"], drug_class="immunotherapy" ) ``` ### Organization to Intervention Pipeline ```python async def org_to_intervention_pipeline(org_name: str): """Trace from organization to their interventions""" # Find organization orgs = await nci_organization_searcher( name=org_name, api_key=os.getenv("NCI_API_KEY") ) if not orgs: return None # Get organization details org_details = await nci_organization_getter( organization_id=orgs[0]['id'], api_key=os.getenv("NCI_API_KEY") ) # Get their trials org_trials = org_details.get('trials', []) # Extract unique interventions interventions = set() for trial_id in org_trials[:20]: # Sample trials trial = await trial_getter( nct_id=trial_id, source="nci", api_key=os.getenv("NCI_API_KEY") ) if trial.get('interventions'): interventions.update(trial['interventions']) # Get details for each intervention intervention_details = [] for intervention_name in interventions: results = await nci_intervention_searcher( name=intervention_name, api_key=os.getenv("NCI_API_KEY") ) if results: intervention_details.append(results[0]) return { 'organization': org_details, 'trial_count': len(org_trials), 'interventions': intervention_details } # Analyze Genentech's intervention portfolio genentech_portfolio = await org_to_intervention_pipeline("Genentech") ``` ## Best Practices ### 1. Always Use City + State Together ```python # ✅ GOOD - Prevents API errors await nci_organization_searcher(city="Boston", state="MA") # ❌ BAD - Will cause Elasticsearch error await nci_organization_searcher(city="Boston") ``` ### 2. Handle Rate Limits ```python import asyncio async def search_with_rate_limit(searches: list): """Execute searches with rate limiting""" results = [] for search in searches: result = await search() results.append(result) # Add delay to respect rate limits await asyncio.sleep(0.1) # 10 requests per second return results ``` ### 3. Use Pagination for Large Results ```python async def get_all_organizations(org_type: str): """Get all organizations of a type using pagination""" all_orgs = [] page = 1 while True: orgs = await nci_organization_searcher( organization_type=org_type, page=page, page_size=100, # Maximum allowed api_key=os.getenv("NCI_API_KEY") ) if not orgs: break all_orgs.extend(orgs) page += 1 # Note: Total count may not be available if len(orgs) < 100: break return all_orgs ``` ### 4. Cache Results ```python from functools import lru_cache import hashlib @lru_cache(maxsize=100) async def cached_org_search(city: str, state: str, org_type: str): """Cache organization searches to reduce API calls""" return await nci_organization_searcher( city=city, state=state, organization_type=org_type, api_key=os.getenv("NCI_API_KEY") ) ``` ## Troubleshooting ### Common Errors and Solutions 1. **"Search Too Broad" Error** - Always use city + state together for location searches - Add more specific filters (name, type) - Reduce page_size parameter 2. **"NCI API key required"** - Set NCI_API_KEY environment variable - Or provide api_key parameter in function calls - Or include in prompt: "my NCI API key is YOUR_KEY" 3. **No Results Found** - Check spelling of organization/drug names - Try partial name matches - Remove filters and broaden search - Enable synonyms for intervention searches 4. **Rate Limit Exceeded** - Add delays between requests - Reduce concurrent requests - Cache frequently accessed data - Consider upgrading API key tier ### Debugging Tips ```python # Enable debug logging import logging logging.basicConfig(level=logging.DEBUG) # Test API key async def test_nci_connection(): try: result = await nci_organization_searcher( name="Mayo", api_key=os.getenv("NCI_API_KEY") ) print(f"✅ API key valid, found {len(result)} results") except Exception as e: print(f"❌ API key error: {e}") # Check specific organization exists async def verify_org_id(org_id: str): try: org = await nci_organization_getter( organization_id=org_id, api_key=os.getenv("NCI_API_KEY") ) print(f"✅ Organization found: {org['name']}") except: print(f"❌ Organization ID not found: {org_id}") ``` ## Next Steps - Review [NCI prompts examples](../tutorials/nci-prompts.md) for AI assistant usage - Explore [trial search with biomarkers](02-find-trials-with-nci-and-biothings.md) - Learn about [variant effect prediction](04-predict-variant-effects-with-alphagenome.md) - Set up [API authentication](../getting-started/03-authentication-and-api-keys.md) ``` -------------------------------------------------------------------------------- /tests/tdd/test_router.py: -------------------------------------------------------------------------------- ```python """Comprehensive tests for the unified router module.""" import json from unittest.mock import patch import pytest from biomcp.exceptions import ( InvalidDomainError, InvalidParameterError, QueryParsingError, SearchExecutionError, ) from biomcp.router import fetch, format_results, search class TestFormatResults: """Test the format_results function.""" def test_format_article_results(self): """Test formatting article results.""" results = [ { "pmid": "12345", "title": "Test Article", "abstract": "This is a test abstract", # Note: url in input is ignored, always generates PubMed URL } ] # Mock thinking tracker to prevent reminder with patch("biomcp.router.get_thinking_reminder", return_value=""): formatted = format_results(results, "article", 1, 10, 1) assert "results" in formatted assert len(formatted["results"]) == 1 result = formatted["results"][0] assert result["id"] == "12345" assert result["title"] == "Test Article" assert "test abstract" in result["text"] assert result["url"] == "https://pubmed.ncbi.nlm.nih.gov/12345/" def test_format_trial_results_api_v2(self): """Test formatting trial results with API v2 structure.""" results = [ { "protocolSection": { "identificationModule": { "nctId": "NCT12345", "briefTitle": "Test Trial", }, "descriptionModule": { "briefSummary": "This is a test trial summary" }, "statusModule": {"overallStatus": "RECRUITING"}, "designModule": {"phases": ["PHASE3"]}, } } ] # Mock thinking tracker to prevent reminder with patch("biomcp.router.get_thinking_reminder", return_value=""): formatted = format_results(results, "trial", 1, 10, 1) assert "results" in formatted assert len(formatted["results"]) == 1 result = formatted["results"][0] assert result["id"] == "NCT12345" assert result["title"] == "Test Trial" assert "test trial summary" in result["text"] assert "NCT12345" in result["url"] def test_format_trial_results_legacy(self): """Test formatting trial results with legacy structure.""" results = [ { "NCT Number": "NCT67890", "Study Title": "Legacy Trial", "Brief Summary": "Legacy trial summary", "Study Status": "COMPLETED", "Phases": "Phase 2", } ] # Mock thinking tracker to prevent reminder with patch("biomcp.router.get_thinking_reminder", return_value=""): formatted = format_results(results, "trial", 1, 10, 1) assert "results" in formatted assert len(formatted["results"]) == 1 result = formatted["results"][0] assert result["id"] == "NCT67890" assert result["title"] == "Legacy Trial" assert "Legacy trial summary" in result["text"] def test_format_variant_results(self): """Test formatting variant results.""" results = [ { "_id": "chr7:g.140453136A>T", "dbsnp": {"rsid": "rs121913529"}, "dbnsfp": {"genename": "BRAF"}, "clinvar": {"rcv": {"clinical_significance": "Pathogenic"}}, } ] # Mock thinking tracker to prevent reminder with patch("biomcp.router.get_thinking_reminder", return_value=""): formatted = format_results(results, "variant", 1, 10, 1) assert "results" in formatted assert len(formatted["results"]) == 1 result = formatted["results"][0] assert result["id"] == "chr7:g.140453136A>T" assert "BRAF" in result["title"] assert "Pathogenic" in result["text"] assert "rs121913529" in result["url"] def test_format_results_invalid_domain(self): """Test format_results with invalid domain.""" with pytest.raises(InvalidDomainError) as exc_info: format_results([], "invalid_domain", 1, 10, 0) assert "Unknown domain: invalid_domain" in str(exc_info.value) def test_format_results_malformed_data(self): """Test format_results handles malformed data gracefully.""" results = [ {"title": "Good Article", "pmid": "123"}, None, # Malformed - will be skipped { "invalid": "data" }, # Missing required fields but won't fail (treated as preprint) ] # Mock thinking tracker to prevent reminder with patch("biomcp.router.get_thinking_reminder", return_value=""): formatted = format_results(results, "article", 1, 10, 3) # Should skip None but include the third (treated as preprint with empty fields) assert len(formatted["results"]) == 2 assert formatted["results"][0]["id"] == "123" assert formatted["results"][1]["id"] == "" # Empty ID for invalid data @pytest.mark.asyncio class TestSearchFunction: """Test the unified search function.""" async def test_search_article_domain(self): """Test search with article domain.""" mock_result = json.dumps([ {"pmid": "123", "title": "Test", "abstract": "Abstract"} ]) with patch( "biomcp.articles.unified.search_articles_unified" ) as mock_search: mock_search.return_value = mock_result # Mock thinking tracker to prevent reminder with patch("biomcp.router.get_thinking_reminder", return_value=""): result = await search( query="", domain="article", genes="BRAF", diseases=["cancer"], page_size=10, ) assert "results" in result assert len(result["results"]) == 1 assert result["results"][0]["id"] == "123" async def test_search_trial_domain(self): """Test search with trial domain.""" mock_result = json.dumps({ "studies": [ { "protocolSection": { "identificationModule": {"nctId": "NCT123"}, } } ] }) with patch("biomcp.trials.search.search_trials") as mock_search: mock_search.return_value = mock_result # Mock thinking tracker to prevent reminder with patch("biomcp.router.get_thinking_reminder", return_value=""): result = await search( query="", domain="trial", conditions=["cancer"], phase="Phase 3", page_size=20, ) assert "results" in result mock_search.assert_called_once() async def test_search_variant_domain(self): """Test search with variant domain.""" mock_result = json.dumps([ {"_id": "rs123", "gene": {"symbol": "BRAF"}} ]) with patch("biomcp.variants.search.search_variants") as mock_search: mock_search.return_value = mock_result # Mock thinking tracker to prevent reminder with patch("biomcp.router.get_thinking_reminder", return_value=""): result = await search( query="", domain="variant", genes="BRAF", significance="pathogenic", page_size=10, ) assert "results" in result assert len(result["results"]) == 1 async def test_search_unified_query(self): """Test search with unified query language.""" with patch("biomcp.router._unified_search") as mock_unified: mock_unified.return_value = { "results": [{"id": "1", "title": "Test"}] } result = await search( query="gene:BRAF AND disease:cancer", max_results_per_domain=20, ) assert "results" in result mock_unified.assert_called_once_with( query="gene:BRAF AND disease:cancer", max_results_per_domain=20, domains=None, explain_query=False, ) async def test_search_no_domain_or_query(self): """Test search without domain or query raises error.""" with pytest.raises(InvalidParameterError) as exc_info: await search(query="") assert "query or domain" in str(exc_info.value) async def test_search_invalid_domain(self): """Test search with invalid domain.""" with pytest.raises(InvalidDomainError): await search(query="", domain="invalid_domain") async def test_search_get_schema(self): """Test search with get_schema flag.""" result = await search(query="", get_schema=True) assert "domains" in result assert "cross_domain_fields" in result assert "domain_fields" in result assert isinstance(result["cross_domain_fields"], dict) async def test_search_pagination_validation(self): """Test search with invalid pagination parameters.""" with pytest.raises(InvalidParameterError) as exc_info: await search( query="", domain="article", page=0, # Invalid - must be >= 1 page_size=10, ) assert "page" in str(exc_info.value) async def test_search_parameter_parsing(self): """Test parameter parsing for list inputs.""" mock_result = json.dumps([]) with patch( "biomcp.articles.unified.search_articles_unified" ) as mock_search: mock_search.return_value = mock_result # Test with JSON array string await search( query="", domain="article", genes='["BRAF", "KRAS"]', diseases="cancer,melanoma", # Comma-separated ) # Check the request was parsed correctly call_args = mock_search.call_args[0][0] assert call_args.genes == ["BRAF", "KRAS"] assert call_args.diseases == ["cancer", "melanoma"] @pytest.mark.asyncio class TestFetchFunction: """Test the unified fetch function.""" async def test_fetch_article(self): """Test fetching article details.""" mock_result = json.dumps([ { "pmid": 12345, "title": "Test Article", "abstract": "Full abstract", "full_text": "Full text content", } ]) with patch("biomcp.articles.fetch.fetch_articles") as mock_fetch: mock_fetch.return_value = mock_result result = await fetch( domain="article", id="12345", ) assert result["id"] == "12345" assert result["title"] == "Test Article" assert result["text"] == "Full text content" assert "metadata" in result async def test_fetch_article_invalid_pmid(self): """Test fetching article with invalid identifier.""" result = await fetch(domain="article", id="not_a_number") # Should return an error since "not_a_number" is neither a valid PMID nor DOI assert "error" in result assert "Invalid identifier format" in result["error"] assert "not_a_number" in result["error"] async def test_fetch_trial_all_sections(self): """Test fetching trial with all sections.""" mock_protocol = json.dumps({ "title": "Test Trial", "nct_id": "NCT123", "brief_summary": "Summary", }) mock_locations = json.dumps({"locations": [{"city": "Boston"}]}) mock_outcomes = json.dumps({ "outcomes": {"primary_outcomes": ["Outcome1"]} }) mock_references = json.dumps({"references": [{"pmid": "456"}]}) with ( patch("biomcp.trials.getter._trial_protocol") as mock_p, patch("biomcp.trials.getter._trial_locations") as mock_l, patch("biomcp.trials.getter._trial_outcomes") as mock_o, patch("biomcp.trials.getter._trial_references") as mock_r, ): mock_p.return_value = mock_protocol mock_l.return_value = mock_locations mock_o.return_value = mock_outcomes mock_r.return_value = mock_references result = await fetch(domain="trial", id="NCT123", detail="all") assert result["id"] == "NCT123" assert "metadata" in result assert "locations" in result["metadata"] assert "outcomes" in result["metadata"] assert "references" in result["metadata"] async def test_fetch_trial_invalid_detail(self): """Test fetching trial with invalid detail parameter.""" with pytest.raises(InvalidParameterError) as exc_info: await fetch( domain="trial", id="NCT123", detail="invalid_section", ) assert "one of:" in str(exc_info.value) async def test_fetch_variant(self): """Test fetching variant details.""" mock_result = json.dumps([ { "_id": "rs123", "gene": {"symbol": "BRAF"}, "clinvar": {"clinical_significance": "Pathogenic"}, "tcga": {"cancer_types": {}}, "external_links": {"dbSNP": "https://example.com"}, } ]) with patch("biomcp.variants.getter.get_variant") as mock_get: mock_get.return_value = mock_result result = await fetch(domain="variant", id="rs123") assert result["id"] == "rs123" assert "TCGA Data: Available" in result["text"] assert "external_links" in result["metadata"] async def test_fetch_variant_list_response(self): """Test fetching variant when API returns list.""" mock_result = json.dumps([ {"_id": "rs123", "gene": {"symbol": "BRAF"}} ]) with patch("biomcp.variants.getter.get_variant") as mock_get: mock_get.return_value = mock_result result = await fetch(domain="variant", id="rs123") assert result["id"] == "rs123" async def test_fetch_invalid_domain(self): """Test fetch with invalid domain.""" with pytest.raises(InvalidDomainError): await fetch(domain="invalid", id="123") async def test_fetch_error_handling(self): """Test fetch error handling.""" with patch("biomcp.articles.fetch.fetch_articles") as mock_fetch: mock_fetch.side_effect = Exception("API Error") with pytest.raises(SearchExecutionError) as exc_info: await fetch(domain="article", id="123") assert "Failed to execute search" in str(exc_info.value) async def test_fetch_domain_auto_detection_pmid(self): """Test domain auto-detection for PMID.""" with patch("biomcp.articles.fetch._article_details") as mock_fetch: mock_fetch.return_value = json.dumps([ {"pmid": "12345", "title": "Test"} ]) # Numeric ID should auto-detect as article result = await fetch(id="12345") assert result["id"] == "12345" mock_fetch.assert_called_once() async def test_fetch_domain_auto_detection_nct(self): """Test domain auto-detection for NCT ID.""" with patch("biomcp.trials.getter.get_trial") as mock_get: mock_get.return_value = json.dumps({ "protocolSection": { "identificationModule": {"briefTitle": "Test Trial"} } }) # NCT ID should auto-detect as trial result = await fetch(id="NCT12345") assert "NCT12345" in result["url"] mock_get.assert_called() async def test_fetch_domain_auto_detection_doi(self): """Test domain auto-detection for DOI.""" with patch("biomcp.articles.fetch._article_details") as mock_fetch: mock_fetch.return_value = json.dumps([ {"doi": "10.1038/nature12345", "title": "Test"} ]) # DOI should auto-detect as article await fetch(id="10.1038/nature12345") mock_fetch.assert_called_once() async def test_fetch_domain_auto_detection_variant(self): """Test domain auto-detection for variant IDs.""" with patch("biomcp.variants.getter.get_variant") as mock_get: mock_get.return_value = json.dumps([{"_id": "rs12345"}]) # rsID should auto-detect as variant await fetch(id="rs12345") mock_get.assert_called_once() # Test HGVS notation with patch("biomcp.variants.getter.get_variant") as mock_get: mock_get.return_value = json.dumps([ {"_id": "chr7:g.140453136A>T"} ]) await fetch(id="chr7:g.140453136A>T") mock_get.assert_called_once() @pytest.mark.asyncio class TestUnifiedSearch: """Test the _unified_search internal function.""" async def test_unified_search_explain_query(self): """Test unified search with explain_query flag.""" from biomcp.router import _unified_search result = await _unified_search( query="gene:BRAF AND disease:cancer", explain_query=True ) assert "original_query" in result assert "parsed_structure" in result assert "routing_plan" in result assert "schema" in result async def test_unified_search_execution(self): """Test unified search normal execution.""" from biomcp.router import _unified_search with patch("biomcp.query_router.execute_routing_plan") as mock_execute: mock_execute.return_value = { "articles": json.dumps([{"pmid": "123", "title": "Article 1"}]) } result = await _unified_search( query="gene:BRAF", max_results_per_domain=10 ) assert "results" in result assert isinstance(result["results"], list) async def test_unified_search_parse_error(self): """Test unified search with invalid query.""" from biomcp.router import _unified_search with patch("biomcp.query_parser.QueryParser.parse") as mock_parse: mock_parse.side_effect = Exception("Parse error") with pytest.raises(QueryParsingError): await _unified_search( query="invalid::query", max_results_per_domain=10 ) ``` -------------------------------------------------------------------------------- /src/biomcp/integrations/biothings_client.py: -------------------------------------------------------------------------------- ```python """BioThings API client for unified access to the BioThings suite. The BioThings suite (https://biothings.io) provides high-performance biomedical data APIs including: - MyGene.info - Gene annotations and information - MyVariant.info - Genetic variant annotations (existing integration enhanced) - MyDisease.info - Disease ontology and synonyms - MyChem.info - Drug/chemical annotations and information This module provides a centralized client for interacting with all BioThings APIs, handling common concerns like error handling, rate limiting, and response parsing. While MyVariant.info has specialized modules for complex variant operations, this client provides the base layer for all BioThings API interactions. """ import logging from typing import Any from urllib.parse import quote from pydantic import BaseModel, Field from .. import http_client from ..constants import ( MYVARIANT_GET_URL, ) logger = logging.getLogger(__name__) # BioThings API endpoints MYGENE_BASE_URL = "https://mygene.info/v3" MYGENE_QUERY_URL = f"{MYGENE_BASE_URL}/query" MYGENE_GET_URL = f"{MYGENE_BASE_URL}/gene" MYDISEASE_BASE_URL = "https://mydisease.info/v1" MYDISEASE_QUERY_URL = f"{MYDISEASE_BASE_URL}/query" MYDISEASE_GET_URL = f"{MYDISEASE_BASE_URL}/disease" MYCHEM_BASE_URL = "https://mychem.info/v1" MYCHEM_QUERY_URL = f"{MYCHEM_BASE_URL}/query" MYCHEM_GET_URL = f"{MYCHEM_BASE_URL}/chem" class GeneInfo(BaseModel): """Gene information from MyGene.info.""" gene_id: str = Field(alias="_id") symbol: str | None = None name: str | None = None summary: str | None = None alias: list[str] | None = Field(default_factory=list) entrezgene: int | str | None = None ensembl: dict[str, Any] | None = None refseq: dict[str, Any] | None = None type_of_gene: str | None = None taxid: int | None = None class DiseaseInfo(BaseModel): """Disease information from MyDisease.info.""" disease_id: str = Field(alias="_id") name: str | None = None mondo: dict[str, Any] | None = None definition: str | None = None synonyms: list[str] | None = Field(default_factory=list) xrefs: dict[str, Any] | None = None phenotypes: list[dict[str, Any]] | None = None class DrugInfo(BaseModel): """Drug/chemical information from MyChem.info.""" drug_id: str = Field(alias="_id") name: str | None = None tradename: list[str] | None = Field(default_factory=list) drugbank_id: str | None = None chebi_id: str | None = None chembl_id: str | None = None pubchem_cid: str | None = None unii: str | dict[str, Any] | None = None inchikey: str | None = None formula: str | None = None description: str | None = None indication: str | None = None pharmacology: dict[str, Any] | None = None mechanism_of_action: str | None = None class BioThingsClient: """Unified client for BioThings APIs (MyGene, MyVariant, MyDisease, MyChem).""" def __init__(self): """Initialize the BioThings client.""" self.logger = logger async def get_gene_info( self, gene_id_or_symbol: str, fields: list[str] | None = None ) -> GeneInfo | None: """Get gene information from MyGene.info. Args: gene_id_or_symbol: Gene ID (Entrez, Ensembl) or symbol (e.g., "TP53") fields: Optional list of fields to return Returns: GeneInfo object or None if not found """ try: # First, try direct GET (works for Entrez IDs) if gene_id_or_symbol.isdigit(): return await self._get_gene_by_id(gene_id_or_symbol, fields) # For symbols, we need to query first query_result = await self._query_gene(gene_id_or_symbol) if not query_result: return None # Get the best match gene_id = query_result[0].get("_id") if not gene_id: return None # Now get full details return await self._get_gene_by_id(gene_id, fields) except Exception as e: self.logger.warning( f"Failed to get gene info for {gene_id_or_symbol}: {e}" ) return None async def _query_gene(self, symbol: str) -> list[dict[str, Any]] | None: """Query MyGene.info for a gene symbol.""" params = { "q": f"symbol:{quote(symbol)}", "species": "human", "fields": "_id,symbol,name,taxid", "size": 5, } response, error = await http_client.request_api( url=MYGENE_QUERY_URL, request=params, method="GET", domain="mygene", ) if error or not response: return None hits = response.get("hits", []) # Filter for human genes (taxid 9606) human_hits = [h for h in hits if h.get("taxid") == 9606] return human_hits if human_hits else hits async def _get_gene_by_id( self, gene_id: str, fields: list[str] | None = None ) -> GeneInfo | None: """Get gene details by ID from MyGene.info.""" if fields is None: fields = [ "symbol", "name", "summary", "alias", "type_of_gene", "ensembl", "refseq", "entrezgene", ] params = {"fields": ",".join(fields)} response, error = await http_client.request_api( url=f"{MYGENE_GET_URL}/{gene_id}", request=params, method="GET", domain="mygene", ) if error or not response: return None try: return GeneInfo(**response) except Exception as e: self.logger.warning(f"Failed to parse gene response: {e}") return None async def batch_get_genes( self, gene_ids: list[str], fields: list[str] | None = None ) -> list[GeneInfo]: """Get multiple genes in a single request. Args: gene_ids: List of gene IDs or symbols fields: Optional list of fields to return Returns: List of GeneInfo objects """ if not gene_ids: return [] if fields is None: fields = ["symbol", "name", "summary", "alias", "type_of_gene"] # MyGene supports POST for batch queries data = { "ids": ",".join(gene_ids), "fields": ",".join(fields), "species": "human", } response, error = await http_client.request_api( url=MYGENE_GET_URL, request=data, method="POST", domain="mygene", ) if error or not response: return [] results = [] for item in response: try: if "notfound" not in item: results.append(GeneInfo(**item)) except Exception as e: self.logger.warning(f"Failed to parse gene in batch: {e}") continue return results async def get_disease_info( self, disease_id_or_name: str, fields: list[str] | None = None ) -> DiseaseInfo | None: """Get disease information from MyDisease.info. Args: disease_id_or_name: Disease ID (MONDO, DOID) or name fields: Optional list of fields to return Returns: DiseaseInfo object or None if not found """ try: # Check if it's an ID (starts with known prefixes) if any( disease_id_or_name.upper().startswith(prefix) for prefix in ["MONDO:", "DOID:", "OMIM:", "MESH:"] ): return await self._get_disease_by_id( disease_id_or_name, fields ) # Otherwise, query by name query_result = await self._query_disease(disease_id_or_name) if not query_result: return None # Get the best match disease_id = query_result[0].get("_id") if not disease_id: return None # Now get full details return await self._get_disease_by_id(disease_id, fields) except Exception as e: self.logger.warning( f"Failed to get disease info for {disease_id_or_name}: {e}" ) return None async def _query_disease(self, name: str) -> list[dict[str, Any]] | None: """Query MyDisease.info for a disease name.""" params = { "q": quote(name), "fields": "_id,name,mondo", "size": 10, } response, error = await http_client.request_api( url=MYDISEASE_QUERY_URL, request=params, method="GET", domain="mydisease", ) if error or not response: return None return response.get("hits", []) async def _get_disease_by_id( self, disease_id: str, fields: list[str] | None = None ) -> DiseaseInfo | None: """Get disease details by ID from MyDisease.info.""" if fields is None: fields = [ "name", "mondo", "definition", "synonyms", "xrefs", "phenotypes", ] params = {"fields": ",".join(fields)} response, error = await http_client.request_api( url=f"{MYDISEASE_GET_URL}/{quote(disease_id, safe='')}", request=params, method="GET", domain="mydisease", ) if error or not response: return None try: # Extract definition from mondo if available if "mondo" in response and isinstance(response["mondo"], dict): if ( "definition" in response["mondo"] and "definition" not in response ): response["definition"] = response["mondo"]["definition"] # Extract synonyms from mondo if "synonym" in response["mondo"]: mondo_synonyms = response["mondo"]["synonym"] if isinstance(mondo_synonyms, dict): # Handle exact synonyms exact = mondo_synonyms.get("exact", []) if isinstance(exact, list): response["synonyms"] = exact elif isinstance(mondo_synonyms, list): response["synonyms"] = mondo_synonyms return DiseaseInfo(**response) except Exception as e: self.logger.warning(f"Failed to parse disease response: {e}") return None async def get_disease_synonyms(self, disease_id_or_name: str) -> list[str]: """Get disease synonyms for query expansion. Args: disease_id_or_name: Disease ID or name Returns: List of synonyms including the original term """ disease_info = await self.get_disease_info(disease_id_or_name) if not disease_info: return [disease_id_or_name] synonyms = [disease_id_or_name] if disease_info.name and disease_info.name != disease_id_or_name: synonyms.append(disease_info.name) if disease_info.synonyms: synonyms.extend(disease_info.synonyms) # Remove duplicates while preserving order seen = set() unique_synonyms = [] for syn in synonyms: if syn.lower() not in seen: seen.add(syn.lower()) unique_synonyms.append(syn) return unique_synonyms[ :5 ] # Limit to top 5 to avoid overly broad searches async def get_drug_info( self, drug_id_or_name: str, fields: list[str] | None = None ) -> DrugInfo | None: """Get drug/chemical information from MyChem.info. Args: drug_id_or_name: Drug ID (DrugBank, ChEMBL, etc.) or name fields: Optional list of fields to return Returns: DrugInfo object or None if not found """ try: # Check if it's an ID (starts with known prefixes) if any( drug_id_or_name.upper().startswith(prefix) for prefix in ["DRUGBANK:", "DB", "CHEMBL", "CHEBI:", "CID"] ): return await self._get_drug_by_id(drug_id_or_name, fields) # Otherwise, query by name query_result = await self._query_drug(drug_id_or_name) if not query_result: return None # Get the best match drug_id = query_result[0].get("_id") if not drug_id: return None # Now get full details return await self._get_drug_by_id(drug_id, fields) except Exception as e: self.logger.warning( f"Failed to get drug info for {drug_id_or_name}: {e}" ) return None async def _query_drug(self, name: str) -> list[dict[str, Any]] | None: """Query MyChem.info for a drug name.""" params = { "q": quote(name), "fields": "_id,name,drugbank.name,chebi.name,chembl.pref_name,unii.display_name", "size": 10, } response, error = await http_client.request_api( url=MYCHEM_QUERY_URL, request=params, method="GET", domain="mychem", ) if error or not response: return None hits = response.get("hits", []) # Sort hits to prioritize those with actual drug names def score_hit(hit): score = hit.get("_score", 0) # Boost score if hit has drug name fields if hit.get("drugbank", {}).get("name"): score += 10 if hit.get("chembl", {}).get("pref_name"): score += 5 if hit.get("unii", {}).get("display_name"): score += 3 return score hits.sort(key=score_hit, reverse=True) return hits async def _get_drug_by_id( self, drug_id: str, fields: list[str] | None = None ) -> DrugInfo | None: """Get drug details by ID from MyChem.info.""" if fields is None: fields = [ "name", "drugbank", "chebi", "chembl", "pubchem", "unii", "inchikey", "formula", "description", "indication", "pharmacology", "mechanism_of_action", ] params = {"fields": ",".join(fields)} response, error = await http_client.request_api( url=f"{MYCHEM_GET_URL}/{quote(drug_id, safe='')}", request=params, method="GET", domain="mychem", ) if error or not response: return None try: # Handle array response (multiple results) if isinstance(response, list): if not response: return None # Take the first result response = response[0] # Extract fields from nested structures self._extract_drugbank_fields(response) self._extract_chebi_fields(response) self._extract_chembl_fields(response) self._extract_pubchem_fields(response) self._extract_unii_fields(response) return DrugInfo(**response) except Exception as e: self.logger.warning(f"Failed to parse drug response: {e}") return None def _extract_drugbank_fields(self, response: dict[str, Any]) -> None: """Extract DrugBank fields from response.""" if "drugbank" in response and isinstance(response["drugbank"], dict): db = response["drugbank"] response["drugbank_id"] = db.get("id") response["name"] = response.get("name") or db.get("name") response["tradename"] = db.get("products", {}).get("name", []) if isinstance(response["tradename"], str): response["tradename"] = [response["tradename"]] response["indication"] = db.get("indication") response["mechanism_of_action"] = db.get("mechanism_of_action") response["description"] = db.get("description") def _extract_chebi_fields(self, response: dict[str, Any]) -> None: """Extract ChEBI fields from response.""" if "chebi" in response and isinstance(response["chebi"], dict): response["chebi_id"] = response["chebi"].get("id") if not response.get("name"): response["name"] = response["chebi"].get("name") def _extract_chembl_fields(self, response: dict[str, Any]) -> None: """Extract ChEMBL fields from response.""" if "chembl" in response and isinstance(response["chembl"], dict): response["chembl_id"] = response["chembl"].get( "molecule_chembl_id" ) if not response.get("name"): response["name"] = response["chembl"].get("pref_name") def _extract_pubchem_fields(self, response: dict[str, Any]) -> None: """Extract PubChem fields from response.""" if "pubchem" in response and isinstance(response["pubchem"], dict): response["pubchem_cid"] = str(response["pubchem"].get("cid", "")) def _extract_unii_fields(self, response: dict[str, Any]) -> None: """Extract UNII fields from response.""" if "unii" in response and isinstance(response["unii"], dict): unii_data = response["unii"] # Set UNII code response["unii"] = unii_data.get("unii", "") # Use display name as drug name if not already set if not response.get("name") and unii_data.get("display_name"): response["name"] = unii_data["display_name"] # Use NCIT description if no description if not response.get("description") and unii_data.get( "ncit_description" ): response["description"] = unii_data["ncit_description"] async def get_variant_info( self, variant_id: str, fields: list[str] | None = None ) -> dict[str, Any] | None: """Get variant information from MyVariant.info. This is a wrapper around the existing MyVariant integration. Args: variant_id: Variant ID (rsID, HGVS) fields: Optional list of fields to return Returns: Variant data dictionary or None if not found """ params = {"fields": "all" if fields is None else ",".join(fields)} response, error = await http_client.request_api( url=f"{MYVARIANT_GET_URL}/{variant_id}", request=params, method="GET", domain="myvariant", ) if error or not response: return None return response ``` -------------------------------------------------------------------------------- /docs/user-guides/02-mcp-tools-reference.md: -------------------------------------------------------------------------------- ```markdown # MCP Tools Reference BioMCP provides 35 specialized tools for biomedical research through the Model Context Protocol (MCP). This reference covers all available tools, their parameters, and usage patterns. ## Related Guides - **Conceptual Overview**: [Sequential Thinking with the Think Tool](../concepts/03-sequential-thinking-with-the-think-tool.md) - **Practical Examples**: See the [How-to Guides](../how-to-guides/01-find-articles-and-cbioportal-data.md) for real-world usage patterns - **Integration Setup**: [Claude Desktop Integration](../getting-started/02-claude-desktop-integration.md) ## Tool Categories | Category | Count | Tools | | ------------------- | ----- | -------------------------------------------------------------- | | **Core Tools** | 3 | `search`, `fetch`, `think` | | **Article Tools** | 2 | `article_searcher`, `article_getter` | | **Trial Tools** | 6 | `trial_searcher`, `trial_getter`, + 4 detail getters | | **Variant Tools** | 3 | `variant_searcher`, `variant_getter`, `alphagenome_predictor` | | **BioThings Tools** | 3 | `gene_getter`, `disease_getter`, `drug_getter` | | **NCI Tools** | 6 | Organization, intervention, biomarker, and disease tools | | **OpenFDA Tools** | 12 | Adverse events, labels, devices, approvals, recalls, shortages | ## Core Unified Tools ### 1. search **Universal search across all biomedical domains with unified query language.** ```python search( query: str = None, # Unified query syntax domain: str = None, # Target domain genes: list[str] = None, # Gene symbols diseases: list[str] = None, # Disease/condition terms variants: list[str] = None, # Variant notations chemicals: list[str] = None, # Drug/chemical names keywords: list[str] = None, # Additional keywords conditions: list[str] = None, # Trial conditions interventions: list[str] = None,# Trial interventions lat: float = None, # Latitude for trials long: float = None, # Longitude for trials page: int = 1, # Page number page_size: int = 10, # Results per page api_key: str = None # For NCI domains ) -> dict ``` **Domains:** `article`, `trial`, `variant`, `gene`, `drug`, `disease`, `nci_organization`, `nci_intervention`, `nci_biomarker`, `nci_disease`, `fda_adverse`, `fda_label`, `fda_device`, `fda_approval`, `fda_recall`, `fda_shortage` **Query Language Examples:** - `"gene:BRAF AND disease:melanoma"` - `"drugs.tradename:gleevec"` - `"gene:TP53 AND (mutation OR variant)"` **Usage Examples:** ```python # Domain-specific search search(domain="article", genes=["BRAF"], diseases=["melanoma"]) # Unified query language search(query="gene:EGFR AND mutation:T790M") # Clinical trials by location search(domain="trial", conditions=["lung cancer"], lat=40.7128, long=-74.0060) # FDA adverse events search(domain="fda_adverse", chemicals=["aspirin"]) # FDA drug approvals search(domain="fda_approval", chemicals=["keytruda"]) ``` ### 2. fetch **Retrieve detailed information for any biomedical record.** ```python fetch( id: str, # Record identifier domain: str = None, # Domain (auto-detected if not provided) detail: str = None, # Specific section for trials api_key: str = None # For NCI records ) -> dict ``` **Supported IDs:** - Articles: PMID (e.g., "38768446"), DOI (e.g., "10.1101/2024.01.20") - Trials: NCT ID (e.g., "NCT03006926") - Variants: HGVS, rsID, genomic coordinates - Genes/Drugs/Diseases: Names or database IDs - FDA Records: Report IDs, Application Numbers (e.g., "BLA125514"), Recall Numbers, etc. **Detail Options for Trials:** `protocol`, `locations`, `outcomes`, `references`, `all` **Usage Examples:** ```python # Fetch article by PMID fetch(id="38768446", domain="article") # Fetch trial with specific details fetch(id="NCT03006926", domain="trial", detail="locations") # Auto-detect domain fetch(id="rs121913529") # Variant fetch(id="BRAF") # Gene # Fetch FDA records fetch(id="BLA125514", domain="fda_approval") # Drug approval fetch(id="D-0001-2023", domain="fda_recall") # Drug recall ``` ### 3. think **Sequential thinking tool for structured problem-solving.** ```python think( thought: str, # Current reasoning step thoughtNumber: int, # Sequential number (1, 2, 3...) totalThoughts: int = None, # Estimated total thoughts nextThoughtNeeded: bool = True # Continue thinking? ) -> str ``` **CRITICAL:** Always use `think` BEFORE any other BioMCP operation! **Usage Pattern:** ```python # Step 1: Problem decomposition think( thought="Breaking down query: need to find BRAF inhibitor trials...", thoughtNumber=1, nextThoughtNeeded=True ) # Step 2: Search strategy think( thought="Will search trials for BRAF V600E melanoma, then articles...", thoughtNumber=2, nextThoughtNeeded=True ) # Final step: Synthesis think( thought="Ready to synthesize findings from 5 trials and 12 articles...", thoughtNumber=3, nextThoughtNeeded=False # Analysis complete ) ``` ## Article Tools ### 4. article_searcher **Search PubMed/PubTator3 for biomedical literature.** ```python article_searcher( chemicals: list[str] = None, diseases: list[str] = None, genes: list[str] = None, keywords: list[str] = None, # Supports OR with "|" variants: list[str] = None, include_preprints: bool = True, include_cbioportal: bool = True, page: int = 1, page_size: int = 10 ) -> str ``` **Features:** - Automatic cBioPortal integration for gene searches - Preprint inclusion from bioRxiv/medRxiv - OR logic in keywords: `"V600E|p.V600E|c.1799T>A"` **Example:** ```python # Search with multiple filters article_searcher( genes=["BRAF"], diseases=["melanoma"], keywords=["resistance|resistant"], include_cbioportal=True ) ``` ### 5. article_getter **Fetch detailed article information.** ```python article_getter( pmid: str # PubMed ID, PMC ID, or DOI ) -> str ``` **Supports:** - PubMed IDs: "38768446" - PMC IDs: "PMC7498215" - DOIs: "10.1101/2024.01.20.23288905" ## Trial Tools ### 6. trial_searcher **Search ClinicalTrials.gov with comprehensive filters.** ```python trial_searcher( conditions: list[str] = None, interventions: list[str] = None, other_terms: list[str] = None, recruiting_status: str = "ANY", # "OPEN", "CLOSED", "ANY" phase: str = None, # "PHASE1", "PHASE2", etc. lat: float = None, # Location-based search long: float = None, distance: int = None, # Miles from coordinates age_group: str = None, # "CHILD", "ADULT", "OLDER_ADULT" sex: str = None, # "MALE", "FEMALE", "ALL" study_type: str = None, # "INTERVENTIONAL", "OBSERVATIONAL" funder_type: str = None, # "NIH", "INDUSTRY", etc. page: int = 1, page_size: int = 10 ) -> str ``` **Location Search Example:** ```python # Trials near Boston trial_searcher( conditions=["breast cancer"], lat=42.3601, long=-71.0589, distance=50, recruiting_status="OPEN" ) ``` ### 7-11. Trial Detail Getters ```python # Get complete trial information trial_getter(nct_id: str) -> str # Get specific sections trial_protocol_getter(nct_id: str) -> str # Core protocol info trial_locations_getter(nct_id: str) -> str # Sites and contacts trial_outcomes_getter(nct_id: str) -> str # Outcome measures trial_references_getter(nct_id: str) -> str # Publications ``` ## Variant Tools ### 12. variant_searcher **Search MyVariant.info for genetic variants.** ```python variant_searcher( gene: str = None, hgvs: str = None, hgvsp: str = None, # Protein HGVS hgvsc: str = None, # Coding DNA HGVS rsid: str = None, region: str = None, # "chr7:140753336-140753337" significance: str = None, # Clinical significance frequency_min: float = None, frequency_max: float = None, cadd_score_min: float = None, sift_prediction: str = None, polyphen_prediction: str = None, sources: list[str] = None, include_cbioportal: bool = True, page: int = 1, page_size: int = 10 ) -> str ``` **Significance Options:** `pathogenic`, `likely_pathogenic`, `uncertain_significance`, `likely_benign`, `benign` **Example:** ```python # Find rare pathogenic BRCA1 variants variant_searcher( gene="BRCA1", significance="pathogenic", frequency_max=0.001, cadd_score_min=20 ) ``` ### 13. variant_getter **Fetch comprehensive variant details.** ```python variant_getter( variant_id: str, # HGVS, rsID, or MyVariant ID include_external: bool = True # Include TCGA, 1000 Genomes ) -> str ``` ### 14. alphagenome_predictor **Predict variant effects using Google DeepMind's AlphaGenome.** ```python alphagenome_predictor( chromosome: str, # e.g., "chr7" position: int, # 1-based position reference: str, # Reference allele alternate: str, # Alternate allele interval_size: int = 131072, # Analysis window tissue_types: list[str] = None, # UBERON terms significance_threshold: float = 0.5, api_key: str = None # AlphaGenome API key ) -> str ``` **Requires:** AlphaGenome API key (environment variable or per-request) **Tissue Examples:** - `UBERON:0002367` - prostate gland - `UBERON:0001155` - colon - `UBERON:0002048` - lung **Example:** ```python # Predict BRAF V600E effects alphagenome_predictor( chromosome="chr7", position=140753336, reference="A", alternate="T", tissue_types=["UBERON:0002367"], # prostate api_key="your-key" ) ``` ## BioThings Tools ### 15. gene_getter **Get gene information from MyGene.info.** ```python gene_getter( gene_id_or_symbol: str # Gene symbol or Entrez ID ) -> str ``` **Returns:** Official name, aliases, summary, genomic location, database links ### 16. disease_getter **Get disease information from MyDisease.info.** ```python disease_getter( disease_id_or_name: str # Disease name or ontology ID ) -> str ``` **Returns:** Definition, synonyms, MONDO/DOID IDs, associated phenotypes ### 17. drug_getter **Get drug/chemical information from MyChem.info.** ```python drug_getter( drug_id_or_name: str # Drug name or database ID ) -> str ``` **Returns:** Chemical structure, mechanism, indications, trade names, identifiers ## NCI-Specific Tools All NCI tools require an API key from [api.cancer.gov](https://api.cancer.gov). ### 18-19. Organization Tools ```python # Search organizations nci_organization_searcher( name: str = None, organization_type: str = None, city: str = None, # Must use with state state: str = None, # Must use with city api_key: str = None ) -> str # Get organization details nci_organization_getter( organization_id: str, api_key: str = None ) -> str ``` ### 20-21. Intervention Tools ```python # Search interventions nci_intervention_searcher( name: str = None, intervention_type: str = None, # "Drug", "Device", etc. synonyms: bool = True, api_key: str = None ) -> str # Get intervention details nci_intervention_getter( intervention_id: str, api_key: str = None ) -> str ``` ### 22. Biomarker Search ```python nci_biomarker_searcher( name: str = None, biomarker_type: str = None, api_key: str = None ) -> str ``` ### 23. Disease Search (NCI) ```python nci_disease_searcher( name: str = None, include_synonyms: bool = True, category: str = None, api_key: str = None ) -> str ``` ## OpenFDA Tools All OpenFDA tools support optional API keys for higher rate limits (240/min vs 40/min). Get a free key at [open.fda.gov/apis/authentication](https://open.fda.gov/apis/authentication/). ### 24. openfda_adverse_searcher **Search FDA Adverse Event Reporting System (FAERS).** ```python openfda_adverse_searcher( drug: str = None, reaction: str = None, serious: bool = None, # Filter serious events only limit: int = 25, skip: int = 0, api_key: str = None # Optional OpenFDA API key ) -> str ``` **Example:** ```python # Find serious bleeding events for warfarin openfda_adverse_searcher( drug="warfarin", reaction="bleeding", serious=True, api_key="your-key" # Optional ) ``` ### 25. openfda_adverse_getter **Get detailed adverse event report.** ```python openfda_adverse_getter( report_id: str, # Safety report ID api_key: str = None ) -> str ``` ### 26. openfda_label_searcher **Search FDA drug product labels.** ```python openfda_label_searcher( name: str = None, indication: str = None, # Search by indication boxed_warning: bool = False, # Filter for boxed warnings section: str = None, # Specific label section limit: int = 25, skip: int = 0, api_key: str = None ) -> str ``` ### 27. openfda_label_getter **Get complete drug label information.** ```python openfda_label_getter( set_id: str, # Label set ID sections: list[str] = None, # Specific sections to retrieve api_key: str = None ) -> str ``` **Label Sections:** `indications_and_usage`, `contraindications`, `warnings_and_precautions`, `dosage_and_administration`, `adverse_reactions`, `drug_interactions`, `pregnancy`, `pediatric_use`, `geriatric_use` ### 28. openfda_device_searcher **Search FDA device adverse event reports (MAUDE).** ```python openfda_device_searcher( device: str = None, manufacturer: str = None, problem: str = None, product_code: str = None, # FDA product code genomics_only: bool = True, # Filter genomic/diagnostic devices limit: int = 25, skip: int = 0, api_key: str = None ) -> str ``` **Note:** FDA uses abbreviated device names (e.g., "F1CDX" for "FoundationOne CDx"). ### 29. openfda_device_getter **Get detailed device event report.** ```python openfda_device_getter( mdr_report_key: str, # MDR report key api_key: str = None ) -> str ``` ### 30. openfda_approval_searcher **Search FDA drug approval records (Drugs@FDA).** ```python openfda_approval_searcher( drug: str = None, application_number: str = None, # NDA/BLA number approval_year: str = None, # YYYY format limit: int = 25, skip: int = 0, api_key: str = None ) -> str ``` ### 31. openfda_approval_getter **Get drug approval details.** ```python openfda_approval_getter( application_number: str, # NDA/BLA number api_key: str = None ) -> str ``` ### 32. openfda_recall_searcher **Search FDA drug recall records.** ```python openfda_recall_searcher( drug: str = None, recall_class: str = None, # "1", "2", or "3" status: str = None, # "ongoing" or "completed" reason: str = None, since_date: str = None, # YYYYMMDD format limit: int = 25, skip: int = 0, api_key: str = None ) -> str ``` **Recall Classes:** - Class 1: Dangerous or defective products that could cause serious health problems or death - Class 2: Products that might cause temporary health problems or pose slight threat - Class 3: Products unlikely to cause adverse health consequences ### 33. openfda_recall_getter **Get drug recall details.** ```python openfda_recall_getter( recall_number: str, # FDA recall number api_key: str = None ) -> str ``` ### 34. openfda_shortage_searcher **Search FDA drug shortage database.** ```python openfda_shortage_searcher( drug: str = None, status: str = None, # "current" or "resolved" therapeutic_category: str = None, limit: int = 25, skip: int = 0, api_key: str = None ) -> str ``` ### 35. openfda_shortage_getter **Get drug shortage details.** ```python openfda_shortage_getter( drug_name: str, api_key: str = None ) -> str ``` ## Best Practices ### 1. Always Think First ```python # ✅ CORRECT - Think before searching think(thought="Planning BRAF melanoma research...", thoughtNumber=1) results = article_searcher(genes=["BRAF"], diseases=["melanoma"]) # ❌ INCORRECT - Skipping think tool results = article_searcher(genes=["BRAF"]) # Poor results! ``` ### 2. Use Unified Tools for Flexibility ```python # Unified search supports complex queries results = search(query="gene:EGFR AND (mutation:T790M OR mutation:C797S)") # Unified fetch auto-detects domain details = fetch(id="NCT03006926") # Knows it's a trial ``` ### 3. Leverage Domain-Specific Features ```python # Article search with cBioPortal articles = article_searcher( genes=["KRAS"], include_cbioportal=True # Adds cancer genomics context ) # Variant search with multiple filters variants = variant_searcher( gene="TP53", significance="pathogenic", frequency_max=0.01, cadd_score_min=25 ) ``` ### 4. Handle API Keys Properly ```python # For personal use - environment variable # export NCI_API_KEY="your-key" nci_results = search(domain="nci_organization", name="Mayo Clinic") # For shared environments - per-request nci_results = search( domain="nci_organization", name="Mayo Clinic", api_key="user-provided-key" ) ``` ### 5. Use Appropriate Page Sizes ```python # Large result sets - increase page_size results = article_searcher( genes=["TP53"], page_size=50 # Get more results at once ) # Iterative exploration - use pagination page1 = trial_searcher(conditions=["cancer"], page=1, page_size=10) page2 = trial_searcher(conditions=["cancer"], page=2, page_size=10) ``` ## Error Handling All tools include comprehensive error handling: - **Invalid parameters**: Clear error messages with valid options - **API failures**: Graceful degradation with informative messages - **Rate limits**: Automatic retry with exponential backoff - **Missing API keys**: Helpful instructions for obtaining keys ## Tool Selection Guide | If you need to... | Use this tool | | ------------------------------ | ------------------------------------------------- | | Search across multiple domains | `search` with query language | | Get any record by ID | `fetch` with auto-detection | | Plan your research approach | `think` (always first!) | | Find recent papers | `article_searcher` | | Locate clinical trials | `trial_searcher` | | Analyze genetic variants | `variant_searcher` + `variant_getter` | | Predict variant effects | `alphagenome_predictor` | | Get gene/drug/disease info | `gene_getter`, `drug_getter`, `disease_getter` | | Access NCI databases | `nci_*` tools with API key | | Check drug adverse events | `openfda_adverse_searcher` | | Review FDA drug labels | `openfda_label_searcher` + `openfda_label_getter` | | Investigate device issues | `openfda_device_searcher` | | Find drug approvals | `openfda_approval_searcher` | | Check drug recalls | `openfda_recall_searcher` | | Monitor drug shortages | `openfda_shortage_searcher` | ## Next Steps - Review [Sequential Thinking](../concepts/03-sequential-thinking-with-the-think-tool.md) methodology - Explore [How-to Guides](../how-to-guides/01-find-articles-and-cbioportal-data.md) for complex workflows - Set up [API Keys](../getting-started/03-authentication-and-api-keys.md) for enhanced features ``` -------------------------------------------------------------------------------- /src/biomcp/domain_handlers.py: -------------------------------------------------------------------------------- ```python """Domain-specific result handlers for BioMCP. This module contains formatting functions for converting raw API responses from different biomedical data sources into a standardized format. """ import logging from typing import Any from biomcp.constants import ( DEFAULT_SIGNIFICANCE, DEFAULT_TITLE, METADATA_AUTHORS, METADATA_COMPLETION_DATE, METADATA_CONSEQUENCE, METADATA_GENE, METADATA_JOURNAL, METADATA_PHASE, METADATA_RSID, METADATA_SIGNIFICANCE, METADATA_SOURCE, METADATA_START_DATE, METADATA_STATUS, METADATA_YEAR, RESULT_ID, RESULT_METADATA, RESULT_SNIPPET, RESULT_TITLE, RESULT_URL, SNIPPET_LENGTH, ) logger = logging.getLogger(__name__) class ArticleHandler: """Handles formatting for article/publication results.""" @staticmethod def format_result(result: dict[str, Any]) -> dict[str, Any]: """Format a single article result. Args: result: Raw article data from PubTator3 or preprint APIs Returns: Standardized article result with id, title, snippet, url, and metadata """ if "pmid" in result: # PubMed article # Clean up title - remove extra spaces title = result.get("title", "").strip() title = " ".join(title.split()) # Normalize whitespace # Use default if empty if not title: title = DEFAULT_TITLE return { RESULT_ID: result["pmid"], RESULT_TITLE: title, RESULT_SNIPPET: result.get("abstract", "")[:SNIPPET_LENGTH] + "..." if result.get("abstract") else "", RESULT_URL: f"https://pubmed.ncbi.nlm.nih.gov/{result['pmid']}/", RESULT_METADATA: { METADATA_YEAR: result.get("pub_year") or ( result.get("date", "")[:4] if result.get("date") else None ), METADATA_JOURNAL: result.get("journal", ""), METADATA_AUTHORS: result.get("authors", [])[:3], }, } else: # Preprint result return { RESULT_ID: result.get("doi", result.get("id", "")), RESULT_TITLE: result.get("title", ""), RESULT_SNIPPET: result.get("abstract", "")[:SNIPPET_LENGTH] + "..." if result.get("abstract") else "", RESULT_URL: result.get("url", ""), RESULT_METADATA: { METADATA_YEAR: result.get("pub_year"), METADATA_SOURCE: result.get("source", ""), METADATA_AUTHORS: result.get("authors", [])[:3], }, } class TrialHandler: """Handles formatting for clinical trial results.""" @staticmethod def format_result(result: dict[str, Any]) -> dict[str, Any]: """Format a single trial result. Handles both ClinicalTrials.gov API v2 nested structure and legacy formats. Args: result: Raw trial data from ClinicalTrials.gov API Returns: Standardized trial result with id, title, snippet, url, and metadata """ # Handle ClinicalTrials.gov API v2 nested structure if "protocolSection" in result: # API v2 format - extract from nested modules protocol = result.get("protocolSection", {}) identification = protocol.get("identificationModule", {}) status = protocol.get("statusModule", {}) description = protocol.get("descriptionModule", {}) nct_id = identification.get("nctId", "") brief_title = identification.get("briefTitle", "") official_title = identification.get("officialTitle", "") brief_summary = description.get("briefSummary", "") overall_status = status.get("overallStatus", "") start_date = status.get("startDateStruct", {}).get("date", "") completion_date = status.get( "primaryCompletionDateStruct", {} ).get("date", "") # Extract phase from designModule design = protocol.get("designModule", {}) phases = design.get("phases", []) phase = phases[0] if phases else "" elif "NCT Number" in result: # Legacy flat format from search results nct_id = result.get("NCT Number", "") brief_title = result.get("Study Title", "") official_title = "" # Not available in this format brief_summary = result.get("Brief Summary", "") overall_status = result.get("Study Status", "") phase = result.get("Phases", "") start_date = result.get("Start Date", "") completion_date = result.get("Completion Date", "") else: # Original legacy format or simplified structure nct_id = result.get("nct_id", "") brief_title = result.get("brief_title", "") official_title = result.get("official_title", "") brief_summary = result.get("brief_summary", "") overall_status = result.get("overall_status", "") phase = result.get("phase", "") start_date = result.get("start_date", "") completion_date = result.get("primary_completion_date", "") return { RESULT_ID: nct_id, RESULT_TITLE: brief_title or official_title or DEFAULT_TITLE, RESULT_SNIPPET: brief_summary[:SNIPPET_LENGTH] + "..." if brief_summary else "", RESULT_URL: f"https://clinicaltrials.gov/study/{nct_id}", RESULT_METADATA: { METADATA_STATUS: overall_status, METADATA_PHASE: phase, METADATA_START_DATE: start_date, METADATA_COMPLETION_DATE: completion_date, }, } class VariantHandler: """Handles formatting for genetic variant results.""" @staticmethod def format_result(result: dict[str, Any]) -> dict[str, Any]: """Format a single variant result. Args: result: Raw variant data from MyVariant.info API Returns: Standardized variant result with id, title, snippet, url, and metadata """ # Extract gene symbol - MyVariant.info stores this in multiple locations gene = ( result.get("dbnsfp", {}).get("genename", "") or result.get("dbsnp", {}).get("gene", {}).get("symbol", "") or "" ) # Handle case where gene is a list if isinstance(gene, list): gene = gene[0] if gene else "" # Extract rsid rsid = result.get("dbsnp", {}).get("rsid", "") or "" # Extract clinical significance clinvar = result.get("clinvar", {}) significance = "" if isinstance(clinvar.get("rcv"), dict): significance = clinvar["rcv"].get("clinical_significance", "") elif isinstance(clinvar.get("rcv"), list) and clinvar["rcv"]: significance = clinvar["rcv"][0].get("clinical_significance", "") # Build a meaningful title hgvs = "" if "dbnsfp" in result and "hgvsp" in result["dbnsfp"]: hgvs = result["dbnsfp"]["hgvsp"] if isinstance(hgvs, list): hgvs = hgvs[0] if hgvs else "" title = f"{gene} {hgvs}".strip() or result.get("_id", DEFAULT_TITLE) return { RESULT_ID: result.get("_id", ""), RESULT_TITLE: title, RESULT_SNIPPET: f"Clinical significance: {significance or DEFAULT_SIGNIFICANCE}", RESULT_URL: f"https://www.ncbi.nlm.nih.gov/snp/{rsid}" if rsid else "", RESULT_METADATA: { METADATA_GENE: gene, METADATA_RSID: rsid, METADATA_SIGNIFICANCE: significance, METADATA_CONSEQUENCE: result.get("cadd", {}).get( "consequence", "" ), }, } class GeneHandler: """Handles formatting for gene information results from MyGene.info.""" @staticmethod def format_result(result: dict[str, Any]) -> dict[str, Any]: """Format a single gene result. Args: result: Raw gene data from MyGene.info API Returns: Standardized gene result with id, title, snippet, url, and metadata """ # Extract gene information gene_id = result.get("_id", result.get("entrezgene", "")) symbol = result.get("symbol", "") name = result.get("name", "") summary = result.get("summary", "") # Build title title = ( f"{symbol}: {name}" if symbol and name else symbol or name or DEFAULT_TITLE ) # Create snippet from summary snippet = ( summary[:SNIPPET_LENGTH] + "..." if summary and len(summary) > SNIPPET_LENGTH else summary ) return { RESULT_ID: str(gene_id), RESULT_TITLE: title, RESULT_SNIPPET: snippet or "No summary available", RESULT_URL: f"https://www.genenames.org/data/gene-symbol-report/#!/symbol/{symbol}" if symbol else "", RESULT_METADATA: { "entrezgene": result.get("entrezgene"), "symbol": symbol, "name": name, "type_of_gene": result.get("type_of_gene", ""), "ensembl": result.get("ensembl", {}).get("gene") if isinstance(result.get("ensembl"), dict) else None, "refseq": result.get("refseq", {}), }, } class DrugHandler: """Handles formatting for drug/chemical information results from MyChem.info.""" @staticmethod def format_result(result: dict[str, Any]) -> dict[str, Any]: """Format a single drug result. Args: result: Raw drug data from MyChem.info API Returns: Standardized drug result with id, title, snippet, url, and metadata """ # Extract drug information drug_id = result.get("_id", "") name = result.get("name", "") drugbank_id = result.get("drugbank_id", "") description = result.get("description", "") indication = result.get("indication", "") # Build title title = name or drug_id or DEFAULT_TITLE # Create snippet from description or indication snippet_text = indication or description snippet = ( snippet_text[:SNIPPET_LENGTH] + "..." if snippet_text and len(snippet_text) > SNIPPET_LENGTH else snippet_text ) # Determine URL based on available IDs url = "" if drugbank_id: url = f"https://www.drugbank.ca/drugs/{drugbank_id}" elif result.get("pubchem_cid"): url = f"https://pubchem.ncbi.nlm.nih.gov/compound/{result['pubchem_cid']}" return { RESULT_ID: drug_id, RESULT_TITLE: title, RESULT_SNIPPET: snippet or "No description available", RESULT_URL: url, RESULT_METADATA: { "drugbank_id": drugbank_id, "chembl_id": result.get("chembl_id", ""), "pubchem_cid": result.get("pubchem_cid", ""), "chebi_id": result.get("chebi_id", ""), "formula": result.get("formula", ""), "tradename": result.get("tradename", []), }, } class DiseaseHandler: """Handles formatting for disease information results from MyDisease.info.""" @staticmethod def format_result(result: dict[str, Any]) -> dict[str, Any]: """Format a single disease result. Args: result: Raw disease data from MyDisease.info API Returns: Standardized disease result with id, title, snippet, url, and metadata """ # Extract disease information disease_id = result.get("_id", "") name = result.get("name", "") definition = result.get("definition", "") mondo_info = result.get("mondo", {}) # Build title title = name or disease_id or DEFAULT_TITLE # Create snippet from definition snippet = ( definition[:SNIPPET_LENGTH] + "..." if definition and len(definition) > SNIPPET_LENGTH else definition ) # Extract MONDO ID for URL mondo_id = mondo_info.get("id") if isinstance(mondo_info, dict) else "" url = ( f"https://monarchinitiative.org/disease/{mondo_id}" if mondo_id else "" ) return { RESULT_ID: disease_id, RESULT_TITLE: title, RESULT_SNIPPET: snippet or "No definition available", RESULT_URL: url, RESULT_METADATA: { "mondo_id": mondo_id, "definition": definition, "synonyms": result.get("synonyms", []), "xrefs": result.get("xrefs", {}), "phenotypes": len(result.get("phenotypes", [])), }, } class NCIOrganizationHandler: """Handles formatting for NCI organization results.""" @staticmethod def format_result(result: dict[str, Any]) -> dict[str, Any]: """Format a single NCI organization result. Args: result: Raw organization data from NCI CTS API Returns: Standardized organization result with id, title, snippet, url, and metadata """ org_id = result.get("id", result.get("org_id", "")) name = result.get("name", "Unknown Organization") org_type = result.get("type", result.get("category", "")) city = result.get("city", "") state = result.get("state", "") # Build location string location_parts = [p for p in [city, state] if p] location = ", ".join(location_parts) if location_parts else "" # Create snippet snippet_parts = [] if org_type: snippet_parts.append(f"Type: {org_type}") if location: snippet_parts.append(f"Location: {location}") snippet = " | ".join(snippet_parts) or "No details available" return { RESULT_ID: org_id, RESULT_TITLE: name, RESULT_SNIPPET: snippet, RESULT_URL: "", # NCI doesn't provide direct URLs to organizations RESULT_METADATA: { "type": org_type, "city": city, "state": state, "country": result.get("country", ""), }, } class NCIInterventionHandler: """Handles formatting for NCI intervention results.""" @staticmethod def format_result(result: dict[str, Any]) -> dict[str, Any]: """Format a single NCI intervention result. Args: result: Raw intervention data from NCI CTS API Returns: Standardized intervention result with id, title, snippet, url, and metadata """ int_id = result.get("id", result.get("intervention_id", "")) name = result.get("name", "Unknown Intervention") int_type = result.get("type", result.get("category", "")) synonyms = result.get("synonyms", []) # Create snippet snippet_parts = [] if int_type: snippet_parts.append(f"Type: {int_type}") if synonyms: if isinstance(synonyms, list) and synonyms: snippet_parts.append( f"Also known as: {', '.join(synonyms[:3])}" ) elif isinstance(synonyms, str): snippet_parts.append(f"Also known as: {synonyms}") snippet = " | ".join(snippet_parts) or "No details available" return { RESULT_ID: int_id, RESULT_TITLE: name, RESULT_SNIPPET: snippet, RESULT_URL: "", # NCI doesn't provide direct URLs to interventions RESULT_METADATA: { "type": int_type, "synonyms": synonyms, "description": result.get("description", ""), }, } class NCIBiomarkerHandler: """Handles formatting for NCI biomarker results.""" @staticmethod def format_result(result: dict[str, Any]) -> dict[str, Any]: """Format a single NCI biomarker result. Args: result: Raw biomarker data from NCI CTS API Returns: Standardized biomarker result with id, title, snippet, url, and metadata """ bio_id = result.get("id", result.get("biomarker_id", "")) name = result.get("name", "Unknown Biomarker") gene = result.get("gene", result.get("gene_symbol", "")) bio_type = result.get("type", result.get("category", "")) assay_type = result.get("assay_type", "") # Build title title = name if gene and gene not in name: title = f"{gene} - {name}" # Create snippet snippet_parts = [] if bio_type: snippet_parts.append(f"Type: {bio_type}") if assay_type: snippet_parts.append(f"Assay: {assay_type}") snippet = ( " | ".join(snippet_parts) or "Biomarker for trial eligibility" ) return { RESULT_ID: bio_id, RESULT_TITLE: title, RESULT_SNIPPET: snippet, RESULT_URL: "", # NCI doesn't provide direct URLs to biomarkers RESULT_METADATA: { "gene": gene, "type": bio_type, "assay_type": assay_type, "trial_count": result.get("trial_count", 0), }, } class NCIDiseaseHandler: """Handles formatting for NCI disease vocabulary results.""" @staticmethod def format_result(result: dict[str, Any]) -> dict[str, Any]: """Format a single NCI disease result. Args: result: Raw disease data from NCI CTS API Returns: Standardized disease result with id, title, snippet, url, and metadata """ disease_id = result.get("id", result.get("disease_id", "")) name = result.get( "name", result.get("preferred_name", "Unknown Disease") ) category = result.get("category", result.get("type", "")) synonyms = result.get("synonyms", []) # Create snippet snippet_parts = [] if category: snippet_parts.append(f"Category: {category}") if synonyms: if isinstance(synonyms, list) and synonyms: snippet_parts.append( f"Also known as: {', '.join(synonyms[:3])}" ) if len(synonyms) > 3: snippet_parts.append(f"and {len(synonyms) - 3} more") elif isinstance(synonyms, str): snippet_parts.append(f"Also known as: {synonyms}") snippet = " | ".join(snippet_parts) or "NCI cancer vocabulary term" return { RESULT_ID: disease_id, RESULT_TITLE: name, RESULT_SNIPPET: snippet, RESULT_URL: "", # NCI doesn't provide direct URLs to disease terms RESULT_METADATA: { "category": category, "synonyms": synonyms, "codes": result.get("codes", {}), }, } def get_domain_handler( domain: str, ) -> ( type[ArticleHandler] | type[TrialHandler] | type[VariantHandler] | type[GeneHandler] | type[DrugHandler] | type[DiseaseHandler] | type[NCIOrganizationHandler] | type[NCIInterventionHandler] | type[NCIBiomarkerHandler] | type[NCIDiseaseHandler] ): """Get the appropriate handler class for a domain. Args: domain: The domain name ('article', 'trial', 'variant', 'gene', 'drug', 'disease', 'nci_organization', 'nci_intervention', 'nci_biomarker', 'nci_disease') Returns: The handler class for the domain Raises: ValueError: If domain is not recognized """ handlers: dict[ str, type[ArticleHandler] | type[TrialHandler] | type[VariantHandler] | type[GeneHandler] | type[DrugHandler] | type[DiseaseHandler] | type[NCIOrganizationHandler] | type[NCIInterventionHandler] | type[NCIBiomarkerHandler] | type[NCIDiseaseHandler], ] = { "article": ArticleHandler, "trial": TrialHandler, "variant": VariantHandler, "gene": GeneHandler, "drug": DrugHandler, "disease": DiseaseHandler, "nci_organization": NCIOrganizationHandler, "nci_intervention": NCIInterventionHandler, "nci_biomarker": NCIBiomarkerHandler, "nci_disease": NCIDiseaseHandler, } handler = handlers.get(domain) if handler is None: raise ValueError(f"Unknown domain: {domain}") return handler ``` -------------------------------------------------------------------------------- /src/biomcp/variants/external.py: -------------------------------------------------------------------------------- ```python """External data sources for enhanced variant annotations.""" import asyncio import json import logging import re from typing import Any from urllib.parse import quote from pydantic import BaseModel, Field from .. import http_client # Import CBioPortalVariantData from the new module from .cbio_external_client import CBioPortalVariantData logger = logging.getLogger(__name__) # TCGA/GDC API endpoints GDC_BASE = "https://api.gdc.cancer.gov" GDC_SSMS_ENDPOINT = f"{GDC_BASE}/ssms" # Simple Somatic Mutations # 1000 Genomes API endpoints ENSEMBL_REST_BASE = "https://rest.ensembl.org" ENSEMBL_VARIATION_ENDPOINT = f"{ENSEMBL_REST_BASE}/variation/human" # Import constants class TCGAVariantData(BaseModel): """TCGA/GDC variant annotation data.""" cosmic_id: str | None = None tumor_types: list[str] = Field(default_factory=list) mutation_frequency: float | None = None mutation_count: int | None = None affected_cases: int | None = None consequence_type: str | None = None clinical_significance: str | None = None class ThousandGenomesData(BaseModel): """1000 Genomes variant annotation data.""" global_maf: float | None = Field( None, description="Global minor allele frequency" ) afr_maf: float | None = Field(None, description="African population MAF") amr_maf: float | None = Field(None, description="American population MAF") eas_maf: float | None = Field( None, description="East Asian population MAF" ) eur_maf: float | None = Field(None, description="European population MAF") sas_maf: float | None = Field( None, description="South Asian population MAF" ) ancestral_allele: str | None = None most_severe_consequence: str | None = None # CBioPortalVariantData is now imported from cbio_external_client.py class EnhancedVariantAnnotation(BaseModel): """Enhanced variant annotation combining multiple sources.""" variant_id: str tcga: TCGAVariantData | None = None thousand_genomes: ThousandGenomesData | None = None cbioportal: CBioPortalVariantData | None = None error_sources: list[str] = Field(default_factory=list) class TCGAClient: """Client for TCGA/GDC API.""" async def get_variant_data( self, variant_id: str ) -> TCGAVariantData | None: """Fetch variant data from TCGA/GDC. Args: variant_id: Can be gene AA change (e.g., "BRAF V600E") or genomic coordinates """ try: # Determine the search field based on variant_id format # If it looks like "GENE AA_CHANGE" format, use gene_aa_change field if " " in variant_id and not variant_id.startswith("chr"): search_field = "gene_aa_change" search_value = variant_id else: # Otherwise try genomic_dna_change search_field = "genomic_dna_change" search_value = variant_id # First, search for the variant params = { "filters": json.dumps({ "op": "in", "content": { "field": search_field, "value": [search_value], }, }), "fields": "cosmic_id,genomic_dna_change,gene_aa_change,ssm_id", "format": "json", "size": "5", # Get a few in case of multiple matches } response, error = await http_client.request_api( url=GDC_SSMS_ENDPOINT, method="GET", request=params, domain="gdc", ) if error or not response: return None data = response.get("data", {}) hits = data.get("hits", []) if not hits: return None # Get the first hit hit = hits[0] ssm_id = hit.get("ssm_id") cosmic_id = hit.get("cosmic_id") # For gene_aa_change searches, verify we have the right variant if search_field == "gene_aa_change": gene_aa_changes = hit.get("gene_aa_change", []) if ( isinstance(gene_aa_changes, list) and search_value not in gene_aa_changes ): # This SSM has multiple AA changes, but not the one we're looking for return None if not ssm_id: return None # Now query SSM occurrences to get project information occ_params = { "filters": json.dumps({ "op": "in", "content": {"field": "ssm.ssm_id", "value": [ssm_id]}, }), "fields": "case.project.project_id", "format": "json", "size": "2000", # Get more occurrences } occ_response, occ_error = await http_client.request_api( url="https://api.gdc.cancer.gov/ssm_occurrences", method="GET", request=occ_params, domain="gdc", ) if occ_error or not occ_response: # Return basic info without occurrence data cosmic_id_str = ( cosmic_id[0] if isinstance(cosmic_id, list) and cosmic_id else cosmic_id ) return TCGAVariantData( cosmic_id=cosmic_id_str, tumor_types=[], affected_cases=0, consequence_type="missense_variant", # Most COSMIC variants are missense ) # Process occurrence data occ_data = occ_response.get("data", {}) occ_hits = occ_data.get("hits", []) # Count by project project_counts = {} for occ in occ_hits: case = occ.get("case", {}) project = case.get("project", {}) if project_id := project.get("project_id"): project_counts[project_id] = ( project_counts.get(project_id, 0) + 1 ) # Extract tumor types tumor_types = [] total_cases = 0 for project_id, count in project_counts.items(): # Extract tumor type from project ID # TCGA format: "TCGA-LUAD" -> "LUAD" # Other formats: "MMRF-COMMPASS" -> "MMRF-COMMPASS", "CPTAC-3" -> "CPTAC-3" if project_id.startswith("TCGA-") and "-" in project_id: tumor_type = project_id.split("-")[-1] tumor_types.append(tumor_type) else: # For non-TCGA projects, use the full project ID tumor_types.append(project_id) total_cases += count # Handle cosmic_id as list cosmic_id_str = ( cosmic_id[0] if isinstance(cosmic_id, list) and cosmic_id else cosmic_id ) return TCGAVariantData( cosmic_id=cosmic_id_str, tumor_types=tumor_types, affected_cases=total_cases, consequence_type="missense_variant", # Default for now ) except (KeyError, ValueError, TypeError, IndexError) as e: # Log the error for debugging while gracefully handling API response issues # KeyError: Missing expected fields in API response # ValueError: Invalid data format or conversion issues # TypeError: Unexpected data types in response # IndexError: Array access issues with response data logger.warning( f"Failed to fetch TCGA variant data for {variant_id}: {type(e).__name__}: {e}" ) return None class ThousandGenomesClient: """Client for 1000 Genomes data via Ensembl REST API.""" def _extract_population_frequencies( self, populations: list[dict] ) -> dict[str, Any]: """Extract population frequencies from Ensembl response.""" # Note: Multiple entries per population (one per allele), we want the alternate allele frequency # The reference allele will have higher frequency for rare variants pop_data: dict[str, float] = {} for pop in populations: pop_name = pop.get("population", "") frequency = pop.get("frequency", 0) # Map 1000 Genomes population codes - taking the minor allele frequency if pop_name == "1000GENOMES:phase_3:ALL": if "global_maf" not in pop_data or frequency < pop_data.get( "global_maf", 1 ): pop_data["global_maf"] = frequency elif pop_name == "1000GENOMES:phase_3:AFR": if "afr_maf" not in pop_data or frequency < pop_data.get( "afr_maf", 1 ): pop_data["afr_maf"] = frequency elif pop_name == "1000GENOMES:phase_3:AMR": if "amr_maf" not in pop_data or frequency < pop_data.get( "amr_maf", 1 ): pop_data["amr_maf"] = frequency elif pop_name == "1000GENOMES:phase_3:EAS": if "eas_maf" not in pop_data or frequency < pop_data.get( "eas_maf", 1 ): pop_data["eas_maf"] = frequency elif pop_name == "1000GENOMES:phase_3:EUR": if "eur_maf" not in pop_data or frequency < pop_data.get( "eur_maf", 1 ): pop_data["eur_maf"] = frequency elif pop_name == "1000GENOMES:phase_3:SAS" and ( "sas_maf" not in pop_data or frequency < pop_data.get("sas_maf", 1) ): pop_data["sas_maf"] = frequency return pop_data async def get_variant_data( self, variant_id: str ) -> ThousandGenomesData | None: """Fetch variant data from 1000 Genomes via Ensembl.""" try: # Try to get rsID or use the variant ID directly encoded_id = quote(variant_id, safe="") url = f"{ENSEMBL_VARIATION_ENDPOINT}/{encoded_id}" # Request with pops=1 to get population data params = {"content-type": "application/json", "pops": "1"} response, error = await http_client.request_api( url=url, method="GET", request=params, domain="ensembl", ) if error or not response: return None # Extract population frequencies populations = response.get("populations", []) pop_data = self._extract_population_frequencies(populations) # Get most severe consequence consequence = None if mappings := response.get("mappings", []): # Extract consequences from transcript consequences all_consequences = [] for mapping in mappings: if transcript_consequences := mapping.get( "transcript_consequences", [] ): for tc in transcript_consequences: if consequence_terms := tc.get( "consequence_terms", [] ): all_consequences.extend(consequence_terms) if all_consequences: # Take the first unique consequence seen = set() unique_consequences = [] for c in all_consequences: if c not in seen: seen.add(c) unique_consequences.append(c) consequence = ( unique_consequences[0] if unique_consequences else None ) # Only return data if we found population frequencies if pop_data: return ThousandGenomesData( **pop_data, ancestral_allele=response.get("ancestral_allele"), most_severe_consequence=consequence, ) else: # No population data found return None except (KeyError, ValueError, TypeError, AttributeError) as e: # Log the error for debugging while gracefully handling API response issues # KeyError: Missing expected fields in API response # ValueError: Invalid data format or conversion issues # TypeError: Unexpected data types in response # AttributeError: Missing attributes on response objects logger.warning( f"Failed to fetch 1000 Genomes data for {variant_id}: {type(e).__name__}: {e}" ) return None class ExternalVariantAggregator: """Aggregates variant data from multiple external sources.""" def __init__(self): self.tcga_client = TCGAClient() self.thousand_genomes_client = ThousandGenomesClient() # Import here to avoid circular imports from .cbio_external_client import CBioPortalExternalClient self.cbioportal_client = CBioPortalExternalClient() def _extract_gene_aa_change( self, variant_data: dict[str, Any] ) -> str | None: """Extract gene and AA change in format like 'BRAF V600A' from variant data.""" logger.info("_extract_gene_aa_change called") try: # First try to get gene name from CADD data gene_name = None if (cadd := variant_data.get("cadd")) and ( gene := cadd.get("gene") ): gene_name = gene.get("genename") # If not found in CADD, try other sources if not gene_name: # Try docm if docm := variant_data.get("docm"): gene_name = docm.get("gene") or docm.get("genename") # Try dbnsfp if not gene_name and (dbnsfp := variant_data.get("dbnsfp")): gene_name = dbnsfp.get("genename") if not gene_name: return None # Now try to get the protein change aa_change = None # Try to get from docm first (it has clean p.V600A format) if (docm := variant_data.get("docm")) and ( aa := docm.get("aa_change") ): # Convert p.V600A to V600A aa_change = aa.replace("p.", "") # Try hgvsp if not found if ( not aa_change and (hgvsp_list := variant_data.get("hgvsp")) and isinstance(hgvsp_list, list) and hgvsp_list ): # Take the first one and clean it hgvsp = hgvsp_list[0] # Remove p. prefix aa_change = hgvsp.replace("p.", "") # Handle formats like Val600Ala -> V600A if "Val" in aa_change or "Ala" in aa_change: # Try to extract the short form match = re.search(r"[A-Z]\d+[A-Z]", aa_change) if match: aa_change = match.group() # Try CADD data if ( not aa_change and (cadd := variant_data.get("cadd")) and (gene_info := cadd.get("gene")) and (prot := gene_info.get("prot")) ): protpos = prot.get("protpos") if protpos and cadd.get("oaa") and cadd.get("naa"): aa_change = f"{cadd['oaa']}{protpos}{cadd['naa']}" if gene_name and aa_change: result = f"{gene_name} {aa_change}" logger.info(f"Extracted gene/AA change: {result}") return result logger.warning( f"Failed to extract gene/AA change: gene_name={gene_name}, aa_change={aa_change}" ) return None except ( KeyError, ValueError, TypeError, AttributeError, re.error, ) as e: # Log the error for debugging while gracefully handling data extraction issues # KeyError: Missing expected fields in variant data # ValueError: Invalid data format or conversion issues # TypeError: Unexpected data types in variant data # AttributeError: Missing attributes on data objects # re.error: Regular expression matching errors logger.warning( f"Failed to extract gene/AA change from variant data: {type(e).__name__}: {e}" ) return None async def get_enhanced_annotations( self, variant_id: str, include_tcga: bool = True, include_1000g: bool = True, include_cbioportal: bool = True, variant_data: dict[str, Any] | None = None, ) -> EnhancedVariantAnnotation: """Fetch and aggregate variant annotations from external sources. Args: variant_id: The variant identifier (rsID or HGVS) include_tcga: Whether to include TCGA data include_1000g: Whether to include 1000 Genomes data include_cbioportal: Whether to include cBioPortal data variant_data: Optional variant data from MyVariant.info to extract gene/protein info """ logger.info( f"get_enhanced_annotations called for {variant_id}, include_cbioportal={include_cbioportal}" ) tasks: list[Any] = [] task_names = [] # Extract gene/AA change once for sources that need it gene_aa_change = None if variant_data: logger.info( f"Extracting gene/AA from variant_data keys: {list(variant_data.keys())}" ) gene_aa_change = self._extract_gene_aa_change(variant_data) else: logger.warning("No variant_data provided for gene/AA extraction") if include_tcga: # Try to extract gene and protein change from variant data for TCGA tcga_id = gene_aa_change if gene_aa_change else variant_id tasks.append(self.tcga_client.get_variant_data(tcga_id)) task_names.append("tcga") if include_1000g: tasks.append( self.thousand_genomes_client.get_variant_data(variant_id) ) task_names.append("thousand_genomes") if include_cbioportal and gene_aa_change: # cBioPortal requires gene/AA format logger.info( f"Adding cBioPortal task with gene_aa_change: {gene_aa_change}" ) tasks.append( self.cbioportal_client.get_variant_data(gene_aa_change) ) task_names.append("cbioportal") elif include_cbioportal and not gene_aa_change: logger.warning( "Skipping cBioPortal: no gene/AA change could be extracted" ) # Run all queries in parallel results = await asyncio.gather(*tasks, return_exceptions=True) # Build the enhanced annotation annotation = EnhancedVariantAnnotation(variant_id=variant_id) for _i, (result, name) in enumerate( zip(results, task_names, strict=False) ): if isinstance(result, Exception): annotation.error_sources.append(name) elif result is not None: setattr(annotation, name, result) else: # No data found for this source pass return annotation def format_enhanced_annotations( annotation: EnhancedVariantAnnotation, ) -> dict[str, Any]: """Format enhanced annotations for display.""" formatted: dict[str, Any] = { "variant_id": annotation.variant_id, "external_annotations": {}, } external_annot = formatted["external_annotations"] if annotation.tcga: external_annot["tcga"] = { "tumor_types": annotation.tcga.tumor_types, "affected_cases": annotation.tcga.affected_cases, "cosmic_id": annotation.tcga.cosmic_id, "consequence": annotation.tcga.consequence_type, } if annotation.thousand_genomes: external_annot["1000_genomes"] = { "global_maf": annotation.thousand_genomes.global_maf, "population_frequencies": { "african": annotation.thousand_genomes.afr_maf, "american": annotation.thousand_genomes.amr_maf, "east_asian": annotation.thousand_genomes.eas_maf, "european": annotation.thousand_genomes.eur_maf, "south_asian": annotation.thousand_genomes.sas_maf, }, "ancestral_allele": annotation.thousand_genomes.ancestral_allele, "consequence": annotation.thousand_genomes.most_severe_consequence, } if annotation.cbioportal: cbio_data: dict[str, Any] = { "studies": annotation.cbioportal.studies, "total_cases": annotation.cbioportal.total_cases, } # Add cancer type distribution if available if annotation.cbioportal.cancer_type_distribution: cbio_data["cancer_types"] = ( annotation.cbioportal.cancer_type_distribution ) # Add mutation type distribution if available if annotation.cbioportal.mutation_types: cbio_data["mutation_types"] = annotation.cbioportal.mutation_types # Add hotspot count if > 0 if annotation.cbioportal.hotspot_count > 0: cbio_data["hotspot_samples"] = annotation.cbioportal.hotspot_count # Add mean VAF if available if annotation.cbioportal.mean_vaf is not None: cbio_data["mean_vaf"] = annotation.cbioportal.mean_vaf # Add sample type distribution if available if annotation.cbioportal.sample_types: cbio_data["sample_types"] = annotation.cbioportal.sample_types external_annot["cbioportal"] = cbio_data if annotation.error_sources: external_annot["errors"] = annotation.error_sources return formatted ``` -------------------------------------------------------------------------------- /tests/tdd/trials/test_search.py: -------------------------------------------------------------------------------- ```python import pytest from biomcp.trials.search import ( CLOSED_STATUSES, AgeGroup, DateField, InterventionType, LineOfTherapy, PrimaryPurpose, RecruitingStatus, SortOrder, SponsorType, StudyDesign, StudyType, TrialPhase, TrialQuery, _build_biomarker_expression_essie, _build_brain_mets_essie, _build_excluded_mutations_essie, _build_line_of_therapy_essie, _build_prior_therapy_essie, _build_progression_essie, _build_required_mutations_essie, _inject_ids, convert_query, ) @pytest.mark.asyncio async def test_convert_query_basic_parameters(): """Test basic parameter conversion from TrialQuery to API format.""" query = TrialQuery(conditions=["lung cancer"]) params = await convert_query(query) assert "markupFormat" in params assert params["markupFormat"] == ["markdown"] assert "query.cond" in params assert params["query.cond"] == ["lung cancer"] assert "filter.overallStatus" in params assert "RECRUITING" in params["filter.overallStatus"][0] @pytest.mark.asyncio async def test_convert_query_multiple_conditions(): """Test conversion of multiple conditions to API format.""" query = TrialQuery(conditions=["lung cancer", "metastatic"]) params = await convert_query(query) assert "query.cond" in params # The query should contain the original terms, but may have expanded synonyms cond_value = params["query.cond"][0] assert "lung cancer" in cond_value assert "metastatic" in cond_value assert cond_value.startswith("(") and cond_value.endswith(")") @pytest.mark.asyncio async def test_convert_query_terms_parameter(): """Test conversion of terms parameter to API format.""" query = TrialQuery(terms=["immunotherapy"]) params = await convert_query(query) assert "query.term" in params assert params["query.term"] == ["immunotherapy"] @pytest.mark.asyncio async def test_convert_query_interventions_parameter(): """Test conversion of interventions parameter to API format.""" query = TrialQuery(interventions=["pembrolizumab"]) params = await convert_query(query) assert "query.intr" in params assert params["query.intr"] == ["pembrolizumab"] @pytest.mark.asyncio async def test_convert_query_nct_ids(): """Test conversion of NCT IDs to API format.""" query = TrialQuery(nct_ids=["NCT04179552"]) params = await convert_query(query) assert "query.id" in params assert params["query.id"] == ["NCT04179552"] # Note: The implementation keeps filter.overallStatus when using nct_ids # So we don't assert its absence @pytest.mark.asyncio async def test_convert_query_recruiting_status(): """Test conversion of recruiting status to API format.""" # Test open status query = TrialQuery(recruiting_status=RecruitingStatus.OPEN) params = await convert_query(query) assert "filter.overallStatus" in params assert "RECRUITING" in params["filter.overallStatus"][0] # Test closed status query = TrialQuery(recruiting_status=RecruitingStatus.CLOSED) params = await convert_query(query) assert "filter.overallStatus" in params assert all( status in params["filter.overallStatus"][0] for status in CLOSED_STATUSES ) # Test any status query = TrialQuery(recruiting_status=RecruitingStatus.ANY) params = await convert_query(query) assert "filter.overallStatus" not in params @pytest.mark.asyncio async def test_convert_query_location_parameters(): """Test conversion of location parameters to API format.""" query = TrialQuery(lat=40.7128, long=-74.0060, distance=10) params = await convert_query(query) assert "filter.geo" in params assert params["filter.geo"] == ["distance(40.7128,-74.006,10mi)"] @pytest.mark.asyncio async def test_convert_query_study_type(): """Test conversion of study type to API format.""" query = TrialQuery(study_type=StudyType.INTERVENTIONAL) params = await convert_query(query) assert "filter.advanced" in params assert "AREA[StudyType]Interventional" in params["filter.advanced"][0] @pytest.mark.asyncio async def test_convert_query_phase(): """Test conversion of phase to API format.""" query = TrialQuery(phase=TrialPhase.PHASE3) params = await convert_query(query) assert "filter.advanced" in params assert "AREA[Phase]PHASE3" in params["filter.advanced"][0] @pytest.mark.asyncio async def test_convert_query_date_range(): """Test conversion of date range to API format.""" query = TrialQuery( min_date="2020-01-01", max_date="2020-12-31", date_field=DateField.LAST_UPDATE, ) params = await convert_query(query) assert "filter.advanced" in params assert ( "AREA[LastUpdatePostDate]RANGE[2020-01-01,2020-12-31]" in params["filter.advanced"][0] ) # Test min date only query = TrialQuery( min_date="2021-01-01", date_field=DateField.STUDY_START, ) params = await convert_query(query) assert "filter.advanced" in params assert ( "AREA[StartDate]RANGE[2021-01-01,MAX]" in params["filter.advanced"][0] ) @pytest.mark.asyncio async def test_convert_query_sort_order(): """Test conversion of sort order to API format.""" query = TrialQuery(sort=SortOrder.RELEVANCE) params = await convert_query(query) assert "sort" in params assert params["sort"] == ["@relevance"] query = TrialQuery(sort=SortOrder.LAST_UPDATE) params = await convert_query(query) assert "sort" in params assert params["sort"] == ["LastUpdatePostDate:desc"] @pytest.mark.asyncio async def test_convert_query_intervention_type(): """Test conversion of intervention type to API format.""" query = TrialQuery(intervention_type=InterventionType.DRUG) params = await convert_query(query) assert "filter.advanced" in params assert "AREA[InterventionType]Drug" in params["filter.advanced"][0] @pytest.mark.asyncio async def test_convert_query_sponsor_type(): """Test conversion of sponsor type to API format.""" query = TrialQuery(sponsor_type=SponsorType.ACADEMIC) params = await convert_query(query) assert "filter.advanced" in params assert "AREA[SponsorType]Academic" in params["filter.advanced"][0] @pytest.mark.asyncio async def test_convert_query_study_design(): """Test conversion of study design to API format.""" query = TrialQuery(study_design=StudyDesign.RANDOMIZED) params = await convert_query(query) assert "filter.advanced" in params assert "AREA[StudyDesign]Randomized" in params["filter.advanced"][0] @pytest.mark.asyncio async def test_convert_query_age_group(): """Test conversion of age group to API format.""" query = TrialQuery(age_group=AgeGroup.ADULT) params = await convert_query(query) assert "filter.advanced" in params assert "AREA[StdAge]Adult" in params["filter.advanced"][0] @pytest.mark.asyncio async def test_convert_query_primary_purpose(): """Test conversion of primary purpose to API format.""" query = TrialQuery(primary_purpose=PrimaryPurpose.TREATMENT) params = await convert_query(query) assert "filter.advanced" in params assert ( "AREA[DesignPrimaryPurpose]Treatment" in params["filter.advanced"][0] ) @pytest.mark.asyncio async def test_convert_query_next_page_hash(): """Test conversion of next_page_hash to API format.""" query = TrialQuery(next_page_hash="abc123") params = await convert_query(query) assert "pageToken" in params assert params["pageToken"] == ["abc123"] @pytest.mark.asyncio async def test_convert_query_complex_parameters(): """Test conversion of multiple parameters to API format.""" query = TrialQuery( conditions=["diabetes"], terms=["obesity"], interventions=["metformin"], primary_purpose=PrimaryPurpose.TREATMENT, study_type=StudyType.INTERVENTIONAL, intervention_type=InterventionType.DRUG, recruiting_status=RecruitingStatus.OPEN, phase=TrialPhase.PHASE3, age_group=AgeGroup.ADULT, sort=SortOrder.RELEVANCE, ) params = await convert_query(query) assert "query.cond" in params # Disease synonym expansion may add synonyms to diabetes assert "diabetes" in params["query.cond"][0] assert "query.term" in params assert params["query.term"] == ["obesity"] assert "query.intr" in params assert params["query.intr"] == ["metformin"] assert "filter.advanced" in params assert ( "AREA[DesignPrimaryPurpose]Treatment" in params["filter.advanced"][0] ) assert "AREA[StudyType]Interventional" in params["filter.advanced"][0] assert "AREA[InterventionType]Drug" in params["filter.advanced"][0] assert "AREA[Phase]PHASE3" in params["filter.advanced"][0] assert "AREA[StdAge]Adult" in params["filter.advanced"][0] assert "filter.overallStatus" in params assert "RECRUITING" in params["filter.overallStatus"][0] assert "sort" in params assert params["sort"] == ["@relevance"] # Test TrialQuery field validation for CLI input processing # noinspection PyTypeChecker def test_trial_query_field_validation_basic(): """Test basic field validation for TrialQuery.""" # Test list fields conversion query = TrialQuery(conditions="diabetes") assert query.conditions == ["diabetes"] query = TrialQuery(interventions="metformin") assert query.interventions == ["metformin"] query = TrialQuery(terms="blood glucose") assert query.terms == ["blood glucose"] query = TrialQuery(nct_ids="NCT01234567") assert query.nct_ids == ["NCT01234567"] # noinspection PyTypeChecker def test_trial_query_field_validation_recruiting_status(): """Test recruiting status field validation.""" # Exact match uppercase query = TrialQuery(recruiting_status="OPEN") assert query.recruiting_status == RecruitingStatus.OPEN # Exact match lowercase query = TrialQuery(recruiting_status="closed") assert query.recruiting_status == RecruitingStatus.CLOSED # Invalid value with pytest.raises(ValueError) as excinfo: TrialQuery(recruiting_status="invalid") assert "validation error for TrialQuery" in str(excinfo.value) # noinspection PyTypeChecker @pytest.mark.asyncio async def test_trial_query_field_validation_combined(): """Test combined parameters validation.""" query = TrialQuery( conditions=["diabetes", "obesity"], interventions="metformin", recruiting_status="open", study_type="interventional", lat=40.7128, long=-74.0060, distance=10, ) assert query.conditions == ["diabetes", "obesity"] assert query.interventions == ["metformin"] assert query.recruiting_status == RecruitingStatus.OPEN assert query.study_type == StudyType.INTERVENTIONAL assert query.lat == 40.7128 assert query.long == -74.0060 assert query.distance == 10 # Check that the query can be converted to parameters properly params = await convert_query(query) assert "query.cond" in params # The query should contain the original terms, but may have expanded synonyms cond_value = params["query.cond"][0] assert "diabetes" in cond_value assert "obesity" in cond_value assert cond_value.startswith("(") and cond_value.endswith(")") assert "query.intr" in params assert "metformin" in params["query.intr"][0] assert "filter.geo" in params assert "distance(40.7128,-74.006,10mi)" in params["filter.geo"][0] # noinspection PyTypeChecker @pytest.mark.asyncio async def test_trial_query_field_validation_terms(): """Test terms parameter validation.""" # Single term as string query = TrialQuery(terms="cancer") assert query.terms == ["cancer"] # Multiple terms as list query = TrialQuery(terms=["cancer", "therapy"]) assert query.terms == ["cancer", "therapy"] # Check parameter generation params = await convert_query(query) assert "query.term" in params assert "(cancer OR therapy)" in params["query.term"][0] # noinspection PyTypeChecker @pytest.mark.asyncio async def test_trial_query_field_validation_nct_ids(): """Test NCT IDs parameter validation.""" # Single NCT ID query = TrialQuery(nct_ids="NCT01234567") assert query.nct_ids == ["NCT01234567"] # Multiple NCT IDs query = TrialQuery(nct_ids=["NCT01234567", "NCT89012345"]) assert query.nct_ids == ["NCT01234567", "NCT89012345"] # Check parameter generation params = await convert_query(query) assert "query.id" in params assert "NCT01234567,NCT89012345" in params["query.id"][0] # noinspection PyTypeChecker @pytest.mark.asyncio async def test_trial_query_field_validation_date_range(): """Test date range parameters validation.""" # Min date only with date field query = TrialQuery(min_date="2020-01-01", date_field=DateField.STUDY_START) assert query.min_date == "2020-01-01" assert query.date_field == DateField.STUDY_START # Min and max date with date field using lazy mapping query = TrialQuery( min_date="2020-01-01", max_date="2021-12-31", date_field="last update", # space not underscore. ) assert query.min_date == "2020-01-01" assert query.max_date == "2021-12-31" assert query.date_field == DateField.LAST_UPDATE # Check parameter generation params = await convert_query(query) assert "filter.advanced" in params assert ( "AREA[LastUpdatePostDate]RANGE[2020-01-01,2021-12-31]" in params["filter.advanced"][0] ) # noinspection PyTypeChecker def test_trial_query_field_validation_primary_purpose(): """Test primary purpose parameter validation.""" # Exact match uppercase query = TrialQuery(primary_purpose=PrimaryPurpose.TREATMENT) assert query.primary_purpose == PrimaryPurpose.TREATMENT # Exact match lowercase query = TrialQuery(primary_purpose=PrimaryPurpose.PREVENTION) assert query.primary_purpose == PrimaryPurpose.PREVENTION # Case-insensitive query = TrialQuery(primary_purpose="ScReeNING") assert query.primary_purpose == PrimaryPurpose.SCREENING # Invalid with pytest.raises(ValueError): TrialQuery(primary_purpose="invalid") def test_inject_ids_with_many_ids_and_condition(): """Test _inject_ids function with 300 IDs and a condition to ensure filter.ids is used.""" # Create a params dict with a condition (indicating other filters present) params = { "query.cond": ["melanoma"], "format": ["json"], "markupFormat": ["markdown"], } # Generate 300 NCT IDs nct_ids = [f"NCT{str(i).zfill(8)}" for i in range(1, 301)] # Call _inject_ids with has_other_filters=True _inject_ids(params, nct_ids, has_other_filters=True) # Assert that filter.ids is used (not query.id) assert "filter.ids" in params assert "query.id" not in params # Verify the IDs are properly formatted ids_param = params["filter.ids"][0] assert ids_param.startswith("NCT") assert "NCT00000001" in ids_param assert "NCT00000300" in ids_param # Verify it's a comma-separated list assert "," in ids_param assert ids_param.count(",") == 299 # 300 IDs = 299 commas def test_inject_ids_without_other_filters(): """Test _inject_ids function with only NCT IDs (no other filters).""" # Create a minimal params dict params = { "format": ["json"], "markupFormat": ["markdown"], } # Use a small number of NCT IDs nct_ids = ["NCT00000001", "NCT00000002", "NCT00000003"] # Call _inject_ids with has_other_filters=False _inject_ids(params, nct_ids, has_other_filters=False) # Assert that query.id is used (not filter.ids) for small lists assert "query.id" in params assert "filter.ids" not in params # Verify the format assert params["query.id"][0] == "NCT00000001,NCT00000002,NCT00000003" def test_inject_ids_large_list_without_filters(): """Test _inject_ids with a large ID list but no other filters.""" params = { "format": ["json"], "markupFormat": ["markdown"], } # Generate enough IDs to exceed 1800 character limit nct_ids = [f"NCT{str(i).zfill(8)}" for i in range(1, 201)] # ~2200 chars # Call _inject_ids with has_other_filters=False _inject_ids(params, nct_ids, has_other_filters=False) # Assert that filter.ids is used for large lists even without other filters assert "filter.ids" in params assert "query.id" not in params # Tests for new Essie builder functions def test_build_prior_therapy_essie(): """Test building Essie fragments for prior therapies.""" # Single therapy fragments = _build_prior_therapy_essie(["osimertinib"]) assert len(fragments) == 1 assert ( fragments[0] == 'AREA[EligibilityCriteria]("osimertinib" AND (prior OR previous OR received))' ) # Multiple therapies fragments = _build_prior_therapy_essie(["osimertinib", "erlotinib"]) assert len(fragments) == 2 assert ( fragments[0] == 'AREA[EligibilityCriteria]("osimertinib" AND (prior OR previous OR received))' ) assert ( fragments[1] == 'AREA[EligibilityCriteria]("erlotinib" AND (prior OR previous OR received))' ) # Empty strings are filtered out fragments = _build_prior_therapy_essie(["osimertinib", "", "erlotinib"]) assert len(fragments) == 2 def test_build_progression_essie(): """Test building Essie fragments for progression on therapy.""" fragments = _build_progression_essie(["pembrolizumab"]) assert len(fragments) == 1 assert ( fragments[0] == 'AREA[EligibilityCriteria]("pembrolizumab" AND (progression OR resistant OR refractory))' ) def test_build_required_mutations_essie(): """Test building Essie fragments for required mutations.""" fragments = _build_required_mutations_essie(["EGFR L858R", "T790M"]) assert len(fragments) == 2 assert fragments[0] == 'AREA[EligibilityCriteria]("EGFR L858R")' assert fragments[1] == 'AREA[EligibilityCriteria]("T790M")' def test_build_excluded_mutations_essie(): """Test building Essie fragments for excluded mutations.""" fragments = _build_excluded_mutations_essie(["KRAS G12C"]) assert len(fragments) == 1 assert fragments[0] == 'AREA[EligibilityCriteria](NOT "KRAS G12C")' def test_build_biomarker_expression_essie(): """Test building Essie fragments for biomarker expression.""" biomarkers = {"PD-L1": "≥50%", "TMB": "≥10 mut/Mb"} fragments = _build_biomarker_expression_essie(biomarkers) assert len(fragments) == 2 assert 'AREA[EligibilityCriteria]("PD-L1" AND "≥50%")' in fragments assert 'AREA[EligibilityCriteria]("TMB" AND "≥10 mut/Mb")' in fragments # Empty values are filtered out biomarkers = {"PD-L1": "≥50%", "TMB": "", "HER2": "positive"} fragments = _build_biomarker_expression_essie(biomarkers) assert len(fragments) == 2 def test_build_line_of_therapy_essie(): """Test building Essie fragment for line of therapy.""" # First line fragment = _build_line_of_therapy_essie(LineOfTherapy.FIRST_LINE) assert ( fragment == 'AREA[EligibilityCriteria]("first line" OR "first-line" OR "1st line" OR "frontline" OR "treatment naive" OR "previously untreated")' ) # Second line fragment = _build_line_of_therapy_essie(LineOfTherapy.SECOND_LINE) assert ( fragment == 'AREA[EligibilityCriteria]("second line" OR "second-line" OR "2nd line" OR "one prior line" OR "1 prior line")' ) # Third line plus fragment = _build_line_of_therapy_essie(LineOfTherapy.THIRD_LINE_PLUS) assert ( fragment == 'AREA[EligibilityCriteria]("third line" OR "third-line" OR "3rd line" OR "≥2 prior" OR "at least 2 prior" OR "heavily pretreated")' ) def test_build_brain_mets_essie(): """Test building Essie fragment for brain metastases filter.""" # Allow brain mets (no filter) fragment = _build_brain_mets_essie(True) assert fragment == "" # Exclude brain mets fragment = _build_brain_mets_essie(False) assert fragment == 'AREA[EligibilityCriteria](NOT "brain metastases")' @pytest.mark.asyncio async def test_convert_query_with_eligibility_fields(): """Test conversion of query with new eligibility-focused fields.""" query = TrialQuery( conditions=["lung cancer"], prior_therapies=["osimertinib"], progression_on=["erlotinib"], required_mutations=["EGFR L858R"], excluded_mutations=["T790M"], biomarker_expression={"PD-L1": "≥50%"}, line_of_therapy=LineOfTherapy.SECOND_LINE, allow_brain_mets=False, ) params = await convert_query(query) # Check that query.term contains all the Essie fragments assert "query.term" in params term = params["query.term"][0] # Prior therapy assert ( 'AREA[EligibilityCriteria]("osimertinib" AND (prior OR previous OR received))' in term ) # Progression assert ( 'AREA[EligibilityCriteria]("erlotinib" AND (progression OR resistant OR refractory))' in term ) # Required mutation assert 'AREA[EligibilityCriteria]("EGFR L858R")' in term # Excluded mutation assert 'AREA[EligibilityCriteria](NOT "T790M")' in term # Biomarker expression assert 'AREA[EligibilityCriteria]("PD-L1" AND "≥50%")' in term # Line of therapy assert 'AREA[EligibilityCriteria]("second line" OR "second-line"' in term # Brain mets exclusion assert 'AREA[EligibilityCriteria](NOT "brain metastases")' in term # All fragments should be combined with AND assert " AND " in term @pytest.mark.asyncio async def test_convert_query_with_custom_fields_and_page_size(): """Test conversion of query with custom return fields and page size.""" query = TrialQuery( conditions=["diabetes"], return_fields=["NCTId", "BriefTitle", "OverallStatus"], page_size=100, ) params = await convert_query(query) assert "fields" in params assert params["fields"] == ["NCTId,BriefTitle,OverallStatus"] assert "pageSize" in params assert params["pageSize"] == ["100"] @pytest.mark.asyncio async def test_convert_query_eligibility_with_existing_terms(): """Test that eligibility Essie fragments are properly combined with existing terms.""" query = TrialQuery( terms=["immunotherapy"], prior_therapies=["chemotherapy"], ) params = await convert_query(query) assert "query.term" in params term = params["query.term"][0] # Should contain both the original term and the new Essie fragment assert "immunotherapy" in term assert ( 'AREA[EligibilityCriteria]("chemotherapy" AND (prior OR previous OR received))' in term ) # Should be combined with AND assert "immunotherapy AND AREA[EligibilityCriteria]" in term ``` -------------------------------------------------------------------------------- /tests/data/pubtator/pubtator3_paper.txt: -------------------------------------------------------------------------------- ``` Nucleic Acids Research, 2024, 52, W540–W546 https://doi.org/10.1093/nar/gkae235 Advance access publication date: 4 April 2024 Web Server issue PubTator 3.0: an AI-powered literature resource for unlocking biomedical knowledge Chih-Hsuan Wei † , Alexis Allot † , Po-Ting Lai , Robert Leaman , Shubo Tian , Ling Luo , Qiao Jin , Zhizheng Wang , Qingyu Chen and Zhiyong Lu * National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA To whom correspondence should be addressed. Tel: +1 301 594 7089; Email: [email protected] The first two authors should be regarded as Joint First Authors. Present addresses: Alexis Allot, The Neuro (Montreal Neurological Institute-Hospital), McGill University, Montreal, Quebec H3A 2B4, Canada. Ling Luo, School of Computer Science and Technology, Dalian University of Technology, 116024 Dalian, China. Qingyu Chen, Biomedical Informatics and Data Science, Yale School of Medicine, New Haven, CT 06510, USA. † Abstract PubTator 3.0 (https://www.ncbi.nlm.nih.gov/research/pubtator3/) is a biomedical literature resource using state-of-the-art AI techniques to offer semantic and relation searches for key concepts like proteins, genetic variants, diseases and chemicals. It currently provides over one billion entity and relation annotations across approximately 36 million PubMed abstracts and 6 million full-text articles from the PMC open access subset, updated weekly. PubTator 3.0’s online interface and API utilize these precomputed entity relations and synonyms to provide advanced search capabilities and enable large-scale analyses, streamlining many complex information needs. We showcase the retrieval quality of PubTator 3.0 using a series of entity pair queries, demonstrating that PubTator 3.0 retrieves a greater number of articles than either PubMed or Google Scholar, with higher precision in the top 20 results. We further show that integrating ChatGPT (GPT-4) with PubTator APIs dramatically improves the factuality and verifiability of its responses. In summary, PubTator 3.0 offers a comprehensive set of features and tools that allow researchers to navigate the ever-expanding wealth of biomedical literature, expediting research and unlocking valuable insights for scientific discovery. Graphical abstract Introduction The biomedical literature is a primary resource to address information needs across the biological and clinical sciences (1), however the requirements for literature search vary widely. Activities such as formulating a research hypothesis require an exploratory approach, whereas tasks like interpreting the clinical significance of genetic variants are more focused. Traditional keyword-based search methods have long formed the foundation of biomedical literature search (2). While generally effective for basic search, these methods also have significant limitations, such as missing relevant articles due to differing terminology or including irrelevant articles because surface-level term matches cannot adequately represent the required association between query terms. These limitations cost time and risk information needs remaining unmet. Natural language processing (NLP) methods provide substantial value for creating bioinformatics resources (3–5), and may improve literature search by enabling semantic and relation search (6). In semantic search, users indicate specific concepts of interest (entities) for which the system has precomputed matches regardless of the terminology used. Relation search increases precision by allowing users to specify the Received: January 18, 2024. Revised: March 2, 2024. Editorial Decision: March 16, 2024. Accepted: March 21, 2024 Published by Oxford University Press on behalf of Nucleic Acids Research 2024. This work is written by (a) US Government employee(s) and is in the public domain in the US. Downloaded from https://academic.oup.com/nar/article/52/W1/W540/7640526 by guest on 10 March 2025 * W541 Nucleic Acids Research, 2024, Vol. 52, Web Server issue type of relationship desired between entities, such as whether a chemical enhances or reduces expression of a gene. In this regard, we present PubTator 3.0, a novel resource engineered to support semantic and relation search in the biomedical literature. Its search capabilities allow users to explore automated entity annotations for six key biomedical entities: genes, diseases, chemicals, genetic variants, species, and cell lines. PubTator 3.0 also identifies and makes searchable 12 common types of relations between entities, enhancing its utility for both targeted and exploratory searches. Focusing on relations and entity types of interest across the biomedical sciences allows PubTator 3.0 to retrieve information precisely while providing broad utility (see detailed comparisons with its predecessor in Supplementary Table S1). The PubTator 3.0 online interface, illustrated in Figure 1 and Supplementary Figure S1, is designed for interactive literature exploration, supporting semantic, relation, keyword, and Boolean queries. An auto-complete function provides semantic search suggestions to assist users with query formulation. For example, it automatically suggests replacing either ‘COVID-19 or "SARS-CoV-2 infection’ with the semantic term ‘@DISEASE_COVID_19 . Relation queries – new to PubTator 3.0 – provide increased precision, allowing users to target articles which discuss specific relationships between entities. PubTator 3.0 offers unified search results, simultaneously searching approximately 36 million PubMed abstracts and over 6 million full-text articles from the PMC Open Access Subset (PMC-OA), improving access to the substantial amount of relevant information present in the article full text (7). Search results are prioritized based on the depth of the relationship between the query terms: articles containing identifiable relations between semantic terms receive the highest priority, while articles where semantic or keyword terms cooccur nearby (e.g. within the same sentence) receive secondary priority. Search results are also prioritized based on the article section where the match appears (e.g. matches within the title receive higher priority). Users can further refine results by employing filters, narrowing articles returned to specific publication types, journals, or article sections. PubTator 3.0 is supported by an NLP pipeline, depicted in Figure 2A. This pipeline, run weekly, first identifies articles newly added to PubMed and PMC-OA. Articles are then processed through three major steps: (i) named entity recognition, provided by the recently developed deep-learning transformer model AIONER (8), (ii) identifier mapping and (iii) relation extraction, performed by BioREx (9) of 12 common types of relations (described in Supplementary Table S2). In total, PubTator 3.0 contains over 1.6 billion entity annotations (4.6 million unique identifiers) and 33 million relations (8.8 million unique pairs). It provides enhanced entity recognition and normalization performance over its previous version, PubTator 2 (10), also known as PubTator Central (Figure 2B and Supplementary Table S3). We show the relation extraction performance of PubTator 3.0 in Figure 2C and its comparison results to the previous state-of-the-art systems (11–13) on the BioCreative V Chemical-Disease Relation (14) corpus, finding that PubTator 3.0 provided substantially higher accuracy. Moreover, when evaluating a randomized sample of entity pair queries compared to PubMed and Google Scholar, Materials and methods Data sources and article processing PubTator 3.0 downloads new articles weekly from the BioC PubMed API (https://www.ncbi.nlm.nih.gov/research/bionlp/ APIs/BioC-PubMed/) and the BioC PMC API (https://www. ncbi.nlm.nih.gov/research/bionlp/APIs/BioC-PMC/) in BioCXML format (16). Local abbreviations are identified using Ab3P (17). Article text and extracted data are stored internally using MongoDB and indexed for search with Solr, ensuring robust and scalable accessibility unconstrained by external dependencies such as the NCBI eUtils API. Entity recognition and normalization/linking PubTator 3.0 uses AIONER (8), a recently developed named entity recognition (NER) model, to recognize entities of six types: genes/proteins, chemicals, diseases, species, genetic variants, and cell lines. AIONER utilizes a flexible tagging scheme to integrate training data created separately into a single resource. These training datasets include NLM-Gene (18), NLM-Chem (19), NCBI-Disease (20), BC5CDR (14), tmVar3 (21), Species-800 (22), BioID (23) and BioRED (15). This consolidation creates a larger training set, improving the model’s ability to generalize to unseen data. Furthermore, it enables recognizing multiple entity types simultaneously, enhancing efficiency and simplifying the challenge of distinguishing boundaries between entities that reference others, such as the disorder ‘Alpha-1 antitrypsin deficiency’ and the protein ‘Alpha-1 antitrypsin’. We previously evaluated the performance of AIONER on 14 benchmark datasets (8), including the test sets for the aforementioned training sets. This evaluation demonstrated that AIONER’s performance surpasses or matches previous state-of-the-art methods. Entity mentions found by AIONER are normalized (linked) to a unique identifier in an appropriate entity database. Normalization is performed by a module designed for (or adapted to) each entity type, using the latest version. The recentlyupgraded GNorm2 system (24) normalizes genes to NCBI Gene identifiers and species mentions to NCBI Taxonomy. tmVar3 (21), also recently upgraded, normalizes genetic variants; it uses dbSNP identifiers for variants listed in dbSNP and HGNV format otherwise. Chemicals are normalized by the NLM-Chem tagger (19) to MeSH identifiers (25). TaggerOne (26) normalizes diseases to MeSH and cell lines to Cellosaurus (27) using a new normalization-only mode. This mode only applies the normalization model, which converts both mentions and lexicon names into high-dimensional TFIDF vectors and learns a mapping, as before. However, it now augments the training data by mapping each lexicon name to itself, resulting in a large performance improvement for names present in the lexicon but not in the annotated training data. These enhancements provide a significant overall improvement in entity normalization performance (Supplementary Table S3). Relation extraction Relations for PubTator 3.0 are extracted by the unified relation extraction model BioREx (9), designed to simulta- Downloaded from https://academic.oup.com/nar/article/52/W1/W540/7640526 by guest on 10 March 2025 System overview PubTator 3.0 consistently returns a greater number of articles with higher precision in the top 20 results (Figure 2D and Supplementary Table S4). W542 Nucleic Acids Research, 2024, Vol. 52, Web Server issue neously extract 12 types of relations across eight entity type pairs: chemical–chemical, chemical–disease, chemical– gene, chemical–variant, disease–gene, disease–variant, gene– gene and variant–variant. Detailed definitions of these relation types and their corresponding entity pairs are presented in Supplementary Table S2. Deep-learning methods for relation extraction, such as BioREx, require ample training data. However, training data for relation extraction is fragmented into many datasets, often tailored to specific entity pairs. BioREx overcomes this limitation with a data-centric approach, reconciling discrepancies between disparate training datasets to construct a comprehensive, unified dataset. We evaluated the relations extracted by BioREx using performance on manually annotated relation extraction datasets as well as a comparative analysis between BioREx and notable comparable systems. BioREx established a new performance benchmark on the BioRED corpus test set (15), elevating the performance from 74.4% (F-score) to 79.6%, and demonstrating higher performance than alternative models such as transfer learning (TL), multi-task learning (MTL), and stateof-the-art models trained on isolated datasets (9). For PubTator 3.0, we replaced its deep learning module, PubMedBERT (28), with LinkBERT (29), further increasing the performance to 82.0%. Furthermore, we conducted a comparative analysis between BioREx and SemRep (11), a widely used rule- based method for extracting diverse relations, the CD-REST (13) system, and the previous state-of-the-art system (12), using the BioCreative V Chemical Disease Relation corpus test set (14). Our evaluation demonstrated that PubTator 3.0 provided substantially higher F-score than previous methods. Programmatic access and data formats PubTator 3.0 offers programmatic access through its API and bulk download. The API (https://www.ncbi. nlm.nih.gov/research/pubtator3/) supports keyword, entity and relation search, and also supports exporting annotations in XML and JSON-based BioC (16) formats and tab-delimited free text. The PubTator 3.0 FTP site (https://ftp.ncbi.nlm.nih.gov/pub/lu/PubTator3) provides bulk downloads of annotated articles and extraction summaries for entities and relations. Programmatic access supports more flexible query options; for example, the information need ‘what chemicals reduce expression of JAK1?’ can be answered directly via API (e.g. https: //www.ncbi.nlm.nih.gov/research/pubtator3-api/relations? e1=@GENE_JAK1&type=negative_correlate&e2=Chemical) or by filtering the bulk relations file. Additionally, the PubTator 3.0 API supports annotation of user-defined free text. Downloaded from https://academic.oup.com/nar/article/52/W1/W540/7640526 by guest on 10 March 2025 Figure 1. PubTator 3.0 system overview and search results page: 1. Query auto-complete enhances search accuracy and synonym matching. 2. Natural language processing (NLP)-enhanced relevance: Search results are prioritized according to the strength of the relationship between the entities queried. 3. Users can further refine results with facet filters—section, journal and type. 4. Search results include highlighted entity snippets explaining relevance. 5. Histogram visualizes number of results by publication year. 6. Entity highlighting can be switched on or off according to user preference. Nucleic Acids Research, 2024, Vol. 52, Web Server issue W543 Case study I: entity relation queries We analyzed the retrieval quality of PubTator 3.0 by preparing a series of 12 entity pairs to serve as case studies for comparison between PubTator 3.0, PubMed and Google Scholar. To provide an equal comparison, we filtered about 30% of the Google Scholar results for articles not present in PubMed. To ensure that the number of results would remain low enough to allow filtering Google Scholar results for articles not in PubMed, we identified entity pairs first discussed together in the literature in 2022 or later. We then randomly selected two entity pairs of each of the following types: disease/gene, chemical/disease, chemical/gene, chemical/chemical, gene/gene and disease/variant. None of the relation pairs selected appears in the training set. The comparison was performed with respect to a snapshot of the search results returned by all search engines on 19 May 2023. We manually evaluated the top 20 results for each system and each query; articles were judged to be relevant if they mentioned both entities in the query and supported a relationship between them. Two curators independently judged each article, and discrepancies were discussed until agreement. The curators were not blinded to the retrieval method but were required to record the text supporting the relationship, if relevant. This experiment evaluated the relevance of the top 20 results for each retrieval method, regardless of whether the article appeared in PubMed. Downloaded from https://academic.oup.com/nar/article/52/W1/W540/7640526 by guest on 10 March 2025 Figure 2. (A) The PubTator 3.0 processing pipeline: AIONER (8) identifies six types of entities in PubMed abstracts and PMC-OA full-text articles. Entity annotations are associated with database identifiers by specialized mappers and BioREx (9) identifies relations between entities. Extracted data is stored in MongoDB and made searchable using Solr. (B) Entity recognition performance for each entity type compared with PubTator2 (also known as PubTatorCentral) (13) on the BioRED corpus (15). (C) Relation extraction performance compared with SemRep (11) and notable previous best systems (12,13) on the BioCreative V Chemical-Disease Relation (14) corpus. (D) Comparison of information retrieval for PubTator 3.0, PubMed, and Google Scholar for entity pair queries, with respect to total article count and top-20 article precision. W544 Case study II: retrieval-augmented generation In the era of large language models (LLMs), PubTator 3.0 can also enhance their factual accuracy via retrieval augmented generation. Despite their strong language ability, LLMs are prone to generating incorrect assertions, sometimes known as hallucinations (30,31). For example, when requested to cite sources for questions such as ‘which diseases can doxorubicin treat’, GPT-4 frequently provides seemingly plausible but nonexistent references. Augmenting GPT-4 with PubTator 3.0 APIs can anchor the model’s response to verifiable references via the extracted relations, significantly reducing hallucinations. We assessed the citation accuracy of responses from three GPT-4 variations: PubTator-augmented GPT-4, PubMedaugmented GPT-4 and standard GPT-4. We performed a qualitative evaluation based on eight questions selected as follows. We identified entities mentioned in the PubMed query logs and randomly selected from entities searched both frequently and rarely. We then identified the common queries for each entity that request relational information and adapted one into a natural language question. Each question is therefore grounded on common information needs of real PubMed users. For example, the questions ‘What can be caused by tocilizumab?’ and ‘What can be treated by doxorubicin?’ are adapted from the user queries ‘tocilizumab side effects’ and ‘doxorubicin treatment’ respectively. Such questions typically require extracting information from multiple articles and an understanding of biomedical entities and relationship descriptions. Supplementary Table S5 lists the questions chosen. We augmented the GPT-4 large language model (LLM) with PubTator 3.0 via the function calling mechanism of the OpenAI ChatCompletion API. This integration involved prompt- ing GPT-4 with descriptions of three PubTator APIs: (i) find entity ID, which retrieves PubTator entity identifiers; (ii) find related entities, which identifies related entities based on an input entity and specified relations and (iii) export relevant search results, which returns PubMed article identifiers containing textual evidence for specific entity relationships. Our instructions prompted GPT-4 to decompose user questions into sub-questions addressable by these APIs, execute the function calls, and synthesize the responses into a coherent final answer. Our prompt promoted a summarized response by instructing GPT-4 to start its message with ‘Summary:’ and requested the response include citations to the articles providing evidence. The PubMed augmentation experiments provided GPT-4 with access to PubMed database search via the National Center for Biotechnology Information (NCBI) E-utils APIs (32). We used Azure OpenAI Services (version 2023-0701-preview) and GPT-4 (version 2023-06-13) and set the decoding temperature to zero to obtain deterministic outputs. The full prompts are provided in Supplementary Table S6. PubTator-augmented GPT-4 generally processed the questions in three steps: (i) finding the standard entity identifiers, (ii) finding its related entity identifiers and (iii) searching PubMed articles. For example, to answer ‘What drugs can treat breast cancer?’, GPT-4 first found the PubTator entity identifier for breast cancer (@DISEASE_Breast_Cancer) using the Find Entity ID API. It then used the Find Related Entities API to identify entities related to @DISEASE_Breast_Cancer through a ‘treat’ relation. For demonstration purposes, we limited the maximum number of output entities to five. Finally, GPT-4 called the Export Relevant Search Results API for the PubMed article identifiers containing evidence for these relationships. The raw responses to each prompt for each method are provided in Supplementary Table S6. We manually evaluated the accuracy of the citations in the responses by reviewing each PubMed article and verifying whether each PubMed article cited supported the stated relationship (e.g. Tamoxifen treating breast cancer). Supplementary Table S5 reports the proportion of the cited articles with valid supporting evidence for each method. GPT4 frequently generated fabricated citations, widely known as the hallucination issue. While PubMed-augmented GPT-4 showed a higher proportion of accurate citations, some articles cited did not support the relation claims. This is likely because PubMed is based on keyword and Boolean search and does not support queries for specific relationships. Responses generated by PubTator-augmented GPT-4 demonstrated the highest level of citation accuracy, underscoring the potential of PubTator 3.0 as a high-quality knowledge source for addressing biomedical information needs through retrievalaugmented generation with LLMs such as GPT-4. In our experiment, using Azure for ChatGPT, the cost was approximately $1 for two questions with GPT-4-Turbo, or 40 questions when downgraded to GPT-3.5-Turbo, including the cost of input/output tokens. Discussion Previous versions of PubTator have fulfilled over one billion API requests since 2015, supporting a wide range of research applications. Numerous studies have harnessed PubTator annotations for disease-specific gene research, including efforts to prioritize candidate genes (33), determine gene–phenotype associations (34), and identify the genetic underpinnings of Downloaded from https://academic.oup.com/nar/article/52/W1/W540/7640526 by guest on 10 March 2025 Our analysis is summarized in Figure 2D, and Supplementary Table S4 presents a detailed comparison of the quality of retrieved results between PubTator 3.0, PubMed and Google Scholar. Our results demonstrate that PubTator 3.0 retrieves a greater number of articles than the comparison systems and its precision is higher for the top 20 results. For instance, PubTator 3.0 returned 346 articles for the query ‘GLPG0634 + ulcerative colitis’, and manual review of the top 20 articles showed that all contained statements about an association between GLPG0634 and ulcerative colitis. In contrast, PubMed only returned a total of 18 articles, with only 12 mentioning an association. Moreover, when searching for ‘COVID19 + PON1’, PubTator 3.0 returns 212 articles in PubMed, surpassing the 43 articles obtained from Google Scholar, only 29 of which are sourced from PubMed. These disparities can be attributed to several factors: (i) PubTator 3.0’s search includes full texts available in PMC-OA, resulting in significantly broader coverage of articles, (ii) entity normalization improves recall, for example, by matching ‘paraoxonase 1’ to ‘PON1’, (iii) PubTator 3.0 prioritizes articles containing relations between the query entities, (iv) Pubtator 3.0 prioritizes articles where the entities appear nearby, rather than distant paragraphs. Across the 12 information retrieval case studies, PubTator 3.0 demonstrated an overall precision of 90.0% for the top 20 articles (216 out of 240), which is significantly higher than PubMed’s precision of 81.6% (84 out of 103) and Google Scholar’s precision of 48.5% (98 out of 202). Nucleic Acids Research, 2024, Vol. 52, Web Server issue W545 Nucleic Acids Research, 2024, Vol. 52, Web Server issue Conclusion PubTator 3.0 offers a comprehensive set of features and tools that allow researchers to navigate the ever-expanding wealth of biomedical literature, expediting research and unlocking valuable insights for scientific discovery. The PubTator 3.0 interface, API, and bulk file downloads are available at https: //www.ncbi.nlm.nih.gov/research/pubtator3/. Data availability Data is available through the online interface at https:// www.ncbi.nlm.nih.gov/research/pubtator3/, through the API at https://www.ncbi.nlm.nih.gov/research/pubtator3/api or bulk FTP download at https://ftp.ncbi.nlm.nih.gov/pub/lu/ PubTator3/. The source code for each component of PubTator 3.0 is openly accessible. The AIONER named entity recognizer is available at https://github.com/ncbi/AIONER. GNorm2, for gene name normalization, is available at https://github. com/ncbi/GNorm2. The tmVar3 variant name normalizer is available at https://github.com/ncbi/tmVar3. The NLMChem Tagger, for chemical name normalization, is available at https://ftp.ncbi.nlm.nih.gov/pub/lu/NLMChem. The TaggerOne system, for disease and cell line normalization, is available at https://www.ncbi.nlm.nih.gov/research/bionlp/Tools/ taggerone. The BioREx relation extraction system is available at https://github.com/ncbi/BioREx. The code for customizing ChatGPT with the PubTator 3.0 API is available at https: //github.com/ncbi-nlp/pubtator-gpt. The details of the applications, performance, evaluation data, and citations for each tool are shown in Supplementary Table S7. All source code is also available at https://doi.org/10.5281/zenodo.10839630. Supplementary data Supplementary Data are available at NAR Online. Funding Intramural Research Program of the National Library of Medicine (NLM), National Institutes of Health; ODSS Support of the Exploration of Cloud in NIH Intramural Research. Funding for open access charge: Intramural Research Program of the National Library of Medicine, National Institutes of Health. Conflict of interest statement None declared. ```