This is page 9 of 15. Use http://codebase.md/genomoncology/biomcp?page={x} to view the full context. # Directory Structure ``` ├── .github │ ├── actions │ │ └── setup-python-env │ │ └── action.yml │ ├── dependabot.yml │ └── workflows │ ├── ci.yml │ ├── deploy-docs.yml │ ├── main.yml.disabled │ ├── on-release-main.yml │ └── validate-codecov-config.yml ├── .gitignore ├── .pre-commit-config.yaml ├── BIOMCP_DATA_FLOW.md ├── CHANGELOG.md ├── CNAME ├── codecov.yaml ├── docker-compose.yml ├── Dockerfile ├── docs │ ├── apis │ │ ├── error-codes.md │ │ ├── overview.md │ │ └── python-sdk.md │ ├── assets │ │ ├── biomcp-cursor-locations.png │ │ ├── favicon.ico │ │ ├── icon.png │ │ ├── logo.png │ │ ├── mcp_architecture.txt │ │ └── remote-connection │ │ ├── 00_connectors.png │ │ ├── 01_add_custom_connector.png │ │ ├── 02_connector_enabled.png │ │ ├── 03_connect_to_biomcp.png │ │ ├── 04_select_google_oauth.png │ │ └── 05_success_connect.png │ ├── backend-services-reference │ │ ├── 01-overview.md │ │ ├── 02-biothings-suite.md │ │ ├── 03-cbioportal.md │ │ ├── 04-clinicaltrials-gov.md │ │ ├── 05-nci-cts-api.md │ │ ├── 06-pubtator3.md │ │ └── 07-alphagenome.md │ ├── blog │ │ ├── ai-assisted-clinical-trial-search-analysis.md │ │ ├── images │ │ │ ├── deep-researcher-video.png │ │ │ ├── researcher-announce.png │ │ │ ├── researcher-drop-down.png │ │ │ ├── researcher-prompt.png │ │ │ ├── trial-search-assistant.png │ │ │ └── what_is_biomcp_thumbnail.png │ │ └── researcher-persona-resource.md │ ├── changelog.md │ ├── CNAME │ ├── concepts │ │ ├── 01-what-is-biomcp.md │ │ ├── 02-the-deep-researcher-persona.md │ │ └── 03-sequential-thinking-with-the-think-tool.md │ ├── developer-guides │ │ ├── 01-server-deployment.md │ │ ├── 02-contributing-and-testing.md │ │ ├── 03-third-party-endpoints.md │ │ ├── 04-transport-protocol.md │ │ ├── 05-error-handling.md │ │ ├── 06-http-client-and-caching.md │ │ ├── 07-performance-optimizations.md │ │ └── generate_endpoints.py │ ├── faq-condensed.md │ ├── FDA_SECURITY.md │ ├── genomoncology.md │ ├── getting-started │ │ ├── 01-quickstart-cli.md │ │ ├── 02-claude-desktop-integration.md │ │ └── 03-authentication-and-api-keys.md │ ├── how-to-guides │ │ ├── 01-find-articles-and-cbioportal-data.md │ │ ├── 02-find-trials-with-nci-and-biothings.md │ │ ├── 03-get-comprehensive-variant-annotations.md │ │ ├── 04-predict-variant-effects-with-alphagenome.md │ │ ├── 05-logging-and-monitoring-with-bigquery.md │ │ └── 06-search-nci-organizations-and-interventions.md │ ├── index.md │ ├── policies.md │ ├── reference │ │ ├── architecture-diagrams.md │ │ ├── quick-architecture.md │ │ ├── quick-reference.md │ │ └── visual-architecture.md │ ├── robots.txt │ ├── stylesheets │ │ ├── announcement.css │ │ └── extra.css │ ├── troubleshooting.md │ ├── tutorials │ │ ├── biothings-prompts.md │ │ ├── claude-code-biomcp-alphagenome.md │ │ ├── nci-prompts.md │ │ ├── openfda-integration.md │ │ ├── openfda-prompts.md │ │ ├── pydantic-ai-integration.md │ │ └── remote-connection.md │ ├── user-guides │ │ ├── 01-command-line-interface.md │ │ ├── 02-mcp-tools-reference.md │ │ └── 03-integrating-with-ides-and-clients.md │ └── workflows │ └── all-workflows.md ├── example_scripts │ ├── mcp_integration.py │ └── python_sdk.py ├── glama.json ├── LICENSE ├── lzyank.toml ├── Makefile ├── mkdocs.yml ├── package-lock.json ├── package.json ├── pyproject.toml ├── README.md ├── scripts │ ├── check_docs_in_mkdocs.py │ ├── check_http_imports.py │ └── generate_endpoints_doc.py ├── smithery.yaml ├── src │ └── biomcp │ ├── __init__.py │ ├── __main__.py │ ├── articles │ │ ├── __init__.py │ │ ├── autocomplete.py │ │ ├── fetch.py │ │ ├── preprints.py │ │ ├── search_optimized.py │ │ ├── search.py │ │ └── unified.py │ ├── biomarkers │ │ ├── __init__.py │ │ └── search.py │ ├── cbioportal_helper.py │ ├── circuit_breaker.py │ ├── cli │ │ ├── __init__.py │ │ ├── articles.py │ │ ├── biomarkers.py │ │ ├── diseases.py │ │ ├── health.py │ │ ├── interventions.py │ │ ├── main.py │ │ ├── openfda.py │ │ ├── organizations.py │ │ ├── server.py │ │ ├── trials.py │ │ └── variants.py │ ├── connection_pool.py │ ├── constants.py │ ├── core.py │ ├── diseases │ │ ├── __init__.py │ │ ├── getter.py │ │ └── search.py │ ├── domain_handlers.py │ ├── drugs │ │ ├── __init__.py │ │ └── getter.py │ ├── exceptions.py │ ├── genes │ │ ├── __init__.py │ │ └── getter.py │ ├── http_client_simple.py │ ├── http_client.py │ ├── individual_tools.py │ ├── integrations │ │ ├── __init__.py │ │ ├── biothings_client.py │ │ └── cts_api.py │ ├── interventions │ │ ├── __init__.py │ │ ├── getter.py │ │ └── search.py │ ├── logging_filter.py │ ├── metrics_handler.py │ ├── metrics.py │ ├── openfda │ │ ├── __init__.py │ │ ├── adverse_events_helpers.py │ │ ├── adverse_events.py │ │ ├── cache.py │ │ ├── constants.py │ │ ├── device_events_helpers.py │ │ ├── device_events.py │ │ ├── drug_approvals.py │ │ ├── drug_labels_helpers.py │ │ ├── drug_labels.py │ │ ├── drug_recalls_helpers.py │ │ ├── drug_recalls.py │ │ ├── drug_shortages_detail_helpers.py │ │ ├── drug_shortages_helpers.py │ │ ├── drug_shortages.py │ │ ├── exceptions.py │ │ ├── input_validation.py │ │ ├── rate_limiter.py │ │ ├── utils.py │ │ └── validation.py │ ├── organizations │ │ ├── __init__.py │ │ ├── getter.py │ │ └── search.py │ ├── parameter_parser.py │ ├── prefetch.py │ ├── query_parser.py │ ├── query_router.py │ ├── rate_limiter.py │ ├── render.py │ ├── request_batcher.py │ ├── resources │ │ ├── __init__.py │ │ ├── getter.py │ │ ├── instructions.md │ │ └── researcher.md │ ├── retry.py │ ├── router_handlers.py │ ├── router.py │ ├── shared_context.py │ ├── thinking │ │ ├── __init__.py │ │ ├── sequential.py │ │ └── session.py │ ├── thinking_tool.py │ ├── thinking_tracker.py │ ├── trials │ │ ├── __init__.py │ │ ├── getter.py │ │ ├── nci_getter.py │ │ ├── nci_search.py │ │ └── search.py │ ├── utils │ │ ├── __init__.py │ │ ├── cancer_types_api.py │ │ ├── cbio_http_adapter.py │ │ ├── endpoint_registry.py │ │ ├── gene_validator.py │ │ ├── metrics.py │ │ ├── mutation_filter.py │ │ ├── query_utils.py │ │ ├── rate_limiter.py │ │ └── request_cache.py │ ├── variants │ │ ├── __init__.py │ │ ├── alphagenome.py │ │ ├── cancer_types.py │ │ ├── cbio_external_client.py │ │ ├── cbioportal_mutations.py │ │ ├── cbioportal_search_helpers.py │ │ ├── cbioportal_search.py │ │ ├── constants.py │ │ ├── external.py │ │ ├── filters.py │ │ ├── getter.py │ │ ├── links.py │ │ └── search.py │ └── workers │ ├── __init__.py │ ├── worker_entry_stytch.js │ ├── worker_entry.js │ └── worker.py ├── tests │ ├── bdd │ │ ├── cli_help │ │ │ ├── help.feature │ │ │ └── test_help.py │ │ ├── conftest.py │ │ ├── features │ │ │ └── alphagenome_integration.feature │ │ ├── fetch_articles │ │ │ ├── fetch.feature │ │ │ └── test_fetch.py │ │ ├── get_trials │ │ │ ├── get.feature │ │ │ └── test_get.py │ │ ├── get_variants │ │ │ ├── get.feature │ │ │ └── test_get.py │ │ ├── search_articles │ │ │ ├── autocomplete.feature │ │ │ ├── search.feature │ │ │ ├── test_autocomplete.py │ │ │ └── test_search.py │ │ ├── search_trials │ │ │ ├── search.feature │ │ │ └── test_search.py │ │ ├── search_variants │ │ │ ├── search.feature │ │ │ └── test_search.py │ │ └── steps │ │ └── test_alphagenome_steps.py │ ├── config │ │ └── test_smithery_config.py │ ├── conftest.py │ ├── data │ │ ├── ct_gov │ │ │ ├── clinical_trials_api_v2.yaml │ │ │ ├── trials_NCT04280705.json │ │ │ └── trials_NCT04280705.txt │ │ ├── myvariant │ │ │ ├── myvariant_api.yaml │ │ │ ├── myvariant_field_descriptions.csv │ │ │ ├── variants_full_braf_v600e.json │ │ │ ├── variants_full_braf_v600e.txt │ │ │ └── variants_part_braf_v600_multiple.json │ │ ├── openfda │ │ │ ├── drugsfda_detail.json │ │ │ ├── drugsfda_search.json │ │ │ ├── enforcement_detail.json │ │ │ └── enforcement_search.json │ │ └── pubtator │ │ ├── pubtator_autocomplete.json │ │ └── pubtator3_paper.txt │ ├── integration │ │ ├── test_openfda_integration.py │ │ ├── test_preprints_integration.py │ │ ├── test_simple.py │ │ └── test_variants_integration.py │ ├── tdd │ │ ├── articles │ │ │ ├── test_autocomplete.py │ │ │ ├── test_cbioportal_integration.py │ │ │ ├── test_fetch.py │ │ │ ├── test_preprints.py │ │ │ ├── test_search.py │ │ │ └── test_unified.py │ │ ├── conftest.py │ │ ├── drugs │ │ │ ├── __init__.py │ │ │ └── test_drug_getter.py │ │ ├── openfda │ │ │ ├── __init__.py │ │ │ ├── test_adverse_events.py │ │ │ ├── test_device_events.py │ │ │ ├── test_drug_approvals.py │ │ │ ├── test_drug_labels.py │ │ │ ├── test_drug_recalls.py │ │ │ ├── test_drug_shortages.py │ │ │ └── test_security.py │ │ ├── test_biothings_integration_real.py │ │ ├── test_biothings_integration.py │ │ ├── test_circuit_breaker.py │ │ ├── test_concurrent_requests.py │ │ ├── test_connection_pool.py │ │ ├── test_domain_handlers.py │ │ ├── test_drug_approvals.py │ │ ├── test_drug_recalls.py │ │ ├── test_drug_shortages.py │ │ ├── test_endpoint_documentation.py │ │ ├── test_error_scenarios.py │ │ ├── test_europe_pmc_fetch.py │ │ ├── test_mcp_integration.py │ │ ├── test_mcp_tools.py │ │ ├── test_metrics.py │ │ ├── test_nci_integration.py │ │ ├── test_nci_mcp_tools.py │ │ ├── test_network_policies.py │ │ ├── test_offline_mode.py │ │ ├── test_openfda_unified.py │ │ ├── test_pten_r173_search.py │ │ ├── test_render.py │ │ ├── test_request_batcher.py.disabled │ │ ├── test_retry.py │ │ ├── test_router.py │ │ ├── test_shared_context.py.disabled │ │ ├── test_unified_biothings.py │ │ ├── thinking │ │ │ ├── __init__.py │ │ │ └── test_sequential.py │ │ ├── trials │ │ │ ├── test_backward_compatibility.py │ │ │ ├── test_getter.py │ │ │ └── test_search.py │ │ ├── utils │ │ │ ├── test_gene_validator.py │ │ │ ├── test_mutation_filter.py │ │ │ ├── test_rate_limiter.py │ │ │ └── test_request_cache.py │ │ ├── variants │ │ │ ├── constants.py │ │ │ ├── test_alphagenome_api_key.py │ │ │ ├── test_alphagenome_comprehensive.py │ │ │ ├── test_alphagenome.py │ │ │ ├── test_cbioportal_mutations.py │ │ │ ├── test_cbioportal_search.py │ │ │ ├── test_external_integration.py │ │ │ ├── test_external.py │ │ │ ├── test_extract_gene_aa_change.py │ │ │ ├── test_filters.py │ │ │ ├── test_getter.py │ │ │ ├── test_links.py │ │ │ └── test_search.py │ │ └── workers │ │ └── test_worker_sanitization.js │ └── test_pydantic_ai_integration.py ├── THIRD_PARTY_ENDPOINTS.md ├── tox.ini ├── uv.lock └── wrangler.toml ``` # Files -------------------------------------------------------------------------------- /src/biomcp/variants/cbioportal_search.py: -------------------------------------------------------------------------------- ```python """cBioPortal search enhancements for variant queries.""" import asyncio import logging from typing import Any from pydantic import BaseModel, Field from ..utils.cbio_http_adapter import CBioHTTPAdapter from ..utils.gene_validator import is_valid_gene_symbol, sanitize_gene_symbol from ..utils.request_cache import request_cache from .cancer_types import get_cancer_keywords logger = logging.getLogger(__name__) # Cache for frequently accessed data _cancer_type_cache: dict[str, dict[str, Any]] = {} _gene_panel_cache: dict[str, list[str]] = {} class GeneHotspot(BaseModel): """Hotspot mutation information.""" position: int amino_acid_change: str count: int frequency: float cancer_types: list[str] = Field(default_factory=list) class CBioPortalSearchSummary(BaseModel): """Summary data from cBioPortal for a gene search.""" gene: str total_mutations: int = 0 total_samples_tested: int = 0 mutation_frequency: float = 0.0 hotspots: list[GeneHotspot] = Field(default_factory=list) cancer_distribution: dict[str, int] = Field(default_factory=dict) study_coverage: dict[str, Any] = Field(default_factory=dict) top_studies: list[str] = Field(default_factory=list) class CBioPortalSearchClient: """Client for cBioPortal search operations.""" def __init__(self): self.http_adapter = CBioHTTPAdapter() @request_cache(ttl=900) # Cache for 15 minutes async def get_gene_search_summary( self, gene: str, max_studies: int = 10 ) -> CBioPortalSearchSummary | None: """Get summary statistics for a gene across cBioPortal. Args: gene: Gene symbol (e.g., "BRAF") max_studies: Maximum number of studies to query Returns: Summary statistics or None if gene not found """ # Validate and sanitize gene symbol if not is_valid_gene_symbol(gene): logger.warning(f"Invalid gene symbol: {gene}") return None gene = sanitize_gene_symbol(gene) try: # Get gene info first gene_data, error = await self.http_adapter.get( f"/genes/{gene}", endpoint_key="cbioportal_genes" ) if error or not gene_data: logger.warning(f"Gene {gene} not found in cBioPortal") return None gene_id = gene_data.get("entrezGeneId") if not gene_id: return None # Get cancer type keywords for this gene cancer_keywords = get_cancer_keywords(gene) # Get relevant molecular profiles in parallel with cancer types profiles_task = self._get_relevant_profiles(gene, cancer_keywords) cancer_types_task = self._get_cancer_types() profiles, cancer_types = await asyncio.gather( profiles_task, cancer_types_task ) if not profiles: logger.info(f"No relevant profiles found for {gene}") return None # Query mutations from top studies selected_profiles = profiles[:max_studies] mutation_summary = await self._get_mutation_summary( gene_id, selected_profiles, cancer_types ) # Build summary summary = CBioPortalSearchSummary( gene=gene, total_mutations=mutation_summary.get("total_mutations", 0), total_samples_tested=mutation_summary.get("total_samples", 0), mutation_frequency=mutation_summary.get("frequency", 0.0), hotspots=mutation_summary.get("hotspots", []), cancer_distribution=mutation_summary.get( "cancer_distribution", {} ), study_coverage={ "total_studies": len(profiles), "queried_studies": len(selected_profiles), "studies_with_data": mutation_summary.get( "studies_with_data", 0 ), }, top_studies=[ p.get("studyId", "") for p in selected_profiles if p.get("studyId") ][:5], ) return summary except TimeoutError: logger.error( f"cBioPortal API timeout for gene {gene}. " "The API may be slow or unavailable. Try again later." ) return None except ConnectionError as e: logger.error( f"Network error accessing cBioPortal for gene {gene}: {e}. " "Check your internet connection." ) return None except Exception as e: logger.error( f"Unexpected error getting cBioPortal summary for {gene}: " f"{type(e).__name__}: {e}. " "This may be a temporary issue. If it persists, please report it." ) return None async def _get_cancer_types(self) -> dict[str, dict[str, Any]]: """Get cancer type hierarchy (cached).""" if _cancer_type_cache: return _cancer_type_cache try: cancer_types, error = await self.http_adapter.get( "/cancer-types", endpoint_key="cbioportal_cancer_types", cache_ttl=86400, # Cache for 24 hours ) if not error and cancer_types: # Build lookup by ID for ct in cancer_types: ct_id = ct.get("cancerTypeId") if ct_id: _cancer_type_cache[ct_id] = ct return _cancer_type_cache except Exception as e: logger.warning(f"Failed to get cancer types: {e}") return {} async def _get_relevant_profiles( self, gene: str, cancer_keywords: list[str], ) -> list[dict[str, Any]]: """Get molecular profiles relevant to the gene.""" try: # Get all mutation profiles all_profiles, error = await self.http_adapter.get( "/molecular-profiles", params={"molecularAlterationType": "MUTATION_EXTENDED"}, endpoint_key="cbioportal_molecular_profiles", cache_ttl=3600, # Cache for 1 hour ) if error or not all_profiles: return [] # Filter by cancer keywords relevant_profiles = [] for profile in all_profiles: study_id = profile.get("studyId", "").lower() if any(keyword in study_id for keyword in cancer_keywords): relevant_profiles.append(profile) # Sort by sample count (larger studies first) # Note: We'd need to fetch study details for actual sample counts # For now, prioritize known large studies priority_studies = [ "msk_impact", "tcga", "genie", "metabric", "broad", ] def study_priority(profile): study_id = profile.get("studyId", "").lower() for i, priority in enumerate(priority_studies): if priority in study_id: return i return len(priority_studies) relevant_profiles.sort(key=study_priority) return relevant_profiles except Exception as e: logger.warning(f"Failed to get profiles: {e}") return [] async def _get_mutation_summary( self, gene_id: int, profiles: list[dict[str, Any]], cancer_types: dict[str, dict[str, Any]], ) -> dict[str, Any]: """Get mutation summary across selected profiles.""" # Batch mutations queries for better performance BATCH_SIZE = ( 5 # Process 5 profiles at a time to avoid overwhelming the API ) mutation_results = [] study_ids = [] for i in range(0, len(profiles), BATCH_SIZE): batch = profiles[i : i + BATCH_SIZE] batch_tasks = [] batch_study_ids = [] for profile in batch: profile_id = profile.get("molecularProfileId") study_id = profile.get("studyId") if profile_id and study_id: task = self._get_profile_mutations( gene_id, profile_id, study_id ) batch_tasks.append(task) batch_study_ids.append(study_id) if batch_tasks: # Execute batch in parallel batch_results = await asyncio.gather( *batch_tasks, return_exceptions=True ) mutation_results.extend(batch_results) study_ids.extend(batch_study_ids) # Small delay between batches to avoid rate limiting if i + BATCH_SIZE < len(profiles): await asyncio.sleep(0.05) # 50ms delay results = mutation_results # Process results using helper function from .cbioportal_search_helpers import ( format_hotspots, process_mutation_results, ) mutation_data = await process_mutation_results( list(zip(results, study_ids, strict=False)), cancer_types, self, ) # Calculate frequency frequency = ( mutation_data["total_mutations"] / mutation_data["total_samples"] if mutation_data["total_samples"] > 0 else 0.0 ) # Format hotspots hotspots = format_hotspots( mutation_data["hotspot_counts"], mutation_data["total_mutations"] ) return { "total_mutations": mutation_data["total_mutations"], "total_samples": mutation_data["total_samples"], "frequency": frequency, "hotspots": hotspots, "cancer_distribution": mutation_data["cancer_distribution"], "studies_with_data": mutation_data["studies_with_data"], } async def _get_profile_mutations( self, gene_id: int, profile_id: str, study_id: str, ) -> dict[str, Any] | None: """Get mutations for a gene in a specific profile.""" try: # Get sample count for the study samples, samples_error = await self.http_adapter.get( f"/studies/{study_id}/samples", params={"projection": "SUMMARY"}, endpoint_key="cbioportal_studies", cache_ttl=3600, # Cache for 1 hour ) sample_count = len(samples) if samples and not samples_error else 0 # Get mutations mutations, mut_error = await self.http_adapter.get( f"/molecular-profiles/{profile_id}/mutations", params={ "sampleListId": f"{study_id}_all", "geneIdType": "ENTREZ_GENE_ID", "geneIds": str(gene_id), "projection": "SUMMARY", }, endpoint_key="cbioportal_mutations", cache_ttl=900, # Cache for 15 minutes ) if not mut_error and mutations: return {"mutations": mutations, "sample_count": sample_count} except Exception as e: logger.debug( f"Failed to get mutations for {profile_id}: {type(e).__name__}" ) return None async def _get_study_cancer_type( self, study_id: str, cancer_types: dict[str, dict[str, Any]], ) -> str: """Get cancer type name for a study.""" try: study, error = await self.http_adapter.get( f"/studies/{study_id}", endpoint_key="cbioportal_studies", cache_ttl=3600, # Cache for 1 hour ) if not error and study: cancer_type_id = study.get("cancerTypeId") if cancer_type_id and cancer_type_id in cancer_types: return cancer_types[cancer_type_id].get("name", "Unknown") elif cancer_type := study.get("cancerType"): return cancer_type.get("name", "Unknown") except Exception: logger.debug(f"Failed to get cancer type for study {study_id}") # Fallback: infer from study ID study_lower = study_id.lower() if "brca" in study_lower or "breast" in study_lower: return "Breast Cancer" elif "lung" in study_lower or "nsclc" in study_lower: return "Lung Cancer" elif "coad" in study_lower or "colorectal" in study_lower: return "Colorectal Cancer" elif "skcm" in study_lower or "melanoma" in study_lower: return "Melanoma" elif "prad" in study_lower or "prostate" in study_lower: return "Prostate Cancer" return "Unknown" def format_cbioportal_search_summary( summary: CBioPortalSearchSummary | None, ) -> str: """Format cBioPortal search summary for display.""" if not summary: return "" lines = [ f"\n### cBioPortal Summary for {summary.gene}", f"- **Mutation Frequency**: {summary.mutation_frequency:.1%} ({summary.total_mutations:,} mutations in {summary.total_samples_tested:,} samples)", f"- **Studies**: {summary.study_coverage.get('studies_with_data', 0)} of {summary.study_coverage.get('queried_studies', 0)} studies have mutations", ] if summary.hotspots: lines.append("\n**Top Hotspots:**") for hs in summary.hotspots[:3]: lines.append( f"- {hs.amino_acid_change}: {hs.count} cases ({hs.frequency:.1%}) in {', '.join(hs.cancer_types[:3])}" ) if summary.cancer_distribution: lines.append("\n**Cancer Type Distribution:**") for cancer_type, count in sorted( summary.cancer_distribution.items(), key=lambda x: x[1], reverse=True, )[:5]: lines.append(f"- {cancer_type}: {count} mutations") return "\n".join(lines) ``` -------------------------------------------------------------------------------- /src/biomcp/query_router.py: -------------------------------------------------------------------------------- ```python """Query router for unified search in BioMCP.""" import asyncio from dataclasses import dataclass from typing import Any from biomcp.articles.search import PubmedRequest from biomcp.articles.unified import search_articles_unified from biomcp.query_parser import ParsedQuery from biomcp.trials.search import TrialQuery, search_trials from biomcp.variants.search import VariantQuery, search_variants @dataclass class RoutingPlan: """Plan for routing a query to appropriate tools.""" tools_to_call: list[str] field_mappings: dict[str, dict[str, Any]] coordination_strategy: str = "parallel" class QueryRouter: """Routes unified queries to appropriate domain-specific tools.""" def route(self, parsed_query: ParsedQuery) -> RoutingPlan: """Determine which tools to call based on query fields.""" tools_to_call = [] field_mappings = {} # Check which domains are referenced domains_referenced = self._get_referenced_domains(parsed_query) # Build field mappings for each domain domain_mappers = { "articles": ("article_searcher", self._map_article_fields), "trials": ("trial_searcher", self._map_trial_fields), "variants": ("variant_searcher", self._map_variant_fields), "genes": ("gene_searcher", self._map_gene_fields), "drugs": ("drug_searcher", self._map_drug_fields), "diseases": ("disease_searcher", self._map_disease_fields), } for domain, (tool_name, mapper_func) in domain_mappers.items(): if domain in domains_referenced: tools_to_call.append(tool_name) field_mappings[tool_name] = mapper_func(parsed_query) return RoutingPlan( tools_to_call=tools_to_call, field_mappings=field_mappings, coordination_strategy="parallel", ) def _get_referenced_domains(self, parsed_query: ParsedQuery) -> set[str]: """Get all domains referenced in the query.""" domains_referenced = set() # Check domain-specific fields for domain, fields in parsed_query.domain_specific_fields.items(): if fields: domains_referenced.add(domain) # Check cross-domain fields (these trigger multiple searches) if parsed_query.cross_domain_fields: cross_domain_mappings = { "gene": ["articles", "variants", "genes", "trials"], "disease": ["articles", "trials", "diseases"], "variant": ["articles", "variants"], "chemical": ["articles", "trials", "drugs"], "drug": ["articles", "trials", "drugs"], } for field, domains in cross_domain_mappings.items(): if field in parsed_query.cross_domain_fields: domains_referenced.update(domains) return domains_referenced def _map_article_fields(self, parsed_query: ParsedQuery) -> dict[str, Any]: """Map query fields to article searcher parameters.""" mapping: dict[str, Any] = {} # Map cross-domain fields if "gene" in parsed_query.cross_domain_fields: mapping["genes"] = [parsed_query.cross_domain_fields["gene"]] if "disease" in parsed_query.cross_domain_fields: mapping["diseases"] = [parsed_query.cross_domain_fields["disease"]] if "variant" in parsed_query.cross_domain_fields: mapping["variants"] = [parsed_query.cross_domain_fields["variant"]] # Map article-specific fields article_fields = parsed_query.domain_specific_fields.get( "articles", {} ) if "title" in article_fields: mapping["keywords"] = [article_fields["title"]] if "author" in article_fields: mapping["keywords"] = mapping.get("keywords", []) + [ article_fields["author"] ] if "journal" in article_fields: mapping["keywords"] = mapping.get("keywords", []) + [ article_fields["journal"] ] # Extract mutation patterns from raw query import re raw_query = parsed_query.raw_query # Look for mutation patterns like F57Y, F57*, V600E mutation_patterns = re.findall(r"\b[A-Z]\d+[A-Z*]\b", raw_query) if mutation_patterns: if "keywords" not in mapping: mapping["keywords"] = [] mapping["keywords"].extend(mutation_patterns) return mapping def _map_trial_fields(self, parsed_query: ParsedQuery) -> dict[str, Any]: """Map query fields to trial searcher parameters.""" mapping: dict[str, Any] = {} # Map cross-domain fields if "disease" in parsed_query.cross_domain_fields: mapping["conditions"] = [ parsed_query.cross_domain_fields["disease"] ] # Gene searches in trials might look for targeted therapies if "gene" in parsed_query.cross_domain_fields: gene = parsed_query.cross_domain_fields["gene"] # Search for gene-targeted interventions mapping["keywords"] = [gene] # Map trial-specific fields trial_fields = parsed_query.domain_specific_fields.get("trials", {}) if "condition" in trial_fields: mapping["conditions"] = [trial_fields["condition"]] if "intervention" in trial_fields: mapping["interventions"] = [trial_fields["intervention"]] if "phase" in trial_fields: mapping["phase"] = f"PHASE{trial_fields['phase']}" if "status" in trial_fields: mapping["recruiting_status"] = trial_fields["status"].upper() return mapping def _map_variant_fields(self, parsed_query: ParsedQuery) -> dict[str, Any]: """Map query fields to variant searcher parameters.""" mapping: dict[str, Any] = {} # Map cross-domain fields if "gene" in parsed_query.cross_domain_fields: mapping["gene"] = parsed_query.cross_domain_fields["gene"] if "variant" in parsed_query.cross_domain_fields: variant = parsed_query.cross_domain_fields["variant"] # Check if it's an rsID or protein change if variant.startswith("rs"): mapping["rsid"] = variant else: mapping["hgvsp"] = variant # Map variant-specific fields variant_fields = parsed_query.domain_specific_fields.get( "variants", {} ) if "rsid" in variant_fields: mapping["rsid"] = variant_fields["rsid"] if "gene" in variant_fields: mapping["gene"] = variant_fields["gene"] if "significance" in variant_fields: mapping["significance"] = variant_fields["significance"] if "frequency" in variant_fields: # Parse frequency operators freq = variant_fields["frequency"] if freq.startswith("<"): mapping["max_frequency"] = float(freq[1:]) elif freq.startswith(">"): mapping["min_frequency"] = float(freq[1:]) return mapping def _map_gene_fields(self, parsed_query: ParsedQuery) -> dict[str, Any]: """Map query fields to gene searcher parameters.""" mapping: dict[str, Any] = {} # Map cross-domain fields if "gene" in parsed_query.cross_domain_fields: mapping["query"] = parsed_query.cross_domain_fields["gene"] # Map gene-specific fields gene_fields = parsed_query.domain_specific_fields.get("genes", {}) if "symbol" in gene_fields: mapping["query"] = gene_fields["symbol"] elif "name" in gene_fields: mapping["query"] = gene_fields["name"] elif "type" in gene_fields: mapping["type_of_gene"] = gene_fields["type"] return mapping def _map_drug_fields(self, parsed_query: ParsedQuery) -> dict[str, Any]: """Map query fields to drug searcher parameters.""" mapping: dict[str, Any] = {} # Map cross-domain fields if "chemical" in parsed_query.cross_domain_fields: mapping["query"] = parsed_query.cross_domain_fields["chemical"] elif "drug" in parsed_query.cross_domain_fields: mapping["query"] = parsed_query.cross_domain_fields["drug"] # Map drug-specific fields drug_fields = parsed_query.domain_specific_fields.get("drugs", {}) if "name" in drug_fields: mapping["query"] = drug_fields["name"] elif "tradename" in drug_fields: mapping["query"] = drug_fields["tradename"] elif "indication" in drug_fields: mapping["indication"] = drug_fields["indication"] return mapping def _map_disease_fields(self, parsed_query: ParsedQuery) -> dict[str, Any]: """Map query fields to disease searcher parameters.""" mapping: dict[str, Any] = {} # Map cross-domain fields if "disease" in parsed_query.cross_domain_fields: mapping["query"] = parsed_query.cross_domain_fields["disease"] # Map disease-specific fields disease_fields = parsed_query.domain_specific_fields.get( "diseases", {} ) if "name" in disease_fields: mapping["query"] = disease_fields["name"] elif "mondo" in disease_fields: mapping["query"] = disease_fields["mondo"] elif "synonym" in disease_fields: mapping["query"] = disease_fields["synonym"] return mapping async def execute_routing_plan( plan: RoutingPlan, output_json: bool = True ) -> dict[str, Any]: """Execute a routing plan by calling the appropriate tools.""" tasks = [] task_names = [] for tool_name in plan.tools_to_call: params = plan.field_mappings[tool_name] if tool_name == "article_searcher": request = PubmedRequest(**params) tasks.append( search_articles_unified( request, include_pubmed=True, include_preprints=False, output_json=output_json, ) ) task_names.append("articles") elif tool_name == "trial_searcher": query = TrialQuery(**params) tasks.append(search_trials(query, output_json=output_json)) task_names.append("trials") elif tool_name == "variant_searcher": variant_query = VariantQuery(**params) tasks.append( search_variants(variant_query, output_json=output_json) ) task_names.append("variants") elif tool_name == "gene_searcher": # For gene search, we'll use the BioThingsClient directly from biomcp.integrations.biothings_client import BioThingsClient client = BioThingsClient() query_str = params.get("query", "") tasks.append(_search_genes(client, query_str, output_json)) task_names.append("genes") elif tool_name == "drug_searcher": # For drug search, we'll use the BioThingsClient directly from biomcp.integrations.biothings_client import BioThingsClient client = BioThingsClient() query_str = params.get("query", "") tasks.append(_search_drugs(client, query_str, output_json)) task_names.append("drugs") elif tool_name == "disease_searcher": # For disease search, we'll use the BioThingsClient directly from biomcp.integrations.biothings_client import BioThingsClient client = BioThingsClient() query_str = params.get("query", "") tasks.append(_search_diseases(client, query_str, output_json)) task_names.append("diseases") # Execute all searches in parallel results = await asyncio.gather(*tasks, return_exceptions=True) # Package results output: dict[str, Any] = {} for name, result in zip(task_names, results, strict=False): if isinstance(result, Exception): output[name] = {"error": str(result)} else: output[name] = result return output async def _search_genes(client, query: str, output_json: bool) -> Any: """Search for genes using BioThingsClient.""" results = await client._query_gene(query) if not results: return [] if output_json else "No genes found" # Fetch full details for each result detailed_results = [] for result in results[:10]: # Limit to 10 results gene_id = result.get("_id") if gene_id: full_gene = await client._get_gene_by_id(gene_id) if full_gene: detailed_results.append(full_gene.model_dump(by_alias=True)) if output_json: import json return json.dumps(detailed_results) else: return detailed_results async def _search_drugs(client, query: str, output_json: bool) -> Any: """Search for drugs using BioThingsClient.""" results = await client._query_drug(query) if not results: return [] if output_json else "No drugs found" # Fetch full details for each result detailed_results = [] for result in results[:10]: # Limit to 10 results drug_id = result.get("_id") if drug_id: full_drug = await client._get_drug_by_id(drug_id) if full_drug: detailed_results.append(full_drug.model_dump(by_alias=True)) if output_json: import json return json.dumps(detailed_results) else: return detailed_results async def _search_diseases(client, query: str, output_json: bool) -> Any: """Search for diseases using BioThingsClient.""" results = await client._query_disease(query) if not results: return [] if output_json else "No diseases found" # Fetch full details for each result detailed_results = [] for result in results[:10]: # Limit to 10 results disease_id = result.get("_id") if disease_id: full_disease = await client._get_disease_by_id(disease_id) if full_disease: detailed_results.append(full_disease.model_dump(by_alias=True)) if output_json: import json return json.dumps(detailed_results) else: return detailed_results ``` -------------------------------------------------------------------------------- /docs/user-guides/03-integrating-with-ides-and-clients.md: -------------------------------------------------------------------------------- ```markdown # Integrating with IDEs and Clients BioMCP can be integrated into your development workflow through multiple approaches. This guide covers integration with IDEs, Python applications, and MCP-compatible clients. ## Integration Methods Overview | Method | Best For | Installation | Usage Pattern | | -------------- | ------------------------- | ------------ | ------------------------ | | **Cursor IDE** | Interactive development | Smithery CLI | Natural language queries | | **Python SDK** | Application development | pip/uv | Direct function calls | | **MCP Client** | AI assistants & protocols | Subprocess | Tool-based communication | ## Cursor IDE Integration Cursor IDE provides the most seamless integration for interactive biomedical research during development. ### Installation 1. **Prerequisites:** - [Cursor IDE](https://cursor.sh/) installed - [Smithery](https://smithery.ai/) account and token 2. **Install BioMCP:** ```bash npx -y @smithery/cli@latest install @genomoncology/biomcp --client cursor ``` 3. **Configuration:** - The Smithery CLI automatically configures Cursor - No manual configuration needed ### Usage in Cursor Once installed, you can query biomedical data using natural language: #### Clinical Trials ``` "Find Phase 3 clinical trials for lung cancer with immunotherapy" ``` #### Research Articles ``` "Summarize recent research on EGFR mutations in lung cancer" ``` #### Genetic Variants ``` "What's the clinical significance of the BRAF V600E mutation?" ``` #### Complex Queries ``` "Compare treatment outcomes for ALK-positive vs EGFR-mutant NSCLC" ``` ### Cursor Tips 1. **Be Specific**: Include gene names, disease types, and treatment modalities 2. **Iterate**: Refine queries based on initial results 3. **Cross-Reference**: Ask for both articles and trials on the same topic 4. **Export Results**: Copy formatted results for documentation ## Python SDK Integration The Python SDK provides programmatic access to BioMCP for building applications. ### Installation ```bash # Using pip pip install biomcp-python # Using uv uv add biomcp-python # For scripts uv pip install biomcp-python ``` ### Basic Usage ```python import asyncio from biomcp import BioMCP async def main(): # Initialize client client = BioMCP() # Search for articles articles = await client.articles.search( genes=["BRAF"], diseases=["melanoma"], limit=5 ) # Search for trials trials = await client.trials.search( conditions=["breast cancer"], interventions=["CDK4/6 inhibitor"], recruiting_status="RECRUITING" ) # Get variant details variant = await client.variants.get("rs121913529") return articles, trials, variant # Run the async function results = asyncio.run(main()) ``` ### Advanced Features #### Domain-Specific Modules ```python from biomcp import BioMCP from biomcp.variants import search_variants, get_variant from biomcp.trials import search_trials, get_trial from biomcp.articles import search_articles, fetch_articles # Direct module usage async def variant_analysis(): # Search pathogenic TP53 variants results = await search_variants( gene="TP53", significance="pathogenic", frequency_max=0.01, limit=20 ) # Get detailed annotations for variant in results: details = await get_variant(variant.id) print(f"{variant.id}: {details.clinical_significance}") ``` #### Output Formats ```python # JSON for programmatic use articles_json = await client.articles.search( genes=["KRAS"], format="json" ) # Markdown for display articles_md = await client.articles.search( genes=["KRAS"], format="markdown" ) ``` #### Error Handling ```python from biomcp.exceptions import BioMCPError, APIError, ValidationError try: results = await client.articles.search(genes=["INVALID_GENE"]) except ValidationError as e: print(f"Invalid input: {e}") except APIError as e: print(f"API error: {e}") except BioMCPError as e: print(f"General error: {e}") ``` ### Example: Building a Variant Report ```python import asyncio from biomcp import BioMCP async def generate_variant_report(gene: str, mutation: str): client = BioMCP() # 1. Get gene information gene_info = await client.genes.get(gene) # 2. Search for the specific variant variants = await client.variants.search( gene=gene, keywords=[mutation] ) # 3. Find relevant articles articles = await client.articles.search( genes=[gene], keywords=[mutation], limit=10 ) # 4. Look for clinical trials trials = await client.trials.search( conditions=["cancer"], other_terms=[f"{gene} {mutation}"], recruiting_status="RECRUITING" ) # 5. Generate report report = f""" # Variant Report: {gene} {mutation} ## Gene Information - **Official Name**: {gene_info.name} - **Summary**: {gene_info.summary} ## Variant Details Found {len(variants)} matching variants ## Literature ({len(articles)} articles) Recent publications discussing this variant... ## Clinical Trials ({len(trials)} active trials) Currently recruiting studies... """ return report # Generate report report = asyncio.run(generate_variant_report("BRAF", "V600E")) print(report) ``` ## MCP Client Integration The Model Context Protocol (MCP) provides a standardized way to integrate BioMCP with AI assistants and other tools. ### Understanding MCP MCP is a protocol for communication between: - **Clients**: AI assistants, IDEs, or custom applications - **Servers**: Tool providers like BioMCP ### Critical Requirement: Think Tool **IMPORTANT**: When using MCP, you MUST call the `think` tool first before any search or fetch operations. This ensures systematic analysis and optimal results. ### Basic MCP Integration ```python import asyncio import subprocess from mcp import ClientSession, StdioServerParameters from mcp.client.stdio import stdio_client async def run_biomcp_query(): # Start BioMCP server server_params = StdioServerParameters( command="uv", args=["run", "--with", "biomcp-python", "biomcp", "run"], env={"PYTHONUNBUFFERED": "1"} ) async with stdio_client(server_params) as (read, write): async with ClientSession(read, write) as session: # Initialize and discover tools await session.initialize() tools = await session.list_tools() # CRITICAL: Always think first! await session.call_tool( "think", arguments={ "thought": "Analyzing BRAF V600E in melanoma...", "thoughtNumber": 1, "nextThoughtNeeded": True } ) # Now search for articles result = await session.call_tool( "article_searcher", arguments={ "genes": ["BRAF"], "diseases": ["melanoma"], "keywords": ["V600E"] } ) return result # Run the query result = asyncio.run(run_biomcp_query()) ``` ### Available MCP Tools BioMCP provides 24 tools through MCP: #### Core Tools (Always Use First) - `think` - Sequential reasoning (MANDATORY first step) - `search` - Unified search across domains - `fetch` - Retrieve specific records #### Domain-Specific Tools - **Articles**: `article_searcher`, `article_getter` - **Trials**: `trial_searcher`, `trial_getter`, plus detail getters - **Variants**: `variant_searcher`, `variant_getter`, `alphagenome_predictor` - **BioThings**: `gene_getter`, `disease_getter`, `drug_getter` - **NCI**: Organization, intervention, biomarker, disease tools ### MCP Integration Patterns #### Pattern 1: AI Assistant Integration ```python # Example for integrating with an AI assistant class BioMCPAssistant: def __init__(self): self.session = None async def connect(self): # Initialize MCP connection server_params = StdioServerParameters( command="biomcp", args=["run"] ) # ... connection setup ... async def process_query(self, user_query: str): # 1. Always think first await self.think_about_query(user_query) # 2. Determine appropriate tools tools_needed = self.analyze_query(user_query) # 3. Execute tool calls results = [] for tool in tools_needed: result = await self.session.call_tool(tool.name, tool.args) results.append(result) # 4. Synthesize results return self.format_response(results) ``` #### Pattern 2: Custom Client Implementation ```python import json from typing import Any, Dict class BioMCPClient: """Custom client for specific biomedical workflows""" async def variant_to_trials_pipeline(self, variant_id: str): """Find trials for patients with specific variants""" # Step 1: Think and plan await self.think( "Planning variant-to-trials search pipeline...", thoughtNumber=1 ) # Step 2: Get variant details variant = await self.call_tool("variant_getter", { "variant_id": variant_id }) # Step 3: Extract gene and disease associations gene = variant.get("gene", {}).get("symbol") diseases = self.extract_diseases(variant) # Step 4: Search for relevant trials trials = await self.call_tool("trial_searcher", { "conditions": diseases, "other_terms": [f"{gene} mutation"], "recruiting_status": "RECRUITING" }) return { "variant": variant, "associated_trials": trials } ``` ### MCP Best Practices 1. **Always Think First** ```python # ✅ Correct await think(thought="Planning research...", thoughtNumber=1) await search(...) # ❌ Wrong - skips thinking await search(...) # Will produce poor results ``` 2. **Use Appropriate Tools** ```python # For broad searches across domains await call_tool("search", {"query": "gene:BRAF AND melanoma"}) # For specific domain searches await call_tool("article_searcher", {"genes": ["BRAF"]}) ``` 3. **Handle Tool Responses** ```python try: result = await session.call_tool("variant_getter", { "variant_id": "rs121913529" }) # Process structured result if result.get("error"): handle_error(result["error"]) else: process_variant(result["data"]) except Exception as e: logger.error(f"Tool call failed: {e}") ``` ## Choosing the Right Integration ### Use Cursor IDE When: - Doing interactive research during development - Exploring biomedical data for new projects - Need quick answers without writing code - Want natural language queries ### Use Python SDK When: - Building production applications - Need type-safe interfaces - Want direct function calls - Require custom error handling ### Use MCP Client When: - Integrating with AI assistants - Building protocol-compliant tools - Need standardized tool interfaces - Want language-agnostic integration ## Integration Examples ### Example 1: Research Dashboard (Python SDK) ```python from biomcp import BioMCP import streamlit as st async def create_dashboard(): client = BioMCP() st.title("Biomedical Research Dashboard") # Gene input gene = st.text_input("Enter gene symbol:", "BRAF") if st.button("Search"): # Fetch comprehensive data col1, col2 = st.columns(2) with col1: st.subheader("Recent Articles") articles = await client.articles.search(genes=[gene], limit=5) for article in articles: st.write(f"- [{article.title}]({article.url})") with col2: st.subheader("Active Trials") trials = await client.trials.search( other_terms=[gene], recruiting_status="RECRUITING", limit=5 ) for trial in trials: st.write(f"- [{trial.nct_id}]({trial.url})") ``` ### Example 2: Variant Analysis Pipeline (MCP) ```python async def comprehensive_variant_analysis(session, hgvs: str): """Complete variant analysis workflow using MCP""" # Think about the analysis await session.call_tool("think", { "thought": f"Planning comprehensive analysis for {hgvs}", "thoughtNumber": 1 }) # Get variant details variant = await session.call_tool("variant_getter", { "variant_id": hgvs }) # Search related articles articles = await session.call_tool("article_searcher", { "variants": [hgvs], "limit": 10 }) # Find applicable trials gene = variant.get("gene", {}).get("symbol") trials = await session.call_tool("trial_searcher", { "other_terms": [f"{gene} mutation"], "recruiting_status": "RECRUITING" }) # Predict functional effects if genomic coordinates available if variant.get("chrom") and variant.get("pos"): prediction = await session.call_tool("alphagenome_predictor", { "chromosome": f"chr{variant['chrom']}", "position": variant["pos"], "reference": variant["ref"], "alternate": variant["alt"] }) return { "variant": variant, "articles": articles, "trials": trials, "prediction": prediction } ``` ## Troubleshooting ### Common Issues 1. **"Think tool not called" errors** - Always call think before other operations - Include thoughtNumber parameter 2. **API rate limits** - Add delays between requests - Use API keys for higher limits 3. **Connection failures** - Check network connectivity - Verify server is running - Ensure correct installation 4. **Invalid gene symbols** - Use official HGNC symbols - Check [genenames.org](https://www.genenames.org) ### Debug Mode Enable debug logging: ```python # Python SDK import logging logging.basicConfig(level=logging.DEBUG) # MCP Client server_params = StdioServerParameters( command="biomcp", args=["run", "--log-level", "DEBUG"] ) ``` ## Next Steps - Explore [tool-specific documentation](02-mcp-tools-reference.md) - Review [API authentication](../getting-started/03-authentication-and-api-keys.md) - Check [example workflows](../how-to-guides/01-find-articles-and-cbioportal-data.md) for your use case ``` -------------------------------------------------------------------------------- /docs/user-guides/01-command-line-interface.md: -------------------------------------------------------------------------------- ```markdown # Command Line Interface Reference BioMCP provides a comprehensive command-line interface for biomedical data retrieval and analysis. This guide covers all available commands, options, and usage patterns. ## Installation ```bash # Using uv (recommended) uv tool install biomcp # Using pip pip install biomcp-python ``` ## Global Options These options work with all commands: ```bash biomcp [OPTIONS] COMMAND [ARGS]... Options: --version Show the version and exit --help Show help message and exit ``` ## Commands Overview | Domain | Commands | Purpose | | ---------------- | -------------------- | ----------------------------------------------- | | **article** | search, get | Search and retrieve biomedical literature | | **trial** | search, get | Find and fetch clinical trial information | | **variant** | search, get, predict | Analyze genetic variants and predict effects | | **gene** | get | Retrieve gene information and annotations | | **drug** | get | Look up drug/chemical information | | **disease** | get | Get disease definitions and synonyms | | **organization** | search | Search NCI organization database | | **intervention** | search | Find interventions (drugs, devices, procedures) | | **biomarker** | search | Search biomarkers used in trials | | **health** | check | Monitor API status and system health | ## Article Commands For practical examples and workflows, see [How to Find Articles and cBioPortal Data](../how-to-guides/01-find-articles-and-cbioportal-data.md). ### article search Search PubMed/PubTator3 for biomedical literature with automatic cBioPortal integration. ```bash biomcp article search [OPTIONS] ``` **Options:** - `--gene, -g TEXT`: Gene symbol(s) to search for - `--variant, -v TEXT`: Genetic variant(s) to search for - `--disease, -d TEXT`: Disease/condition(s) to search for - `--chemical, -c TEXT`: Chemical/drug name(s) to search for - `--keyword, -k TEXT`: Keyword(s) to search for (supports OR with `|`) - `--pmid TEXT`: Specific PubMed ID(s) to retrieve - `--limit INTEGER`: Maximum results to return (default: 10) - `--no-preprints`: Exclude preprints from results - `--no-cbioportal`: Disable automatic cBioPortal integration - `--format [json|markdown]`: Output format (default: markdown) **Examples:** ```bash # Basic gene search with automatic cBioPortal data biomcp article search --gene BRAF --disease melanoma # Multiple filters biomcp article search --gene EGFR --disease "lung cancer" --chemical erlotinib # OR logic in keywords (find different variant notations) biomcp article search --gene PTEN --keyword "R173|Arg173|p.R173" # Exclude preprints biomcp article search --gene TP53 --no-preprints --limit 20 # JSON output for programmatic use biomcp article search --gene KRAS --format json > results.json ``` ### article get Retrieve a specific article by PubMed ID or DOI. ```bash biomcp article get IDENTIFIER ``` **Arguments:** - `IDENTIFIER`: PubMed ID (e.g., "38768446") or DOI (e.g., "10.1101/2024.01.20.23288905") **Examples:** ```bash # Get article by PubMed ID biomcp article get 38768446 # Get preprint by DOI biomcp article get "10.1101/2024.01.20.23288905" ``` ## Trial Commands For practical examples and workflows, see [How to Find Trials with NCI and BioThings](../how-to-guides/02-find-trials-with-nci-and-biothings.md). ### trial search Search ClinicalTrials.gov or NCI CTS API for clinical trials. ```bash biomcp trial search [OPTIONS] ``` **Basic Options:** - `--condition TEXT`: Disease/condition to search - `--intervention TEXT`: Treatment/intervention to search - `--term TEXT`: General search terms - `--nct-id TEXT`: Specific NCT ID(s) - `--limit INTEGER`: Maximum results (default: 10) - `--source [ctgov|nci]`: Data source (default: ctgov) - `--api-key TEXT`: API key for NCI source **Study Characteristics:** - `--status TEXT`: Trial status (RECRUITING, ACTIVE_NOT_RECRUITING, etc.) - `--study-type TEXT`: Type of study (INTERVENTIONAL, OBSERVATIONAL) - `--phase TEXT`: Trial phase (EARLY_PHASE1, PHASE1, PHASE2, PHASE3, PHASE4) - `--study-purpose TEXT`: Primary purpose (TREATMENT, PREVENTION, etc.) - `--age-group TEXT`: Target age group (CHILD, ADULT, OLDER_ADULT) **Location Options:** - `--country TEXT`: Country name - `--state TEXT`: State/province - `--city TEXT`: City name - `--latitude FLOAT`: Geographic latitude - `--longitude FLOAT`: Geographic longitude - `--distance INTEGER`: Search radius in miles **Advanced Filters:** - `--start-date TEXT`: Trial start date (YYYY-MM-DD) - `--end-date TEXT`: Trial end date (YYYY-MM-DD) - `--intervention-type TEXT`: Type of intervention - `--sponsor-type TEXT`: Type of sponsor - `--is-fda-regulated`: FDA-regulated trials only - `--expanded-access`: Trials offering expanded access **Examples:** ```bash # Find recruiting melanoma trials biomcp trial search --condition melanoma --status RECRUITING # Search by location (requires coordinates) biomcp trial search --condition "lung cancer" \ --latitude 41.4993 --longitude -81.6944 --distance 50 # Use NCI source with advanced filters biomcp trial search --condition melanoma --source nci \ --required-mutations "BRAF V600E" --allow-brain-mets true \ --api-key YOUR_KEY # Multiple filters biomcp trial search --condition "breast cancer" \ --intervention "CDK4/6 inhibitor" --phase PHASE3 \ --status RECRUITING --country "United States" ``` ### trial get Retrieve detailed information about a specific clinical trial. ```bash biomcp trial get NCT_ID [OPTIONS] ``` **Arguments:** - `NCT_ID`: Clinical trial identifier (e.g., NCT03006926) **Options:** - `--include TEXT`: Specific sections to include (Protocol, Locations, References, Outcomes) - `--source [ctgov|nci]`: Data source (default: ctgov) - `--api-key TEXT`: API key for NCI source **Examples:** ```bash # Get basic trial information biomcp trial get NCT03006926 # Get specific sections biomcp trial get NCT03006926 --include Protocol --include Locations # Use NCI source biomcp trial get NCT04280705 --source nci --api-key YOUR_KEY ``` ## Variant Commands For practical examples and workflows, see: - [Get Comprehensive Variant Annotations](../how-to-guides/03-get-comprehensive-variant-annotations.md) - [Predict Variant Effects with AlphaGenome](../how-to-guides/04-predict-variant-effects-with-alphagenome.md) ### variant search Search MyVariant.info for genetic variant annotations. ```bash biomcp variant search [OPTIONS] ``` **Options:** - `--gene TEXT`: Gene symbol - `--hgvs TEXT`: HGVS notation - `--rsid TEXT`: dbSNP rsID - `--chromosome TEXT`: Chromosome - `--start INTEGER`: Genomic start position - `--end INTEGER`: Genomic end position - `--assembly [hg19|hg38]`: Genome assembly (default: hg38) - `--significance TEXT`: Clinical significance - `--min-frequency FLOAT`: Minimum allele frequency - `--max-frequency FLOAT`: Maximum allele frequency - `--min-cadd FLOAT`: Minimum CADD score - `--polyphen TEXT`: PolyPhen prediction - `--sift TEXT`: SIFT prediction - `--sources TEXT`: Data sources to include - `--limit INTEGER`: Maximum results (default: 10) - `--no-cbioportal`: Disable cBioPortal integration **Examples:** ```bash # Search pathogenic BRCA1 variants biomcp variant search --gene BRCA1 --significance pathogenic # Search by HGVS notation biomcp variant search --hgvs "NM_007294.4:c.5266dupC" # Filter by frequency and prediction scores biomcp variant search --gene TP53 --max-frequency 0.01 \ --min-cadd 20 --polyphen possibly_damaging # Search genomic region biomcp variant search --chromosome 7 --start 140753336 --end 140753337 ``` ### variant get Retrieve detailed information about a specific variant. ```bash biomcp variant get VARIANT_ID [OPTIONS] ``` **Arguments:** - `VARIANT_ID`: Variant identifier (HGVS, rsID, or genomic) **Options:** - `--json, -j`: Output in JSON format - `--include-external / --no-external`: Include/exclude external annotations (default: include) - `--assembly TEXT`: Genome assembly (hg19 or hg38, default: hg19) **Examples:** ```bash # Get variant by HGVS (defaults to hg19) biomcp variant get "NM_007294.4:c.5266dupC" # Get variant by rsID biomcp variant get rs121913529 # Specify hg38 assembly biomcp variant get rs113488022 --assembly hg38 # JSON output with hg38 biomcp variant get rs113488022 --json --assembly hg38 # Without external annotations biomcp variant get rs113488022 --no-external # Get variant by genomic coordinates biomcp variant get "chr17:g.43082434G>A" ``` ### variant predict Predict variant effects using Google DeepMind's AlphaGenome (requires API key). ```bash biomcp variant predict CHROMOSOME POSITION REFERENCE ALTERNATE [OPTIONS] ``` **Arguments:** - `CHROMOSOME`: Chromosome (e.g., chr7) - `POSITION`: Genomic position - `REFERENCE`: Reference allele - `ALTERNATE`: Alternate allele **Options:** - `--tissue TEXT`: Tissue type(s) using UBERON ontology - `--interval INTEGER`: Analysis window size (default: 20000) - `--api-key TEXT`: AlphaGenome API key **Examples:** ```bash # Basic prediction (requires ALPHAGENOME_API_KEY env var) biomcp variant predict chr7 140753336 A T # Tissue-specific prediction biomcp variant predict chr7 140753336 A T \ --tissue UBERON:0002367 # breast tissue # With per-request API key biomcp variant predict chr7 140753336 A T --api-key YOUR_KEY ``` ## Gene/Drug/Disease Commands For practical examples using BioThings integration, see [How to Find Trials with NCI and BioThings](../how-to-guides/02-find-trials-with-nci-and-biothings.md#biothings-integration-for-enhanced-search). ### gene get Retrieve gene information from MyGene.info. ```bash biomcp gene get GENE_NAME ``` **Examples:** ```bash # Get gene information biomcp gene get TP53 biomcp gene get BRAF ``` ### drug get Retrieve drug/chemical information from MyChem.info. ```bash biomcp drug get DRUG_NAME ``` **Examples:** ```bash # Get drug information biomcp drug get imatinib biomcp drug get pembrolizumab ``` ### disease get Retrieve disease information from MyDisease.info. ```bash biomcp disease get DISEASE_NAME ``` **Examples:** ```bash # Get disease information biomcp disease get melanoma biomcp disease get "non-small cell lung cancer" ``` ## NCI-Specific Commands These commands require an NCI API key. For setup instructions and usage examples, see: - [Authentication and API Keys](../getting-started/03-authentication-and-api-keys.md#nci-clinical-trials-api) - [How to Find Trials with NCI and BioThings](../how-to-guides/02-find-trials-with-nci-and-biothings.md#using-nci-api-advanced-features) ### organization search Search NCI's organization database. ```bash biomcp organization search [OPTIONS] ``` **Options:** - `--name TEXT`: Organization name - `--city TEXT`: City location - `--state TEXT`: State/province - `--country TEXT`: Country - `--org-type TEXT`: Organization type - `--api-key TEXT`: NCI API key **Example:** ```bash biomcp organization search --name "MD Anderson" \ --city Houston --state TX --api-key YOUR_KEY ``` ### intervention search Search NCI's intervention database. ```bash biomcp intervention search [OPTIONS] ``` **Options:** - `--name TEXT`: Intervention name - `--intervention-type TEXT`: Type (Drug, Device, Procedure, etc.) - `--api-key TEXT`: NCI API key **Example:** ```bash biomcp intervention search --name pembrolizumab \ --intervention-type Drug --api-key YOUR_KEY ``` ### biomarker search Search biomarkers used in clinical trials. ```bash biomcp biomarker search [OPTIONS] ``` **Options:** - `--gene TEXT`: Gene symbol - `--biomarker-type TEXT`: Type of biomarker - `--api-key TEXT`: NCI API key **Example:** ```bash biomcp biomarker search --gene EGFR \ --biomarker-type mutation --api-key YOUR_KEY ``` ## Health Command For monitoring API status before bulk operations, see the [Performance Optimizations Guide](../developer-guides/07-performance-optimizations.md). ### health check Monitor API endpoints and system health. ```bash biomcp health check [OPTIONS] ``` **Options:** - `--apis-only`: Check only API endpoints - `--system-only`: Check only system resources - `--verbose, -v`: Show detailed information **Examples:** ```bash # Full health check biomcp health check # Check APIs only biomcp health check --apis-only # Detailed system check biomcp health check --system-only --verbose ``` ## Output Formats Most commands support both human-readable markdown and machine-readable JSON output: ```bash # Default markdown output biomcp article search --gene BRAF # JSON for programmatic use biomcp article search --gene BRAF --format json # Save to file biomcp trial search --condition melanoma --format json > trials.json ``` ## Environment Variables Configure default behavior with environment variables: ```bash # API Keys export NCI_API_KEY="your-nci-key" export ALPHAGENOME_API_KEY="your-alphagenome-key" export CBIO_TOKEN="your-cbioportal-token" # Logging export BIOMCP_LOG_LEVEL="DEBUG" export BIOMCP_CACHE_DIR="/path/to/cache" ``` ## Getting Help Every command has a built-in help flag: ```bash # General help biomcp --help # Command-specific help biomcp article search --help biomcp trial get --help biomcp variant predict --help ``` ## Tips and Best Practices 1. **Use Official Gene Symbols**: Always use HGNC-approved gene symbols (e.g., "TP53" not "p53") 2. **Combine Filters**: Most commands support multiple filters for precise results: ```bash biomcp article search --gene EGFR --disease "lung cancer" \ --chemical erlotinib --keyword "resistance" ``` 3. **Handle Large Results**: Use `--limit` and `--format json` for processing: ```bash biomcp article search --gene BRCA1 --limit 100 --format json | \ jq '.results[] | {pmid: .pmid, title: .title}' ``` 4. **Location Searches**: Always provide both latitude and longitude: ```bash # Find trials near Boston biomcp trial search --condition cancer \ --latitude 42.3601 --longitude -71.0589 --distance 25 ``` 5. **Use OR Logic**: The pipe character enables flexible searches: ```bash # Find articles mentioning any form of a variant biomcp article search --gene BRAF --keyword "V600E|p.V600E|c.1799T>A" ``` 6. **Check API Health**: Before bulk operations, verify API status: ```bash biomcp health check --apis-only ``` ## Next Steps - Set up [API keys](../getting-started/03-authentication-and-api-keys.md) for enhanced features - Explore [MCP tools](02-mcp-tools-reference.md) for AI integration - Read [how-to guides](../how-to-guides/01-find-articles-and-cbioportal-data.md) for complex workflows ``` -------------------------------------------------------------------------------- /CHANGELOG.md: -------------------------------------------------------------------------------- ```markdown # Changelog All notable changes to the BioMCP project will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). ## [0.6.7] - 2025-08-13 ### Fixed - **MCP Resource Encoding** - Fixed character encoding error when loading resources on Windows (Issue #63): - Added explicit UTF-8 encoding for reading `instructions.md` and `researcher.md` resource files - Resolves "'charmap' codec can't decode byte 0x8f" error on Windows systems - Ensures cross-platform compatibility for resource loading ### Changed - **Documentation** - Clarified sequential thinking integration: - Updated `researcher-persona-resource.md` to remove references to external `sequential-thinking` MCP server - Clarified that the `think` tool is built into BioMCP (no external dependencies needed) - Updated configuration examples to show only BioMCP server is required ## [0.6.6] - 2025-08-08 ### Fixed - **Windows Compatibility** - Fixed fcntl module import error on Windows (Issue #57): - Added conditional import with try/except for fcntl module - File locking now only applies on Unix systems - Windows users get full functionality without file locking - Refactored cache functions to reduce code complexity ### Changed - **Documentation** - Updated Docker instructions in README (Issue #58): - Added `docker build -t biomcp:latest .` command before `docker run` - Clarified that biomcp:latest is a local build, not pulled from Docker Hub ## [0.6.5] - 2025-08-07 ### Added - **OpenFDA Integration** - Comprehensive FDA regulatory data access: - **12 New MCP Tools** for adverse events, drug labels, device events, drug approvals, recalls, and shortages - Each domain includes searcher and getter tools for flexible data retrieval - Unified search support with `domain="fda_*"` parameters - Enhanced CLI commands for all OpenFDA endpoints - Smart caching and rate limiting for API efficiency - Comprehensive error handling and data validation ### Changed - Improved API key support across all OpenFDA tools - Enhanced documentation for FDA data integration ## [0.6.4] - 2025-08-06 ### Changed - **Documentation Restructure** - Major documentation improvements: - Simplified navigation structure for better user experience - Fixed code block formatting and layout issues - Removed unnecessary sections and redundant content - Improved overall documentation readability and organization - Enhanced mobile responsiveness ## [0.6.3] - 2025-08-05 ### Added - **NCI Clinical Trials Search API Integration** - Enhanced cancer trial search capabilities: - Dual source support for trial search/getter tools (ClinicalTrials.gov + NCI) - NCI API key handling via `NCI_API_KEY` environment variable or parameter - Advanced trial filters: biomarkers, prior therapy, brain metastases acceptance - **6 New MCP Tools** for NCI-specific searches: - `nci_organization_searcher` / `nci_organization_getter`: Cancer centers, hospitals, research institutions - `nci_intervention_searcher` / `nci_intervention_getter`: Drugs, devices, procedures, biologicals - `nci_biomarker_searcher`: Trial eligibility biomarkers (reference genes, branches) - `nci_disease_searcher`: NCI's controlled vocabulary of cancer conditions - **OR Query Support**: All NCI endpoints support OR queries (e.g., "PD-L1 OR CD274") - Real-time access to NCI's curated cancer trials database - Automatic cBioPortal integration for gene searches - Proper NCI parameter mapping (org_city, org_state_or_province, etc.) - Comprehensive error handling for Elasticsearch limits ### Changed - Enhanced unified search router to properly handle NCI domains - Trial search/getter tools now accept `source` parameter ("clinicaltrials" or "nci") - Improved domain-specific search logic for query+domain combinations ## [0.6.2] - 2025-08-05 Note: Initial NCI integration release - see v0.6.3 for the full implementation. ## [0.6.1] - 2025-08-03 ### Fixed - **Dependency Management** - Fixed alphagenome dependency to enable PyPI publishing - Made alphagenome an optional dependency - Resolved packaging conflicts for distribution ## [0.6.0] - 2025-08-02 ### Added - **Streamable HTTP Transport Protocol** - Modern MCP transport implementation: - Single `/mcp` endpoint for all communication - Session management with persistent session IDs - Event resumption support for reliability - On-demand streaming for long operations - Configurable HTTP server modes (STDIO, HTTP, Worker) - Better scalability for cloud deployments - Full MCP specification compliance (2025-03-26) ### Changed - Improved Cloudflare Worker integration - Enhanced transport layer with comprehensive testing - Updated deployment configurations for HTTP mode ## [0.5.0] - 2025-07-31 ### Added - **BioThings API Integration** - Real-time biomedical data access: - **MyGene.info**: Gene annotations, summaries, aliases, and database links - **MyChem.info**: Drug/chemical information, identifiers, mechanisms of action - **MyDisease.info**: Disease definitions, synonyms, MONDO/DOID mappings - **3 New MCP Tools**: `gene_getter`, `drug_getter`, `disease_getter` - Automatic synonym expansion for enhanced trial searches - Batch optimization for multiple gene lookups - Live data fetching ensures current information ### Changed - Enhanced unified search capabilities with BioThings data - Expanded query language support for gene, drug, and disease queries - Improved trial searches with automatic disease synonym expansion ## [0.4.7] - 2025-07-30 ### Added - **BioThings Integration** for real-time biomedical data access: - **New MCP Tools** (3 tools added, total now 17): - `gene_getter`: Query MyGene.info for gene information (symbols, names, summaries) - `drug_getter`: Query MyChem.info for drug/chemical data (formulas, indications, mechanisms) - `disease_getter`: Query MyDisease.info for disease information (definitions, synonyms, ontologies) - **Unified Search/Fetch Enhancement**: - Added `gene`, `drug`, `disease` as new searchable domains alongside article, trial, variant - Integrated into unified search syntax: `search(domain="gene", keywords=["BRAF"])` - Query language support: `gene:BRAF`, `drug:pembrolizumab`, `disease:melanoma` - Full fetch support: `fetch(domain="drug", id="DB00945")` - **Clinical Trial Enhancement**: - Automatic disease synonym expansion for trial searches - Real-time synonym lookup from MyDisease.info - Example: searching for "GIST" automatically includes "gastrointestinal stromal tumor" - **Smart Caching & Performance**: - Batch operations for multiple gene/drug lookups - Intelligent caching with TTL (gene: 24h, drug: 48h, disease: 72h) - Rate limiting to respect API guidelines ### Changed - Trial search now expands disease terms by default (disable with `expand_synonyms=False`) - Enhanced error handling for BioThings API responses - Improved network reliability with automatic retries ## [0.4.6] - 2025-07-09 ### Added - MkDocs documentation deployment ## [0.4.5] - 2025-07-09 ### Added - Unified search and fetch tools following OpenAI MCP guidelines - Additional variant sources (TCGA/GDC, 1000 Genomes) enabled by default in fetch operations - Additional article sources (bioRxiv, medRxiv, Europe PMC) enabled by default in search operations ### Changed - Consolidated 10 separate MCP tools into 2 unified tools (search and fetch) - Updated response formats to comply with OpenAI MCP specifications ### Fixed - OpenAI MCP compliance issues to enable integration ## [0.4.4] - 2025-07-08 ### Added - **Performance Optimizations**: - Connection pooling with event loop lifecycle management (30% latency reduction) - Parallel test execution with pytest-xdist (5x faster test runs) - Request batching for cBioPortal API calls (80% fewer API calls) - Smart caching with LRU eviction and fast hash keys (10x faster cache operations) - Major performance improvements achieving ~3x faster test execution (120s → 42s) ### Fixed - Non-critical ASGI errors suppressed - Performance issues in article_searcher ## [0.4.3] - 2025-07-08 ### Added - Complete HTTP centralization and improved code quality - Comprehensive constants module for better maintainability - Domain-specific handlers for result formatting - Parameter parser for robust input validation - Custom exception hierarchy for better error handling ### Changed - Refactored domain handlers to use static methods for better performance - Enhanced type safety throughout the codebase - Refactored complex functions to meet code quality standards ### Fixed - Type errors in router.py for full mypy compliance - Complex functions exceeding cyclomatic complexity thresholds ## [0.4.2] - 2025-07-07 ### Added - Europe PMC DOI support for article fetching - Pagination support for Europe PMC searches - OR logic support for variant notation searches (e.g., R173 vs Arg173 vs p.R173) ### Changed - Enhanced variant notation search capabilities ## [0.4.1] - 2025-07-03 ### Added - AlphaGenome as an optional dependency to predict variant effects on gene regulation - Per-request API key support for AlphaGenome integration - AI predictions to complement existing database lookups ### Security - Comprehensive sanitization in Cloudflare Worker to prevent sensitive data logging - Secure usage in hosted environments where users provide their own keys ## [0.4.0] - 2025-06-27 ### Added - **cBioPortal Integration** for article searches: - Automatic gene-level mutation summaries when searching with gene parameters - Mutation-specific search capabilities (e.g., BRAF V600E, SRSF2 F57\*) - Dynamic cancer type resolution using cBioPortal API - Smart caching and rate limiting for optimal performance ## [0.3.3] - 2025-06-20 ### Changed - Release workflow updates ## [0.3.2] - 2025-06-20 ### Changed - Release workflow updates ## [0.3.1] - 2025-06-20 ### Fixed - Build and release process improvements ## [0.3.0] - 2025-06-20 ### Added - Expanded search capabilities - Integration tests for MCP server functionality - Utility modules for gene validation, mutation filtering, and request caching ## [0.2.1] - 2025-06-19 ### Added - Remote MCP policies ## [0.2.0] - 2025-06-17 ### Added - Sequential thinking tool for systematic problem-solving - Session-based thinking to replace global state - Extracted router handlers to reduce complexity ### Changed - Replaced global state in thinking module with session management ### Removed - Global state from sequential thinking module ### Fixed - Race conditions in sequential thinking with concurrent usage ## [0.1.11] - 2025-06-12 ### Added - Advanced eligibility criteria filters to clinical trial search ## [0.1.10] - 2025-05-21 ### Added - OAuth support on the Cloudflare worker via Stytch ## [0.1.9] - 2025-05-17 ### Fixed - Refactor: Bump minimum Python version to 3.10 ## [0.1.8] - 2025-05-14 ### Fixed - Article searcher fixes ## [0.1.7] - 2025-05-07 ### Added - Remote OAuth support ## [0.1.6] - 2025-05-05 ### Added - Updates to handle cursor integration ## [0.1.5] - 2025-05-01 ### Added - Updates to smithery yaml to account for object types needed for remote calls - Documentation and Lzyank updates ## [0.1.3] - 2025-05-01 ### Added - Health check functionality to assist with API call issues - System resources and network & environment information gathering - Remote MCP capability via Cloudflare using SSE ## [0.1.2] - 2025-04-18 ### Added - Researcher persona and BioMCP v0.1.2 release - Deep Researcher Persona blog post - Researcher persona video demo ## [0.1.1] - 2025-04-14 ### Added - Claude Desktop and MCP Inspector tutorials - Improved Claude Desktop Tutorial for BioMCP - Troubleshooting guide and blog post ### Fixed - Log tool names as comma separated string - Server hanging issues - Error responses in variant count check ## [0.1.0] - 2025-04-08 ### Added - Initial release of BioMCP - PubMed/PubTator3 article search integration - ClinicalTrials.gov trial search integration - MyVariant.info variant search integration - CLI interface for direct usage - MCP server for AI assistant integration - Cloudflare Worker support for remote deployment - Comprehensive test suite with pytest-bdd - GenomOncology introduction - Blog post on AI-assisted clinical trial search - MacOS troubleshooting guide ### Security - API keys properly externalized - Input validation using Pydantic models - Safe string handling in all API calls [Unreleased]: https://github.com/genomoncology/biomcp/compare/v0.6.6...HEAD [0.6.6]: https://github.com/genomoncology/biomcp/releases/tag/v0.6.6 [0.6.5]: https://github.com/genomoncology/biomcp/releases/tag/v0.6.5 [0.6.4]: https://github.com/genomoncology/biomcp/releases/tag/v0.6.4 [0.6.3]: https://github.com/genomoncology/biomcp/releases/tag/v0.6.3 [0.6.2]: https://github.com/genomoncology/biomcp/releases/tag/v0.6.2 [0.6.1]: https://github.com/genomoncology/biomcp/releases/tag/v0.6.1 [0.6.0]: https://github.com/genomoncology/biomcp/releases/tag/v0.6.0 [0.5.0]: https://github.com/genomoncology/biomcp/releases/tag/v0.5.0 [0.4.7]: https://github.com/genomoncology/biomcp/releases/tag/v0.4.7 [0.4.6]: https://github.com/genomoncology/biomcp/releases/tag/v0.4.6 [0.4.5]: https://github.com/genomoncology/biomcp/releases/tag/v0.4.5 [0.4.4]: https://github.com/genomoncology/biomcp/releases/tag/v0.4.4 [0.4.3]: https://github.com/genomoncology/biomcp/releases/tag/v0.4.3 [0.4.2]: https://github.com/genomoncology/biomcp/releases/tag/v0.4.2 [0.4.1]: https://github.com/genomoncology/biomcp/releases/tag/v0.4.1 [0.4.0]: https://github.com/genomoncology/biomcp/releases/tag/v0.4.0 [0.3.3]: https://github.com/genomoncology/biomcp/releases/tag/v0.3.3 [0.3.2]: https://github.com/genomoncology/biomcp/releases/tag/v0.3.2 [0.3.1]: https://github.com/genomoncology/biomcp/releases/tag/v0.3.1 [0.3.0]: https://github.com/genomoncology/biomcp/releases/tag/v0.3.0 [0.2.1]: https://github.com/genomoncology/biomcp/releases/tag/v0.2.1 [0.2.0]: https://github.com/genomoncology/biomcp/releases/tag/v0.2.0 [0.1.11]: https://github.com/genomoncology/biomcp/releases/tag/v0.1.11 [0.1.10]: https://github.com/genomoncology/biomcp/releases/tag/v0.1.10 [0.1.9]: https://github.com/genomoncology/biomcp/releases/tag/v0.1.9 [0.1.8]: https://github.com/genomoncology/biomcp/releases/tag/v0.1.8 [0.1.7]: https://github.com/genomoncology/biomcp/releases/tag/v0.1.7 [0.1.6]: https://github.com/genomoncology/biomcp/releases/tag/v0.1.6 [0.1.5]: https://github.com/genomoncology/biomcp/releases/tag/v0.1.5 [0.1.3]: https://github.com/genomoncology/biomcp/releases/tag/v0.1.3 [0.1.2]: https://github.com/genomoncology/biomcp/releases/tag/v0.1.2 [0.1.1]: https://github.com/genomoncology/biomcp/releases/tag/v0.1.1 [0.1.0]: https://github.com/genomoncology/biomcp/releases/tag/v0.1.0 ``` -------------------------------------------------------------------------------- /docs/developer-guides/02-contributing-and-testing.md: -------------------------------------------------------------------------------- ```markdown # Contributing and Testing Guide This guide covers how to contribute to BioMCP and run the comprehensive test suite. ## Getting Started ### Prerequisites - Python 3.10 or higher - [uv](https://docs.astral.sh/uv/) package manager - Git - Node.js (for MCP Inspector) ### Initial Setup 1. **Fork and clone the repository:** ```bash git clone https://github.com/YOUR_USERNAME/biomcp.git cd biomcp ``` 2. **Install dependencies and setup:** ```bash # Recommended: Use make for complete setup make install # Alternative: Manual setup uv sync --all-extras uv run pre-commit install ``` 3. **Verify installation:** ```bash # Run server biomcp run # Run tests make test-offline ``` ## Development Workflow ### 1. Create Feature Branch ```bash git checkout -b feature/your-feature-name ``` ### 2. Make Changes Follow these principles: - **Keep changes minimal and focused** - **Follow existing code patterns** - **Add tests for new functionality** - **Update documentation as needed** ### 3. Quality Checks **MANDATORY: Run these before considering work complete:** ```bash # Step 1: Code quality checks make check # This runs: # - ruff check (linting) # - ruff format (code formatting) # - mypy (type checking) # - pre-commit hooks # - deptry (dependency analysis) ``` ### 4. Run Tests ```bash # Step 2: Run appropriate test suite make test # Full suite (requires network) # OR make test-offline # Unit tests only (no network) ``` **Both quality checks and tests MUST pass before submitting changes.** ## Testing Strategy ### Test Categories #### Unit Tests - Fast, reliable tests without external dependencies - Mock all external API calls - Always run in CI/CD ```python # Example unit test @patch('httpx.AsyncClient.get') async def test_article_search(mock_get): mock_get.return_value.json.return_value = {"results": [...]} result = await article_searcher(genes=["BRAF"]) assert len(result) > 0 ``` #### Integration Tests - Test real API interactions - May fail due to network/API issues - Run separately in CI with `continue-on-error` ```python # Example integration test @pytest.mark.integration async def test_real_pubmed_search(): result = await article_searcher(genes=["TP53"], limit=5) assert len(result) == 5 assert all("TP53" in r.text for r in result) ``` ### Running Tests #### Command Options ```bash # Run all tests make test uv run python -m pytest # Run only unit tests (fast, offline) make test-offline uv run python -m pytest -m "not integration" # Run only integration tests uv run python -m pytest -m "integration" # Run specific test file uv run python -m pytest tests/tdd/test_article_search.py # Run with coverage make cov uv run python -m pytest --cov --cov-report=html # Run tests verbosely uv run python -m pytest -v # Run tests and stop on first failure uv run python -m pytest -x ``` #### Test Discovery Tests are organized in: - `tests/tdd/` - Unit and integration tests - `tests/bdd/` - Behavior-driven development tests - `tests/data/` - Test fixtures and sample data ### Writing Tests #### Test Structure ```python import pytest from unittest.mock import patch, AsyncMock from biomcp.articles import article_searcher class TestArticleSearch: """Test article search functionality""" @pytest.fixture def mock_response(self): """Sample API response""" return { "results": [ {"pmid": "12345", "title": "BRAF in melanoma"} ] } @patch('httpx.AsyncClient.get') async def test_basic_search(self, mock_get, mock_response): """Test basic article search""" # Setup mock_get.return_value = AsyncMock() mock_get.return_value.json.return_value = mock_response # Execute result = await article_searcher(genes=["BRAF"]) # Assert assert len(result) == 1 assert "BRAF" in result[0].title ``` #### Async Testing ```python import pytest import asyncio @pytest.mark.asyncio async def test_async_function(): """Test async functionality""" result = await some_async_function() assert result is not None # Or use pytest-asyncio fixtures @pytest.fixture async def async_client(): async with AsyncClient() as client: yield client ``` #### Mocking External APIs ```python from unittest.mock import patch, MagicMock @patch('biomcp.integrations.pubmed.search') def test_with_mock(mock_search): # Configure mock mock_search.return_value = [{ "pmid": "12345", "title": "Test Article" }] # Test code that uses the mocked function result = search_articles("BRAF") # Verify mock was called correctly mock_search.assert_called_once_with("BRAF") ``` ## MCP Inspector Testing The MCP Inspector provides an interactive way to test MCP tools. ### Setup ```bash # Install inspector npm install -g @modelcontextprotocol/inspector # Run BioMCP with inspector make inspector # OR npx @modelcontextprotocol/inspector uv run --with biomcp-python biomcp run ``` ### Testing Tools 1. **Connect to server** in the inspector UI 2. **View available tools** in the tools panel 3. **Test individual tools** with sample inputs #### Example Tool Tests ```javascript // Test article search { "tool": "article_searcher", "arguments": { "genes": ["BRAF"], "diseases": ["melanoma"], "limit": 5 } } // Test trial search { "tool": "trial_searcher", "arguments": { "conditions": ["lung cancer"], "recruiting_status": "OPEN", "limit": 10 } } // Test think tool (ALWAYS first!) { "tool": "think", "arguments": { "thought": "Planning to search for BRAF mutations", "thoughtNumber": 1, "nextThoughtNeeded": true } } ``` ### Debugging with Inspector 1. **Check request/response**: View raw MCP messages 2. **Verify parameters**: Ensure correct argument format 3. **Test error handling**: Try invalid inputs 4. **Monitor performance**: Check response times ## Code Style and Standards ### Python Style - **Formatter**: ruff (line length: 79) - **Type hints**: Required for all functions - **Docstrings**: Google style for all public functions ```python def search_articles( genes: list[str], limit: int = 10 ) -> list[Article]: """Search for articles by gene names. Args: genes: List of gene symbols to search limit: Maximum number of results Returns: List of Article objects Raises: ValueError: If genes list is empty """ if not genes: raise ValueError("Genes list cannot be empty") # Implementation... ``` ### Pre-commit Hooks Automatically run on commit: - ruff formatting - ruff linting - mypy type checking - File checks (YAML, TOML, merge conflicts) Manual run: ```bash uv run pre-commit run --all-files ``` ## Continuous Integration ### GitHub Actions Workflow The CI pipeline runs: 1. **Linting and Formatting** 2. **Type Checking** 3. **Unit Tests** (required to pass) 4. **Integration Tests** (allowed to fail) 5. **Coverage Report** ### CI Configuration ```yaml # .github/workflows/test.yml structure jobs: test: strategy: matrix: python-version: ["3.10", "3.11", "3.12"] steps: - uses: actions/checkout@v4 - uses: astral-sh/setup-uv@v2 - run: make check - run: make test-offline ``` ## Debugging and Troubleshooting ### Common Issues #### Test Failures ```bash # Run failed test with more details uv run python -m pytest -vvs tests/path/to/test.py::test_name # Debug with print statements uv run python -m pytest -s # Don't capture stdout # Use debugger uv run python -m pytest --pdb # Drop to debugger on failure ``` #### Integration Test Issues Common causes: - **Rate limiting**: Add delays or use mocks - **API changes**: Update test expectations - **Network issues**: Check connectivity - **API keys**: Ensure valid keys for NCI tests ## Integration Testing ### Overview BioMCP includes integration tests that make real API calls to external services. These tests verify that our integrations work correctly with live data but can be affected by API availability, rate limits, and data changes. ### Running Integration Tests ```bash # Run all tests including integration make test # Run only integration tests pytest -m integration # Skip integration tests pytest -m "not integration" ``` ### Handling Flaky Tests Integration tests may fail or skip for various reasons: 1. **API Unavailability** - **Symptom**: Tests skip with "API returned no data" message - **Cause**: The external service is down or experiencing issues - **Action**: Re-run tests later or check service status 2. **Rate Limiting** - **Symptom**: Multiple test failures after initial successes - **Cause**: Too many requests in a short time - **Action**: Run tests with delays between them or use API tokens 3. **Data Changes** - **Symptom**: Assertions about specific data fail - **Cause**: The external data has changed (e.g., new mutations discovered) - **Action**: Update tests to use more flexible assertions ### Integration Test Design Principles #### 1. Graceful Skipping Tests should skip rather than fail when: - API returns no data - Service is unavailable - Rate limits are hit ```python if not data or data.total_count == 0: pytest.skip("API returned no data - possible service issue") ``` #### 2. Flexible Assertions Avoid assertions on specific data values that might change: ❌ **Bad**: Expecting exact mutation counts ```python assert summary.total_mutations == 1234 ``` ✅ **Good**: Checking data exists and has reasonable structure ```python assert summary.total_mutations > 0 assert hasattr(summary, 'hotspots') ``` #### 3. Retry Logic For critical tests, implement retry with delay: ```python async def fetch_with_retry(client, resource, max_attempts=2, delay=1.0): for attempt in range(max_attempts): result = await client.get(resource) if result and result.data: return result if attempt < max_attempts - 1: await asyncio.sleep(delay) return None ``` #### 4. Cache Management Clear caches before tests to ensure fresh data: ```python from biomcp.utils.request_cache import clear_cache await clear_cache() ``` ### Common Integration Test Patterns #### Testing Search Functionality ```python @pytest.mark.integration async def test_gene_search(self): client = SearchClient() results = await client.search("BRAF") # Flexible assertions assert results is not None if results.count > 0: assert results.items[0].gene_symbol == "BRAF" else: pytest.skip("No results returned - API may be unavailable") ``` #### Testing Data Retrieval ```python @pytest.mark.integration async def test_variant_details(self): client = VariantClient() variant = await client.get_variant("rs121913529") if not variant: pytest.skip("Variant not found - may have been removed from database") # Check structure, not specific values assert hasattr(variant, 'chromosome') assert hasattr(variant, 'position') ``` ### Debugging Failed Integration Tests 1. **Enable Debug Logging** ```bash BIOMCP_LOG_LEVEL=DEBUG pytest tests/integration/test_failing.py -v ``` 2. **Check API Status** - PubMed: https://www.ncbi.nlm.nih.gov/home/about/website-updates/ - ClinicalTrials.gov: https://clinicaltrials.gov/about/announcements - cBioPortal: https://www.cbioportal.org/ 3. **Inspect Response Data** ```python if not expected_data: print(f"Unexpected response: {response}") pytest.skip("Data structure changed") ``` ### Environment Variables for Testing #### API Tokens Some services provide higher rate limits with authentication: ```bash export CBIO_TOKEN="your-token-here" export PUBMED_API_KEY="your-key-here" ``` #### Offline Mode Test offline behavior: ```bash export BIOMCP_OFFLINE=true pytest tests/ ``` #### Custom Timeouts Adjust timeouts for slow connections: ```bash export BIOMCP_REQUEST_TIMEOUT=60 pytest tests/integration/ ``` ### CI/CD Considerations 1. **Separate Test Runs** ```yaml - name: Unit Tests run: pytest -m "not integration" - name: Integration Tests run: pytest -m integration continue-on-error: true ``` 2. **Scheduled Runs** ```yaml on: schedule: - cron: "0 6 * * *" # Daily at 6 AM ``` 3. **Result Monitoring**: Track integration test success rates over time to identify patterns. ### Integration Testing Best Practices 1. **Keep integration tests focused** - Test integration points, not business logic 2. **Use reasonable timeouts** - Don't wait forever for slow APIs 3. **Document expected failures** - Add comments explaining why tests might skip 4. **Monitor external changes** - Subscribe to API change notifications 5. **Provide escape hatches** - Allow skipping integration tests when needed #### Type Checking Errors ```bash # Check specific file uv run mypy src/biomcp/specific_file.py # Ignore specific error # type: ignore[error-code] # Show error codes uv run mypy --show-error-codes ``` ### Performance Testing ```python import time import pytest @pytest.mark.performance def test_search_performance(): """Ensure search completes within time limit""" start = time.time() result = search_articles("TP53", limit=100) duration = time.time() - start assert duration < 5.0 # Should complete in 5 seconds assert len(result) == 100 ``` ## Submitting Changes ### Pull Request Process 1. **Ensure all checks pass:** ```bash make check && make test ``` 2. **Update documentation** if needed 3. **Commit with clear message:** ```bash git add . git commit -m "feat: add support for variant batch queries - Add batch_variant_search function - Update tests for batch functionality - Document batch size limits" ``` 4. **Push to your fork:** ```bash git push origin feature/your-feature-name ``` 5. **Create Pull Request** with: - Clear description of changes - Link to related issues - Test results summary ### Code Review Guidelines Your PR will be reviewed for: - **Code quality** and style consistency - **Test coverage** for new features - **Documentation** updates - **Performance** impact - **Security** considerations ## Best Practices ### DO: - Write tests for new functionality - Follow existing patterns - Keep PRs focused and small - Update documentation - Run full test suite locally ### DON'T: - Skip tests to "save time" - Mix unrelated changes in one PR - Ignore linting warnings - Commit sensitive data - Break existing functionality ## Additional Resources - [MCP Documentation](https://modelcontextprotocol.org) - [pytest Documentation](https://docs.pytest.org) - [Type Hints Guide](https://mypy.readthedocs.io) - [Ruff Documentation](https://docs.astral.sh/ruff) ## Getting Help - **GitHub Issues**: Report bugs or request features - **Issues**: Ask questions or share ideas - **Pull Requests**: Submit contributions - **Documentation**: Check existing docs first Remember: Quality over speed. Take time to write good tests and clean code! ``` -------------------------------------------------------------------------------- /src/biomcp/cli/openfda.py: -------------------------------------------------------------------------------- ```python """ OpenFDA CLI commands for BioMCP. """ import asyncio from typing import Annotated import typer from rich.console import Console from ..openfda import ( get_adverse_event, get_device_event, get_drug_approval, get_drug_label, get_drug_recall, get_drug_shortage, search_adverse_events, search_device_events, search_drug_approvals, search_drug_labels, search_drug_recalls, search_drug_shortages, ) console = Console() # Create separate Typer apps for each subdomain adverse_app = typer.Typer( no_args_is_help=True, help="Search and retrieve FDA drug adverse event reports (FAERS)", ) label_app = typer.Typer( no_args_is_help=True, help="Search and retrieve FDA drug product labels (SPL)", ) device_app = typer.Typer( no_args_is_help=True, help="Search and retrieve FDA device adverse event reports (MAUDE)", ) approval_app = typer.Typer( no_args_is_help=True, help="Search and retrieve FDA drug approval records (Drugs@FDA)", ) recall_app = typer.Typer( no_args_is_help=True, help="Search and retrieve FDA drug recall records (Enforcement)", ) shortage_app = typer.Typer( no_args_is_help=True, help="Search and retrieve FDA drug shortage information", ) # Adverse Events Commands @adverse_app.command("search") def search_adverse_events_cli( drug: Annotated[ str | None, typer.Option("--drug", "-d", help="Drug name to search for"), ] = None, reaction: Annotated[ str | None, typer.Option( "--reaction", "-r", help="Adverse reaction to search for" ), ] = None, serious: Annotated[ bool | None, typer.Option("--serious/--all", help="Filter for serious events only"), ] = None, limit: Annotated[ int, typer.Option("--limit", "-l", help="Maximum number of results") ] = 25, page: Annotated[ int, typer.Option("--page", "-p", help="Page number (1-based)") ] = 1, api_key: Annotated[ str | None, typer.Option( "--api-key", help="OpenFDA API key (overrides OPENFDA_API_KEY env var)", ), ] = None, ): """Search FDA adverse event reports for drugs.""" skip = (page - 1) * limit try: results = asyncio.run( search_adverse_events( drug=drug, reaction=reaction, serious=serious, limit=limit, skip=skip, api_key=api_key, ) ) console.print(results) except Exception as e: console.print(f"[red]Error: {e}[/red]") raise typer.Exit(1) from e @adverse_app.command("get") def get_adverse_event_cli( report_id: Annotated[str, typer.Argument(help="Safety report ID")], api_key: Annotated[ str | None, typer.Option( "--api-key", help="OpenFDA API key (overrides OPENFDA_API_KEY env var)", ), ] = None, ): """Get detailed information for a specific adverse event report.""" try: result = asyncio.run(get_adverse_event(report_id, api_key=api_key)) console.print(result) except Exception as e: console.print(f"[red]Error: {e}[/red]") raise typer.Exit(1) from e # Drug Label Commands @label_app.command("search") def search_drug_labels_cli( name: Annotated[ str | None, typer.Option("--name", "-n", help="Drug name to search for"), ] = None, indication: Annotated[ str | None, typer.Option( "--indication", "-i", help="Search for drugs indicated for this condition", ), ] = None, boxed_warning: Annotated[ bool, typer.Option( "--boxed-warning", help="Filter for drugs with boxed warnings" ), ] = False, section: Annotated[ str | None, typer.Option( "--section", "-s", help="Specific label section to search" ), ] = None, limit: Annotated[ int, typer.Option("--limit", "-l", help="Maximum number of results") ] = 25, page: Annotated[ int, typer.Option("--page", "-p", help="Page number (1-based)") ] = 1, api_key: Annotated[ str | None, typer.Option( "--api-key", help="OpenFDA API key (overrides OPENFDA_API_KEY env var)", ), ] = None, ): """Search FDA drug product labels.""" skip = (page - 1) * limit try: results = asyncio.run( search_drug_labels( name=name, indication=indication, boxed_warning=boxed_warning, section=section, limit=limit, skip=skip, api_key=api_key, ) ) console.print(results) except Exception as e: console.print(f"[red]Error: {e}[/red]") raise typer.Exit(1) from e @label_app.command("get") def get_drug_label_cli( set_id: Annotated[str, typer.Argument(help="Label set ID")], sections: Annotated[ str | None, typer.Option( "--sections", help="Comma-separated list of sections to retrieve" ), ] = None, api_key: Annotated[ str | None, typer.Option( "--api-key", help="OpenFDA API key (overrides OPENFDA_API_KEY env var)", ), ] = None, ): """Get detailed drug label information.""" section_list = None if sections: section_list = [s.strip() for s in sections.split(",")] try: result = asyncio.run( get_drug_label(set_id, section_list, api_key=api_key) ) console.print(result) except Exception as e: console.print(f"[red]Error: {e}[/red]") raise typer.Exit(1) from e # Device Event Commands @device_app.command("search") def search_device_events_cli( device: Annotated[ str | None, typer.Option("--device", "-d", help="Device name to search for"), ] = None, manufacturer: Annotated[ str | None, typer.Option("--manufacturer", "-m", help="Manufacturer name"), ] = None, problem: Annotated[ str | None, typer.Option("--problem", "-p", help="Device problem description"), ] = None, product_code: Annotated[ str | None, typer.Option("--product-code", help="FDA product code") ] = None, genomics_only: Annotated[ bool, typer.Option( "--genomics-only/--all-devices", help="Filter to genomic/diagnostic devices", ), ] = True, limit: Annotated[ int, typer.Option("--limit", "-l", help="Maximum number of results") ] = 25, page: Annotated[ int, typer.Option("--page", help="Page number (1-based)") ] = 1, api_key: Annotated[ str | None, typer.Option( "--api-key", help="OpenFDA API key (overrides OPENFDA_API_KEY env var)", ), ] = None, ): """Search FDA device adverse event reports.""" skip = (page - 1) * limit try: results = asyncio.run( search_device_events( device=device, manufacturer=manufacturer, problem=problem, product_code=product_code, genomics_only=genomics_only, limit=limit, skip=skip, api_key=api_key, ) ) console.print(results) except Exception as e: console.print(f"[red]Error: {e}[/red]") raise typer.Exit(1) from e @device_app.command("get") def get_device_event_cli( mdr_report_key: Annotated[str, typer.Argument(help="MDR report key")], api_key: Annotated[ str | None, typer.Option( "--api-key", help="OpenFDA API key (overrides OPENFDA_API_KEY env var)", ), ] = None, ): """Get detailed information for a specific device event report.""" try: result = asyncio.run(get_device_event(mdr_report_key, api_key=api_key)) console.print(result) except Exception as e: console.print(f"[red]Error: {e}[/red]") raise typer.Exit(1) from e # Drug Approval Commands @approval_app.command("search") def search_drug_approvals_cli( drug: Annotated[ str | None, typer.Option("--drug", "-d", help="Drug name to search for"), ] = None, application: Annotated[ str | None, typer.Option( "--application", "-a", help="NDA or BLA application number" ), ] = None, year: Annotated[ str | None, typer.Option("--year", "-y", help="Approval year (YYYY format)"), ] = None, limit: Annotated[ int, typer.Option("--limit", "-l", help="Maximum number of results") ] = 25, page: Annotated[ int, typer.Option("--page", "-p", help="Page number (1-based)") ] = 1, api_key: Annotated[ str | None, typer.Option( "--api-key", help="OpenFDA API key (overrides OPENFDA_API_KEY env var)", ), ] = None, ): """Search FDA drug approval records.""" skip = (page - 1) * limit try: results = asyncio.run( search_drug_approvals( drug=drug, application_number=application, approval_year=year, limit=limit, skip=skip, api_key=api_key, ) ) console.print(results) except Exception as e: console.print(f"[red]Error: {e}[/red]") raise typer.Exit(1) from e @approval_app.command("get") def get_drug_approval_cli( application: Annotated[ str, typer.Argument(help="NDA or BLA application number") ], api_key: Annotated[ str | None, typer.Option( "--api-key", help="OpenFDA API key (overrides OPENFDA_API_KEY env var)", ), ] = None, ): """Get detailed drug approval information.""" try: result = asyncio.run(get_drug_approval(application, api_key=api_key)) console.print(result) except Exception as e: console.print(f"[red]Error: {e}[/red]") raise typer.Exit(1) from e # Drug Recall Commands @recall_app.command("search") def search_drug_recalls_cli( drug: Annotated[ str | None, typer.Option("--drug", "-d", help="Drug name to search for"), ] = None, recall_class: Annotated[ str | None, typer.Option( "--class", "-c", help="Recall classification (1, 2, or 3)" ), ] = None, status: Annotated[ str | None, typer.Option( "--status", "-s", help="Recall status (ongoing, completed)" ), ] = None, reason: Annotated[ str | None, typer.Option("--reason", "-r", help="Search in recall reason"), ] = None, since: Annotated[ str | None, typer.Option("--since", help="Show recalls after date (YYYYMMDD)"), ] = None, limit: Annotated[ int, typer.Option("--limit", "-l", help="Maximum number of results") ] = 25, page: Annotated[ int, typer.Option("--page", "-p", help="Page number (1-based)") ] = 1, api_key: Annotated[ str | None, typer.Option( "--api-key", help="OpenFDA API key (overrides OPENFDA_API_KEY env var)", ), ] = None, ): """Search FDA drug recall records.""" skip = (page - 1) * limit try: results = asyncio.run( search_drug_recalls( drug=drug, recall_class=recall_class, status=status, reason=reason, since_date=since, limit=limit, skip=skip, api_key=api_key, ) ) console.print(results) except Exception as e: console.print(f"[red]Error: {e}[/red]") raise typer.Exit(1) from e @recall_app.command("get") def get_drug_recall_cli( recall_number: Annotated[str, typer.Argument(help="FDA recall number")], api_key: Annotated[ str | None, typer.Option( "--api-key", help="OpenFDA API key (overrides OPENFDA_API_KEY env var)", ), ] = None, ): """Get detailed drug recall information.""" try: result = asyncio.run(get_drug_recall(recall_number, api_key=api_key)) console.print(result) except Exception as e: console.print(f"[red]Error: {e}[/red]") raise typer.Exit(1) from e # Drug Shortage Commands @shortage_app.command("search") def search_drug_shortages_cli( drug: Annotated[ str | None, typer.Option("--drug", "-d", help="Drug name to search for"), ] = None, status: Annotated[ str | None, typer.Option( "--status", "-s", help="Shortage status (current, resolved)" ), ] = None, category: Annotated[ str | None, typer.Option("--category", "-c", help="Therapeutic category"), ] = None, limit: Annotated[ int, typer.Option("--limit", "-l", help="Maximum number of results") ] = 25, page: Annotated[ int, typer.Option("--page", "-p", help="Page number (1-based)") ] = 1, api_key: Annotated[ str | None, typer.Option( "--api-key", help="OpenFDA API key (overrides OPENFDA_API_KEY env var)", ), ] = None, ): """Search FDA drug shortage records.""" skip = (page - 1) * limit try: results = asyncio.run( search_drug_shortages( drug=drug, status=status, therapeutic_category=category, limit=limit, skip=skip, api_key=api_key, ) ) console.print(results) except Exception as e: console.print(f"[red]Error: {e}[/red]") raise typer.Exit(1) from e @shortage_app.command("get") def get_drug_shortage_cli( drug: Annotated[str, typer.Argument(help="Drug name")], api_key: Annotated[ str | None, typer.Option( "--api-key", help="OpenFDA API key (overrides OPENFDA_API_KEY env var)", ), ] = None, ): """Get detailed drug shortage information.""" try: result = asyncio.run(get_drug_shortage(drug, api_key=api_key)) console.print(result) except Exception as e: console.print(f"[red]Error: {e}[/red]") raise typer.Exit(1) from e # Main OpenFDA app that combines all subcommands openfda_app = typer.Typer( no_args_is_help=True, help="Search and retrieve data from FDA's openFDA API", ) # Add subcommands openfda_app.add_typer( adverse_app, name="adverse", help="Drug adverse events (FAERS)" ) openfda_app.add_typer( label_app, name="label", help="Drug product labels (SPL)" ) openfda_app.add_typer( device_app, name="device", help="Device adverse events (MAUDE)" ) openfda_app.add_typer( approval_app, name="approval", help="Drug approvals (Drugs@FDA)" ) openfda_app.add_typer( recall_app, name="recall", help="Drug recalls (Enforcement)" ) openfda_app.add_typer(shortage_app, name="shortage", help="Drug shortages") ``` -------------------------------------------------------------------------------- /src/biomcp/articles/preprints.py: -------------------------------------------------------------------------------- ```python """Preprint search functionality for bioRxiv/medRxiv and Europe PMC.""" import asyncio import json import logging from datetime import datetime from typing import Any from pydantic import BaseModel, Field from .. import http_client, render from ..constants import ( BIORXIV_BASE_URL, BIORXIV_DEFAULT_DAYS_BACK, BIORXIV_MAX_PAGES, BIORXIV_RESULTS_PER_PAGE, EUROPE_PMC_BASE_URL, EUROPE_PMC_PAGE_SIZE, MEDRXIV_BASE_URL, SYSTEM_PAGE_SIZE, ) from ..core import PublicationState from .search import PubmedRequest, ResultItem, SearchResponse logger = logging.getLogger(__name__) class BiorxivRequest(BaseModel): """Request parameters for bioRxiv/medRxiv API.""" query: str interval: str = Field( default="", description="Date interval in YYYY-MM-DD/YYYY-MM-DD format" ) cursor: int = Field(default=0, description="Starting position") class BiorxivResult(BaseModel): """Individual result from bioRxiv/medRxiv.""" doi: str | None = None title: str | None = None authors: str | None = None author_corresponding: str | None = None author_corresponding_institution: str | None = None date: str | None = None version: int | None = None type: str | None = None license: str | None = None category: str | None = None jatsxml: str | None = None abstract: str | None = None published: str | None = None server: str | None = None def to_result_item(self) -> ResultItem: """Convert to standard ResultItem format.""" authors_list = [] if self.authors: authors_list = [ author.strip() for author in self.authors.split(";") ] return ResultItem( pmid=None, pmcid=None, title=self.title, journal=f"{self.server or 'bioRxiv'} (preprint)", authors=authors_list, date=self.date, doi=self.doi, abstract=self.abstract, publication_state=PublicationState.PREPRINT, source=self.server or "bioRxiv", ) class BiorxivResponse(BaseModel): """Response from bioRxiv/medRxiv API.""" collection: list[BiorxivResult] = Field(default_factory=list) messages: list[dict[str, Any]] = Field(default_factory=list) total: int = Field(default=0, alias="total") class EuropePMCRequest(BaseModel): """Request parameters for Europe PMC API.""" query: str format: str = "json" pageSize: int = Field(default=25, le=1000) cursorMark: str = Field(default="*") src: str = Field(default="PPR", description="Source: PPR for preprints") class EuropePMCResult(BaseModel): """Individual result from Europe PMC.""" id: str | None = None source: str | None = None pmid: str | None = None pmcid: str | None = None doi: str | None = None title: str | None = None authorString: str | None = None journalTitle: str | None = None pubYear: str | None = None firstPublicationDate: str | None = None abstractText: str | None = None def to_result_item(self) -> ResultItem: """Convert to standard ResultItem format.""" authors_list = [] if self.authorString: authors_list = [ author.strip() for author in self.authorString.split(",") ] return ResultItem( pmid=int(self.pmid) if self.pmid and self.pmid.isdigit() else None, pmcid=self.pmcid, title=self.title, journal=f"{self.journalTitle or 'Preprint Server'} (preprint)", authors=authors_list, date=self.firstPublicationDate or self.pubYear, doi=self.doi, abstract=self.abstractText, publication_state=PublicationState.PREPRINT, source="Europe PMC", ) class EuropePMCResponse(BaseModel): """Response from Europe PMC API.""" hitCount: int = Field(default=0) nextCursorMark: str | None = None resultList: dict[str, Any] = Field(default_factory=dict) @property def results(self) -> list[EuropePMCResult]: result_data = self.resultList.get("result", []) return [EuropePMCResult(**r) for r in result_data] class PreprintSearcher: """Handles searching across multiple preprint sources.""" def __init__(self): self.biorxiv_client = BiorxivClient() self.europe_pmc_client = EuropePMCClient() async def search( self, request: PubmedRequest, include_biorxiv: bool = True, include_europe_pmc: bool = True, ) -> SearchResponse: """Search across preprint sources and merge results.""" query = self._build_query(request) tasks = [] if include_biorxiv: tasks.append(self.biorxiv_client.search(query)) if include_europe_pmc: tasks.append(self.europe_pmc_client.search(query)) results_lists = await asyncio.gather(*tasks, return_exceptions=True) all_results = [] for results in results_lists: if isinstance(results, list): all_results.extend(results) # Remove duplicates based on DOI seen_dois = set() unique_results = [] for result in all_results: if result.doi and result.doi in seen_dois: continue if result.doi: seen_dois.add(result.doi) unique_results.append(result) # Sort by date (newest first) unique_results.sort(key=lambda x: x.date or "0000-00-00", reverse=True) # Limit results limited_results = unique_results[:SYSTEM_PAGE_SIZE] return SearchResponse( results=limited_results, page_size=len(limited_results), current=0, count=len(limited_results), total_pages=1, ) def _build_query(self, request: PubmedRequest) -> str: """Build query string from structured request. Note: Preprint servers use plain text search, not PubMed syntax. """ query_parts = [] if request.keywords: query_parts.extend(request.keywords) if request.genes: query_parts.extend(request.genes) if request.diseases: query_parts.extend(request.diseases) if request.chemicals: query_parts.extend(request.chemicals) if request.variants: query_parts.extend(request.variants) return " ".join(query_parts) if query_parts else "" class BiorxivClient: """Client for bioRxiv/medRxiv API. IMPORTANT LIMITATION: bioRxiv/medRxiv APIs do not provide a search endpoint. This implementation works around this limitation by: 1. Fetching articles from a date range (last 365 days by default) 2. Filtering results client-side based on query match in title/abstract This approach has limitations but is optimized for performance: - Searches up to 1 year of preprints by default (configurable) - Uses pagination to avoid fetching all results at once - May still miss older preprints beyond the date range Consider using Europe PMC for more comprehensive preprint search capabilities, as it has proper search functionality without date limitations. """ async def search( # noqa: C901 self, query: str, server: str = "biorxiv", days_back: int = BIORXIV_DEFAULT_DAYS_BACK, ) -> list[ResultItem]: """Search bioRxiv or medRxiv for articles. Note: Due to API limitations, this performs client-side filtering on recent articles only. See class docstring for details. """ if not query: return [] base_url = ( BIORXIV_BASE_URL if server == "biorxiv" else MEDRXIV_BASE_URL ) # Optimize by only fetching recent articles (last 30 days by default) from datetime import timedelta today = datetime.now() start_date = today - timedelta(days=days_back) interval = f"{start_date.year}-{start_date.month:02d}-{start_date.day:02d}/{today.year}-{today.month:02d}-{today.day:02d}" # Prepare query terms for better matching query_terms = query.lower().split() filtered_results = [] cursor = 0 max_pages = ( BIORXIV_MAX_PAGES # Limit pagination to avoid excessive API calls ) for page in range(max_pages): request = BiorxivRequest( query=query, interval=interval, cursor=cursor ) url = f"{base_url}/{request.interval}/{request.cursor}" response, error = await http_client.request_api( url=url, method="GET", request={}, response_model_type=BiorxivResponse, domain="biorxiv", cache_ttl=300, # Cache for 5 minutes ) if error or not response: logger.warning( f"Failed to fetch {server} articles page {page} for query '{query}': {error if error else 'No response'}" ) break # Filter results based on query page_filtered = 0 for result in response.collection: # Create searchable text from title and abstract searchable_text = "" if result.title: searchable_text += result.title.lower() + " " if result.abstract: searchable_text += result.abstract.lower() # Check if all query terms are present (AND logic) if all(term in searchable_text for term in query_terms): filtered_results.append(result.to_result_item()) page_filtered += 1 # Stop if we have enough results if len(filtered_results) >= SYSTEM_PAGE_SIZE: return filtered_results[:SYSTEM_PAGE_SIZE] # If this page had no matches and we have some results, stop pagination if page_filtered == 0 and filtered_results: break # Move to next page cursor += len(response.collection) # Stop if we've processed all available results if ( len(response.collection) < BIORXIV_RESULTS_PER_PAGE ): # bioRxiv typically returns this many per page break return filtered_results[:SYSTEM_PAGE_SIZE] class EuropePMCClient: """Client for Europe PMC API.""" async def search( self, query: str, max_results: int = SYSTEM_PAGE_SIZE ) -> list[ResultItem]: """Search Europe PMC for preprints with pagination support.""" results: list[ResultItem] = [] cursor_mark = "*" page_size = min( EUROPE_PMC_PAGE_SIZE, max_results ) # Europe PMC optimal page size while len(results) < max_results: request = EuropePMCRequest( query=f"(SRC:PPR) AND ({query})" if query else "SRC:PPR", pageSize=page_size, cursorMark=cursor_mark, ) params = request.model_dump(exclude_none=True) response, error = await http_client.request_api( url=EUROPE_PMC_BASE_URL, method="GET", request=params, response_model_type=EuropePMCResponse, domain="europepmc", cache_ttl=300, # Cache for 5 minutes ) if error or not response: logger.warning( f"Failed to fetch Europe PMC preprints for query '{query}': {error if error else 'No response'}" ) break # Add results page_results = [ result.to_result_item() for result in response.results ] results.extend(page_results) # Check if we have more pages if ( not response.nextCursorMark or response.nextCursorMark == cursor_mark ): break # Check if we got fewer results than requested (last page) if len(page_results) < page_size: break cursor_mark = response.nextCursorMark # Adjust page size for last request if needed remaining = max_results - len(results) if remaining < page_size: page_size = remaining return results[:max_results] async def fetch_europe_pmc_article( doi: str, output_json: bool = False, ) -> str: """Fetch a single article from Europe PMC by DOI.""" # Europe PMC search API can fetch article details by DOI request = EuropePMCRequest( query=f'DOI:"{doi}"', pageSize=1, src="PPR", # Preprints source ) params = request.model_dump(exclude_none=True) response, error = await http_client.request_api( url=EUROPE_PMC_BASE_URL, method="GET", request=params, response_model_type=EuropePMCResponse, domain="europepmc", ) if error: data: list[dict[str, Any]] = [ {"error": f"Error {error.code}: {error.message}"} ] elif response and response.results: # Convert Europe PMC result to Article format for consistency europe_pmc_result = response.results[0] article_data = { "pmid": None, # Europe PMC preprints don't have PMIDs "pmcid": europe_pmc_result.pmcid, "doi": europe_pmc_result.doi, "title": europe_pmc_result.title, "journal": f"{europe_pmc_result.journalTitle or 'Preprint Server'} (preprint)", "date": europe_pmc_result.firstPublicationDate or europe_pmc_result.pubYear, "authors": [ author.strip() for author in (europe_pmc_result.authorString or "").split(",") ], "abstract": europe_pmc_result.abstractText, "full_text": "", # Europe PMC API doesn't provide full text for preprints "pubmed_url": None, "pmc_url": f"https://europepmc.org/article/PPR/{doi}" if doi else None, "source": "Europe PMC", } data = [article_data] else: data = [{"error": "Article not found in Europe PMC"}] if data and not output_json: return render.to_markdown(data) else: return json.dumps(data, indent=2) async def search_preprints( request: PubmedRequest, include_biorxiv: bool = True, include_europe_pmc: bool = True, output_json: bool = False, ) -> str: """Search for preprints across multiple sources.""" searcher = PreprintSearcher() response = await searcher.search( request, include_biorxiv=include_biorxiv, include_europe_pmc=include_europe_pmc, ) if response and response.results: data = [ result.model_dump(mode="json", exclude_none=True) for result in response.results ] else: data = [] if data and not output_json: return render.to_markdown(data) else: return json.dumps(data, indent=2) ``` -------------------------------------------------------------------------------- /src/biomcp/query_parser.py: -------------------------------------------------------------------------------- ```python """Query parser for unified search language in BioMCP.""" from dataclasses import dataclass from enum import Enum from typing import Any class Operator(str, Enum): """Query operators.""" EQ = ":" GT = ">" LT = "<" GTE = ">=" LTE = "<=" RANGE = ".." AND = "AND" OR = "OR" NOT = "NOT" class FieldType(str, Enum): """Field data types.""" STRING = "string" NUMBER = "number" DATE = "date" ENUM = "enum" BOOLEAN = "boolean" @dataclass class FieldDefinition: """Definition of a searchable field.""" name: str domain: str # "trials", "articles", "variants", "cross" type: FieldType operators: list[str] example_values: list[str] description: str underlying_api_field: str aliases: list[str] | None = None @dataclass class QueryTerm: """Parsed query term.""" field: str operator: Operator value: Any domain: str | None = None is_negated: bool = False @dataclass class ParsedQuery: """Parsed query structure.""" terms: list[QueryTerm] cross_domain_fields: dict[str, Any] domain_specific_fields: dict[str, dict[str, Any]] raw_query: str class QueryParser: """Parser for unified search queries.""" def __init__(self): self.field_registry = self._build_field_registry() def _build_field_registry(self) -> dict[str, FieldDefinition]: """Build the field registry with all searchable fields.""" registry = {} # Cross-domain fields cross_domain_fields = [ FieldDefinition( name="gene", domain="cross", type=FieldType.STRING, operators=[Operator.EQ], example_values=["BRAF", "TP53", "EGFR"], description="Gene symbol", underlying_api_field="gene", ), FieldDefinition( name="variant", domain="cross", type=FieldType.STRING, operators=[Operator.EQ], example_values=["V600E", "L858R", "rs113488022"], description="Variant notation or rsID", underlying_api_field="variant", ), FieldDefinition( name="disease", domain="cross", type=FieldType.STRING, operators=[Operator.EQ], example_values=["melanoma", "lung cancer", "diabetes"], description="Disease or condition", underlying_api_field="disease", ), ] # Trial-specific fields trial_fields = [ FieldDefinition( name="trials.condition", domain="trials", type=FieldType.STRING, operators=[Operator.EQ], example_values=["melanoma", "lung cancer"], description="Clinical trial condition", underlying_api_field="conditions", ), FieldDefinition( name="trials.intervention", domain="trials", type=FieldType.STRING, operators=[Operator.EQ], example_values=["osimertinib", "pembrolizumab"], description="Trial intervention", underlying_api_field="interventions", ), FieldDefinition( name="trials.phase", domain="trials", type=FieldType.ENUM, operators=[Operator.EQ], example_values=["1", "2", "3", "4"], description="Trial phase", underlying_api_field="phase", ), FieldDefinition( name="trials.status", domain="trials", type=FieldType.ENUM, operators=[Operator.EQ], example_values=["recruiting", "active", "completed"], description="Trial recruitment status", underlying_api_field="recruiting_status", ), ] # Article-specific fields article_fields = [ FieldDefinition( name="articles.title", domain="articles", type=FieldType.STRING, operators=[Operator.EQ], example_values=["EGFR mutations", "cancer therapy"], description="Article title", underlying_api_field="title", ), FieldDefinition( name="articles.author", domain="articles", type=FieldType.STRING, operators=[Operator.EQ], example_values=["Smith J", "Johnson A"], description="Article author", underlying_api_field="author", ), FieldDefinition( name="articles.journal", domain="articles", type=FieldType.STRING, operators=[Operator.EQ], example_values=["Nature", "Science", "Cell"], description="Journal name", underlying_api_field="journal", ), FieldDefinition( name="articles.date", domain="articles", type=FieldType.DATE, operators=[Operator.GT, Operator.LT, Operator.RANGE], example_values=[">2023-01-01", "2023-01-01..2024-01-01"], description="Publication date", underlying_api_field="date", ), ] # Variant-specific fields variant_fields = [ FieldDefinition( name="variants.rsid", domain="variants", type=FieldType.STRING, operators=[Operator.EQ], example_values=["rs113488022", "rs121913529"], description="dbSNP rsID", underlying_api_field="rsid", ), FieldDefinition( name="variants.gene", domain="variants", type=FieldType.STRING, operators=[Operator.EQ], example_values=["BRAF", "TP53"], description="Gene containing variant", underlying_api_field="gene", ), FieldDefinition( name="variants.significance", domain="variants", type=FieldType.ENUM, operators=[Operator.EQ], example_values=["pathogenic", "benign", "uncertain"], description="Clinical significance", underlying_api_field="significance", ), FieldDefinition( name="variants.frequency", domain="variants", type=FieldType.NUMBER, operators=[Operator.LT, Operator.GT], example_values=["<0.01", ">0.05"], description="Population allele frequency", underlying_api_field="frequency", ), ] # Gene-specific fields gene_fields = [ FieldDefinition( name="genes.symbol", domain="genes", type=FieldType.STRING, operators=[Operator.EQ], example_values=["BRAF", "TP53", "EGFR"], description="Gene symbol", underlying_api_field="symbol", ), FieldDefinition( name="genes.name", domain="genes", type=FieldType.STRING, operators=[Operator.EQ], example_values=[ "tumor protein p53", "epidermal growth factor receptor", ], description="Gene name", underlying_api_field="name", ), FieldDefinition( name="genes.type", domain="genes", type=FieldType.STRING, operators=[Operator.EQ], example_values=["protein-coding", "pseudo", "ncRNA"], description="Gene type", underlying_api_field="type_of_gene", ), ] # Drug-specific fields drug_fields = [ FieldDefinition( name="drugs.name", domain="drugs", type=FieldType.STRING, operators=[Operator.EQ], example_values=["imatinib", "aspirin", "metformin"], description="Drug name", underlying_api_field="name", ), FieldDefinition( name="drugs.tradename", domain="drugs", type=FieldType.STRING, operators=[Operator.EQ], example_values=["Gleevec", "Tylenol", "Lipitor"], description="Drug trade name", underlying_api_field="tradename", ), FieldDefinition( name="drugs.indication", domain="drugs", type=FieldType.STRING, operators=[Operator.EQ], example_values=["leukemia", "hypertension", "diabetes"], description="Drug indication", underlying_api_field="indication", ), ] # Disease-specific fields disease_fields = [ FieldDefinition( name="diseases.name", domain="diseases", type=FieldType.STRING, operators=[Operator.EQ], example_values=["melanoma", "breast cancer", "diabetes"], description="Disease name", underlying_api_field="name", ), FieldDefinition( name="diseases.mondo", domain="diseases", type=FieldType.STRING, operators=[Operator.EQ], example_values=["MONDO:0005105", "MONDO:0007254"], description="MONDO disease ID", underlying_api_field="mondo_id", ), FieldDefinition( name="diseases.synonym", domain="diseases", type=FieldType.STRING, operators=[Operator.EQ], example_values=["cancer", "tumor", "neoplasm"], description="Disease synonym", underlying_api_field="synonyms", ), ] # Build registry for field_list in [ cross_domain_fields, trial_fields, article_fields, variant_fields, gene_fields, drug_fields, disease_fields, ]: for field in field_list: registry[field.name] = field return registry def parse(self, query: str) -> ParsedQuery: """Parse a unified search query.""" # Simple tokenization - in production, use a proper parser terms = self._tokenize(query) parsed_terms = [] cross_domain = {} domain_specific: dict[str, dict[str, Any]] = { "trials": {}, "articles": {}, "variants": {}, "genes": {}, "drugs": {}, "diseases": {}, } for term in terms: if ":" in term: field, value = term.split(":", 1) # Check if it's a known field if field in self.field_registry: field_def = self.field_registry[field] parsed_term = QueryTerm( field=field, operator=Operator.EQ, value=value.strip('"'), domain=field_def.domain, ) parsed_terms.append(parsed_term) # Categorize the term if field_def.domain == "cross": cross_domain[field] = value.strip('"') else: domain = ( field.split(".")[0] if "." in field else field_def.domain ) if domain not in domain_specific: domain_specific[domain] = {} field_name = ( field.split(".")[-1] if "." in field else field ) domain_specific[domain][field_name] = value.strip('"') return ParsedQuery( terms=parsed_terms, cross_domain_fields=cross_domain, domain_specific_fields=domain_specific, raw_query=query, ) def _tokenize(self, query: str) -> list[str]: """Simple tokenizer for query strings.""" # This is a simplified tokenizer - in production, use a proper lexer # For now, split on AND/OR/NOT while preserving field:value pairs tokens = [] current_token = "" in_quotes = False for char in query: if char == '"': in_quotes = not in_quotes current_token += char elif char == " " and not in_quotes: if current_token: tokens.append(current_token) current_token = "" else: current_token += char if current_token: tokens.append(current_token) # Filter out boolean operators for now return [t for t in tokens if t not in ["AND", "OR", "NOT"]] def get_schema(self) -> dict[str, Any]: """Get the complete field schema for discovery.""" schema: dict[str, Any] = { "domains": [ "trials", "articles", "variants", "genes", "drugs", "diseases", ], "cross_domain_fields": {}, "domain_fields": { "trials": {}, "articles": {}, "variants": {}, "genes": {}, "drugs": {}, "diseases": {}, }, "operators": [op.value for op in Operator], "examples": [ "gene:BRAF AND trials.condition:melanoma", "articles.date:>2023 AND disease:cancer", "variants.significance:pathogenic AND gene:TP53", "genes.symbol:BRAF AND genes.type:protein-coding", "drugs.tradename:gleevec", "diseases.name:melanoma", ], } for field_name, field_def in self.field_registry.items(): field_info = { "type": field_def.type.value, "operators": field_def.operators, "examples": field_def.example_values, "description": field_def.description, } if field_def.domain == "cross": schema["cross_domain_fields"][field_name] = field_info else: domain = field_name.split(".")[0] field_short_name = field_name.split(".")[-1] schema["domain_fields"][domain][field_short_name] = field_info return schema ``` -------------------------------------------------------------------------------- /src/biomcp/resources/instructions.md: -------------------------------------------------------------------------------- ```markdown # BioMCP Instructions for the Biomedical Assistant Welcome to **BioMCP** – your unified interface to access key biomedical data sources. This document serves as an internal instruction set for the biomedical assistant (LLM) to ensure a clear, well-reasoned, and accurate response to user queries. --- ## CRITICAL: Always Use the 'think' Tool FIRST **The 'think' tool is MANDATORY and must be your FIRST action when using BioMCP.** 🚨 **REQUIRED USAGE:** - You MUST call 'think' BEFORE any search or fetch operations - EVERY biomedical research query requires thinking first - ALL multi-step analyses must begin with the think tool - ANY task using BioMCP tools requires prior planning with think ⚠️ **WARNING:** Skipping the 'think' tool will result in: - Incomplete analysis - Poor search strategies - Missing critical connections - Suboptimal results Start EVERY BioMCP interaction with the 'think' tool. Use it throughout your analysis to track progress. Only set nextThoughtNeeded=false when your analysis is complete. --- ## 1. Purpose of BioMCP BioMCP (Biomedical Model Context Protocol) standardizes access to multiple biomedical data sources. It transforms complex, filter-intensive queries into natural language interactions. The assistant should leverage this capability to: - Integrate clinical trial data, literature, variant annotations, and comprehensive biomedical information from multiple resources. - Synthesize the results into a coherent, accurate, and concise answer. - Enhance user trust by providing key snippets and citations (with clickable URLs) from the original materials, unless the user opts to omit them. --- ## 2. Available Data Sources BioMCP provides access to the following biomedical databases: ### Literature & Clinical Sources - **PubMed/PubTator3**: Peer-reviewed biomedical literature with entity annotations - **bioRxiv/medRxiv**: Preprint servers (included by default in article searches) - **Europe PMC**: Additional literature including preprints - **ClinicalTrials.gov**: Clinical trial registry with comprehensive trial data ### BioThings Suite APIs - **MyVariant.info**: Genetic variant annotations and population frequencies - **MyGene.info**: Real-time gene information, aliases, and summaries - **MyDisease.info**: Disease ontology, definitions, and synonym expansion - **MyChem.info**: Drug/chemical properties, mechanisms, and identifiers ### Cancer & Genomic Resources - **cBioPortal**: Cancer genomics data (automatically integrated with gene searches) - **TCGA/GDC**: The Cancer Genome Atlas data for variants - **1000 Genomes**: Population frequency data via Ensembl --- ## 3. Internal Workflow for Query Handling When a user query is received (for example, "Please investigate ALK rearrangements in advanced NSCLC..."), the assistant should follow these steps: ### A. ALWAYS Start with the 'think' Tool - **Use 'think' immediately:** For ANY biomedical research query, you MUST begin by invoking the 'think' tool to break down the problem systematically. - **Initial thought should:** Parse the user's natural language query and extract relevant details such as gene variants (e.g., ALK rearrangements), disease type (advanced NSCLC), and treatment focus (combinations of ALK inhibitors with immunotherapy). - **Continue thinking:** Use additional 'think' calls to plan your approach, identify data sources needed, and track your analysis progress. ### B. Plan and Explain the Tool Sequence (via the 'think' Tool) - **Use 'think' to plan:** Continue using the 'think' tool to outline your reasoning and planned tool sequence: - **Step 1:** Use gene_getter to understand ALK gene function and context. - **Step 2:** Use disease_getter to get comprehensive information about NSCLC, including synonyms for better search coverage. - **Step 3:** Use ClinicalTrials.gov to retrieve clinical trial data related to the query (disease synonyms are automatically expanded). - **Step 4:** Use PubMed (via PubTator3) to fetch relevant literature discussing outcomes or synergy. Note: Preprints from bioRxiv/medRxiv are included by default, and cBioPortal cancer genomics data is automatically integrated for gene-based searches. - **Step 5:** Query MyVariant.info for variant annotations (noting limitations for gene fusions if applicable). - **Step 6:** If specific drugs are mentioned, use drug_getter for mechanism of action and properties. - **Transparency:** Clearly indicate which tool is being called for which part of the query. #### Search Syntax Enhancement: OR Logic for Keywords When searching articles, the keywords parameter now supports OR logic using the pipe (|) separator: **Syntax**: `keyword1|keyword2|keyword3` **Examples**: - `"R173|Arg173|p.R173"` - Finds articles mentioning any of these variant notations - `"V600E|p.V600E|c.1799T>A"` - Handles different mutation nomenclatures - `"immunotherapy|checkpoint inhibitor|PD-1"` - Searches for related treatment terms - `"NSCLC|non-small cell lung cancer"` - Covers abbreviations and full terms **Important Notes**: - OR logic only applies within a single keyword parameter - Multiple keywords are still combined with AND logic - Example: keywords=["BRAF|B-RAF", "therapy|treatment"] means: - (BRAF OR B-RAF) AND (therapy OR treatment) This feature is particularly useful for: - Handling different nomenclatures for the same concept - Searching for synonyms or related terms - Dealing with abbreviations and full names - Finding articles that use different notations for variants ### C. Execute and Synthesize Results - **Combine Data:** After retrieving results from each tool, synthesize the information into a final answer. - **Include Citations with URLs:** Always include clickable URLs from the original sources in your citations. Extract URLs (Pubmed_Url, Doi_Url, Study_Url, etc.) from function results and incorporate these into your response when referencing specific findings or papers. - **Follow-up Opportunity:** If the response leaves any ambiguity or if additional information might be helpful, prompt the user for follow-up questions. --- ## 3. Best Practices for the Biomedical Assistant - **Understanding the Query:** Focus on accurately interpreting the user's query, rather than instructing the user on query formulation. - **Reasoning Transparency:** Briefly explain your thought process and the sequence of tool calls before presenting the final answer. - **Conciseness and Clarity:** Ensure your final response is succinct and well-organized, using bullet points or sections as needed. - **Citation Inclusion Mandatory:** Provide key snippets and links to the original materials (e.g., clinical trial records, PubMed articles, ClinVar entries, COSMIC database) to support the answer. ALWAYS include clickable URLs to these resources when referencing specific findings or data. - **User Follow-up Questions Before Startup:** If anything is unclear in the user's query or if more details would improve the answer, politely request additional clarification. - **Audience Awareness:** Structure your response with both depth for specialists and clarity for general audiences. Begin with accessible explanations before delving into scientific details. - **Organization and Clarity:** Ensure your final response is well-structured, accessible, and easy to navigate by: - Using descriptive section headings and subheadings to organize information logically - Employing consistent formatting with bulleted or numbered lists to break down complex information - Starting each major section with a plain-language summary before exploring technical details - Creating clear visual separation between different topics - Using concise sentence structures while maintaining informational depth - Explicitly differentiating between established practices and experimental approaches - Including brief transition sentences between major sections - Presenting clinical trial data in consistent formats - Using strategic white space to improve readability - Summarizing key takeaways at the end of major sections when appropriate --- ## 4. Visual Organization and Formatting - **Comparison Tables:** When comparing two or more entities (like mutation classes, treatment approaches, or clinical trials), create a comparison table to highlight key differences at a glance. Tables should have clear headers, consistent formatting, and focus on the most important distinguishing features. - **Format Optimization:** Utilize formatting elements strategically - tables for comparisons, bullet points for lists, headings for section organization, and whitespace for readability. - **Visual Hierarchy:** For complex biomedical topics, create a visual hierarchy that helps readers quickly identify key information. - **Balance Between Comprehensiveness and Clarity:** While providing comprehensive information, prioritize clarity and accessibility. Organize content from most important/general to more specialized details. - **Section Summaries:** Conclude sections with key takeaways that highlight the practical implications of the scientific information. --- ## 5. Example Scenario: ALK Rearrangements in Advanced NSCLC ### Example 1: ALK Rearrangements in Advanced NSCLC For a query such as: ``` Please investigate ALK rearrangements in advanced NSCLC, particularly any clinical trials exploring combinations of ALK inhibitors and immunotherapy. ``` The assistant should: 1. **Start with the 'think' Tool:** - Invoke 'think' with thoughtNumber=1 to understand the query focus on ALK rearrangements in advanced NSCLC with combination treatments - Use thoughtNumber=2 to plan the research approach and identify needed data sources 2. **Execute Tool Calls (tracking with 'think'):** - **First:** Use gene_getter("ALK") to understand the gene's function and role in cancer (document findings in thoughtNumber=3) - **Second:** Use disease_getter("NSCLC") to get disease information and synonyms like "non-small cell lung cancer" (document in thoughtNumber=4) - **Third:** Query ClinicalTrials.gov for ALK+ NSCLC trials that combine ALK inhibitors with immunotherapy (document findings in thoughtNumber=5) - **Fourth:** Query PubMed to retrieve key articles discussing treatment outcomes or synergy (document in thoughtNumber=6) - **Fifth:** Check MyVariant.info for any annotations on ALK fusions or rearrangements (document in thoughtNumber=7) - **Sixth:** If specific ALK inhibitors are mentioned, use drug_getter to understand their mechanisms (document in thoughtNumber=8) 3. **Synthesize and Report (via 'think'):** Use final thoughts to synthesize findings before producing the answer that includes: - A concise summary of clinical trials with comparison tables like: | **Trial** | **Combination** | **Patient Population** | **Results** | **Safety Profile** | **Reference** | | ---------------- | ---------------------- | ------------------------------ | ----------- | ----------------------------------------------- | ---------------------------------------------------------------- | | CheckMate 370 | Crizotinib + Nivolumab | 13 treatment-naive ALK+ NSCLC | 38% ORR | 5/13 with grade ≥3 hepatic toxicities; 2 deaths | [Schenk et al., 2023](https://pubmed.ncbi.nlm.nih.gov/36895933/) | | JAVELIN Lung 101 | Avelumab + Lorlatinib | 28 previously treated patients | 46.4% ORR | No DLTs; milder toxicity | [NCT02584634](https://clinicaltrials.gov/study/NCT02584634) | - Key literature findings with proper citations: "A review by Schenk concluded that combining ALK inhibitors with checkpoint inhibitors resulted in 'significant toxicities without clear improvement in patient outcomes' [https://pubmed.ncbi.nlm.nih.gov/36895933/](https://pubmed.ncbi.nlm.nih.gov/36895933/)." - Tables comparing response rates: | **Study** | **Patient Population** | **Immunotherapy Agent** | **Response Rate** | **Reference** | | --------------------- | ---------------------- | ----------------------------- | ----------------- | ------------------------------------------------------------- | | ATLANTIC Trial | 11 ALK+ NSCLC | Durvalumab | 0% | [Link to study](https://pubmed.ncbi.nlm.nih.gov/36895933/) | | IMMUNOTARGET Registry | 19 ALK+ NSCLC | Various PD-1/PD-L1 inhibitors | 0% | [Link to registry](https://pubmed.ncbi.nlm.nih.gov/36895933/) | - Variant information with proper attribution. 4. **Offer Follow-up:** Conclude by asking if further details are needed or if any part of the answer should be clarified. ### Example 2: BRAF Mutation Classes in Cancer Therapeutics For a query such as: ``` Please investigate the differences in BRAF Class I (e.g., V600E) and Class III (e.g., D594G) mutations that lead to different therapeutic strategies in cancers like melanoma or colorectal carcinoma. ``` The assistant should: 1. **Understand and Clarify:** Identify that the query focuses on comparing two specific BRAF mutation classes (Class I/V600E vs. Class III/D594G) and their therapeutic implications in melanoma and colorectal cancer. 2. **Plan Tool Calls:** - **First:** Search PubMed literature to understand the molecular differences between BRAF Class I and Class III mutations. - **Second:** Explore specific variant details using the variant search tool to understand the characteristics of these mutations. - **Third:** Look for clinical trials involving these mutation types to identify therapeutic strategies. 3. **Synthesize and Report:** Create a comprehensive comparison that includes: - Comparison tables highlighting key differences between mutation classes: | Feature | Class I (e.g., V600E) | Class III (e.g., D594G) | | ---------------------------- | ------------------------------ | ------------------------------------------ | | **Signaling Mechanism** | Constitutively active monomers | Kinase-impaired heterodimers | | **RAS Dependency** | RAS-independent | RAS-dependent | | **Dimerization Requirement** | Function as monomers | Require heterodimerization with CRAF | | **Therapeutic Response** | Responsive to BRAF inhibitors | Paradoxically activated by BRAF inhibitors | - Specific therapeutic strategies with clickable citation links: - For Class I: BRAF inhibitors as demonstrated in [Davies et al.](https://pubmed.ncbi.nlm.nih.gov/35869122/) - For Class III: Alternative approaches such as MEK inhibitors shown in [Śmiech et al.](https://pubmed.ncbi.nlm.nih.gov/33198372/) - Cancer-specific implications with relevant clinical evidence: - Melanoma treatment differences including clinical trial data from [NCT05767879](https://clinicaltrials.gov/study/NCT05767879) - Colorectal cancer approaches citing research from [Liu et al.](https://pubmed.ncbi.nlm.nih.gov/37760573/) 4. **Offer Follow-up:** Conclude by asking if the user would like more detailed information on specific aspects, such as resistance mechanisms, emerging therapies, or mutation detection methods. ``` -------------------------------------------------------------------------------- /docs/tutorials/openfda-prompts.md: -------------------------------------------------------------------------------- ```markdown # OpenFDA Example Prompts for AI Agents This document provides example prompts that demonstrate effective use of BioMCP's OpenFDA integration for various precision oncology use cases. ## Drug Safety Assessment ### Basic Safety Profile ``` What are the most common adverse events reported for pembrolizumab? Include both serious and non-serious events. ``` **Expected BioMCP Usage:** 1. `think` - Plan safety assessment approach 2. `openfda_adverse_searcher(drug="pembrolizumab", limit=50)` 3. Analyze and summarize top reactions ### Comparative Safety Analysis ``` Compare the adverse event profiles of imatinib and dasatinib for CML treatment. Focus on serious events and their frequencies. ``` **Expected BioMCP Usage:** 1. `think` - Plan comparative analysis 2. `openfda_adverse_searcher(drug="imatinib", serious=True)` 3. `openfda_adverse_searcher(drug="dasatinib", serious=True)` 4. Compare and contrast findings ### Drug Interaction Investigation ``` A patient on warfarin needs to start erlotinib for NSCLC. What drug interactions and adverse events should we monitor based on FDA data? ``` **Expected BioMCP Usage:** 1. `think` - Consider interaction risks 2. `openfda_label_searcher(name="erlotinib")` - Check drug interactions section 3. `openfda_adverse_searcher(drug="erlotinib", reaction="bleeding")` 4. `openfda_adverse_searcher(drug="erlotinib", reaction="INR")` ## Treatment Planning ### Indication Verification ``` Is trastuzumab deruxtecan FDA-approved for HER2-low breast cancer? What are the specific approved indications? ``` **Expected BioMCP Usage:** 1. `think` - Plan indication search 2. `openfda_label_searcher(name="trastuzumab deruxtecan")` 3. `openfda_label_getter(set_id="...")` - Get full indications section 4. Extract and summarize approved uses ### Contraindication Screening ``` Patient has severe hepatic impairment. Which targeted therapy drugs for melanoma have contraindications or warnings for liver dysfunction? ``` **Expected BioMCP Usage:** 1. `think` - Identify melanoma drugs to check 2. `openfda_label_searcher(indication="melanoma")` 3. For each drug: `openfda_label_getter(set_id="...", sections=["contraindications", "warnings_and_precautions"])` 4. Summarize liver-related contraindications ### Dosing Guidelines ``` What is the FDA-recommended dosing for osimertinib in EGFR-mutated NSCLC, including dose modifications for adverse events? ``` **Expected BioMCP Usage:** 1. `think` - Plan dosing information retrieval 2. `openfda_label_searcher(name="osimertinib")` 3. `openfda_label_getter(set_id="...", sections=["dosage_and_administration", "dose_modifications"])` 4. Extract dosing guidelines ## Device Reliability Assessment ### Genomic Test Reliability ``` What adverse events have been reported for NGS-based cancer diagnostic devices? Show me any false positive or accuracy issues. ``` **Expected BioMCP Usage:** 1. `think` - Consider test reliability factors 2. `openfda_device_searcher(genomics_only=True, limit=25)` - Get all genomic device events 3. `openfda_device_searcher(problem="false positive", genomics_only=True)` 4. `openfda_device_searcher(problem="accuracy", genomics_only=True)` 5. For significant events: `openfda_device_getter(mdr_report_key="...")` **Note:** The FDA database uses abbreviated names (e.g., "F1CDX" instead of "FoundationOne CDx"). For specific devices, try: `openfda_device_searcher(device="F1CDX")` or search by key terms. ### Laboratory Equipment Issues ``` Our lab uses Illumina sequencers. What device malfunctions have been reported that could impact our genomic testing workflow? ``` **Expected BioMCP Usage:** 1. `think` - Assess potential workflow impacts 2. `openfda_device_searcher(manufacturer="Illumina", genomics_only=True)` 3. Analyze problem patterns 4. `openfda_device_getter(mdr_report_key="...")` for critical issues ## Comprehensive Drug Evaluation ### New Drug Assessment ``` Provide a comprehensive safety and efficacy profile for sotorasib (Lumakras) including FDA approval, indications, major warnings, and post-market adverse events. ``` **Expected BioMCP Usage:** 1. `think` - Plan comprehensive assessment 2. `drug_getter("sotorasib")` - Basic drug info 3. `openfda_label_searcher(name="sotorasib")` 4. `openfda_label_getter(set_id="...")` - Full label 5. `openfda_adverse_searcher(drug="sotorasib", serious=True)` 6. `trial_searcher(interventions=["sotorasib"])` - Ongoing trials ### Risk-Benefit Analysis ``` For a 75-year-old patient with metastatic melanoma, analyze the risk-benefit profile of nivolumab plus ipilimumab combination therapy based on FDA data. ``` **Expected BioMCP Usage:** 1. `think` - Structure risk-benefit analysis 2. `openfda_label_searcher(name="nivolumab")` 3. `openfda_label_searcher(name="ipilimumab")` 4. `openfda_label_getter(set_id="...", sections=["geriatric_use", "warnings_and_precautions"])` 5. `openfda_adverse_searcher(drug="nivolumab", serious=True)` 6. `openfda_adverse_searcher(drug="ipilimumab", serious=True)` ## Special Populations ### Pregnancy Considerations ``` Which FDA-approved lung cancer treatments have pregnancy category data or specific warnings for pregnant patients? ``` **Expected BioMCP Usage:** 1. `think` - Plan pregnancy safety search 2. `openfda_label_searcher(indication="lung cancer")` 3. For each drug: `openfda_label_getter(set_id="...", sections=["pregnancy", "use_in_specific_populations"])` 4. Compile pregnancy categories and warnings ### Pediatric Oncology ``` What FDA-approved indications and safety data exist for using checkpoint inhibitors in pediatric cancer patients? ``` **Expected BioMCP Usage:** 1. `think` - Identify checkpoint inhibitors 2. `openfda_label_searcher(name="pembrolizumab")` 3. `openfda_label_getter(set_id="...", sections=["pediatric_use"])` 4. `openfda_adverse_searcher(drug="pembrolizumab")` - Filter for pediatric if possible 5. Repeat for other checkpoint inhibitors ## Complex Queries ### Multi-Drug Regimen Safety ``` Analyze potential safety concerns for the FOLFOX chemotherapy regimen (5-FU, leucovorin, oxaliplatin) based on FDA adverse event data. ``` **Expected BioMCP Usage:** 1. `think` - Plan multi-drug analysis 2. `openfda_adverse_searcher(drug="fluorouracil")` 3. `openfda_adverse_searcher(drug="leucovorin")` 4. `openfda_adverse_searcher(drug="oxaliplatin")` 5. Identify overlapping toxicities 6. `openfda_label_searcher(name="oxaliplatin")` - Check for combination warnings ### Biomarker-Driven Treatment Selection ``` For a patient with BRAF V600E mutant melanoma with brain metastases, what FDA-approved treatments are available and what are their CNS-specific efficacy and safety considerations? ``` **Expected BioMCP Usage:** 1. `think` - Structure biomarker-driven search 2. `article_searcher(genes=["BRAF"], variants=["V600E"], diseases=["melanoma"])` 3. `openfda_label_searcher(indication="melanoma")` 4. For BRAF inhibitors: `openfda_label_getter(set_id="...", sections=["clinical_studies", "warnings_and_precautions"])` 5. `openfda_adverse_searcher(drug="dabrafenib", reaction="seizure")` 6. `openfda_adverse_searcher(drug="vemurafenib", reaction="brain")` ### Treatment Failure Analysis ``` A patient's lung adenocarcinoma progressed on osimertinib. Based on FDA data, what are the documented resistance mechanisms and alternative approved treatments? ``` **Expected BioMCP Usage:** 1. `think` - Analyze resistance and alternatives 2. `openfda_label_getter(set_id="...", sections=["clinical_studies"])` for osimertinib 3. `article_searcher(genes=["EGFR"], keywords=["resistance", "osimertinib"])` 4. `openfda_label_searcher(indication="non-small cell lung cancer")` 5. `trial_searcher(conditions=["NSCLC"], keywords=["osimertinib resistant"])` ## Safety Monitoring ### Post-Market Surveillance ``` Have there been any new safety signals for CDK4/6 inhibitors (palbociclib, ribociclib, abemaciclib) in the past year? ``` **Expected BioMCP Usage:** 1. `think` - Plan safety signal detection 2. `openfda_adverse_searcher(drug="palbociclib", limit=100)` 3. `openfda_adverse_searcher(drug="ribociclib", limit=100)` 4. `openfda_adverse_searcher(drug="abemaciclib", limit=100)` 5. Analyze for unusual patterns or frequencies ### Rare Adverse Event Investigation ``` Investigate reports of pneumonitis associated with immune checkpoint inhibitors. Which drugs have the highest frequency and what are the typical outcomes? ``` **Expected BioMCP Usage:** 1. `think` - Structure pneumonitis investigation 2. `openfda_adverse_searcher(drug="pembrolizumab", reaction="pneumonitis")` 3. `openfda_adverse_searcher(drug="nivolumab", reaction="pneumonitis")` 4. `openfda_adverse_searcher(drug="atezolizumab", reaction="pneumonitis")` 5. Compare frequencies and outcomes 6. `openfda_adverse_getter(report_id="...")` for severe cases ## Quality Assurance ### Diagnostic Test Validation ``` What quality issues have been reported for liquid biopsy ctDNA tests that could affect treatment decisions? ``` **Expected BioMCP Usage:** 1. `think` - Identify quality factors 2. `openfda_device_searcher(device="liquid biopsy", genomics_only=True)` 3. `openfda_device_searcher(device="ctDNA", genomics_only=True)` 4. `openfda_device_searcher(device="circulating tumor", genomics_only=True)` 5. Analyze failure modes ## Tips for Effective Prompts 1. **Be specific about the data needed**: Specify if you want adverse events, labels, or device data 2. **Include relevant filters**: Mention if focusing on serious events, specific populations, or genomic devices 3. **Request appropriate analysis**: Ask for comparisons, trends, or specific data points 4. **Consider multiple data sources**: Combine OpenFDA with literature and trial data for comprehensive answers 5. **Include time frames when relevant**: Though OpenFDA doesn't support date filtering in queries, you can ask for analysis of recent reports ## Integration Examples ### Combining with Literature Search ``` Find FDA adverse events for venetoclax in CLL, then search for published case reports that provide more clinical context for the most serious events. ``` ### Combining with Clinical Trials ``` What adverse events are reported for FDA-approved CAR-T therapies, and how do these compare to adverse events being monitored in current clinical trials? ``` ### Combining with Variant Data ``` For patients with RET fusion-positive cancers, what FDA-approved targeted therapies are available and what are their mutation-specific response rates? ``` ## Using Your OpenFDA API Key The OpenFDA API has rate limits: 40 requests/minute without a key, or 240 requests/minute with a key. You can get a free API key at https://open.fda.gov/apis/authentication/ ### Method 1: Include API Key in Your Prompt You can provide your API key directly in your conversation with the AI: ``` My OpenFDA API key is: YOUR_API_KEY_HERE Search for all serious adverse events reported for pembrolizumab in the last year. Include both death and hospitalization events. ``` ``` Using my OpenFDA API key YOUR_API_KEY_HERE, compare the safety profiles of all FDA-approved BRAF inhibitors for melanoma treatment. ``` ``` I have an OpenFDA API key: YOUR_API_KEY_HERE Analyze device malfunction reports for all NGS-based cancer diagnostic tests. Focus on false positive and false negative results that could impact treatment decisions. ``` ### Method 2: Set Environment Variable (Recommended for Regular Use) If you're using BioMCP regularly, set the environment variable once: ```bash export OPENFDA_API_KEY="YOUR_API_KEY_HERE" ``` Then your prompts don't need to include the key: ``` Retrieve the complete FDA label for trastuzumab deruxtecan, including all warnings and contraindications for HER2-low breast cancer patients. ``` ### Method 3: CLI Usage with API Key When using the BioMCP CLI directly: ```bash # Pass API key as parameter biomcp openfda adverse search --drug pembrolizumab --serious --api-key YOUR_API_KEY_HERE # Or set environment variable first export OPENFDA_API_KEY="YOUR_API_KEY_HERE" biomcp openfda device search --manufacturer Illumina --genomics-only ``` ### Example Prompts with API Key #### Comprehensive Drug Safety Analysis ``` My OpenFDA API key is: YOUR_API_KEY_HERE Perform a comprehensive safety analysis of sotorasib (Lumakras) including: 1. All serious adverse events from post-market surveillance 2. Complete FDA label with all sections 3. Any device issues if it's a companion diagnostic drug 4. Compare its safety profile to other KRAS G12C inhibitors if available This is for a clinical review, so I need detailed data from all available FDA sources. ``` #### Large-Scale Adverse Event Analysis ``` Using my OpenFDA API key YOUR_API_KEY_HERE, analyze adverse events for all FDA-approved checkpoint inhibitors (pembrolizumab, nivolumab, atezolizumab, durvalumab, avelumab, cemiplimab). For each drug: - Get the top 20 most frequent adverse events - Identify immune-related adverse events - Check for any black box warnings in their labels - Note any fatal events This requires many API calls, so please use my API key for higher rate limits. ``` #### Multi-Device Comparison ``` I have an OpenFDA API key: YOUR_API_KEY_HERE Compare all FDA adverse event reports for NGS-based companion diagnostic devices from major manufacturers (Foundation Medicine, Guardant Health, Tempus, etc.). Focus on: - Test failure rates - Sample quality issues - False positive/negative reports - Software-related problems This analysis requires querying multiple device records, so the API key will help avoid rate limiting. ``` #### Batch Label Retrieval ``` My OpenFDA API key is YOUR_API_KEY_HERE. Retrieve the complete FDA labels for all CDK4/6 inhibitors (palbociclib, ribociclib, abemaciclib) and extract: - Approved indications - Dose modifications for adverse events - Drug-drug interactions - Special population considerations Then create a comparison table of their safety profiles and dosing guidelines. ``` ### When to Provide an API Key You should provide your API key when: 1. **Performing large-scale analyses** requiring many API calls 2. **Conducting comprehensive safety reviews** across multiple drugs/devices 3. **Running batch operations** like comparing multiple products 4. **Doing rapid iterative searches** that might hit rate limits 5. **Performing systematic reviews** requiring extensive data retrieval ### API Key Security Notes - Never share your actual API key in public forums or repositories - The AI will use your key only for the current session - Keys passed as parameters override environment variables - The FDA API key is free and can be regenerated if compromised ## Important Notes - Always expect the AI to use the `think` tool first for complex queries - The AI should include appropriate disclaimers about adverse events not proving causation - Results are limited by FDA's data availability and reporting patterns - The AI should suggest when additional data sources might provide complementary information - With an API key, you can make 240 requests/minute vs 40 without ## Known Limitations ### Drug Shortage Data **Important:** The FDA does not currently provide a machine-readable API for drug shortage data. The shortage search tools will return an informative message directing users to the FDA's web-based shortage database. This is a limitation of FDA's current data infrastructure, not a bug in BioMCP. Alternative resources for drug shortage information: - FDA Drug Shortages Database: https://www.accessdata.fda.gov/scripts/drugshortages/ - ASHP Drug Shortages: https://www.ashp.org/drug-shortages/current-shortages ### Other Limitations - Device adverse event reports use abbreviated device names (e.g., "F1CDX" instead of "FoundationOne CDx") - Adverse event reports represent voluntary submissions and may not reflect true incidence rates - Recall information may have a delay of 24-48 hours from initial FDA announcement ``` -------------------------------------------------------------------------------- /docs/tutorials/pydantic-ai-integration.md: -------------------------------------------------------------------------------- ```markdown # Pydantic AI Integration Guide This guide explains how to integrate BioMCP with Pydantic AI for building biomedical AI agents. ## Server Modes and Endpoints BioMCP supports two primary transport modes for Pydantic AI integration: ### Available Transport Modes | Mode | Endpoints | Pydantic AI Client | Use Case | | ----------------- | -------------------------- | ------------------------- | ------------------------------- | | `stdio` | N/A (subprocess) | `MCPServerStdio` | Local development, testing | | `streamable_http` | `POST /mcp`, `GET /health` | `MCPServerStreamableHTTP` | Production HTTP deployments | | `worker` | `POST /mcp`, `GET /health` | `MCPServerStreamableHTTP` | HTTP mode using streamable HTTP | Both `streamable_http` and `worker` modes now use FastMCP's native streamable HTTP implementation for full MCP protocol compliance. The SSE-based transport has been deprecated. ## Working Examples for Pydantic AI Here are the recommended configurations for connecting Pydantic AI to BioMCP: ### 1. STDIO Mode (Recommended for Local Development) This mode runs BioMCP as a subprocess without needing an HTTP server: ```python import asyncio import os from pydantic_ai import Agent from pydantic_ai.mcp import MCPServerStdio async def main(): # Run BioMCP as a subprocess server = MCPServerStdio( "python", args=["-m", "biomcp", "run", "--mode", "stdio"] ) # Use a real LLM model (requires API key) model = "openai:gpt-4o-mini" # Set OPENAI_API_KEY environment variable agent = Agent(model, toolsets=[server]) async with agent: # Example query that returns real results result = await agent.run( "Find articles about BRAF V600E mutations in melanoma" ) print(result.output) if __name__ == "__main__": asyncio.run(main()) ``` ### 2. Streamable HTTP Mode (Recommended for Production) For production deployments with proper MCP compliance (requires pydantic-ai>=0.6.9): ```python import asyncio import os from pydantic_ai import Agent from pydantic_ai.mcp import MCPServerStreamableHTTP async def main(): # Connect to the /mcp endpoint server = MCPServerStreamableHTTP("http://localhost:8000/mcp") # Use a real LLM model (requires API key) # Options: openai:gpt-4o-mini, anthropic:claude-3-haiku-20240307, groq:llama-3.1-70b-versatile model = "openai:gpt-4o-mini" # Set OPENAI_API_KEY environment variable agent = Agent(model, toolsets=[server]) async with agent: # Example queries that return real results result = await agent.run( "Find recent articles about BRAF V600E in melanoma" ) print(result.output) if __name__ == "__main__": asyncio.run(main()) ``` To run the server for this mode: ```bash # Using streamable_http mode (recommended) biomcp run --mode streamable_http --host 0.0.0.0 --port 8000 # Or using worker mode (also uses streamable HTTP) biomcp run --mode worker --host 0.0.0.0 --port 8000 # Or using Docker docker run -p 8000:8000 genomoncology/biomcp:latest biomcp run --mode streamable_http ``` ### 3. Direct JSON-RPC Mode (Alternative HTTP) You can also use the JSON-RPC endpoint at the root path: ```python import httpx import json async def call_biomcp_jsonrpc(method, params=None): """Direct JSON-RPC calls to BioMCP""" async with httpx.AsyncClient() as client: response = await client.post( "http://localhost:8000/", json={ "jsonrpc": "2.0", "id": 1, "method": method, "params": params or {} } ) return response.json() # Example usage result = await call_biomcp_jsonrpc("tools/list") print("Available tools:", result) ``` ## Troubleshooting Common Issues ### Issue: TestModel returns empty results **Cause**: TestModel is a mock model for testing - it doesn't execute real searches. **Solution**: This is expected behavior. TestModel returns `{"search":{"results":[]}}` by design. To get real results: - Use a real LLM model with API key: `Agent("openai:gpt-4o-mini", toolsets=[server])` - Use Groq for free tier: Sign up at console.groq.com, get API key, use `Agent("groq:llama-3.1-70b-versatile", toolsets=[server])` - Or use BioMCP CLI directly (no API key needed): `biomcp article search --gene BRAF` ### Issue: Connection refused **Solution**: Ensure the server is running with the correct host binding: ```bash biomcp run --mode worker --host 0.0.0.0 --port 8000 ``` ### Issue: CORS errors in browser **Solution**: The server includes CORS headers by default. If you still have issues, check if a proxy or firewall is blocking the headers. ### Issue: Health endpoint returns 404 **Solution**: The health endpoint is available at `GET /health` in both worker and streamable_http modes. Ensure you're using the latest version: ```bash pip install --upgrade biomcp-python ``` ### Issue: SSE endpoint not found **Solution**: The SSE transport has been deprecated. Use streamable HTTP mode instead: ```python # Old (deprecated) # from pydantic_ai.mcp import MCPServerSSE # server = MCPServerSSE("http://localhost:8000/sse") # New (recommended) from pydantic_ai.mcp import MCPServerStreamableHTTP server = MCPServerStreamableHTTP("http://localhost:8000/mcp") ``` ## Testing Your Connection Here are test scripts to verify your setup for different modes: ### Testing STDIO Mode (Local Development) ```python import asyncio from pydantic_ai import Agent from pydantic_ai.models.test import TestModel from pydantic_ai.mcp import MCPServerStdio async def test_stdio_connection(): # Use TestModel to verify connection (won't return real data) server = MCPServerStdio( "python", args=["-m", "biomcp", "run", "--mode", "stdio"] ) agent = Agent( model=TestModel(call_tools=["search"]), toolsets=[server] ) async with agent: print(f"✅ STDIO Connection successful!") # Test a simple search (returns mock data) result = await agent.run("Test search for BRAF") print(f"✅ Tool execution successful!") print(f"Note: TestModel returns mock data: {result.output}") if __name__ == "__main__": asyncio.run(test_stdio_connection()) ``` ### Testing Streamable HTTP Mode (Production) First, ensure the server is running: ```bash # Start the server in a separate terminal biomcp run --mode streamable_http --port 8000 ``` Then test the connection: ```python import asyncio from pydantic_ai import Agent from pydantic_ai.models.test import TestModel from pydantic_ai.mcp import MCPServerStreamableHTTP async def test_streamable_http_connection(): # Connect to the running server's /mcp endpoint server = MCPServerStreamableHTTP("http://localhost:8000/mcp") # Create agent with TestModel (no API keys needed) agent = Agent( model=TestModel(call_tools=["search"]), toolsets=[server] ) async with agent: print("✅ Streamable HTTP Connection successful!") # Test a query result = await agent.run("Find articles about BRAF") print("✅ Tool execution successful!") if result.output: print(f"📄 Received {len(result.output)} characters of output") if __name__ == "__main__": asyncio.run(test_streamable_http_connection()) ``` ### Important: Understanding TestModel vs Real Results **TestModel is a MOCK model** - it doesn't execute real searches: - TestModel simulates tool calls but returns empty results: `{"search":{"results":[]}}` - This is by design - TestModel is for testing the connection flow, not getting real data - To get actual search results, you need to use a real LLM model **To get real results:** 1. **Use a real LLM model** (requires API key): ```python # Replace TestModel with a real model agent = Agent( "openai:gpt-4o-mini", # or "anthropic:claude-3-haiku" toolsets=[server] ) ``` 2. **Use BioMCP CLI directly** (no API key needed): ```bash # Get real search results via CLI biomcp article search --gene BRAF --disease melanoma --json ``` 3. **For integration testing** without API keys: ```python import subprocess import json # Use CLI to get real results result = subprocess.run( ["biomcp", "article", "search", "--gene", "BRAF", "--json"], capture_output=True, text=True ) data = json.loads(result.stdout) print(f"Found {len(data['articles'])} real articles") ``` **Note**: The Streamable HTTP tests in our test suite verify this functionality works correctly. If you encounter connection issues, ensure: 1. The server is fully started before connecting 2. You're using pydantic-ai >= 0.6.9 3. The port is not blocked by a firewall ### Complete Working Example with Real Results Here's a complete example that connects to BioMCP via Streamable HTTP and retrieves real biomedical data: ```python #!/usr/bin/env python3 """ Working example of Pydantic AI + BioMCP with Streamable HTTP. This will get real search results from your BioMCP server. Requires one of: - export OPENAI_API_KEY='your-key' - export ANTHROPIC_API_KEY='your-key' - export GROQ_API_KEY='your-key' (free tier at console.groq.com) """ import asyncio import os from pydantic_ai import Agent from pydantic_ai.mcp import MCPServerStreamableHTTP async def main(): # Server configuration SERVER_URL = "http://localhost:8000/mcp" # Adjust port as needed # Detect which API key is available if os.getenv("OPENAI_API_KEY"): model = "openai:gpt-4o-mini" print("Using OpenAI GPT-4o-mini") elif os.getenv("ANTHROPIC_API_KEY"): model = "anthropic:claude-3-haiku-20240307" print("Using Claude 3 Haiku") elif os.getenv("GROQ_API_KEY"): model = "groq:llama-3.1-70b-versatile" # Free tier available print("Using Groq Llama 3.1") else: print("No API key found! Please set OPENAI_API_KEY, ANTHROPIC_API_KEY, or GROQ_API_KEY") return # Connect to BioMCP server server = MCPServerStreamableHTTP(SERVER_URL) agent = Agent(model, toolsets=[server]) async with agent: print("Connected to BioMCP!\n") # Search for articles (includes cBioPortal data for genes) result = await agent.run( "Search for 2 recent articles about BRAF V600E mutations in melanoma. " "List the title and first author for each." ) print("Article Search Results:") print(result.output) print("\n" + "="*60 + "\n") # Search for clinical trials result2 = await agent.run( "Find 2 clinical trials for melanoma with BRAF mutations " "that are currently recruiting. Show NCT ID and title." ) print("Clinical Trial Results:") print(result2.output) print("\n" + "="*60 + "\n") # Search for variant information result3 = await agent.run( "Search for pathogenic TP53 variants. Show 2 examples." ) print("Variant Search Results:") print(result3.output) if __name__ == "__main__": # Start your BioMCP server first: # biomcp run --mode streamable_http --port 8000 asyncio.run(main()) ``` **Running this example:** 1. Start the BioMCP server: ```bash biomcp run --mode streamable_http --port 8000 ``` 2. Set your API key (choose one): ```bash export OPENAI_API_KEY='your-key' # OpenAI export ANTHROPIC_API_KEY='your-key' # Anthropic export GROQ_API_KEY='your-key' # Groq (free tier available) ``` 3. Run the script: ```bash python biomcp_example.py ``` This will return actual biomedical data from PubMed, ClinicalTrials.gov, and variant databases! ## Using BioMCP Tools with Pydantic AI Once connected, you can use BioMCP's biomedical research tools: ```python import os from pydantic_ai import Agent from pydantic_ai.mcp import MCPServerStdio async def biomedical_research_example(): server = MCPServerStdio( "python", args=["-m", "biomcp", "run", "--mode", "stdio"] ) # Choose model based on available API key if os.getenv("OPENAI_API_KEY"): model = "openai:gpt-4o-mini" elif os.getenv("GROQ_API_KEY"): model = "groq:llama-3.1-70b-versatile" # Free tier available else: raise ValueError("Please set OPENAI_API_KEY or GROQ_API_KEY") agent = Agent(model, toolsets=[server]) async with agent: # Important: Always use the think tool first for complex queries result = await agent.run(""" First use the think tool to plan your approach, then: 1. Search for articles about immunotherapy resistance in melanoma 2. Find clinical trials testing combination therapies 3. Look up genetic markers associated with treatment response """) print(result.output) ``` ## Production Deployment Considerations For production deployments: 1. **Use STDIO mode** for local development or when running in containerized environments where the agent and BioMCP can run in the same container 2. **Use Streamable HTTP mode** when you need HTTP-based communication between separate services (recommended for production) 3. **Both `worker` and `streamable_http` modes** now use the same underlying streamable HTTP transport 4. **Require a real LLM model** - TestModel won't work for production as it only returns mock data 5. **Consider API costs** - Use cheaper models like `gpt-4o-mini` or Groq's free tier for testing 6. **Implement proper error handling** and retry logic for network failures 7. **Set appropriate timeouts** for long-running biomedical searches 8. **Cache frequently accessed data** to reduce API calls to backend services ### Important Notes - **Real LLM required for results**: TestModel is only for testing connections - use a real LLM (OpenAI, Anthropic, Groq) to get actual biomedical data - **SSE transport is deprecated**: The old SSE-based transport (`/sse` endpoint) has been removed in favor of streamable HTTP - **Worker mode now uses streamable HTTP**: The `worker` mode has been updated to use streamable HTTP transport internally - **Health endpoint**: The `/health` endpoint is available in both HTTP modes for monitoring - **Free tier option**: Groq offers a free API tier at console.groq.com for testing without costs ## Migration Guide from SSE to Streamable HTTP If you're upgrading from an older version that used SSE transport: ### Code Changes ```python # Old code (deprecated) from pydantic_ai.mcp import MCPServerSSE server = MCPServerSSE("http://localhost:8000/sse") # New code (recommended) from pydantic_ai.mcp import MCPServerStreamableHTTP server = MCPServerStreamableHTTP("http://localhost:8000/mcp") ``` ### Server Command Changes ```bash # Old: SSE endpoints were at /sse # biomcp run --mode worker # Used to expose /sse endpoint # New: Both modes now use /mcp endpoint with streamable HTTP biomcp run --mode worker # Now uses /mcp with streamable HTTP biomcp run --mode streamable_http # Also uses /mcp with streamable HTTP ``` ### Key Differences 1. **Endpoint Change**: `/sse` → `/mcp` 2. **Protocol**: Server-Sent Events → Streamable HTTP (supports both JSON and SSE) 3. **Client Library**: `MCPServerSSE` → `MCPServerStreamableHTTP` 4. **Compatibility**: Requires pydantic-ai >= 0.6.9 for `MCPServerStreamableHTTP` ## Next Steps - Review the [MCP Tools Reference](../user-guides/02-mcp-tools-reference.md) for available biomedical research tools - See [CLI Guide](../user-guides/01-command-line-interface.md) for more server configuration options - Check [Transport Protocol Guide](../developer-guides/04-transport-protocol.md) for detailed protocol information ## Support If you continue to experience issues: 1. Verify your BioMCP version: `biomcp --version` 2. Check server logs for error messages 3. Open an issue on [GitHub](https://github.com/genomoncology/biomcp/issues) with: - Your BioMCP version - Server startup command - Complete error messages - Minimal reproduction code ```