beehiveinnovations/gemini-mcp-server # codebase.md

This is page 5 of 19. Use http://codebase.md/beehiveinnovations/gemini-mcp-server?page={x} to view the full context.

# Directory Structure

```
├── .claude
│   ├── commands
│   │   └── fix-github-issue.md
│   └── settings.json
├── .coveragerc
├── .dockerignore
├── .env.example
├── .gitattributes
├── .github
│   ├── FUNDING.yml
│   ├── ISSUE_TEMPLATE
│   │   ├── bug_report.yml
│   │   ├── config.yml
│   │   ├── documentation.yml
│   │   ├── feature_request.yml
│   │   └── tool_addition.yml
│   ├── pull_request_template.md
│   └── workflows
│       ├── docker-pr.yml
│       ├── docker-release.yml
│       ├── semantic-pr.yml
│       ├── semantic-release.yml
│       └── test.yml
├── .gitignore
├── .pre-commit-config.yaml
├── AGENTS.md
├── CHANGELOG.md
├── claude_config_example.json
├── CLAUDE.md
├── clink
│   ├── __init__.py
│   ├── agents
│   │   ├── __init__.py
│   │   ├── base.py
│   │   ├── claude.py
│   │   ├── codex.py
│   │   └── gemini.py
│   ├── constants.py
│   ├── models.py
│   ├── parsers
│   │   ├── __init__.py
│   │   ├── base.py
│   │   ├── claude.py
│   │   ├── codex.py
│   │   └── gemini.py
│   └── registry.py
├── code_quality_checks.ps1
├── code_quality_checks.sh
├── communication_simulator_test.py
├── conf
│   ├── __init__.py
│   ├── azure_models.json
│   ├── cli_clients
│   │   ├── claude.json
│   │   ├── codex.json
│   │   └── gemini.json
│   ├── custom_models.json
│   ├── dial_models.json
│   ├── gemini_models.json
│   ├── openai_models.json
│   ├── openrouter_models.json
│   └── xai_models.json
├── config.py
├── docker
│   ├── README.md
│   └── scripts
│       ├── build.ps1
│       ├── build.sh
│       ├── deploy.ps1
│       ├── deploy.sh
│       └── healthcheck.py
├── docker-compose.yml
├── Dockerfile
├── docs
│   ├── adding_providers.md
│   ├── adding_tools.md
│   ├── advanced-usage.md
│   ├── ai_banter.md
│   ├── ai-collaboration.md
│   ├── azure_openai.md
│   ├── configuration.md
│   ├── context-revival.md
│   ├── contributions.md
│   ├── custom_models.md
│   ├── docker-deployment.md
│   ├── gemini-setup.md
│   ├── getting-started.md
│   ├── index.md
│   ├── locale-configuration.md
│   ├── logging.md
│   ├── model_ranking.md
│   ├── testing.md
│   ├── tools
│   │   ├── analyze.md
│   │   ├── apilookup.md
│   │   ├── challenge.md
│   │   ├── chat.md
│   │   ├── clink.md
│   │   ├── codereview.md
│   │   ├── consensus.md
│   │   ├── debug.md
│   │   ├── docgen.md
│   │   ├── listmodels.md
│   │   ├── planner.md
│   │   ├── precommit.md
│   │   ├── refactor.md
│   │   ├── secaudit.md
│   │   ├── testgen.md
│   │   ├── thinkdeep.md
│   │   ├── tracer.md
│   │   └── version.md
│   ├── troubleshooting.md
│   ├── vcr-testing.md
│   └── wsl-setup.md
├── examples
│   ├── claude_config_macos.json
│   └── claude_config_wsl.json
├── LICENSE
├── providers
│   ├── __init__.py
│   ├── azure_openai.py
│   ├── base.py
│   ├── custom.py
│   ├── dial.py
│   ├── gemini.py
│   ├── openai_compatible.py
│   ├── openai.py
│   ├── openrouter.py
│   ├── registries
│   │   ├── __init__.py
│   │   ├── azure.py
│   │   ├── base.py
│   │   ├── custom.py
│   │   ├── dial.py
│   │   ├── gemini.py
│   │   ├── openai.py
│   │   ├── openrouter.py
│   │   └── xai.py
│   ├── registry_provider_mixin.py
│   ├── registry.py
│   ├── shared
│   │   ├── __init__.py
│   │   ├── model_capabilities.py
│   │   ├── model_response.py
│   │   ├── provider_type.py
│   │   └── temperature.py
│   └── xai.py
├── pyproject.toml
├── pytest.ini
├── README.md
├── requirements-dev.txt
├── requirements.txt
├── run_integration_tests.ps1
├── run_integration_tests.sh
├── run-server.ps1
├── run-server.sh
├── scripts
│   └── sync_version.py
├── server.py
├── simulator_tests
│   ├── __init__.py
│   ├── base_test.py
│   ├── conversation_base_test.py
│   ├── log_utils.py
│   ├── test_analyze_validation.py
│   ├── test_basic_conversation.py
│   ├── test_chat_simple_validation.py
│   ├── test_codereview_validation.py
│   ├── test_consensus_conversation.py
│   ├── test_consensus_three_models.py
│   ├── test_consensus_workflow_accurate.py
│   ├── test_content_validation.py
│   ├── test_conversation_chain_validation.py
│   ├── test_cross_tool_comprehensive.py
│   ├── test_cross_tool_continuation.py
│   ├── test_debug_certain_confidence.py
│   ├── test_debug_validation.py
│   ├── test_line_number_validation.py
│   ├── test_logs_validation.py
│   ├── test_model_thinking_config.py
│   ├── test_o3_model_selection.py
│   ├── test_o3_pro_expensive.py
│   ├── test_ollama_custom_url.py
│   ├── test_openrouter_fallback.py
│   ├── test_openrouter_models.py
│   ├── test_per_tool_deduplication.py
│   ├── test_planner_continuation_history.py
│   ├── test_planner_validation_old.py
│   ├── test_planner_validation.py
│   ├── test_precommitworkflow_validation.py
│   ├── test_prompt_size_limit_bug.py
│   ├── test_refactor_validation.py
│   ├── test_secaudit_validation.py
│   ├── test_testgen_validation.py
│   ├── test_thinkdeep_validation.py
│   ├── test_token_allocation_validation.py
│   ├── test_vision_capability.py
│   └── test_xai_models.py
├── systemprompts
│   ├── __init__.py
│   ├── analyze_prompt.py
│   ├── chat_prompt.py
│   ├── clink
│   │   ├── codex_codereviewer.txt
│   │   ├── default_codereviewer.txt
│   │   ├── default_planner.txt
│   │   └── default.txt
│   ├── codereview_prompt.py
│   ├── consensus_prompt.py
│   ├── debug_prompt.py
│   ├── docgen_prompt.py
│   ├── generate_code_prompt.py
│   ├── planner_prompt.py
│   ├── precommit_prompt.py
│   ├── refactor_prompt.py
│   ├── secaudit_prompt.py
│   ├── testgen_prompt.py
│   ├── thinkdeep_prompt.py
│   └── tracer_prompt.py
├── tests
│   ├── __init__.py
│   ├── CASSETTE_MAINTENANCE.md
│   ├── conftest.py
│   ├── gemini_cassettes
│   │   ├── chat_codegen
│   │   │   └── gemini25_pro_calculator
│   │   │       └── mldev.json
│   │   ├── chat_cross
│   │   │   └── step1_gemini25_flash_number
│   │   │       └── mldev.json
│   │   └── consensus
│   │       └── step2_gemini25_flash_against
│   │           └── mldev.json
│   ├── http_transport_recorder.py
│   ├── mock_helpers.py
│   ├── openai_cassettes
│   │   ├── chat_cross_step2_gpt5_reminder.json
│   │   ├── chat_gpt5_continuation.json
│   │   ├── chat_gpt5_moon_distance.json
│   │   ├── consensus_step1_gpt5_for.json
│   │   └── o3_pro_basic_math.json
│   ├── pii_sanitizer.py
│   ├── sanitize_cassettes.py
│   ├── test_alias_target_restrictions.py
│   ├── test_auto_mode_comprehensive.py
│   ├── test_auto_mode_custom_provider_only.py
│   ├── test_auto_mode_model_listing.py
│   ├── test_auto_mode_provider_selection.py
│   ├── test_auto_mode.py
│   ├── test_auto_model_planner_fix.py
│   ├── test_azure_openai_provider.py
│   ├── test_buggy_behavior_prevention.py
│   ├── test_cassette_semantic_matching.py
│   ├── test_challenge.py
│   ├── test_chat_codegen_integration.py
│   ├── test_chat_cross_model_continuation.py
│   ├── test_chat_openai_integration.py
│   ├── test_chat_simple.py
│   ├── test_clink_claude_agent.py
│   ├── test_clink_claude_parser.py
│   ├── test_clink_codex_agent.py
│   ├── test_clink_gemini_agent.py
│   ├── test_clink_gemini_parser.py
│   ├── test_clink_integration.py
│   ├── test_clink_parsers.py
│   ├── test_clink_tool.py
│   ├── test_collaboration.py
│   ├── test_config.py
│   ├── test_consensus_integration.py
│   ├── test_consensus_schema.py
│   ├── test_consensus.py
│   ├── test_conversation_continuation_integration.py
│   ├── test_conversation_field_mapping.py
│   ├── test_conversation_file_features.py
│   ├── test_conversation_memory.py
│   ├── test_conversation_missing_files.py
│   ├── test_custom_openai_temperature_fix.py
│   ├── test_custom_provider.py
│   ├── test_debug.py
│   ├── test_deploy_scripts.py
│   ├── test_dial_provider.py
│   ├── test_directory_expansion_tracking.py
│   ├── test_disabled_tools.py
│   ├── test_docker_claude_desktop_integration.py
│   ├── test_docker_config_complete.py
│   ├── test_docker_healthcheck.py
│   ├── test_docker_implementation.py
│   ├── test_docker_mcp_validation.py
│   ├── test_docker_security.py
│   ├── test_docker_volume_persistence.py
│   ├── test_file_protection.py
│   ├── test_gemini_token_usage.py
│   ├── test_image_support_integration.py
│   ├── test_image_validation.py
│   ├── test_integration_utf8.py
│   ├── test_intelligent_fallback.py
│   ├── test_issue_245_simple.py
│   ├── test_large_prompt_handling.py
│   ├── test_line_numbers_integration.py
│   ├── test_listmodels_restrictions.py
│   ├── test_listmodels.py
│   ├── test_mcp_error_handling.py
│   ├── test_model_enumeration.py
│   ├── test_model_metadata_continuation.py
│   ├── test_model_resolution_bug.py
│   ├── test_model_restrictions.py
│   ├── test_o3_pro_output_text_fix.py
│   ├── test_o3_temperature_fix_simple.py
│   ├── test_openai_compatible_token_usage.py
│   ├── test_openai_provider.py
│   ├── test_openrouter_provider.py
│   ├── test_openrouter_registry.py
│   ├── test_parse_model_option.py
│   ├── test_per_tool_model_defaults.py
│   ├── test_pii_sanitizer.py
│   ├── test_pip_detection_fix.py
│   ├── test_planner.py
│   ├── test_precommit_workflow.py
│   ├── test_prompt_regression.py
│   ├── test_prompt_size_limit_bug_fix.py
│   ├── test_provider_retry_logic.py
│   ├── test_provider_routing_bugs.py
│   ├── test_provider_utf8.py
│   ├── test_providers.py
│   ├── test_rate_limit_patterns.py
│   ├── test_refactor.py
│   ├── test_secaudit.py
│   ├── test_server.py
│   ├── test_supported_models_aliases.py
│   ├── test_thinking_modes.py
│   ├── test_tools.py
│   ├── test_tracer.py
│   ├── test_utf8_localization.py
│   ├── test_utils.py
│   ├── test_uvx_resource_packaging.py
│   ├── test_uvx_support.py
│   ├── test_workflow_file_embedding.py
│   ├── test_workflow_metadata.py
│   ├── test_workflow_prompt_size_validation_simple.py
│   ├── test_workflow_utf8.py
│   ├── test_xai_provider.py
│   ├── transport_helpers.py
│   └── triangle.png
├── tools
│   ├── __init__.py
│   ├── analyze.py
│   ├── apilookup.py
│   ├── challenge.py
│   ├── chat.py
│   ├── clink.py
│   ├── codereview.py
│   ├── consensus.py
│   ├── debug.py
│   ├── docgen.py
│   ├── listmodels.py
│   ├── models.py
│   ├── planner.py
│   ├── precommit.py
│   ├── refactor.py
│   ├── secaudit.py
│   ├── shared
│   │   ├── __init__.py
│   │   ├── base_models.py
│   │   ├── base_tool.py
│   │   ├── exceptions.py
│   │   └── schema_builders.py
│   ├── simple
│   │   ├── __init__.py
│   │   └── base.py
│   ├── testgen.py
│   ├── thinkdeep.py
│   ├── tracer.py
│   ├── version.py
│   └── workflow
│       ├── __init__.py
│       ├── base.py
│       ├── schema_builders.py
│       └── workflow_mixin.py
├── utils
│   ├── __init__.py
│   ├── client_info.py
│   ├── conversation_memory.py
│   ├── env.py
│   ├── file_types.py
│   ├── file_utils.py
│   ├── image_utils.py
│   ├── model_context.py
│   ├── model_restrictions.py
│   ├── security_config.py
│   ├── storage_backend.py
│   └── token_utils.py
└── zen-mcp-server
```

# Files

--------------------------------------------------------------------------------
/tests/test_openrouter_registry.py:
--------------------------------------------------------------------------------

```python
"""Tests for OpenRouter model registry functionality."""

import json
import os
import tempfile
from unittest.mock import patch

import pytest

from providers.registries.openrouter import OpenRouterModelRegistry
from providers.shared import ModelCapabilities, ProviderType


class TestOpenRouterModelRegistry:
    """Test cases for OpenRouter model registry."""

    def test_registry_initialization(self):
        """Test registry initializes with default config."""
        registry = OpenRouterModelRegistry()

        # Should load models from default location
        assert len(registry.list_models()) > 0
        assert len(registry.list_aliases()) > 0

    def test_custom_config_path(self):
        """Test registry with custom config path."""
        # Create temporary config
        config_data = {
            "models": [
                {
                    "model_name": "test/model-1",
                    "aliases": ["test1", "t1"],
                    "context_window": 4096,
                    "max_output_tokens": 2048,
                }
            ]
        }

        with tempfile.NamedTemporaryFile(mode="w", suffix=".json", delete=False) as f:
            json.dump(config_data, f)
            temp_path = f.name

        try:
            registry = OpenRouterModelRegistry(config_path=temp_path)
            assert len(registry.list_models()) == 1
            assert "test/model-1" in registry.list_models()
            assert "test1" in registry.list_aliases()
            assert "t1" in registry.list_aliases()
        finally:
            os.unlink(temp_path)

    def test_environment_variable_override(self):
        """Test OPENROUTER_MODELS_CONFIG_PATH environment variable."""
        # Create custom config
        config_data = {
            "models": [
                {"model_name": "env/model", "aliases": ["envtest"], "context_window": 8192, "max_output_tokens": 4096}
            ]
        }

        with tempfile.NamedTemporaryFile(mode="w", suffix=".json", delete=False) as f:
            json.dump(config_data, f)
            temp_path = f.name

        try:
            # Set environment variable
            original_env = os.environ.get("OPENROUTER_MODELS_CONFIG_PATH")
            os.environ["OPENROUTER_MODELS_CONFIG_PATH"] = temp_path

            # Create registry without explicit path
            registry = OpenRouterModelRegistry()

            # Should load from environment path
            assert "env/model" in registry.list_models()
            assert "envtest" in registry.list_aliases()

        finally:
            # Restore environment
            if original_env is not None:
                os.environ["OPENROUTER_MODELS_CONFIG_PATH"] = original_env
            else:
                del os.environ["OPENROUTER_MODELS_CONFIG_PATH"]
            os.unlink(temp_path)

    def test_alias_resolution(self):
        """Test alias resolution functionality."""
        registry = OpenRouterModelRegistry()

        # Test various aliases
        test_cases = [
            ("opus", "anthropic/claude-opus-4.1"),
            ("OPUS", "anthropic/claude-opus-4.1"),  # Case insensitive
            ("sonnet", "anthropic/claude-sonnet-4.5"),
            ("o3", "openai/o3"),
            ("deepseek", "deepseek/deepseek-r1-0528"),
            ("mistral", "mistralai/mistral-large-2411"),
        ]

        for alias, expected_model in test_cases:
            config = registry.resolve(alias)
            assert config is not None, f"Failed to resolve alias '{alias}'"
            assert config.model_name == expected_model

    def test_direct_model_name_lookup(self):
        """Test looking up models by their full name."""
        registry = OpenRouterModelRegistry()

        # Should be able to look up by full model name
        config = registry.resolve("anthropic/claude-opus-4.1")
        assert config is not None
        assert config.model_name == "anthropic/claude-opus-4.1"

        config = registry.resolve("openai/o3")
        assert config is not None
        assert config.model_name == "openai/o3"

    def test_unknown_model_resolution(self):
        """Test resolution of unknown models."""
        registry = OpenRouterModelRegistry()

        # Unknown aliases should return None
        assert registry.resolve("unknown-alias") is None
        assert registry.resolve("") is None
        assert registry.resolve("non-existent") is None

    def test_model_capabilities_conversion(self):
        """Test that registry returns ModelCapabilities directly."""
        registry = OpenRouterModelRegistry()

        config = registry.resolve("opus")
        assert config is not None

        # Registry now returns ModelCapabilities objects directly
        assert config.provider == ProviderType.OPENROUTER
        assert config.model_name == "anthropic/claude-opus-4.1"
        assert config.friendly_name == "OpenRouter (anthropic/claude-opus-4.1)"
        assert config.context_window == 200000
        assert not config.supports_extended_thinking

    def test_duplicate_alias_detection(self):
        """Test that duplicate aliases are detected."""
        config_data = {
            "models": [
                {"model_name": "test/model-1", "aliases": ["dupe"], "context_window": 4096, "max_output_tokens": 2048},
                {
                    "model_name": "test/model-2",
                    "aliases": ["DUPE"],  # Same alias, different case
                    "context_window": 8192,
                    "max_output_tokens": 2048,
                },
            ]
        }

        with tempfile.NamedTemporaryFile(mode="w", suffix=".json", delete=False) as f:
            json.dump(config_data, f)
            temp_path = f.name

        try:
            with pytest.raises(ValueError, match="Duplicate alias"):
                OpenRouterModelRegistry(config_path=temp_path)
        finally:
            os.unlink(temp_path)

    def test_backwards_compatibility_max_tokens(self):
        """Test that legacy max_tokens field maps to max_output_tokens."""
        config_data = {
            "models": [
                {
                    "model_name": "test/old-model",
                    "aliases": ["old"],
                    "max_tokens": 16384,  # Old field name should cause error
                    "supports_extended_thinking": False,
                }
            ]
        }

        with tempfile.NamedTemporaryFile(mode="w", suffix=".json", delete=False) as f:
            json.dump(config_data, f)
            temp_path = f.name

        try:
            with patch.dict("os.environ", {}, clear=True):
                with pytest.raises(ValueError, match="max_output_tokens"):
                    OpenRouterModelRegistry(config_path=temp_path)
        finally:
            os.unlink(temp_path)

    def test_missing_config_file(self):
        """Test behavior with missing config file."""
        # Use a non-existent path
        with patch.dict("os.environ", {}, clear=True):
            registry = OpenRouterModelRegistry(config_path="/non/existent/path.json")

        # Should initialize with empty maps
        assert len(registry.list_models()) == 0
        assert len(registry.list_aliases()) == 0
        assert registry.resolve("anything") is None

    def test_invalid_json_config(self):
        """Test handling of invalid JSON."""
        with tempfile.NamedTemporaryFile(mode="w", suffix=".json", delete=False) as f:
            f.write("{ invalid json }")
            temp_path = f.name

        try:
            registry = OpenRouterModelRegistry(config_path=temp_path)
            # Should handle gracefully and initialize empty
            assert len(registry.list_models()) == 0
            assert len(registry.list_aliases()) == 0
        finally:
            os.unlink(temp_path)

    def test_model_with_all_capabilities(self):
        """Test model with all capability flags."""
        from providers.shared import TemperatureConstraint

        caps = ModelCapabilities(
            provider=ProviderType.OPENROUTER,
            model_name="test/full-featured",
            friendly_name="OpenRouter (test/full-featured)",
            aliases=["full"],
            context_window=128000,
            max_output_tokens=8192,
            supports_extended_thinking=True,
            supports_system_prompts=True,
            supports_streaming=True,
            supports_function_calling=True,
            supports_json_mode=True,
            description="Fully featured test model",
            temperature_constraint=TemperatureConstraint.create("range"),
        )
        assert caps.context_window == 128000
        assert caps.supports_extended_thinking
        assert caps.supports_system_prompts
        assert caps.supports_streaming
        assert caps.supports_function_calling
        # Note: supports_json_mode is not in ModelCapabilities yet

```

--------------------------------------------------------------------------------
/docs/tools/docgen.md:
--------------------------------------------------------------------------------

```markdown
# DocGen Tool - Comprehensive Documentation Generation

**Generates comprehensive documentation with complexity analysis through workflow-driven investigation**

The `docgen` tool creates thorough documentation by analyzing your code structure, understanding function complexity, and documenting gotchas and unexpected behaviors that developers need to know. This workflow tool guides Claude through systematic investigation of code functionality, architectural patterns, and documentation needs across multiple steps before generating comprehensive documentation with complexity analysis and call flow information.

## How the Workflow Works

The docgen tool implements a **structured workflow** for comprehensive documentation generation:

**Investigation Phase (Claude-Led):**
1. **Step 1 (Discovery)**: Claude discovers ALL files needing documentation and reports exact count
2. **Step 2+ (Documentation)**: Claude documents files one-by-one with complete coverage validation
3. **Throughout**: Claude tracks progress with counters and enforces modern documentation styles
4. **Completion**: Only when all files are documented (num_files_documented = total_files_to_document)

**Documentation Generation Phase:**
After Claude completes the investigation:
- Complete documentation strategy with style consistency
- Function/method documentation with complexity analysis
- Call flow and dependency documentation
- Gotchas and unexpected behavior documentation
- Final polished documentation following project standards

This workflow ensures methodical analysis before documentation generation, resulting in more comprehensive and valuable documentation.

## Model Recommendation

Documentation generation excels with analytical models like Gemini Pro or O3, which can understand complex code relationships, identify non-obvious behaviors, and generate thorough documentation that covers gotchas and edge cases. The combination of large context windows and analytical reasoning enables generation of documentation that helps prevent integration issues and developer confusion.

## Example Prompts

**Basic Usage:**
```
"Use zen to generate documentation for the UserManager class"
"Document the authentication module with complexity analysis using gemini pro"
"Add comprehensive documentation to all methods in src/payment_processor.py"
```

## Key Features

- **Systematic file-by-file approach** - Complete documentation with progress tracking and validation
- **Modern documentation styles** - Enforces /// for Objective-C/Swift, /** */ for Java/JavaScript, etc.
- **Complexity analysis** - Big O notation for algorithms and performance characteristics
- **Call flow documentation** - Dependencies and method relationships
- **Counter-based completion** - Prevents stopping until all files are documented
- **Large file handling** - Systematic portion-by-portion documentation for comprehensive coverage
- **Final verification scan** - Mandatory check to ensure no functions are missed
- **Bug tracking** - Surfaces code issues without altering logic
- **Configuration parameters** - Control complexity analysis, call flow, and inline comments

## Tool Parameters

**Workflow Parameters (used during step-by-step process):**
- `step`: Current step description - discovery phase (step 1) or documentation phase (step 2+)
- `step_number`: Current step number in documentation sequence (required)
- `total_steps`: Dynamically calculated as 1 + total_files_to_document
- `next_step_required`: Whether another step is needed
- `findings`: Discoveries about code structure and documentation needs (required)
- `relevant_files`: Files being actively documented in current step
- `num_files_documented`: Counter tracking completed files (required)
- `total_files_to_document`: Total count of files needing documentation (required)

**Configuration Parameters (required fields):**
- `document_complexity`: Include Big O complexity analysis (default: true)
- `document_flow`: Include call flow and dependency information (default: true)
- `update_existing`: Update existing documentation when incorrect/incomplete (default: true)
- `comments_on_complex_logic`: Add inline comments for complex algorithmic steps (default: true)

## Usage Examples

**Class Documentation:**
```
"Generate comprehensive documentation for the PaymentProcessor class including complexity analysis"
```

**Module Documentation:**
```
"Document all functions in the authentication module with call flow information"
```

**API Documentation:**
```
"Create documentation for the REST API endpoints in api/users.py with parameter gotchas"
```

**Algorithm Documentation:**
```
"Document the sorting algorithm in utils/sort.py with Big O analysis and edge cases"
```

**Library Documentation:**
```
"Add comprehensive documentation to the utility library with usage examples and warnings"
```

## Documentation Standards

**Function/Method Documentation:**
- Parameter types and descriptions
- Return value documentation with types
- Algorithmic complexity analysis (Big O notation)
- Call flow and dependency information
- Purpose and behavior explanation
- Exception types and conditions

**Gotchas and Edge Cases:**
- Parameter combinations that produce unexpected results
- Hidden dependencies on global state or environment
- Order-dependent operations where sequence matters
- Performance implications and bottlenecks
- Thread safety considerations
- Platform-specific behavior differences

**Code Quality Documentation:**
- Inline comments for complex logic
- Design pattern explanations
- Architectural decision rationale
- Usage examples and best practices

## Documentation Features Generated

**Complexity Analysis:**
- Time complexity (Big O notation)
- Space complexity when relevant
- Worst-case, average-case, and best-case scenarios
- Performance characteristics and bottlenecks

**Call Flow Documentation:**
- Which methods/functions this code calls
- Which methods/functions call this code
- Key dependencies and interactions
- Side effects and state modifications
- Data flow through functions

**Gotchas Documentation:**
- Non-obvious parameter interactions
- Hidden state dependencies
- Silent failure conditions
- Resource management requirements
- Version compatibility issues
- Platform-specific behaviors

## Incremental Documentation Approach

**Key Benefits:**
- **Immediate value delivery** - Code becomes more maintainable right away
- **Iterative improvement** - Pattern recognition across multiple analysis rounds
- **Quality validation** - Testing documentation effectiveness during workflow
- **Reduced cognitive load** - Focus on one function/method at a time

**Workflow Process:**
1. **Analyze and Document**: Examine each function and immediately add documentation
2. **Continue Analyzing**: Move to next function while building understanding
3. **Refine and Standardize**: Review and improve previously added documentation

## Language Support

**Modern Documentation Style Enforcement:**
- **Python**: Triple-quote docstrings with type hints
- **Objective-C**: /// comments
- **Swift**: /// comments
- **JavaScript/TypeScript**: /** */ JSDoc style
- **Java**: /** */ Javadoc style  
- **C#**: /// XML documentation comments
- **C/C++**: /// for documentation comments
- **Go**: // comments above functions/types
- **Rust**: /// for documentation comments

## Documentation Quality Features

**Comprehensive Coverage:**
- All public methods and functions
- Complex private methods requiring explanation
- Class and module-level documentation
- Configuration and setup requirements

**Developer-Focused:**
- Clear explanations of non-obvious behavior
- Usage examples for complex APIs
- Warning about common pitfalls
- Integration guidance and best practices

**Maintainable Format:**
- Consistent documentation style
- Appropriate level of detail
- Cross-references and links
- Version and compatibility notes

## Best Practices

- **Use systematic approach**: Tool now documents all files with progress tracking and validation
- **Trust the counters**: Tool prevents premature completion until all files are documented
- **Large files handled**: Tool automatically processes large files in systematic portions
- **Modern styles enforced**: Tool ensures correct documentation style per language
- **Configuration matters**: Enable complexity analysis and call flow for comprehensive docs
- **Bug tracking**: Tool surfaces issues without altering code - review findings after completion

## When to Use DocGen vs Other Tools

- **Use `docgen`** for: Creating comprehensive documentation, adding missing docs, improving existing documentation
- **Use `analyze`** for: Understanding code structure without generating documentation
- **Use `codereview`** for: Reviewing code quality including documentation completeness
- **Use `refactor`** for: Restructuring code before documentation (cleaner code = better docs)
```

--------------------------------------------------------------------------------
/clink/registry.py:
--------------------------------------------------------------------------------

```python
"""Configuration registry for clink CLI integrations."""

from __future__ import annotations

import json
import logging
import shlex
from collections.abc import Iterable
from pathlib import Path

from clink.constants import (
    CONFIG_DIR,
    DEFAULT_TIMEOUT_SECONDS,
    INTERNAL_DEFAULTS,
    PROJECT_ROOT,
    USER_CONFIG_DIR,
    CLIInternalDefaults,
)
from clink.models import (
    CLIClientConfig,
    CLIRoleConfig,
    ResolvedCLIClient,
    ResolvedCLIRole,
)
from utils.env import get_env
from utils.file_utils import read_json_file

logger = logging.getLogger("clink.registry")

CONFIG_ENV_VAR = "CLI_CLIENTS_CONFIG_PATH"


class RegistryLoadError(RuntimeError):
    """Raised when configuration files are invalid or missing critical data."""


class ClinkRegistry:
    """Loads CLI client definitions and exposes them for schema generation/runtime use."""

    def __init__(self) -> None:
        self._clients: dict[str, ResolvedCLIClient] = {}
        self._load()

    def _load(self) -> None:
        self._clients.clear()
        for config_path in self._iter_config_files():
            try:
                data = read_json_file(str(config_path))
            except json.JSONDecodeError as exc:
                raise RegistryLoadError(f"Invalid JSON in {config_path}: {exc}") from exc

            if not data:
                logger.debug("Skipping empty configuration file: %s", config_path)
                continue

            config = CLIClientConfig.model_validate(data)
            resolved = self._resolve_config(config, source_path=config_path)
            key = resolved.name.lower()
            if key in self._clients:
                logger.info("Overriding CLI configuration for '%s' from %s", resolved.name, config_path)
            else:
                logger.debug("Loaded CLI configuration for '%s' from %s", resolved.name, config_path)
            self._clients[key] = resolved

        if not self._clients:
            raise RegistryLoadError(
                "No CLI clients configured. Ensure conf/cli_clients contains at least one definition or set "
                f"{CONFIG_ENV_VAR}."
            )

    def reload(self) -> None:
        """Reload configurations from disk."""
        self._load()

    def list_clients(self) -> list[str]:
        return sorted(client.name for client in self._clients.values())

    def list_roles(self, cli_name: str) -> list[str]:
        config = self.get_client(cli_name)
        return sorted(config.roles.keys())

    def get_client(self, cli_name: str) -> ResolvedCLIClient:
        key = cli_name.lower()
        if key not in self._clients:
            available = ", ".join(self.list_clients())
            raise KeyError(f"CLI '{cli_name}' is not configured. Available clients: {available}")
        return self._clients[key]

    # ------------------------------------------------------------------
    # Internal helpers
    # ------------------------------------------------------------------

    def _iter_config_files(self) -> Iterable[Path]:
        search_paths: list[Path] = []

        # 1. Built-in configs
        search_paths.append(CONFIG_DIR)

        # 2. CLI_CLIENTS_CONFIG_PATH environment override (file or directory)
        env_path_raw = get_env(CONFIG_ENV_VAR)
        if env_path_raw:
            env_path = Path(env_path_raw).expanduser()
            search_paths.append(env_path)

        # 3. User overrides in ~/.zen/cli_clients
        search_paths.append(USER_CONFIG_DIR)

        seen: set[Path] = set()

        for base in search_paths:
            if not base:
                continue
            if base in seen:
                continue
            seen.add(base)

            if base.is_file() and base.suffix.lower() == ".json":
                yield base
                continue

            if base.is_dir():
                for path in sorted(base.glob("*.json")):
                    if path.is_file():
                        yield path
            else:
                logger.debug("Configuration path does not exist: %s", base)

    def _resolve_config(self, raw: CLIClientConfig, *, source_path: Path) -> ResolvedCLIClient:
        if not raw.name:
            raise RegistryLoadError(f"CLI configuration at {source_path} is missing a 'name' field")

        normalized_name = raw.name.strip()
        internal_defaults = INTERNAL_DEFAULTS.get(normalized_name.lower())
        if internal_defaults is None:
            raise RegistryLoadError(f"CLI '{raw.name}' is not supported by clink")

        executable = self._resolve_executable(raw, internal_defaults, source_path)

        internal_args = list(internal_defaults.additional_args) if internal_defaults else []
        config_args = list(raw.additional_args)

        timeout_seconds = raw.timeout_seconds or (
            internal_defaults.timeout_seconds if internal_defaults else DEFAULT_TIMEOUT_SECONDS
        )

        parser_name = internal_defaults.parser
        if not parser_name:
            raise RegistryLoadError(
                f"CLI '{raw.name}' must define a parser either in configuration or internal defaults"
            )

        runner_name = internal_defaults.runner if internal_defaults else None

        env = self._merge_env(raw, internal_defaults)
        working_dir = self._resolve_optional_path(raw.working_dir, source_path.parent)
        roles = self._resolve_roles(raw, internal_defaults, source_path)

        output_to_file = raw.output_to_file

        return ResolvedCLIClient(
            name=normalized_name,
            executable=executable,
            internal_args=internal_args,
            config_args=config_args,
            env=env,
            timeout_seconds=int(timeout_seconds),
            parser=parser_name,
            runner=runner_name,
            roles=roles,
            output_to_file=output_to_file,
            working_dir=working_dir,
        )

    def _resolve_executable(
        self,
        raw: CLIClientConfig,
        internal_defaults: CLIInternalDefaults | None,
        source_path: Path,
    ) -> list[str]:
        command = raw.command
        if not command:
            raise RegistryLoadError(f"CLI '{raw.name}' must specify a 'command' in configuration")
        return shlex.split(command)

    def _merge_env(
        self,
        raw: CLIClientConfig,
        internal_defaults: CLIInternalDefaults | None,
    ) -> dict[str, str]:
        merged: dict[str, str] = {}
        if internal_defaults and internal_defaults.env:
            merged.update(internal_defaults.env)
        merged.update(raw.env)
        return merged

    def _resolve_roles(
        self,
        raw: CLIClientConfig,
        internal_defaults: CLIInternalDefaults | None,
        source_path: Path,
    ) -> dict[str, ResolvedCLIRole]:
        roles: dict[str, CLIRoleConfig] = dict(raw.roles)

        default_role_prompt = internal_defaults.default_role_prompt if internal_defaults else None
        if "default" not in roles:
            roles["default"] = CLIRoleConfig(prompt_path=default_role_prompt)
        elif roles["default"].prompt_path is None and default_role_prompt:
            roles["default"].prompt_path = default_role_prompt

        resolved: dict[str, ResolvedCLIRole] = {}
        for role_name, role_config in roles.items():
            prompt_path_str = role_config.prompt_path or default_role_prompt
            if not prompt_path_str:
                raise RegistryLoadError(f"Role '{role_name}' for CLI '{raw.name}' must define a prompt_path")
            prompt_path = self._resolve_prompt_path(prompt_path_str, source_path.parent)
            resolved[role_name] = ResolvedCLIRole(
                name=role_name,
                prompt_path=prompt_path,
                role_args=list(role_config.role_args),
                description=role_config.description,
            )
        return resolved

    def _resolve_prompt_path(self, prompt_path: str, base_dir: Path) -> Path:
        resolved = self._resolve_path(prompt_path, base_dir)
        if not resolved.exists():
            raise RegistryLoadError(f"Prompt file not found: {resolved}")
        return resolved

    def _resolve_optional_path(self, candidate: str | None, base_dir: Path) -> Path | None:
        if not candidate:
            return None
        return self._resolve_path(candidate, base_dir)

    def _resolve_path(self, candidate: str, base_dir: Path) -> Path:
        path = Path(candidate)
        if path.is_absolute():
            return path

        candidate_path = (base_dir / path).resolve()
        if candidate_path.exists():
            return candidate_path

        project_relative = (PROJECT_ROOT / path).resolve()
        return project_relative


_REGISTRY: ClinkRegistry | None = None


def get_registry() -> ClinkRegistry:
    global _REGISTRY
    if _REGISTRY is None:
        _REGISTRY = ClinkRegistry()
    return _REGISTRY

```

--------------------------------------------------------------------------------
/docs/tools/testgen.md:
--------------------------------------------------------------------------------

```markdown
# TestGen Tool - Comprehensive Test Generation

**Generates thorough test suites with edge case coverage through workflow-driven investigation**

The `testgen` tool creates comprehensive test suites by analyzing your code paths, understanding intricate dependencies, and identifying realistic edge cases and failure scenarios that need test coverage. This workflow tool guides Claude through systematic investigation of code functionality, critical paths, edge cases, and integration points across multiple steps before generating comprehensive tests with realistic failure mode analysis.

## Thinking Mode

**Default is `medium` (8,192 tokens) for extended thinking models.** Use `high` for complex systems with many interactions or `max` for critical systems requiring exhaustive test coverage.

## How the Workflow Works

The testgen tool implements a **structured workflow** for comprehensive test generation:

**Investigation Phase (Claude-Led):**
1. **Step 1**: Claude describes the test generation plan and begins analyzing code functionality
2. **Step 2+**: Claude examines critical paths, edge cases, error handling, and integration points
3. **Throughout**: Claude tracks findings, test scenarios, and coverage gaps
4. **Completion**: Once investigation is thorough, Claude signals completion

**Test Generation Phase:**
After Claude completes the investigation:
- Complete test scenario catalog with all edge cases
- Framework-specific test generation
- Realistic failure mode coverage
- Final test suite with comprehensive coverage

This workflow ensures methodical analysis before test generation, resulting in more thorough and valuable test suites.

## Model Recommendation

Test generation excels with extended reasoning models like Gemini Pro or O3, which can analyze complex code paths, understand intricate dependencies, and identify comprehensive edge cases. The combination of large context windows and advanced reasoning enables generation of thorough test suites that cover realistic failure scenarios and integration points that shorter-context models might overlook.

## Example Prompts

**Basic Usage:**
```
"Use zen to generate tests for User.login() method"
"Generate comprehensive tests for the sorting method in src/new_sort.py using o3"
"Create tests for edge cases not already covered in our tests using gemini pro"
```

## Key Features

- **Multi-agent workflow** analyzing code paths and identifying realistic failure modes
- **Generates framework-specific tests** following project conventions
- **Supports test pattern following** when examples are provided
- **Dynamic token allocation** (25% for test examples, 75% for main code)
- **Prioritizes smallest test files** for pattern detection
- **Can reference existing test files**: `"Generate tests following patterns from tests/unit/"`
- **Specific code coverage** - target specific functions/classes rather than testing everything
- **Image support**: Test UI components, analyze visual requirements: `"Generate tests for this login form using the UI mockup screenshot"`
- **Edge case identification**: Systematic discovery of boundary conditions and error states
- **Realistic failure mode analysis**: Understanding what can actually go wrong in production
- **Integration test support**: Tests that cover component interactions and system boundaries

## Tool Parameters

**Workflow Investigation Parameters (used during step-by-step process):**
- `step`: Current investigation step description (required for each step)
- `step_number`: Current step number in test generation sequence (required)
- `total_steps`: Estimated total investigation steps (adjustable)
- `next_step_required`: Whether another investigation step is needed
- `findings`: Discoveries about functionality and test scenarios (required)
- `files_checked`: All files examined during investigation
- `relevant_files`: Files directly needing tests (required in step 1)
- `relevant_context`: Methods/functions/classes requiring test coverage
- `confidence`: Confidence level in test plan completeness (exploring/low/medium/high/certain)

**Initial Configuration (used in step 1):**
- `prompt`: Description of what to test, testing objectives, and specific scope/focus areas (required)
- `model`: auto|pro|flash|flash-2.0|flashlite|o3|o3-mini|o4-mini|gpt4.1|gpt5|gpt5-mini|gpt5-nano (default: server default)
- `test_examples`: Optional existing test files or directories to use as style/pattern reference (absolute paths)
- `thinking_mode`: minimal|low|medium|high|max (default: medium, Gemini only)
- `use_assistant_model`: Whether to use expert test generation phase (default: true, set to false to use Claude only)

## Usage Examples

**Method-Specific Tests:**
```
"Generate tests for User.login() method covering authentication success, failure, and edge cases"
```

**Class Testing:**
```
"Use pro to generate comprehensive tests for PaymentProcessor class with max thinking mode"
```

**Following Existing Patterns:**
```
"Generate tests for new authentication module following patterns from tests/unit/auth/"
```

**UI Component Testing:**
```
"Generate tests for this login form component using the UI mockup screenshot"
```

**Algorithm Testing:**
```
"Create thorough tests for the sorting algorithm in utils/sort.py, focus on edge cases and performance"
```

**Integration Testing:**
```
"Generate integration tests for the payment processing pipeline from order creation to completion"
```

## Test Generation Strategy

**Code Path Analysis:**
- Identifies all execution paths through the code
- Maps conditional branches and loops
- Discovers error handling paths
- Analyzes state transitions

**Edge Case Discovery:**
- Boundary value analysis (empty, null, max values)
- Invalid input scenarios
- Race conditions and timing issues
- Resource exhaustion cases

**Failure Mode Analysis:**
- External dependency failures
- Network and I/O errors
- Authentication and authorization failures
- Data corruption scenarios

**Framework Detection:**
The tool automatically detects and generates tests for:
- **Python**: pytest, unittest, nose2
- **JavaScript**: Jest, Mocha, Jasmine, Vitest
- **Java**: JUnit 4/5, TestNG, Mockito
- **C#**: NUnit, MSTest, xUnit
- **Swift**: XCTest
- **Go**: testing package
- **And more**: Adapts to project conventions

## Test Categories Generated

**Unit Tests:**
- Function/method behavior validation
- Input/output verification
- Error condition handling
- State change verification

**Integration Tests:**
- Component interaction testing
- API endpoint validation
- Database integration
- External service mocking

**Edge Case Tests:**
- Boundary conditions
- Invalid inputs
- Resource limits
- Concurrent access

**Performance Tests:**
- Response time validation
- Memory usage checks
- Load handling
- Scalability verification

## Best Practices

- **Be specific about scope**: Target specific functions/classes rather than requesting tests for everything
- **Provide test examples**: Include existing test files for pattern consistency
- **Focus on critical paths**: Prioritize testing of business-critical functionality
- **Include visual context**: Screenshots or mockups for UI component testing
- **Describe testing objectives**: Explain what aspects are most important to test
- **Consider test maintenance**: Request readable, maintainable test code

## Test Quality Features

**Realistic Test Data:**
- Generates meaningful test data that represents real-world scenarios
- Avoids trivial test cases that don't add value
- Creates data that exercises actual business logic

**Comprehensive Coverage:**
- Happy path scenarios
- Error conditions and exceptions
- Edge cases and boundary conditions
- Integration points and dependencies

**Maintainable Code:**
- Clear test names that describe what's being tested
- Well-organized test structure
- Appropriate use of setup/teardown
- Minimal test data and mocking

## Advanced Features

**Pattern Following:**
When test examples are provided, the tool analyzes:
- Naming conventions and structure
- Assertion patterns and style
- Mocking and setup approaches
- Test data organization

**Large Context Analysis:**
With models like Gemini Pro, the tool can:
- Analyze extensive codebases for comprehensive test coverage
- Understand complex interactions across multiple modules
- Generate integration tests that span multiple components

**Visual Testing:**
For UI components and visual elements:
- Generate tests based on visual requirements
- Create accessibility testing scenarios
- Test responsive design behaviors

## When to Use TestGen vs Other Tools

- **Use `testgen`** for: Creating comprehensive test suites, filling test coverage gaps, testing new features
- **Use `debug`** for: Diagnosing specific test failures or runtime issues
- **Use `codereview`** for: Reviewing existing test quality and coverage
- **Use `analyze`** for: Understanding existing test structure without generating new tests

```

--------------------------------------------------------------------------------
/utils/client_info.py:
--------------------------------------------------------------------------------

```python
"""
Client Information Utility for MCP Server

This module provides utilities to extract and format client information
from the MCP protocol's clientInfo sent during initialization.

It also provides friendly name mapping and caching for consistent client
identification across the application.
"""

import logging
from typing import Any, Optional

logger = logging.getLogger(__name__)

# Global cache for client information
_client_info_cache: Optional[dict[str, Any]] = None

# Mapping of known client names to friendly names
# This is case-insensitive and checks if the key is contained in the client name
CLIENT_NAME_MAPPINGS = {
    # Claude variants
    "claude-ai": "Claude",
    "claude": "Claude",
    "claude-desktop": "Claude",
    "claude-code": "Claude",
    "anthropic": "Claude",
    # Gemini variants
    "gemini-cli-mcp-client": "Gemini",
    "gemini-cli": "Gemini",
    "gemini": "Gemini",
    "google": "Gemini",
    # Other known clients
    "cursor": "Cursor",
    "vscode": "VS Code",
    "codeium": "Codeium",
    "copilot": "GitHub Copilot",
    # Generic MCP clients
    "mcp-client": "MCP Client",
    "test-client": "Test Client",
}

# Default friendly name when no match is found
DEFAULT_FRIENDLY_NAME = "Claude"


def get_friendly_name(client_name: str) -> str:
    """
    Map a client name to a friendly name.

    Args:
        client_name: The raw client name from clientInfo

    Returns:
        A friendly name for display (e.g., "Claude", "Gemini")
    """
    if not client_name:
        return DEFAULT_FRIENDLY_NAME

    # Convert to lowercase for case-insensitive matching
    client_name_lower = client_name.lower()

    # Check each mapping - using 'in' to handle partial matches
    for key, friendly_name in CLIENT_NAME_MAPPINGS.items():
        if key.lower() in client_name_lower:
            return friendly_name

    # If no match found, return the default
    return DEFAULT_FRIENDLY_NAME


def get_cached_client_info() -> Optional[dict[str, Any]]:
    """
    Get cached client information if available.

    Returns:
        Cached client info dictionary or None
    """
    global _client_info_cache
    return _client_info_cache


def get_client_info_from_context(server: Any) -> Optional[dict[str, Any]]:
    """
    Extract client information from the MCP server's request context.

    The MCP protocol sends clientInfo during initialization containing:
    - name: The client application name (e.g., "Claude Code", "Claude Desktop")
    - version: The client version string

    This function also adds a friendly_name field and caches the result.

    Args:
        server: The MCP server instance

    Returns:
        Dictionary with client info or None if not available:
        {
            "name": "claude-ai",
            "version": "1.0.0",
            "friendly_name": "Claude"
        }
    """
    global _client_info_cache

    # Return cached info if available
    if _client_info_cache is not None:
        return _client_info_cache

    try:
        # Try to access the request context and session
        if not server:
            return None

        # Check if server has request_context property
        request_context = None
        try:
            request_context = server.request_context
        except AttributeError:
            logger.debug("Server does not have request_context property")
            return None

        if not request_context:
            logger.debug("Request context is None")
            return None

        # Try to access session from request context
        session = None
        try:
            session = request_context.session
        except AttributeError:
            logger.debug("Request context does not have session property")
            return None

        if not session:
            logger.debug("Session is None")
            return None

        # Try to access client params from session
        client_params = None
        try:
            # The clientInfo is stored in _client_params.clientInfo
            client_params = session._client_params
        except AttributeError:
            logger.debug("Session does not have _client_params property")
            return None

        if not client_params:
            logger.debug("Client params is None")
            return None

        # Try to extract clientInfo
        client_info = None
        try:
            client_info = client_params.clientInfo
        except AttributeError:
            logger.debug("Client params does not have clientInfo property")
            return None

        if not client_info:
            logger.debug("Client info is None")
            return None

        # Extract name and version
        result = {}

        try:
            result["name"] = client_info.name
        except AttributeError:
            logger.debug("Client info does not have name property")

        try:
            result["version"] = client_info.version
        except AttributeError:
            logger.debug("Client info does not have version property")

        if not result:
            return None

        # Add friendly name
        raw_name = result.get("name", "")
        result["friendly_name"] = get_friendly_name(raw_name)

        # Cache the result
        _client_info_cache = result
        logger.debug(f"Cached client info: {result}")

        return result

    except Exception as e:
        logger.debug(f"Error extracting client info: {e}")
        return None


def format_client_info(client_info: Optional[dict[str, Any]], use_friendly_name: bool = True) -> str:
    """
    Format client information for display.

    Args:
        client_info: Dictionary with client info or None
        use_friendly_name: If True, use the friendly name instead of raw name

    Returns:
        Formatted string like "Claude v1.0.0" or "Claude"
    """
    if not client_info:
        return DEFAULT_FRIENDLY_NAME

    if use_friendly_name:
        name = client_info.get("friendly_name", client_info.get("name", DEFAULT_FRIENDLY_NAME))
    else:
        name = client_info.get("name", "Unknown")

    version = client_info.get("version", "")

    if version and not use_friendly_name:
        return f"{name} v{version}"
    else:
        # For friendly names, we just return the name without version
        return name


def get_client_friendly_name() -> str:
    """
    Get the cached client's friendly name.

    This is a convenience function that returns just the friendly name
    from the cached client info, defaulting to "Claude" if not available.

    Returns:
        The friendly name (e.g., "Claude", "Gemini")
    """
    cached_info = get_cached_client_info()
    if cached_info:
        return cached_info.get("friendly_name", DEFAULT_FRIENDLY_NAME)
    return DEFAULT_FRIENDLY_NAME


def log_client_info(server: Any, logger_instance: Optional[logging.Logger] = None) -> None:
    """
    Log client information extracted from the server.

    Args:
        server: The MCP server instance
        logger_instance: Optional logger to use (defaults to module logger)
    """
    log = logger_instance or logger

    client_info = get_client_info_from_context(server)
    if client_info:
        # Log with both raw and friendly names for debugging
        raw_name = client_info.get("name", "Unknown")
        friendly_name = client_info.get("friendly_name", DEFAULT_FRIENDLY_NAME)
        version = client_info.get("version", "")

        if raw_name != friendly_name:
            log.info(f"MCP Client Connected: {friendly_name} (raw: {raw_name} v{version})")
        else:
            log.info(f"MCP Client Connected: {friendly_name} v{version}")

        # Log to activity logger as well
        try:
            activity_logger = logging.getLogger("mcp_activity")
            activity_logger.info(f"CLIENT_IDENTIFIED: {friendly_name} (name={raw_name}, version={version})")
        except Exception:
            pass
    else:
        log.debug("Could not extract client info from MCP protocol")


# Example usage in tools:
#
# from utils.client_info import get_client_friendly_name, get_cached_client_info
#
# # In a tool's execute method:
# def execute(self, arguments: dict[str, Any]) -> list[TextContent]:
#     # Get the friendly name of the connected client
#     client_name = get_client_friendly_name()  # Returns "Claude" or "Gemini" etc.
#
#     # Or get full cached info if needed
#     client_info = get_cached_client_info()
#     if client_info:
#         raw_name = client_info['name']        # e.g., "claude-ai"
#         version = client_info['version']      # e.g., "1.0.0"
#         friendly = client_info['friendly_name'] # e.g., "Claude"
#
#     # Customize response based on client
#     if client_name == "Claude":
#         response = f"Hello from Zen MCP Server to {client_name}!"
#     elif client_name == "Gemini":
#         response = f"Greetings {client_name}, welcome to Zen MCP Server!"
#     else:
#         response = f"Welcome {client_name}!"

```

--------------------------------------------------------------------------------
/docs/tools/clink.md:
--------------------------------------------------------------------------------

```markdown
# Clink Tool - CLI-to-CLI Bridge

**Spawn AI subagents, connect external CLIs, orchestrate isolated contexts – all without leaving your session**

The `clink` tool transforms your CLI into a multi-agent orchestrator. Launch isolated Codex instances from _within_ Codex, delegate to Gemini's 1M context, or run specialized Claude agents—all while preserving conversation continuity. Instead of context-switching or token bloat, spawn fresh subagents that handle complex tasks in isolation and return only the results you need.

> **CAUTION**: Clink launches real CLI agents with relaxed permission flags (Gemini ships with `--yolo`, Codex with `--dangerously-bypass-approvals-and-sandbox`, Claude with `--permission-mode acceptEdits`) so they can edit files and run tools autonomously via MCP. If that’s more access than you want, remove those flags—the CLI can still open/read files and report findings, it just won’t auto-apply edits. You can also tighten role prompts or system prompts with stop-words/guardrails, or disable clink entirely. Otherwise, keep the shipped presets confined to workspaces you fully trust.

## Why Use Clink (CLI + Link)?

### Codex-within-Codex: The Ultimate Context Management

**The Problem**: You're deep in a Codex session debugging authentication. Now you need a comprehensive security audit, but that'll consume 50K tokens of context you can't spare.

**The Solution**: Spawn a fresh Codex subagent in an isolated context:
```bash
clink with codex codereviewer to audit auth/ for OWASP Top 10 vulnerabilities
```

The subagent:
- Launches in a **pristine context** with full token budget
- Performs deep analysis using its own MCP tools and web search
- Returns **only the final security report** (not intermediate steps)
- Your main session stays **laser-focused** on debugging

**Works with any supported CLI**: Codex can spawn Codex / Claude Code / Gemini CLI subagents, or mix and match between different CLIs.

---

### Cross-CLI Orchestration

**Scenario 1**: You're in Codex and need Gemini's 1M context window to analyze a massive legacy codebase.

**Without clink**: Open new terminal → run `gemini` → lose conversation context → manually copy/paste findings → context mismatch hell.

**With clink**: `"clink with gemini to map dependencies across this 500-file monorepo"` – Gemini processes, returns insights, conversation flows seamlessly.

**Scenario 2**: Use [`consensus`](consensus.md) to debate features with multiple models, then hand off to Gemini for implementation.

```
"Use consensus with pro and gpt5 to decide whether to add dark mode or offline support next"
[consensus runs, models deliberate, recommendation emerges]

Use continuation with clink - implement the recommended feature
```

Gemini receives the full conversation context from `consensus` including the consensus prompt + replies, understands the chosen feature, technical constraints discussed, and can start implementation immediately. No re-explaining, no context loss - true conversation continuity across tools and models.

## Key Features

- **Stay in one CLI**: No switching between terminal sessions or losing context
- **Full conversation continuity**: Gemini's responses participate in the same conversation thread
- **Role-based prompts**: Pre-configured roles for planning, code review, or general questions
- **Full CLI capabilities**: Gemini can use its own web search, file tools, and latest features
- **Token efficiency**: File references (not full content) to conserve tokens
- **Cross-tool collaboration**: Combine with other Zen tools like `planner` → `clink` → `codereview`
- **Free tier available**: Gemini offers 1,000 requests/day free with a personal Google account - great for cost savings across tools

## Available Roles

**Default Role** - General questions, summaries, quick answers
```
Use clink to ask gemini about the latest React 19 features
```

**Planner Role** - Strategic planning with multi-phase approach
```
clink with gemini with planner role to map out our microservices migration strategy
```

**Code Reviewer Role** - Focused code analysis with severity levels
```
Use clink codereviewer role to review auth.py for security issues
```

You can make your own custom roles in `conf/cli_clients/` or tweak any of the shipped presets.

## Tool Parameters

- `prompt`: Your question or task for the external CLI (required)
- `cli_name`: Which CLI to use - `gemini` (default), `claude`, `codex`, or add your own in `conf/cli_clients/`
- `role`: Preset role - `default`, `planner`, `codereviewer` (default: `default`)
- `files`: Optional file paths for context (references only, CLI opens files itself)
- `images`: Optional image paths for visual context
- `continuation_id`: Continue previous clink conversations

## Usage Examples

**Architecture Planning:**
```
Use clink with gemini planner to design a 3-phase rollout plan for our feature flags system
```

**Code Review with Context:**
```
clink to gemini codereviewer: Review payment_service.py for race conditions and concurrency issues
```

**Codex Code Review:**
```
"clink with codex cli and perform a full code review using the codereview role"
```

**Quick Research Question:**
```
"Ask gemini via clink: What are the breaking changes in TypeScript 5.5?"
```

**Multi-Tool Workflow:**
```
"Use planner to outline the refactor, then clink gemini planner for validation,
then codereview to verify the implementation"
```

**Leveraging Gemini's Web Search:**
```
"Clink gemini to research current best practices for Kubernetes autoscaling in 2025"
```

## How Clink Works

1. **Your request** - You ask your current CLI to use `clink` with a specific CLI and role
2. **Background execution** - Zen spawns the configured CLI (e.g., `gemini --output-format json`)
3. **Context forwarding** - Your prompt, files (as references), and conversation history are sent as part of the prompt
4. **CLI processing** - Gemini (or other CLI) uses its own tools: web search, file access, thinking modes
5. **Seamless return** - Results flow back into your conversation with full context preserved
6. **Continuation support** - Future tools and models can reference Gemini's findings via [continuation support](../context-revival.md) within Zen. 

## Best Practices

- **Pre-authenticate CLIs**: Install and configure Gemini CLI first (`npm install -g @google/gemini-cli`)
- **Choose appropriate roles**: Use `planner` for strategy, `codereviewer` for code, `default` for general questions
- **Leverage CLI strengths**: Gemini's 1M context for large codebases, web search for current docs
- **Combine with Zen tools**: Chain `clink` with `planner`, `codereview`, `debug` for powerful workflows
- **File efficiency**: Pass file paths, let the CLI decide what to read (saves tokens)

## Configuration

Clink configurations live in `conf/cli_clients/`. We ship presets for the supported CLIs:

- `gemini.json` – runs `gemini --telemetry false --yolo -o json`
- `claude.json` – runs `claude --print --output-format json --permission-mode acceptEdits --model sonnet`
- `codex.json` – runs `codex exec --json --dangerously-bypass-approvals-and-sandbox`

> **CAUTION**: These flags intentionally bypass each CLI's safety prompts so they can edit files or launch tools autonomously via MCP. Only enable them in trusted sandboxes and tailor role prompts or CLI configs if you need more guardrails.

Each preset points to role-specific prompts in `systemprompts/clink/`. Duplicate those files to add more roles or adjust CLI flags.

> **Why `--yolo` for Gemini?** The Gemini CLI currently requires automatic approvals to execute its own tools (for example `run_shell_command`). Without the flag it errors with `Tool "run_shell_command" not found in registry`. See [issue #5382](https://github.com/google-gemini/gemini-cli/issues/5382) for more details.

**Adding new CLIs**: Drop a JSON config into `conf/cli_clients/`, create role prompts in `systemprompts/clink/`, and register a parser/agent if the CLI outputs a new format.

## When to Use Clink vs Other Tools

- **Use `clink`** for: Leveraging external CLI capabilities (Gemini's web search, 1M context), specialized CLI features, cross-CLI collaboration
- **Use `chat`** for: Direct model-to-model conversations within Zen
- **Use `planner`** for: Zen's native planning workflows with step validation
- **Use `codereview`** for: Zen's structured code review with severity levels

## Setup Requirements

Ensure the relevant CLI is installed and configured:

- [Claude Code](https://www.anthropic.com/claude-code)
- [Gemini CLI](https://github.com/google-gemini/gemini-cli)
- [Codex CLI](https://docs.sourcegraph.com/codex)

## Related Guides

- [Chat Tool](chat.md) - Direct model conversations
- [Planner Tool](planner.md) - Zen's native planning workflows
- [CodeReview Tool](codereview.md) - Structured code reviews
- [Context Revival](../context-revival.md) - Continuing conversations across tools
- [Advanced Usage](../advanced-usage.md) - Complex multi-tool workflows

```

--------------------------------------------------------------------------------
/systemprompts/precommit_prompt.py:
--------------------------------------------------------------------------------

```python
"""
Precommit tool system prompt
"""

PRECOMMIT_PROMPT = """
ROLE
You are an expert pre-commit reviewer and senior engineering partner,
conducting a pull-request style review as the final gatekeeper for
production code.
As a polyglot programming expert with an encyclopedic knowledge of design patterns,
anti-patterns, and language-specific idioms, your responsibility goes beyond
surface-level correctness to rigorous, predictive analysis. Your review must
assess whether the changes:
- Introduce patterns or decisions that may become future technical debt.
- Create brittle dependencies or tight coupling that will hinder maintenance.
- Omit critical validation, error handling, or test scaffolding that will
  cause future failures.
- Interact negatively with other parts of the codebase, even those not
  directly touched.

Your task is to perform rigorous mental static analysis, simulating how new
inputs and edge cases flow through the changed code to predict failures. Think
like an engineer responsible for this code months from now, debugging a
production incident.

In addition to reviewing correctness, completeness, and quality of the change,
apply long-term architectural thinking. Your feedback helps ensure this code
won't cause silent regressions, developer confusion, or downstream side effects
later.

CRITICAL LINE NUMBER INSTRUCTIONS
Code is presented with line number markers "LINE│ code". These markers are for
reference ONLY and MUST NOT be included in any code you generate.
Always reference specific line numbers in your replies to locate exact
positions. Include a very short code excerpt alongside each finding for clarity.
Never include "LINE│" markers in generated code snippets.

INPUTS PROVIDED
1. Git diff (staged or branch comparison)
2. Original request / acceptance criteria or context around what changed
3. File names and related code

SCOPE & FOCUS
- Review ONLY the changes in the diff and their immediate context.
- Reconstruct what changed, why it was changed, and what outcome it is supposed to deliver.
- Classify the diff (bug fix, improvement, new feature, refactor, etc.) and
confirm the implementation matches that intent.
- If the change is a bug fix, determine whether it addresses the root cause and
whether a materially safer or more maintainable fix was available.
- Evaluate whether the change achieves its stated goals without introducing
regressions, especially when new methods, public APIs, or behavioral fixes are
involved.
- Assess potential repercussions: downstream consumers, compatibility
contracts, documentation, dependencies, and operational impact.
- Anchor every observation in the provided request, commit message, tests, and
diff evidence; avoid speculation beyond available context.
- Surface any assumptions or missing context explicitly. If clarity is
impossible without more information, use the structured response to request it.
- Ensure the changes correctly implement the request and are secure, performant, and maintainable.
- Do not propose broad refactors or unrelated improvements. Stay strictly within the boundaries of the provided changes.

REVIEW PROCESS & MENTAL MODEL
1.  **Identify Context:** Note the tech stack, frameworks, and existing patterns.
2.  **Infer Intent & Change Type:** Determine what changed, why it changed, how
it is expected to behave, and categorize it (bug fix, feature, improvement,
refactor, etc.). Tie this back to the stated request, commit message, and
available tests so conclusions stay grounded; for bug fixes, confirm the root
cause is resolved and note if a materially better remedy exists.
3.  **Perform Deep Static Analysis of the Diff:**
    - **Verify Objectives:** Confirm the modifications actually deliver the
      intended behavior and align with the inferred goals.
    - **Trace Data Flow:** Follow variables and data structures through the
      new/modified logic.
    - **Simulate Edge Cases:** Mentally test with `null`/`nil`, empty
      collections, zero, negative numbers, and extremely large values.
    - **Assess Side Effects:** Consider the impact on callers, downstream
      consumers, and shared state (e.g., databases, caches).
4.  **Assess Ripple Effects:** Identify compatibility shifts, documentation
    impacts, regression risks, and untested surfaces introduced by the change.
5.  **Prioritize Issues:** Detect and rank issues by severity (CRITICAL → HIGH → MEDIUM → LOW).
6.  **Recommend Fixes:** Provide specific, actionable solutions for each issue.
7.  **Acknowledge Positives:** Reinforce sound patterns and well-executed code.
8.  **Avoid Over-engineering:** Do not suggest solutions that add unnecessary
    complexity for hypothetical future problems.

CORE ANALYSIS (Applied to the diff)
- **Security:** Does this change introduce injection risks, auth flaws, data
  exposure, or unsafe dependencies?
- **Bugs & Logic Errors:** Does this change introduce off-by-one errors, null
  dereferences, incorrect logic, or race conditions?
- **Performance:** Does this change introduce inefficient loops, blocking I/O on
  critical paths, or resource leaks?
- **Code Quality:** Does this change add unnecessary complexity, duplicate logic
  (DRY), or violate architectural principles (SOLID)?

ADDITIONAL ANALYSIS (only when relevant)
- Language/runtime concerns – memory management, concurrency, exception
  handling
    - Carefully assess the code's context and purpose before raising
      concurrency-related concerns. Confirm the presence of shared state, race
      conditions, or unsafe access patterns before flagging any issues to avoid
      false positives.
    - Also carefully evaluate concurrency and parallelism risks only after
      confirming that the code runs in an environment where such concerns are
      applicable. Avoid flagging issues unless shared state, asynchronous
      execution, or multi-threaded access are clearly possible based on
      context.
- System/integration – config handling, external calls, operational impact
- Testing – coverage gaps for new logic
    - If no tests are found in the project, do not flag test coverage as an issue unless the change introduces logic
      that is high-risk or complex.
    - In such cases, offer a low-severity suggestion encouraging basic tests, rather than marking it as a required fix.
- Change-specific pitfalls – unused new functions, partial enum updates, scope creep, risky deletions
- Determine if there are any new dependencies added but not declared, or new functionality added but not used
- Determine unintended side effects: could changes in file_A break module_B even if module_B wasn't changed?
- Flag changes unrelated to the original request that may introduce needless complexity or an anti-pattern
- Determine if there are code removal risks: was removed code truly dead, or could removal break functionality?
- Missing documentation around new methods / parameters, or missing comments around complex logic and code that
  requires it

OUTPUT FORMAT

### Repository Summary
**Repository:** /path/to/repo
- Files changed: X
- Overall assessment: brief statement with critical issue count

MANDATORY: You must ONLY respond in the following format. List issues by
severity and include ONLY the severities that apply:

[CRITICAL] Short title
- File: /absolute/path/to/file.py:line
- Description: what & why
- Fix: specific change (code snippet if helpful)

[HIGH] ...

[MEDIUM] ...

[LOW] ...

GIVE RECOMMENDATIONS:
Make a final, short, and focused statement or bullet list:
- Top priority fixes that MUST IMMEDIATELY be addressed before commit
- Notable positives to retain

Be thorough yet actionable. Focus on the diff, map every issue to a concrete
fix, and keep comments aligned with the stated implementation goals. Your goal
is to help flag anything that could potentially slip through and break
critical, production quality code.

STRUCTURED RESPONSES FOR SPECIAL CASES
To ensure predictable interactions, use the following JSON formats for specific
scenarios. Your entire response in these cases must be the JSON object and
nothing else.

1. IF MORE INFORMATION IS NEEDED
If you need additional context (e.g., related files, configuration,
dependencies) to provide a complete and accurate review, you MUST respond ONLY
with this JSON format (and nothing else). Do NOT ask for the same file you've
been provided unless its content is missing or incomplete:
{
  "status": "files_required_to_continue",
  "mandatory_instructions": "<your critical instructions for the agent>",
  "files_needed": ["[file name here]", "[or some folder/]"]
}

2. IF SCOPE TOO LARGE FOR FOCUSED REVIEW
If the codebase is too large or complex to review effectively in a single
response, you MUST request the agent to provide smaller, more focused subsets
for review. Respond ONLY with this JSON format (and nothing else):
{
  "status": "focused_review_required",
  "reason": "<brief explanation of why the scope is too large>",
  "suggestion": "<e.g., 'Review authentication module (auth.py, login.py)' or
  'Focus on data layer (models/)' or
  'Review payment processing functionality'>"
 }
"""

```

--------------------------------------------------------------------------------
/simulator_tests/test_xai_models.py:
--------------------------------------------------------------------------------

```python
#!/usr/bin/env python3
"""
X.AI GROK Model Tests

Tests that verify X.AI GROK functionality including:
- Model alias resolution (grok, grok3, grokfast map to actual GROK models)
- GROK-3 and GROK-3-fast models work correctly
- Conversation continuity works with GROK models
- API integration and response validation
"""


from .base_test import BaseSimulatorTest


class XAIModelsTest(BaseSimulatorTest):
    """Test X.AI GROK model functionality and integration"""

    @property
    def test_name(self) -> str:
        return "xai_models"

    @property
    def test_description(self) -> str:
        return "X.AI GROK model functionality and integration"

    def run_test(self) -> bool:
        """Test X.AI GROK model functionality"""
        try:
            self.logger.info("Test: X.AI GROK model functionality and integration")

            # Check if X.AI API key is configured and not empty
            import os

            xai_key = os.environ.get("XAI_API_KEY", "")
            is_valid = bool(xai_key and xai_key != "your_xai_api_key_here" and xai_key.strip())

            if not is_valid:
                self.logger.info("  ⚠️  X.AI API key not configured or empty - skipping test")
                self.logger.info("  ℹ️  This test requires XAI_API_KEY to be set in .env with a valid key")
                return True  # Return True to indicate test is skipped, not failed

            # Setup test files for later use
            self.setup_test_files()

            # Test 1: 'grok' alias (should map to grok-4)
            self.logger.info("  1: Testing 'grok' alias (should map to grok-4)")

            response1, continuation_id = self.call_mcp_tool(
                "chat",
                {
                    "prompt": "Say 'Hello from GROK model!' and nothing else.",
                    "model": "grok",
                    "temperature": 0.1,
                },
            )

            if not response1:
                self.logger.error("  ❌ GROK alias test failed")
                return False

            self.logger.info("  ✅ GROK alias call completed")
            if continuation_id:
                self.logger.info(f"  ✅ Got continuation_id: {continuation_id}")

            # Test 2: Direct grok-3 model name
            self.logger.info("  2: Testing direct model name (grok-3)")

            response2, _ = self.call_mcp_tool(
                "chat",
                {
                    "prompt": "Say 'Hello from GROK-3!' and nothing else.",
                    "model": "grok-3",
                    "temperature": 0.1,
                },
            )

            if not response2:
                self.logger.error("  ❌ Direct GROK-3 model test failed")
                return False

            self.logger.info("  ✅ Direct GROK-3 model call completed")

            # Test 3: grok-3-fast model
            self.logger.info("  3: Testing GROK-3-fast model")

            response3, _ = self.call_mcp_tool(
                "chat",
                {
                    "prompt": "Say 'Hello from GROK-3-fast!' and nothing else.",
                    "model": "grok-3-fast",
                    "temperature": 0.1,
                },
            )

            if not response3:
                self.logger.error("  ❌ GROK-3-fast model test failed")
                return False

            self.logger.info("  ✅ GROK-3-fast model call completed")

            # Test 4: Shorthand aliases
            self.logger.info("  4: Testing shorthand aliases (grok3, grokfast)")

            response4, _ = self.call_mcp_tool(
                "chat",
                {
                    "prompt": "Say 'Hello from grok3 alias!' and nothing else.",
                    "model": "grok3",
                    "temperature": 0.1,
                },
            )

            if not response4:
                self.logger.error("  ❌ grok3 alias test failed")
                return False

            response5, _ = self.call_mcp_tool(
                "chat",
                {
                    "prompt": "Say 'Hello from grokfast alias!' and nothing else.",
                    "model": "grokfast",
                    "temperature": 0.1,
                },
            )

            if not response5:
                self.logger.error("  ❌ grokfast alias test failed")
                return False

            self.logger.info("  ✅ Shorthand aliases work correctly")

            # Test 5: Conversation continuity with GROK models
            self.logger.info("  5: Testing conversation continuity with GROK")

            response6, new_continuation_id = self.call_mcp_tool(
                "chat",
                {
                    "prompt": "Remember this number: 87. What number did I just tell you?",
                    "model": "grok",
                    "temperature": 0.1,
                },
            )

            if not response6 or not new_continuation_id:
                self.logger.error("  ❌ Failed to start conversation with continuation_id")
                return False

            # Continue the conversation
            response7, _ = self.call_mcp_tool(
                "chat",
                {
                    "prompt": "What was the number I told you earlier?",
                    "model": "grok",
                    "continuation_id": new_continuation_id,
                    "temperature": 0.1,
                },
            )

            if not response7:
                self.logger.error("  ❌ Failed to continue conversation")
                return False

            # Check if the model remembered the number
            if "87" in response7:
                self.logger.info("  ✅ Conversation continuity working with GROK")
            else:
                self.logger.warning("  ⚠️  Model may not have remembered the number")

            # Test 6: Validate X.AI API usage from logs
            self.logger.info("  6: Validating X.AI API usage in logs")
            logs = self.get_recent_server_logs()

            # Check for X.AI API calls
            xai_logs = [line for line in logs.split("\n") if "x.ai" in line.lower()]
            xai_api_logs = [line for line in logs.split("\n") if "api.x.ai" in line]
            grok_logs = [line for line in logs.split("\n") if "grok" in line.lower()]

            # Check for specific model resolution
            grok_resolution_logs = [
                line
                for line in logs.split("\n")
                if ("Resolved model" in line and "grok" in line.lower()) or ("grok" in line and "->" in line)
            ]

            # Check for X.AI provider usage
            xai_provider_logs = [line for line in logs.split("\n") if "XAI" in line or "X.AI" in line]

            # Log findings
            self.logger.info(f"   X.AI-related logs: {len(xai_logs)}")
            self.logger.info(f"   X.AI API logs: {len(xai_api_logs)}")
            self.logger.info(f"   GROK-related logs: {len(grok_logs)}")
            self.logger.info(f"   Model resolution logs: {len(grok_resolution_logs)}")
            self.logger.info(f"   X.AI provider logs: {len(xai_provider_logs)}")

            # Sample log output for debugging
            if self.verbose and xai_logs:
                self.logger.debug("  📋 Sample X.AI logs:")
                for log in xai_logs[:3]:
                    self.logger.debug(f"    {log}")

            if self.verbose and grok_logs:
                self.logger.debug("  📋 Sample GROK logs:")
                for log in grok_logs[:3]:
                    self.logger.debug(f"    {log}")

            # Success criteria
            grok_mentioned = len(grok_logs) > 0
            api_used = len(xai_api_logs) > 0 or len(xai_logs) > 0
            provider_used = len(xai_provider_logs) > 0

            success_criteria = [
                ("GROK models mentioned in logs", grok_mentioned),
                ("X.AI API calls made", api_used),
                ("X.AI provider used", provider_used),
                ("All model calls succeeded", True),  # We already checked this above
                ("Conversation continuity works", True),  # We already tested this
            ]

            passed_criteria = sum(1 for _, passed in success_criteria if passed)
            self.logger.info(f"   Success criteria met: {passed_criteria}/{len(success_criteria)}")

            for criterion, passed in success_criteria:
                status = "✅" if passed else "❌"
                self.logger.info(f"    {status} {criterion}")

            if passed_criteria >= 3:  # At least 3 out of 5 criteria
                self.logger.info("  ✅ X.AI GROK model tests passed")
                return True
            else:
                self.logger.error("  ❌ X.AI GROK model tests failed")
                return False

        except Exception as e:
            self.logger.error(f"X.AI GROK model test failed: {e}")
            return False
        finally:
            self.cleanup_test_files()


def main():
    """Run the X.AI GROK model tests"""
    import sys

    verbose = "--verbose" in sys.argv or "-v" in sys.argv
    test = XAIModelsTest(verbose=verbose)

    success = test.run_test()
    sys.exit(0 if success else 1)


if __name__ == "__main__":
    main()

```

--------------------------------------------------------------------------------
/simulator_tests/test_openrouter_fallback.py:
--------------------------------------------------------------------------------

```python
#!/usr/bin/env python3
"""
OpenRouter Fallback Test

Tests that verify the system correctly falls back to OpenRouter when:
- Only OPENROUTER_API_KEY is configured
- Native models (flash, pro) are requested but map to OpenRouter equivalents
- Auto mode correctly selects OpenRouter models
"""


from .base_test import BaseSimulatorTest


class OpenRouterFallbackTest(BaseSimulatorTest):
    """Test OpenRouter fallback behavior when it's the only provider"""

    @property
    def test_name(self) -> str:
        return "openrouter_fallback"

    @property
    def test_description(self) -> str:
        return "OpenRouter fallback behavior when only provider"

    def run_test(self) -> bool:
        """Test OpenRouter fallback behavior"""
        try:
            self.logger.info("Test: OpenRouter fallback behavior when only provider available")

            # Check if ONLY OpenRouter API key is configured (this is a fallback test)
            import os

            has_openrouter = bool(os.environ.get("OPENROUTER_API_KEY"))
            has_gemini = bool(os.environ.get("GEMINI_API_KEY"))
            has_openai = bool(os.environ.get("OPENAI_API_KEY"))

            if not has_openrouter:
                self.logger.info("  ⚠️  OpenRouter API key not configured - skipping test")
                self.logger.info("  ℹ️  This test requires OPENROUTER_API_KEY to be set in .env")
                return True  # Return True to indicate test is skipped, not failed

            if has_gemini or has_openai:
                self.logger.info("  ⚠️  Other API keys configured - this is not a fallback scenario")
                self.logger.info("  ℹ️  This test requires ONLY OpenRouter to be configured (no Gemini/OpenAI keys)")
                self.logger.info("  ℹ️  Current setup has multiple providers, so fallback behavior doesn't apply")
                return True  # Return True to indicate test is skipped, not failed

            # Setup test files
            self.setup_test_files()

            # Test 1: Auto mode should work with OpenRouter
            self.logger.info("  1: Testing auto mode with OpenRouter as only provider")

            response1, continuation_id = self.call_mcp_tool(
                "chat",
                {
                    "prompt": "What is 2 + 2? Give a brief answer.",
                    # No model specified - should use auto mode
                    "temperature": 0.1,
                },
            )

            if not response1:
                self.logger.error("  ❌ Auto mode with OpenRouter failed")
                return False

            self.logger.info("  ✅ Auto mode call completed with OpenRouter")

            # Test 2: Flash model should map to OpenRouter equivalent
            self.logger.info("  2: Testing flash model mapping to OpenRouter")

            # Use codereview tool to test a different tool type
            test_code = """def calculate_sum(numbers):
    total = 0
    for num in numbers:
        total += num
    return total"""

            test_file = self.create_additional_test_file("sum_function.py", test_code)

            response2, _ = self.call_mcp_tool(
                "codereview",
                {
                    "step": "Quick review of this sum function for quality and potential issues",
                    "step_number": 1,
                    "total_steps": 1,
                    "next_step_required": False,
                    "findings": "Starting code review of sum function",
                    "relevant_files": [test_file],
                    "model": "flash",
                    "temperature": 0.1,
                },
            )

            if not response2:
                self.logger.error("  ❌ Flash model mapping to OpenRouter failed")
                return False

            self.logger.info("  ✅ Flash model successfully mapped to OpenRouter")

            # Test 3: Pro model should map to OpenRouter equivalent
            self.logger.info("  3: Testing pro model mapping to OpenRouter")

            response3, _ = self.call_mcp_tool(
                "analyze",
                {
                    "step": "Analyze the structure of this Python code",
                    "step_number": 1,
                    "total_steps": 1,
                    "next_step_required": False,
                    "findings": "Starting code structure analysis",
                    "relevant_files": [self.test_files["python"]],
                    "model": "pro",
                    "temperature": 0.1,
                },
            )

            if not response3:
                self.logger.error("  ❌ Pro model mapping to OpenRouter failed")
                return False

            self.logger.info("  ✅ Pro model successfully mapped to OpenRouter")

            # Test 4: Debug tool with OpenRouter
            self.logger.info("  4: Testing debug tool with OpenRouter")

            response4, _ = self.call_mcp_tool(
                "debug",
                {
                    "step": "Why might a function return None instead of a value?",
                    "step_number": 1,
                    "total_steps": 1,
                    "next_step_required": False,
                    "findings": "Starting debug investigation of None return values",
                    "model": "flash",  # Should map to OpenRouter
                    "temperature": 0.1,
                },
            )

            if not response4:
                self.logger.error("  ❌ Debug tool with OpenRouter failed")
                return False

            self.logger.info("  ✅ Debug tool working with OpenRouter")

            # Test 5: Validate logs show OpenRouter is being used
            self.logger.info("  5: Validating OpenRouter is the active provider")
            logs = self.get_recent_server_logs()

            # Check for provider fallback logs
            fallback_logs = [
                line
                for line in logs.split("\n")
                if "No Gemini API key found" in line
                or "No OpenAI API key found" in line
                or "Only OpenRouter available" in line
                or "Using OpenRouter" in line
            ]

            # Check for OpenRouter provider initialization
            provider_logs = [
                line
                for line in logs.split("\n")
                if "OpenRouter provider" in line or "OpenRouterProvider" in line or "openrouter.ai/api/v1" in line
            ]

            # Check for model resolution through OpenRouter
            model_resolution_logs = [
                line
                for line in logs.split("\n")
                if ("Resolved model" in line and "via OpenRouter" in line)
                or ("Model alias" in line and "resolved to" in line)
                or ("flash" in line and "gemini-flash" in line)
                or ("pro" in line and "gemini-pro" in line)
            ]

            # Log findings
            self.logger.info(f"   Fallback indication logs: {len(fallback_logs)}")
            self.logger.info(f"   OpenRouter provider logs: {len(provider_logs)}")
            self.logger.info(f"   Model resolution logs: {len(model_resolution_logs)}")

            # Sample logs for debugging
            if self.verbose:
                if fallback_logs:
                    self.logger.debug("  📋 Sample fallback logs:")
                    for log in fallback_logs[:3]:
                        self.logger.debug(f"    {log}")

                if provider_logs:
                    self.logger.debug("  📋 Sample provider logs:")
                    for log in provider_logs[:3]:
                        self.logger.debug(f"    {log}")

            # Success criteria
            openrouter_active = len(provider_logs) > 0
            models_resolved = len(model_resolution_logs) > 0
            all_tools_worked = True  # We checked this above

            success_criteria = [
                ("OpenRouter provider active", openrouter_active),
                ("Models resolved through OpenRouter", models_resolved),
                ("All tools worked with OpenRouter", all_tools_worked),
            ]

            passed_criteria = sum(1 for _, passed in success_criteria if passed)
            self.logger.info(f"   Success criteria met: {passed_criteria}/{len(success_criteria)}")

            for criterion, passed in success_criteria:
                status = "✅" if passed else "❌"
                self.logger.info(f"    {status} {criterion}")

            if passed_criteria >= 2:  # At least 2 out of 3 criteria
                self.logger.info("  ✅ OpenRouter fallback test passed")
                return True
            else:
                self.logger.error("  ❌ OpenRouter fallback test failed")
                return False

        except Exception as e:
            self.logger.error(f"OpenRouter fallback test failed: {e}")
            return False
        finally:
            self.cleanup_test_files()


def main():
    """Run the OpenRouter fallback tests"""
    import sys

    verbose = "--verbose" in sys.argv or "-v" in sys.argv
    test = OpenRouterFallbackTest(verbose=verbose)

    success = test.run_test()
    sys.exit(0 if success else 1)


if __name__ == "__main__":
    main()

```

--------------------------------------------------------------------------------
/simulator_tests/test_openrouter_models.py:
--------------------------------------------------------------------------------

```python
#!/usr/bin/env python3
"""
OpenRouter Model Tests

Tests that verify OpenRouter functionality including:
- Model alias resolution (flash, pro, o3, etc. map to OpenRouter equivalents)
- Multiple OpenRouter models work correctly
- Conversation continuity works with OpenRouter models
- Error handling when models are not available
"""


from .base_test import BaseSimulatorTest


class OpenRouterModelsTest(BaseSimulatorTest):
    """Test OpenRouter model functionality and alias mapping"""

    @property
    def test_name(self) -> str:
        return "openrouter_models"

    @property
    def test_description(self) -> str:
        return "OpenRouter model functionality and alias mapping"

    def run_test(self) -> bool:
        """Test OpenRouter model functionality"""
        try:
            self.logger.info("Test: OpenRouter model functionality and alias mapping")

            # Check if OpenRouter API key is configured
            import os

            has_openrouter = bool(os.environ.get("OPENROUTER_API_KEY"))

            if not has_openrouter:
                self.logger.info("  ⚠️  OpenRouter API key not configured - skipping test")
                self.logger.info("  ℹ️  This test requires OPENROUTER_API_KEY to be set in .env")
                return True  # Return True to indicate test is skipped, not failed

            # Setup test files for later use
            self.setup_test_files()

            # Test 1: Flash alias mapping to OpenRouter
            self.logger.info("  1: Testing 'flash' alias (should map to google/gemini-2.5-flash)")

            response1, continuation_id = self.call_mcp_tool(
                "chat",
                {
                    "prompt": "Say 'Hello from Flash model!' and nothing else.",
                    "model": "flash",
                    "temperature": 0.1,
                },
            )

            if not response1:
                self.logger.error("  ❌ Flash alias test failed")
                return False

            self.logger.info("  ✅ Flash alias call completed")
            if continuation_id:
                self.logger.info(f"  ✅ Got continuation_id: {continuation_id}")

            # Test 2: Pro alias mapping to OpenRouter
            self.logger.info("  2: Testing 'pro' alias (should map to google/gemini-2.5-pro)")

            response2, _ = self.call_mcp_tool(
                "chat",
                {
                    "prompt": "Say 'Hello from Pro model!' and nothing else.",
                    "model": "pro",
                    "temperature": 0.1,
                },
            )

            if not response2:
                self.logger.error("  ❌ Pro alias test failed")
                return False

            self.logger.info("  ✅ Pro alias call completed")

            # Test 3: O3 alias mapping to OpenRouter (should map to openai/gpt-4o)
            self.logger.info("  3: Testing 'o3' alias (should map to openai/gpt-4o)")

            response3, _ = self.call_mcp_tool(
                "chat",
                {
                    "prompt": "Say 'Hello from O3 model!' and nothing else.",
                    "model": "o3",
                    "temperature": 0.1,
                },
            )

            if not response3:
                self.logger.error("  ❌ O3 alias test failed")
                return False

            self.logger.info("  ✅ O3 alias call completed")

            # Test 4: Direct OpenRouter model name
            self.logger.info("  4: Testing direct OpenRouter model name (anthropic/claude-3-haiku)")

            response4, _ = self.call_mcp_tool(
                "chat",
                {
                    "prompt": "Say 'Hello from Claude Haiku!' and nothing else.",
                    "model": "anthropic/claude-3-haiku",
                    "temperature": 0.1,
                },
            )

            if not response4:
                self.logger.error("  ❌ Direct OpenRouter model test failed")
                return False

            self.logger.info("  ✅ Direct OpenRouter model call completed")

            # Test 5: OpenRouter alias from config
            self.logger.info("  5: Testing OpenRouter alias from config ('opus' -> anthropic/claude-opus-4)")

            response5, _ = self.call_mcp_tool(
                "chat",
                {
                    "prompt": "Say 'Hello from Opus!' and nothing else.",
                    "model": "opus",
                    "temperature": 0.1,
                },
            )

            if not response5:
                self.logger.error("  ❌ OpenRouter alias test failed")
                return False

            self.logger.info("  ✅ OpenRouter alias call completed")

            # Test 6: Conversation continuity with OpenRouter models
            self.logger.info("  6: Testing conversation continuity with OpenRouter")

            response6, new_continuation_id = self.call_mcp_tool(
                "chat",
                {
                    "prompt": "Remember this number: 42. What number did I just tell you?",
                    "model": "sonnet",  # Claude Sonnet via OpenRouter
                    "temperature": 0.1,
                },
            )

            if not response6 or not new_continuation_id:
                self.logger.error("  ❌ Failed to start conversation with continuation_id")
                return False

            # Continue the conversation
            response7, _ = self.call_mcp_tool(
                "chat",
                {
                    "prompt": "What was the number I told you earlier?",
                    "model": "sonnet",
                    "continuation_id": new_continuation_id,
                    "temperature": 0.1,
                },
            )

            if not response7:
                self.logger.error("  ❌ Failed to continue conversation")
                return False

            # Check if the model remembered the number
            if "42" in response7:
                self.logger.info("  ✅ Conversation continuity working with OpenRouter")
            else:
                self.logger.warning("  ⚠️  Model may not have remembered the number")

            # Test 7: Validate OpenRouter API usage from logs
            self.logger.info("  7: Validating OpenRouter API usage in logs")
            logs = self.get_recent_server_logs()

            # Check for OpenRouter API calls
            openrouter_logs = [line for line in logs.split("\n") if "openrouter" in line.lower()]
            openrouter_api_logs = [line for line in logs.split("\n") if "openrouter.ai/api/v1" in line]

            # Check for specific model mappings
            flash_mapping_logs = [
                line
                for line in logs.split("\n")
                if ("flash" in line and "google/gemini-flash" in line)
                or ("Resolved model" in line and "google/gemini-flash" in line)
            ]

            pro_mapping_logs = [
                line
                for line in logs.split("\n")
                if ("pro" in line and "google/gemini-pro" in line)
                or ("Resolved model" in line and "google/gemini-pro" in line)
            ]

            # Log findings
            self.logger.info(f"   OpenRouter-related logs: {len(openrouter_logs)}")
            self.logger.info(f"   OpenRouter API logs: {len(openrouter_api_logs)}")
            self.logger.info(f"   Flash mapping logs: {len(flash_mapping_logs)}")
            self.logger.info(f"   Pro mapping logs: {len(pro_mapping_logs)}")

            # Sample log output for debugging
            if self.verbose and openrouter_logs:
                self.logger.debug("  📋 Sample OpenRouter logs:")
                for log in openrouter_logs[:5]:
                    self.logger.debug(f"    {log}")

            # Success criteria
            openrouter_api_used = len(openrouter_api_logs) > 0
            models_mapped = len(flash_mapping_logs) > 0 or len(pro_mapping_logs) > 0

            success_criteria = [
                ("OpenRouter API calls made", openrouter_api_used),
                ("Model aliases mapped correctly", models_mapped),
                ("All model calls succeeded", True),  # We already checked this above
            ]

            passed_criteria = sum(1 for _, passed in success_criteria if passed)
            self.logger.info(f"   Success criteria met: {passed_criteria}/{len(success_criteria)}")

            for criterion, passed in success_criteria:
                status = "✅" if passed else "❌"
                self.logger.info(f"    {status} {criterion}")

            if passed_criteria >= 2:  # At least 2 out of 3 criteria
                self.logger.info("  ✅ OpenRouter model tests passed")
                return True
            else:
                self.logger.error("  ❌ OpenRouter model tests failed")
                return False

        except Exception as e:
            self.logger.error(f"OpenRouter model test failed: {e}")
            return False
        finally:
            self.cleanup_test_files()


def main():
    """Run the OpenRouter model tests"""
    import sys

    verbose = "--verbose" in sys.argv or "-v" in sys.argv
    test = OpenRouterModelsTest(verbose=verbose)

    success = test.run_test()
    sys.exit(0 if success else 1)


if __name__ == "__main__":
    main()

```

--------------------------------------------------------------------------------
/simulator_tests/test_cross_tool_continuation.py:
--------------------------------------------------------------------------------

```python
#!/usr/bin/env python3
"""
Cross-Tool Continuation Test

Tests comprehensive cross-tool continuation scenarios to ensure
conversation context is maintained when switching between different tools.
"""

from .conversation_base_test import ConversationBaseTest


class CrossToolContinuationTest(ConversationBaseTest):
    """Test comprehensive cross-tool continuation scenarios"""

    @property
    def test_name(self) -> str:
        return "cross_tool_continuation"

    @property
    def test_description(self) -> str:
        return "Cross-tool conversation continuation scenarios"

    def run_test(self) -> bool:
        """Test comprehensive cross-tool continuation scenarios"""
        try:
            self.logger.info("🔧 Test: Cross-tool continuation scenarios")

            # Setup test environment for conversation testing
            self.setUp()

            success_count = 0
            total_scenarios = 3

            # Scenario 1: chat -> thinkdeep -> codereview
            if self._test_chat_thinkdeep_codereview():
                success_count += 1

            # Scenario 2: analyze -> debug -> thinkdeep
            if self._test_analyze_debug_thinkdeep():
                success_count += 1

            # Scenario 3: Multi-file cross-tool continuation
            if self._test_multi_file_continuation():
                success_count += 1

            self.logger.info(
                f"  ✅ Cross-tool continuation scenarios completed: {success_count}/{total_scenarios} scenarios passed"
            )

            # Consider successful if at least one scenario worked
            return success_count > 0

        except Exception as e:
            self.logger.error(f"Cross-tool continuation test failed: {e}")
            return False
        finally:
            self.cleanup_test_files()

    def _test_chat_thinkdeep_codereview(self) -> bool:
        """Test chat -> thinkdeep -> codereview scenario"""
        try:
            self.logger.info("  1: Testing chat -> thinkdeep -> codereview")

            # Start with chat
            chat_response, chat_id = self.call_mcp_tool(
                "chat",
                {
                    "prompt": "Please use low thinking mode. Look at this Python code and tell me what you think about it",
                    "absolute_file_paths": [self.test_files["python"]],
                    "model": "flash",
                },
            )

            if not chat_response or not chat_id:
                self.logger.error("Failed to start chat conversation")
                return False

            # Continue with thinkdeep
            thinkdeep_response, _ = self.call_mcp_tool(
                "thinkdeep",
                {
                    "step": "Think deeply about potential performance issues in this code. Please use low thinking mode.",
                    "step_number": 1,
                    "total_steps": 1,
                    "next_step_required": False,
                    "findings": "Building on previous chat analysis to examine performance issues",
                    "relevant_files": [self.test_files["python"]],  # Same file should be deduplicated
                    "continuation_id": chat_id,
                    "model": "flash",
                },
            )

            if not thinkdeep_response:
                self.logger.error("Failed chat -> thinkdeep continuation")
                return False

            # Continue with codereview
            codereview_response, _ = self.call_mcp_tool(
                "codereview",
                {
                    "step": "Building on our previous analysis, provide a comprehensive code review",
                    "step_number": 1,
                    "total_steps": 1,
                    "next_step_required": False,
                    "findings": "Continuing from previous chat and thinkdeep analysis for comprehensive review",
                    "relevant_files": [self.test_files["python"]],  # Same file should be deduplicated
                    "continuation_id": chat_id,
                    "model": "flash",
                },
            )

            if not codereview_response:
                self.logger.error("Failed thinkdeep -> codereview continuation")
                return False

            self.logger.info("  ✅ chat -> thinkdeep -> codereview working")
            return True

        except Exception as e:
            self.logger.error(f"Chat -> thinkdeep -> codereview scenario failed: {e}")
            return False

    def _test_analyze_debug_thinkdeep(self) -> bool:
        """Test analyze -> debug -> thinkdeep scenario"""
        try:
            self.logger.info("  2: Testing analyze -> debug -> thinkdeep")

            # Start with analyze
            analyze_response, analyze_id = self.call_mcp_tool(
                "analyze",
                {
                    "step": "Analyze this code for quality and performance issues",
                    "step_number": 1,
                    "total_steps": 1,
                    "next_step_required": False,
                    "findings": "Starting analysis of Python code for quality and performance issues",
                    "relevant_files": [self.test_files["python"]],
                    "model": "flash",
                },
            )

            if not analyze_response or not analyze_id:
                self.logger.warning("Failed to start analyze conversation, skipping scenario 2")
                return False

            # Continue with debug
            debug_response, _ = self.call_mcp_tool(
                "debug",
                {
                    "step": "Based on our analysis, help debug the performance issue in fibonacci",
                    "step_number": 1,
                    "total_steps": 1,
                    "next_step_required": False,
                    "findings": "Building on previous analysis to debug specific performance issue",
                    "relevant_files": [self.test_files["python"]],  # Same file should be deduplicated
                    "continuation_id": analyze_id,
                    "model": "flash",
                },
            )

            if not debug_response:
                self.logger.warning("  ⚠️ analyze -> debug continuation failed")
                return False

            # Continue with thinkdeep
            final_response, _ = self.call_mcp_tool(
                "thinkdeep",
                {
                    "step": "Think deeply about the architectural implications of the issues we've found. Please use low thinking mode.",
                    "step_number": 1,
                    "total_steps": 1,
                    "next_step_required": False,
                    "findings": "Building on analysis and debug findings to explore architectural implications",
                    "relevant_files": [self.test_files["python"]],  # Same file should be deduplicated
                    "continuation_id": analyze_id,
                    "model": "flash",
                },
            )

            if not final_response:
                self.logger.warning("  ⚠️ debug -> thinkdeep continuation failed")
                return False

            self.logger.info("  ✅ analyze -> debug -> thinkdeep working")
            return True

        except Exception as e:
            self.logger.error(f"Analyze -> debug -> thinkdeep scenario failed: {e}")
            return False

    def _test_multi_file_continuation(self) -> bool:
        """Test multi-file cross-tool continuation"""
        try:
            self.logger.info("  3: Testing multi-file cross-tool continuation")

            # Start with both files
            multi_response, multi_id = self.call_mcp_tool(
                "chat",
                {
                    "prompt": "Please use low thinking mode. Analyze both the Python code and configuration file",
                    "absolute_file_paths": [self.test_files["python"], self.test_files["config"]],
                    "model": "flash",
                },
            )

            if not multi_response or not multi_id:
                self.logger.warning("Failed to start multi-file conversation, skipping scenario 3")
                return False

            # Switch to codereview with same files (should use conversation history)
            multi_review, _ = self.call_mcp_tool(
                "codereview",
                {
                    "step": "Review both files in the context of our previous discussion",
                    "step_number": 1,
                    "total_steps": 1,
                    "next_step_required": False,
                    "findings": "Continuing multi-file analysis with code review perspective",
                    "relevant_files": [self.test_files["python"], self.test_files["config"]],  # Same files
                    "continuation_id": multi_id,
                    "model": "flash",
                },
            )

            if not multi_review:
                self.logger.warning("  ⚠️ Multi-file cross-tool continuation failed")
                return False

            self.logger.info("  ✅ Multi-file cross-tool continuation working")
            return True

        except Exception as e:
            self.logger.error(f"Multi-file continuation scenario failed: {e}")
            return False

```

--------------------------------------------------------------------------------
/providers/registries/base.py:
--------------------------------------------------------------------------------

```python
"""Shared infrastructure for JSON-backed model registries."""

from __future__ import annotations

import importlib.resources
import json
import logging
from collections.abc import Iterable
from dataclasses import fields
from pathlib import Path

from utils.env import get_env
from utils.file_utils import read_json_file

from ..shared import ModelCapabilities, ProviderType, TemperatureConstraint

logger = logging.getLogger(__name__)


CAPABILITY_FIELD_NAMES = {field.name for field in fields(ModelCapabilities)}


class CustomModelRegistryBase:
    """Load and expose capability metadata from a JSON manifest."""

    def __init__(
        self,
        *,
        env_var_name: str,
        default_filename: str,
        config_path: str | None = None,
    ) -> None:
        self._env_var_name = env_var_name
        self._default_filename = default_filename
        self._use_resources = False
        self._resource_package = "conf"
        self._default_path = Path(__file__).resolve().parents[3] / "conf" / default_filename

        if config_path:
            self.config_path = Path(config_path)
        else:
            env_path = get_env(env_var_name)
            if env_path:
                self.config_path = Path(env_path)
            else:
                try:
                    resource = importlib.resources.files(self._resource_package).joinpath(default_filename)
                    if hasattr(resource, "read_text"):
                        self._use_resources = True
                        self.config_path = None
                    else:
                        raise AttributeError("resource accessor not available")
                except Exception:
                    self.config_path = Path(__file__).resolve().parents[3] / "conf" / default_filename

        self.alias_map: dict[str, str] = {}
        self.model_map: dict[str, ModelCapabilities] = {}
        self._extras: dict[str, dict] = {}

    def reload(self) -> None:
        data = self._load_config_data()
        configs = [config for config in self._parse_models(data) if config is not None]
        self._build_maps(configs)

    def list_models(self) -> list[str]:
        return list(self.model_map.keys())

    def list_aliases(self) -> list[str]:
        return list(self.alias_map.keys())

    def resolve(self, name_or_alias: str) -> ModelCapabilities | None:
        key = name_or_alias.lower()
        canonical = self.alias_map.get(key)
        if canonical:
            return self.model_map.get(canonical)

        for model_name in self.model_map:
            if model_name.lower() == key:
                return self.model_map[model_name]
        return None

    def get_capabilities(self, name_or_alias: str) -> ModelCapabilities | None:
        return self.resolve(name_or_alias)

    def get_entry(self, model_name: str) -> dict | None:
        return self._extras.get(model_name)

    def get_model_config(self, model_name: str) -> ModelCapabilities | None:
        """Backwards-compatible accessor for registries expecting this helper."""

        return self.model_map.get(model_name) or self.resolve(model_name)

    def iter_entries(self) -> Iterable[tuple[str, ModelCapabilities, dict]]:
        for model_name, capability in self.model_map.items():
            yield model_name, capability, self._extras.get(model_name, {})

    # ------------------------------------------------------------------
    # Internal helpers
    # ------------------------------------------------------------------
    def _load_config_data(self) -> dict:
        if self._use_resources:
            try:
                resource = importlib.resources.files(self._resource_package).joinpath(self._default_filename)
                if hasattr(resource, "read_text"):
                    config_text = resource.read_text(encoding="utf-8")
                else:  # pragma: no cover - legacy Python fallback
                    with resource.open("r", encoding="utf-8") as handle:
                        config_text = handle.read()
                data = json.loads(config_text)
            except FileNotFoundError:
                logger.debug("Packaged %s not found", self._default_filename)
                return {"models": []}
            except Exception as exc:
                logger.warning("Failed to read packaged %s: %s", self._default_filename, exc)
                return {"models": []}
            return data or {"models": []}

        if not self.config_path:
            raise FileNotFoundError("Registry configuration path is not set")

        if not self.config_path.exists():
            logger.debug("Model registry config not found at %s", self.config_path)
            if self.config_path == self._default_path:
                fallback = Path.cwd() / "conf" / self._default_filename
                if fallback != self.config_path and fallback.exists():
                    logger.debug("Falling back to %s", fallback)
                    self.config_path = fallback
                else:
                    return {"models": []}
            else:
                return {"models": []}

        data = read_json_file(str(self.config_path))
        return data or {"models": []}

    @property
    def use_resources(self) -> bool:
        return self._use_resources

    def _parse_models(self, data: dict) -> Iterable[ModelCapabilities | None]:
        for raw in data.get("models", []):
            if not isinstance(raw, dict):
                continue
            yield self._convert_entry(raw)

    def _convert_entry(self, raw: dict) -> ModelCapabilities | None:
        entry = dict(raw)
        model_name = entry.get("model_name")
        if not model_name:
            return None

        aliases = entry.get("aliases")
        if isinstance(aliases, str):
            entry["aliases"] = [alias.strip() for alias in aliases.split(",") if alias.strip()]

        entry.setdefault("friendly_name", self._default_friendly_name(model_name))

        temperature_hint = entry.get("temperature_constraint")
        if isinstance(temperature_hint, str):
            entry["temperature_constraint"] = TemperatureConstraint.create(temperature_hint)
        elif temperature_hint is None:
            entry["temperature_constraint"] = TemperatureConstraint.create("range")

        if "max_tokens" in entry:
            raise ValueError(
                "`max_tokens` is no longer supported. Use `max_output_tokens` in your model configuration."
            )

        unknown_keys = set(entry.keys()) - CAPABILITY_FIELD_NAMES - self._extra_keys()
        if unknown_keys:
            raise ValueError("Unsupported fields in model configuration: " + ", ".join(sorted(unknown_keys)))

        capability, extras = self._finalise_entry(entry)
        capability.provider = self._provider_default()
        self._extras[capability.model_name] = extras or {}
        return capability

    def _default_friendly_name(self, model_name: str) -> str:
        return model_name

    def _extra_keys(self) -> set[str]:
        return set()

    def _provider_default(self) -> ProviderType:
        return ProviderType.OPENROUTER

    def _finalise_entry(self, entry: dict) -> tuple[ModelCapabilities, dict]:
        return ModelCapabilities(**{k: v for k, v in entry.items() if k in CAPABILITY_FIELD_NAMES}), {}

    def _build_maps(self, configs: Iterable[ModelCapabilities]) -> None:
        alias_map: dict[str, str] = {}
        model_map: dict[str, ModelCapabilities] = {}

        for config in configs:
            if not config:
                continue
            model_map[config.model_name] = config

            model_name_lower = config.model_name.lower()
            if model_name_lower not in alias_map:
                alias_map[model_name_lower] = config.model_name

            for alias in config.aliases:
                alias_lower = alias.lower()
                if alias_lower in alias_map and alias_map[alias_lower] != config.model_name:
                    raise ValueError(
                        f"Duplicate alias '{alias}' found for models '{alias_map[alias_lower]}' and '{config.model_name}'"
                    )
                alias_map[alias_lower] = config.model_name

        self.alias_map = alias_map
        self.model_map = model_map


class CapabilityModelRegistry(CustomModelRegistryBase):
    """Registry that returns :class:`ModelCapabilities` objects with alias support."""

    def __init__(
        self,
        *,
        env_var_name: str,
        default_filename: str,
        provider: ProviderType,
        friendly_prefix: str,
        config_path: str | None = None,
    ) -> None:
        self._provider = provider
        self._friendly_prefix = friendly_prefix
        super().__init__(
            env_var_name=env_var_name,
            default_filename=default_filename,
            config_path=config_path,
        )
        self.reload()

    def _provider_default(self) -> ProviderType:
        return self._provider

    def _default_friendly_name(self, model_name: str) -> str:
        return self._friendly_prefix.format(model=model_name)

    def _finalise_entry(self, entry: dict) -> tuple[ModelCapabilities, dict]:
        filtered = {k: v for k, v in entry.items() if k in CAPABILITY_FIELD_NAMES}
        filtered.setdefault("provider", self._provider_default())
        capability = ModelCapabilities(**filtered)
        return capability, {}

```

--------------------------------------------------------------------------------
/conf/openai_models.json:
--------------------------------------------------------------------------------

```json
{
  "_README": {
    "description": "Model metadata for native OpenAI API access.",
    "documentation": "https://github.com/BeehiveInnovations/zen-mcp-server/blob/main/docs/custom_models.md",
    "usage": "Models listed here are exposed directly through the OpenAI provider. Aliases are case-insensitive.",
    "field_notes": "Matches providers/shared/model_capabilities.py.",
    "field_descriptions": {
      "model_name": "The model identifier (e.g., 'gpt-5', 'o3-pro')",
      "aliases": "Array of short names users can type instead of the full model name",
      "context_window": "Total number of tokens the model can process (input + output combined)",
      "max_output_tokens": "Maximum number of tokens the model can generate in a single response",
      "max_thinking_tokens": "Maximum reasoning/thinking tokens the model will allocate when extended thinking is requested",
      "supports_extended_thinking": "Whether the model supports extended reasoning tokens (currently none do via OpenRouter or custom APIs)",
      "supports_json_mode": "Whether the model can guarantee valid JSON output",
      "supports_function_calling": "Whether the model supports function/tool calling",
      "supports_images": "Whether the model can process images/visual input",
      "max_image_size_mb": "Maximum total size in MB for all images combined (capped at 40MB max for custom models)",
      "supports_temperature": "Whether the model accepts temperature parameter in API calls (set to false for O3/O4 reasoning models)",
      "temperature_constraint": "Type of temperature constraint: 'fixed' (fixed value), 'range' (continuous range), 'discrete' (specific values), or omit for default range",
      "use_openai_response_api": "Set to true when the model must use the /responses endpoint (reasoning models like GPT-5 Pro). Leave false/omit for standard chat completions.",
      "default_reasoning_effort": "Default reasoning effort level for models that support it (e.g., 'low', 'medium', 'high'). Omit if not applicable.",
      "description": "Human-readable description of the model",
      "intelligence_score": "1-20 human rating used as the primary signal for auto-mode model ordering",
      "allow_code_generation": "Whether this model can generate and suggest fully working code - complete with functions, files, and detailed implementation instructions - for your AI tool to use right away. Only set this to 'true' for a model more capable than the AI model / CLI you're currently using."
    }
  },
  "models": [
    {
      "model_name": "gpt-5",
      "friendly_name": "OpenAI (GPT-5)",
      "aliases": [
        "gpt5",
        "gpt-5"
      ],
      "intelligence_score": 16,
      "description": "GPT-5 (400K context, 128K output) - Advanced model with reasoning support",
      "context_window": 400000,
      "max_output_tokens": 128000,
      "supports_extended_thinking": true,
      "supports_system_prompts": true,
      "supports_streaming": false,
      "supports_function_calling": true,
      "supports_json_mode": true,
      "supports_images": true,
      "supports_temperature": true,
      "max_image_size_mb": 20.0,
      "temperature_constraint": "fixed"
    },
    {
      "model_name": "gpt-5-pro",
      "friendly_name": "OpenAI (GPT-5 Pro)",
      "aliases": [
        "gpt5pro",
        "gpt5-pro"
      ],
      "intelligence_score": 18,
      "description": "GPT-5 Pro (400K context, 272K output) - Very advanced, reasoning model",
      "context_window": 400000,
      "max_output_tokens": 272000,
      "supports_extended_thinking": true,
      "supports_system_prompts": true,
      "supports_streaming": false,
      "supports_function_calling": true,
      "supports_json_mode": true,
      "supports_images": true,
      "supports_temperature": true,
      "max_image_size_mb": 20.0,
      "use_openai_response_api": true,
      "default_reasoning_effort": "high",
      "allow_code_generation": true,
      "temperature_constraint": "fixed"
    },
    {
      "model_name": "gpt-5-mini",
      "friendly_name": "OpenAI (GPT-5-mini)",
      "aliases": [
        "gpt5-mini",
        "gpt5mini",
        "mini"
      ],
      "intelligence_score": 15,
      "description": "GPT-5-mini (400K context, 128K output) - Efficient variant with reasoning support",
      "context_window": 400000,
      "max_output_tokens": 128000,
      "supports_extended_thinking": true,
      "supports_system_prompts": true,
      "supports_streaming": false,
      "supports_function_calling": true,
      "supports_json_mode": true,
      "supports_images": true,
      "supports_temperature": true,
      "max_image_size_mb": 20.0,
      "temperature_constraint": "fixed"
    },
    {
      "model_name": "gpt-5-nano",
      "friendly_name": "OpenAI (GPT-5 nano)",
      "aliases": [
        "gpt5nano",
        "gpt5-nano",
        "nano"
      ],
      "intelligence_score": 13,
      "description": "GPT-5 nano (400K context) - Fastest, cheapest version of GPT-5 for summarization and classification tasks",
      "context_window": 400000,
      "max_output_tokens": 128000,
      "supports_extended_thinking": true,
      "supports_system_prompts": true,
      "supports_streaming": true,
      "supports_function_calling": true,
      "supports_json_mode": true,
      "supports_images": true,
      "supports_temperature": true,
      "max_image_size_mb": 20.0,
      "temperature_constraint": "fixed"
    },
    {
      "model_name": "o3",
      "friendly_name": "OpenAI (O3)",
      "intelligence_score": 14,
      "description": "Strong reasoning (200K context) - Logical problems, code generation, systematic analysis",
      "context_window": 200000,
      "max_output_tokens": 65536,
      "supports_extended_thinking": false,
      "supports_system_prompts": true,
      "supports_streaming": true,
      "supports_function_calling": true,
      "supports_json_mode": true,
      "supports_images": true,
      "supports_temperature": false,
      "max_image_size_mb": 20.0,
      "temperature_constraint": "fixed"
    },
    {
      "model_name": "o3-mini",
      "friendly_name": "OpenAI (O3-mini)",
      "aliases": [
        "o3mini"
      ],
      "intelligence_score": 12,
      "description": "Fast O3 variant (200K context) - Balanced performance/speed, moderate complexity",
      "context_window": 200000,
      "max_output_tokens": 65536,
      "supports_extended_thinking": false,
      "supports_system_prompts": true,
      "supports_streaming": true,
      "supports_function_calling": true,
      "supports_json_mode": true,
      "supports_images": true,
      "supports_temperature": false,
      "max_image_size_mb": 20.0,
      "temperature_constraint": "fixed"
    },
    {
      "model_name": "o3-pro",
      "friendly_name": "OpenAI (O3-Pro)",
      "aliases": [
        "o3pro"
      ],
      "intelligence_score": 15,
      "description": "Professional-grade reasoning with advanced capabilities (200K context)",
      "context_window": 200000,
      "max_output_tokens": 65536,
      "supports_extended_thinking": false,
      "supports_system_prompts": true,
      "supports_streaming": true,
      "supports_function_calling": true,
      "supports_json_mode": true,
      "supports_images": true,
      "supports_temperature": false,
      "max_image_size_mb": 20.0,
      "use_openai_response_api": true,
      "temperature_constraint": "fixed"
    },
    {
      "model_name": "o4-mini",
      "friendly_name": "OpenAI (O4-mini)",
      "aliases": [
        "o4mini"
      ],
      "intelligence_score": 11,
      "description": "Latest reasoning model (200K context) - Optimized for shorter contexts, rapid reasoning",
      "context_window": 200000,
      "supports_extended_thinking": false,
      "supports_system_prompts": true,
      "supports_streaming": true,
      "supports_function_calling": true,
      "supports_json_mode": true,
      "supports_images": true,
      "supports_temperature": false,
      "max_image_size_mb": 20.0,
      "temperature_constraint": "fixed"
    },
    {
      "model_name": "gpt-4.1",
      "friendly_name": "OpenAI (GPT 4.1)",
      "aliases": [
        "gpt4.1"
      ],
      "intelligence_score": 13,
      "description": "GPT-4.1 (1M context) - Advanced reasoning model with large context window",
      "context_window": 1000000,
      "max_output_tokens": 32768,
      "supports_extended_thinking": false,
      "supports_system_prompts": true,
      "supports_streaming": true,
      "supports_function_calling": true,
      "supports_json_mode": true,
      "supports_images": true,
      "supports_temperature": true,
      "max_image_size_mb": 20.0
    },
    {
      "model_name": "gpt-5-codex",
      "friendly_name": "OpenAI (GPT-5 Codex)",
      "aliases": [
        "gpt5-codex",
        "codex",
        "gpt-5-code",
        "gpt5-code"
      ],
      "intelligence_score": 17,
      "description": "GPT-5 Codex (400K context) Specialized for coding, refactoring, and software architecture.",
      "context_window": 400000,
      "max_output_tokens": 128000,
      "supports_extended_thinking": true,
      "supports_system_prompts": true,
      "supports_streaming": true,
      "supports_function_calling": true,
      "supports_json_mode": true,
      "supports_images": true,
      "supports_temperature": true,
      "max_image_size_mb": 20.0,
      "use_openai_response_api": true
    }
  ]
}

```

--------------------------------------------------------------------------------
/tests/test_docker_security.py:
--------------------------------------------------------------------------------

```python
"""
Tests for Docker security configuration and best practices
"""

import os
from pathlib import Path
from unittest.mock import patch

import pytest


class TestDockerSecurity:
    """Test Docker security configuration"""

    @pytest.fixture(autouse=True)
    def setup(self):
        """Setup for each test"""
        self.project_root = Path(__file__).parent.parent
        self.dockerfile_path = self.project_root / "Dockerfile"
        self.compose_path = self.project_root / "docker-compose.yml"

    def test_non_root_user_configuration(self):
        """Test that container runs as non-root user"""
        if not self.dockerfile_path.exists():
            pytest.skip("Dockerfile not found")

        content = self.dockerfile_path.read_text()

        # Check for user creation or switching
        user_indicators = ["USER " in content, "useradd" in content, "adduser" in content, "RUN addgroup" in content]

        assert any(user_indicators), "Container should run as non-root user"

    def test_no_unnecessary_privileges(self):
        """Test that container doesn't request unnecessary privileges"""
        if not self.compose_path.exists():
            pytest.skip("docker-compose.yml not found")

        content = self.compose_path.read_text()

        # Check that dangerous options are not used
        dangerous_options = ["privileged: true", "--privileged", "cap_add:", "SYS_ADMIN"]

        for option in dangerous_options:
            assert option not in content, f"Dangerous option {option} should not be used"

    def test_read_only_filesystem(self):
        """Test read-only filesystem configuration where applicable"""
        if not self.compose_path.exists():
            pytest.skip("docker-compose.yml not found")

        content = self.compose_path.read_text()

        # Check for read-only configurations
        if "read_only:" in content:
            assert "read_only: true" in content, "Read-only filesystem should be properly configured"

    def test_environment_variable_security(self):
        """Test secure handling of environment variables"""
        # Ensure sensitive data is not hardcoded
        sensitive_patterns = ["password", "secret", "key", "token"]

        for file_path in [self.dockerfile_path, self.compose_path]:
            if not file_path.exists():
                continue

            content = file_path.read_text().lower()

            # Check that we don't have hardcoded secrets
            for pattern in sensitive_patterns:
                # Allow variable names but not actual values
                lines = content.split("\n")
                for line in lines:
                    if f"{pattern}=" in line and not line.strip().startswith("#"):
                        # Check if it looks like a real value vs variable name
                        if '"' in line or "'" in line:
                            value_part = line.split("=")[1].strip()
                            if len(value_part) > 10 and not value_part.startswith("$"):
                                pytest.fail(f"Potential hardcoded secret in {file_path}: {line.strip()}")

    def test_network_security(self):
        """Test network security configuration"""
        if not self.compose_path.exists():
            pytest.skip("docker-compose.yml not found")

        content = self.compose_path.read_text()

        # Check for custom network (better than default bridge)
        if "networks:" in content:
            assert (
                "driver: bridge" in content or "external:" in content
            ), "Custom networks should use bridge driver or be external"

    def test_volume_security(self):
        """Test volume security configuration"""
        if not self.compose_path.exists():
            pytest.skip("docker-compose.yml not found")

        content = self.compose_path.read_text()

        # Check that sensitive host paths are not mounted
        dangerous_mounts = ["/:/", "/var/run/docker.sock:", "/etc/passwd:", "/etc/shadow:", "/root:"]

        for mount in dangerous_mounts:
            assert mount not in content, f"Dangerous mount {mount} should not be used"

    def test_secret_management(self):
        """Test that secrets are properly managed"""
        # Check for Docker secrets usage in compose file
        if self.compose_path.exists():
            content = self.compose_path.read_text()

            # If secrets are used, they should be properly configured
            if "secrets:" in content:
                assert "external: true" in content or "file:" in content, "Secrets should be external or file-based"

    def test_container_capabilities(self):
        """Test container capabilities are properly restricted"""
        if not self.compose_path.exists():
            pytest.skip("docker-compose.yml not found")

        content = self.compose_path.read_text()

        # Check for capability restrictions
        if "cap_drop:" in content:
            assert "ALL" in content, "Should drop all capabilities by default"

        # If capabilities are added, they should be minimal
        if "cap_add:" in content:
            dangerous_caps = ["SYS_ADMIN", "NET_ADMIN", "SYS_PTRACE"]
            for cap in dangerous_caps:
                assert cap not in content, f"Dangerous capability {cap} should not be added"


class TestDockerSecretsHandling:
    """Test Docker secrets and API key handling"""

    def test_env_file_not_in_image(self):
        """Test that .env files are not copied into Docker image"""
        project_root = Path(__file__).parent.parent
        dockerfile = project_root / "Dockerfile"

        if dockerfile.exists():
            content = dockerfile.read_text()

            # .env files should not be copied
            assert "COPY .env" not in content, ".env file should not be copied into image"

    def test_dockerignore_for_sensitive_files(self):
        """Test that .dockerignore excludes sensitive files"""
        project_root = Path(__file__).parent.parent
        dockerignore = project_root / ".dockerignore"

        if dockerignore.exists():
            content = dockerignore.read_text()

            sensitive_files = [".env", "*.key", "*.pem", ".git"]

            for file_pattern in sensitive_files:
                if file_pattern not in content:
                    # Warning rather than failure for flexibility
                    import warnings

                    warnings.warn(f"Consider adding {file_pattern} to .dockerignore", UserWarning, stacklevel=2)

    @patch.dict(os.environ, {}, clear=True)
    def test_no_default_api_keys(self):
        """Test that no default API keys are present"""
        # Ensure no API keys are set by default
        api_key_vars = ["GEMINI_API_KEY", "OPENAI_API_KEY", "XAI_API_KEY", "ANTHROPIC_API_KEY"]

        for var in api_key_vars:
            assert os.getenv(var) is None, f"{var} should not have a default value"

    def test_api_key_format_validation(self):
        """Test API key format validation if implemented"""
        # Test cases for API key validation
        test_cases = [
            {"key": "", "valid": False},
            {"key": "test", "valid": False},  # Too short
            {"key": "sk-" + "x" * 40, "valid": True},  # OpenAI format
            {"key": "AIza" + "x" * 35, "valid": True},  # Google format
        ]

        for case in test_cases:
            # This would test actual validation if implemented
            # For now, just check the test structure
            assert isinstance(case["valid"], bool)
            assert isinstance(case["key"], str)


class TestDockerComplianceChecks:
    """Test Docker configuration compliance with security standards"""

    def test_dockerfile_best_practices(self):
        """Test Dockerfile follows security best practices"""
        project_root = Path(__file__).parent.parent
        dockerfile = project_root / "Dockerfile"

        if not dockerfile.exists():
            pytest.skip("Dockerfile not found")

        content = dockerfile.read_text()

        # Check for multi-stage builds (reduces attack surface)
        if "FROM" in content:
            from_count = content.count("FROM")
            if from_count > 1:
                assert "AS" in content, "Multi-stage builds should use named stages"

        # Check for specific user ID (better than name-only)
        if "USER" in content:
            user_lines = [line for line in content.split("\n") if line.strip().startswith("USER")]
            for line in user_lines:
                # Could be improved to check for numeric UID
                assert len(line.strip()) > 5, "USER directive should be specific"

    def test_container_security_context(self):
        """Test container security context configuration"""
        project_root = Path(__file__).parent.parent
        compose_file = project_root / "docker-compose.yml"

        if compose_file.exists():
            content = compose_file.read_text()

            # Check for security context if configured
            security_options = ["security_opt:", "no-new-privileges:", "read_only:"]

            # At least one security option should be present
            security_configured = any(opt in content for opt in security_options)

            if not security_configured:
                import warnings

                warnings.warn("Consider adding security options to docker-compose.yml", UserWarning, stacklevel=2)

```

--------------------------------------------------------------------------------
/simulator_tests/test_per_tool_deduplication.py:
--------------------------------------------------------------------------------

```python
#!/usr/bin/env python3
"""
Per-Tool File Deduplication Test

Tests file deduplication for each individual MCP tool to ensure
that files are properly deduplicated within single-tool conversations.
Validates that:
1. Files are embedded only once in conversation history
2. Continuation calls don't re-read existing files
3. New files are still properly embedded
4. Server logs show deduplication behavior
"""

import os

from .conversation_base_test import ConversationBaseTest


class PerToolDeduplicationTest(ConversationBaseTest):
    """Test file deduplication for each individual tool"""

    @property
    def test_name(self) -> str:
        return "per_tool_deduplication"

    @property
    def test_description(self) -> str:
        return "File deduplication for individual tools"

    # create_additional_test_file method now inherited from base class

    def run_test(self) -> bool:
        """Test file deduplication with realistic precommit/codereview workflow"""
        try:
            self.logger.info("📄 Test: Simplified file deduplication with precommit/codereview workflow")

            # Setup test environment for conversation testing
            self.setUp()

            # Setup test files
            self.setup_test_files()

            # Create a short dummy file for quick testing in the current repo
            dummy_content = """def add(a, b):
    return a + b  # Missing type hints

def divide(x, y):
    return x / y  # No zero check
"""
            # Create the file in the current git repo directory to make it show up in git status
            dummy_file_path = os.path.join(os.getcwd(), "dummy_code.py")
            with open(dummy_file_path, "w") as f:
                f.write(dummy_content)

            # Get timestamp for log filtering
            import datetime

            start_time = datetime.datetime.now().strftime("%Y-%m-%dT%H:%M:%S")

            # Step 1: precommit tool with dummy file (low thinking mode)
            self.logger.info("  Step 1: precommit tool with dummy file")
            precommit_params = {
                "step": "Initial analysis of dummy_code.py for commit readiness. Please give me a quick one line reply.",
                "step_number": 1,
                "total_steps": 2,
                "next_step_required": True,
                "findings": "Starting pre-commit validation of dummy_code.py",
                "path": os.getcwd(),  # Use current working directory as the git repo path
                "relevant_files": [dummy_file_path],
                "thinking_mode": "low",
                "model": "flash",
            }

            response1, continuation_id = self.call_mcp_tool("precommit", precommit_params)
            if not response1:
                self.logger.error("  ❌ Step 1: precommit tool failed")
                return False

            if not continuation_id:
                self.logger.error("  ❌ Step 1: precommit tool didn't provide continuation_id")
                return False

            # Validate continuation_id format (should be UUID)
            if len(continuation_id) < 32:
                self.logger.error(f"  ❌ Step 1: Invalid continuation_id format: {continuation_id}")
                return False

            self.logger.info(f"  ✅ Step 1: precommit completed with continuation_id: {continuation_id[:8]}...")

            # Step 2: codereview tool with same file (NO continuation - fresh conversation)
            self.logger.info("  Step 2: codereview tool with same file (fresh conversation)")
            codereview_params = {
                "step": "Initial code review of dummy_code.py for quality and best practices. Please give me a quick one line reply.",
                "step_number": 1,
                "total_steps": 1,
                "next_step_required": False,
                "findings": "Starting code review of dummy_code.py",
                "relevant_files": [dummy_file_path],
                "thinking_mode": "low",
                "model": "flash",
            }

            response2, _ = self.call_mcp_tool("codereview", codereview_params)
            if not response2:
                self.logger.error("  ❌ Step 2: codereview tool failed")
                return False

            self.logger.info("  ✅ Step 2: codereview completed (fresh conversation)")

            # Step 3: Create new file and continue with precommit
            self.logger.info("  Step 3: precommit continuation with old + new file")
            new_file_content = """def multiply(x, y):
    return x * y

def subtract(a, b):
    return a - b
"""
            # Create another temp file in the current repo for git changes
            new_file_path = os.path.join(os.getcwd(), "new_feature.py")
            with open(new_file_path, "w") as f:
                f.write(new_file_content)

            # Continue precommit with both files
            continue_params = {
                "continuation_id": continuation_id,
                "step": "Continue analysis with new_feature.py added. Please give me a quick one line reply about both files.",
                "step_number": 2,
                "total_steps": 2,
                "next_step_required": False,
                "findings": "Continuing pre-commit validation with both dummy_code.py and new_feature.py",
                "path": os.getcwd(),  # Use current working directory as the git repo path
                "relevant_files": [dummy_file_path, new_file_path],  # Old + new file
                "thinking_mode": "low",
                "model": "flash",
            }

            response3, _ = self.call_mcp_tool("precommit", continue_params)
            if not response3:
                self.logger.error("  ❌ Step 3: precommit continuation failed")
                return False

            self.logger.info("  ✅ Step 3: precommit continuation completed")

            # Validate results in server logs
            self.logger.info("  📋 Validating conversation history and file deduplication...")
            logs = self.get_server_logs_since(start_time)

            # Check for conversation history building
            conversation_logs = [
                line for line in logs.split("\n") if "conversation" in line.lower() or "history" in line.lower()
            ]

            # Check for file embedding/deduplication
            embedding_logs = [
                line
                for line in logs.split("\n")
                if "[FILE_PROCESSING]" in line or "embedding" in line.lower() or "[FILES]" in line
            ]

            # Check for continuation evidence
            continuation_logs = [
                line for line in logs.split("\n") if "continuation" in line.lower() or continuation_id[:8] in line
            ]

            # Check for both files mentioned
            dummy_file_mentioned = any("dummy_code.py" in line for line in logs.split("\n"))
            new_file_mentioned = any("new_feature.py" in line for line in logs.split("\n"))

            # Print diagnostic information
            self.logger.info(f"   Conversation logs found: {len(conversation_logs)}")
            self.logger.info(f"   File embedding logs found: {len(embedding_logs)}")
            self.logger.info(f"   Continuation logs found: {len(continuation_logs)}")
            self.logger.info(f"   Dummy file mentioned: {dummy_file_mentioned}")
            self.logger.info(f"   New file mentioned: {new_file_mentioned}")

            if self.verbose:
                self.logger.debug("  📋 Sample embedding logs:")
                for log in embedding_logs[:5]:  # Show first 5
                    if log.strip():
                        self.logger.debug(f"    {log.strip()}")

                self.logger.debug("  📋 Sample continuation logs:")
                for log in continuation_logs[:3]:  # Show first 3
                    if log.strip():
                        self.logger.debug(f"    {log.strip()}")

            # Determine success criteria
            success_criteria = [
                len(embedding_logs) > 0,  # File embedding occurred
                len(continuation_logs) > 0,  # Continuation worked
                dummy_file_mentioned,  # Original file processed
                new_file_mentioned,  # New file processed
            ]

            passed_criteria = sum(success_criteria)
            total_criteria = len(success_criteria)

            self.logger.info(f"   Success criteria met: {passed_criteria}/{total_criteria}")

            if passed_criteria == total_criteria:  # All criteria must pass
                self.logger.info("  ✅ File deduplication workflow test: PASSED")
                return True
            else:
                self.logger.warning("  ⚠️ File deduplication workflow test: FAILED")
                self.logger.warning("  💡 Check server logs for detailed file embedding and continuation activity")
                return False

        except Exception as e:
            self.logger.error(f"File deduplication workflow test failed: {e}")
            return False
        finally:
            # Clean up temp files created in current repo
            temp_files = ["dummy_code.py", "new_feature.py"]
            for temp_file in temp_files:
                temp_path = os.path.join(os.getcwd(), temp_file)
                if os.path.exists(temp_path):
                    os.remove(temp_path)
                    self.logger.debug(f"Removed temp file: {temp_path}")
            self.cleanup_test_files()

```

--------------------------------------------------------------------------------
/tests/openai_cassettes/chat_cross_step2_gpt5_reminder.json:
--------------------------------------------------------------------------------

```json
{
  "interactions": [
    {
      "request": {
        "content": {
          "messages": [
            {
              "content": "\nYou are a senior engineering thought-partner collaborating with another AI agent. Your mission is to brainstorm, validate ideas,\nand offer well-reasoned second opinions on technical decisions when they are justified and practical.\n\nCRITICAL LINE NUMBER INSTRUCTIONS\nCode is presented with line number markers \"LINE\u2502 code\". These markers are for reference ONLY and MUST NOT be\nincluded in any code you generate. Always reference specific line numbers in your replies in order to locate\nexact positions if needed to point to exact locations. Include a very short code excerpt alongside for clarity.\nInclude context_start_text and context_end_text as backup references. Never include \"LINE\u2502\" markers in generated code\nsnippets.\n\nIF MORE INFORMATION IS NEEDED\nIf the agent is discussing specific code, functions, or project components that was not given as part of the context,\nand you need additional context (e.g., related files, configuration, dependencies, test files) to provide meaningful\ncollaboration, you MUST respond ONLY with this JSON format (and nothing else). Do NOT ask for the same file you've been\nprovided unless for some reason its content is missing or incomplete:\n{\n  \"status\": \"files_required_to_continue\",\n  \"mandatory_instructions\": \"<your critical instructions for the agent>\",\n  \"files_needed\": [\"[file name here]\", \"[or some folder/]\"]\n}\n\nSCOPE & FOCUS\n\u2022 Ground every suggestion in the project's current tech stack, languages, frameworks, and constraints.\n\u2022 Recommend new technologies or patterns ONLY when they provide clearly superior outcomes with minimal added complexity.\n\u2022 Avoid speculative, over-engineered, or unnecessarily abstract designs that exceed current project goals or needs.\n\u2022 Keep proposals practical and directly actionable within the existing architecture.\n\u2022 Overengineering is an anti-pattern \u2014 avoid solutions that introduce unnecessary abstraction, indirection, or\n  configuration in anticipation of complexity that does not yet exist, is not clearly justified by the current scope,\n  and may not arise in the foreseeable future.\n\nCOLLABORATION APPROACH\n1. Treat the collaborating agent as an equally senior peer. Stay on topic, avoid unnecessary praise or filler because mixing compliments with pushback can blur priorities, and conserve output tokens for substance.\n2. Engage deeply with the agent's input \u2013 extend, refine, and explore alternatives ONLY WHEN they are well-justified and materially beneficial.\n3. Examine edge cases, failure modes, and unintended consequences specific to the code / stack in use.\n4. Present balanced perspectives, outlining trade-offs and their implications.\n5. Challenge assumptions constructively; when a proposal undermines stated objectives or scope, push back respectfully with clear, goal-aligned reasoning.\n6. Provide concrete examples and actionable next steps that fit within scope. Prioritize direct, achievable outcomes.\n7. Ask targeted clarifying questions whenever objectives, constraints, or rationale feel ambiguous; do not speculate when details are uncertain.\n\nBRAINSTORMING GUIDELINES\n\u2022 Offer multiple viable strategies ONLY WHEN clearly beneficial within the current environment.\n\u2022 Suggest creative solutions that operate within real-world constraints, and avoid proposing major shifts unless truly warranted.\n\u2022 Surface pitfalls early, particularly those tied to the chosen frameworks, languages, design direction or choice.\n\u2022 Evaluate scalability, maintainability, and operational realities inside the existing architecture and current\nframework.\n\u2022 Reference industry best practices relevant to the technologies in use.\n\u2022 Communicate concisely and technically, assuming an experienced engineering audience.\n\nREMEMBER\nAct as a peer, not a lecturer. Avoid overcomplicating. Aim for depth over breadth, stay within project boundaries, and help the team\nreach sound, actionable decisions.\n",
              "role": "system"
            },
            {
              "content": "=== CONVERSATION HISTORY (CONTINUATION) ===\nThread: dbadc23e-c0f4-4853-982f-6c5bc722b5de\nTool: chat\nTurn 3/50\nYou are continuing this conversation thread from where it left off.\n\nPrevious conversation turns:\n\n--- Turn 1 (Agent using chat) ---\nPick a number between 1 and 10 and respond with JUST that number.\n\n--- Turn 2 (gemini-2.5-flash using chat via google) ---\n7\n\n---\n\nAGENT'S TURN: Evaluate this perspective alongside your analysis to form a comprehensive solution and continue with the user's request and task at hand.\n\n--- Turn 3 (Agent) ---\nRemind me, what number did you pick, respond with JUST that number.\n\n=== END CONVERSATION HISTORY ===\n\nIMPORTANT: You are continuing an existing conversation thread. Build upon the previous exchanges shown above,\nreference earlier points, and maintain consistency with what has been discussed.\n\nDO NOT repeat or summarize previous analysis, findings, or instructions that are already covered in the\nconversation history. Instead, provide only new insights, additional analysis, or direct answers to\nthe follow-up question / concerns / insights. Assume the user has read the prior conversation.\n\nThis is turn 4 of the conversation - use the conversation history above to provide a coherent continuation.\n\n=== NEW USER INPUT ===\n=== USER REQUEST ===\nRemind me, what number did you pick, respond with JUST that number.\n=== END REQUEST ===\n\nPlease provide a thoughtful, comprehensive response:",
              "role": "user"
            }
          ],
          "model": "gpt-5",
          "stream": false,
          "temperature": 1.0
        },
        "headers": {
          "accept": "application/json",
          "accept-encoding": "gzip, deflate",
          "authorization": "Bearer SANITIZED",
          "connection": "keep-alive",
          "content-length": "5587",
          "content-type": "application/json",
          "host": "api.openai.com",
          "user-agent": "OpenAI/Python 2.1.0",
          "x-stainless-arch": "arm64",
          "x-stainless-async": "false",
          "x-stainless-lang": "python",
          "x-stainless-os": "MacOS",
          "x-stainless-package-version": "2.1.0",
          "x-stainless-read-timeout": "900.0",
          "x-stainless-retry-count": "0",
          "x-stainless-runtime": "CPython",
          "x-stainless-runtime-version": "3.12.11"
        },
        "method": "POST",
        "path": "/v1/chat/completions",
        "url": "https://api.openai.com/v1/chat/completions"
      },
      "response": {
        "content": {
          "data": "ewogICJpZCI6ICJjaGF0Y21wbC1DTXRaVXZHWjN3S3RTMWxEVTgxUXQxT3g2dnNtciIsCiAgIm9iamVjdCI6ICJjaGF0LmNvbXBsZXRpb24iLAogICJjcmVhdGVkIjogMTc1OTU3Mjg2OCwKICAibW9kZWwiOiAiZ3B0LTUtMjAyNS0wOC0wNyIsCiAgImNob2ljZXMiOiBbCiAgICB7CiAgICAgICJpbmRleCI6IDAsCiAgICAgICJtZXNzYWdlIjogewogICAgICAgICJyb2xlIjogImFzc2lzdGFudCIsCiAgICAgICAgImNvbnRlbnQiOiAiNyIsCiAgICAgICAgInJlZnVzYWwiOiBudWxsLAogICAgICAgICJhbm5vdGF0aW9ucyI6IFtdCiAgICAgIH0sCiAgICAgICJmaW5pc2hfcmVhc29uIjogInN0b3AiCiAgICB9CiAgXSwKICAidXNhZ2UiOiB7CiAgICAicHJvbXB0X3Rva2VucyI6IDEwNTUsCiAgICAiY29tcGxldGlvbl90b2tlbnMiOiAyNjYsCiAgICAidG90YWxfdG9rZW5zIjogMTMyMSwKICAgICJwcm9tcHRfdG9rZW5zX2RldGFpbHMiOiB7CiAgICAgICJjYWNoZWRfdG9rZW5zIjogMCwKICAgICAgImF1ZGlvX3Rva2VucyI6IDAKICAgIH0sCiAgICAiY29tcGxldGlvbl90b2tlbnNfZGV0YWlscyI6IHsKICAgICAgInJlYXNvbmluZ190b2tlbnMiOiAyNTYsCiAgICAgICJhdWRpb190b2tlbnMiOiAwLAogICAgICAiYWNjZXB0ZWRfcHJlZGljdGlvbl90b2tlbnMiOiAwLAogICAgICAicmVqZWN0ZWRfcHJlZGljdGlvbl90b2tlbnMiOiAwCiAgICB9CiAgfSwKICAic2VydmljZV90aWVyIjogImRlZmF1bHQiLAogICJzeXN0ZW1fZmluZ2VycHJpbnQiOiBudWxsCn0K",
          "encoding": "base64",
          "size": 774
        },
        "headers": {
          "access-control-expose-headers": "X-Request-ID",
          "alt-svc": "h3=\":443\"; ma=86400",
          "cf-cache-status": "DYNAMIC",
          "cf-ray": "9893e998cd90f08b-DXB",
          "connection": "keep-alive",
          "content-encoding": "gzip",
          "content-type": "application/json",
          "date": "Sat, 04 Oct 2025 10:14:32 GMT",
          "openai-organization": "beehive-innovations-fze",
          "openai-processing-ms": "3725",
          "openai-project": "proj_QP57xBVPOlWpp0vuJEPGwXK3",
          "openai-version": "2020-10-01",
          "server": "cloudflare",
          "set-cookie": "__cf_bm=cyePl915F03L6RqnIdyla05Q1NzsdFJkMGvh3F89Q6Q-(XXX) XXX-XXXX-0.0.0.0-gBMxI3BY11pPcnlWTVD3TZiEcmP5Q5vbBrFFQoOwTFwRmSZpcanQETT3_6dQmMMX6vIGW8Gi3W44gI3ERJAyj7aROYPS6Ii7CkNPa2qxP04; path=/; expires=Sat, 04-Oct-25 10:44:32 GMT; domain=.api.openai.com; HttpOnly; Secure; SameSite=None, _cfuvid=e5KUvSkbb2EWE.MCk6ma4sq3qlfQOWx.geZuS4ggYfI-175(XXX) XXX-XXXX-0.0.0.0-604800000; path=/; domain=.api.openai.com; HttpOnly; Secure; SameSite=None",
          "strict-transport-security": "max-age=31536000; includeSubDomains; preload",
          "transfer-encoding": "chunked",
          "x-content-type-options": "nosniff",
          "x-envoy-upstream-service-time": "3885",
          "x-openai-proxy-wasm": "v0.1",
          "x-ratelimit-limit-requests": "500",
          "x-ratelimit-limit-tokens": "500000",
          "x-ratelimit-remaining-requests": "499",
          "x-ratelimit-remaining-tokens": "498657",
          "x-ratelimit-reset-requests": "120ms",
          "x-ratelimit-reset-tokens": "161ms",
          "x-request-id": "req_36d40cbab28f4a2cb8fd48aea5a4f394"
        },
        "reason_phrase": "OK",
        "status_code": 200
      }
    }
  ]
}
```

--------------------------------------------------------------------------------
/tests/gemini_cassettes/chat_cross/step1_gemini25_flash_number/mldev.json:
--------------------------------------------------------------------------------

```json
{
  "replay_id": "chat_cross/step1_gemini25_flash_number/mldev",
  "interactions": [
    {
      "request": {
        "method": "post",
        "url": "{MLDEV_URL_PREFIX}/models/gemini-2.5-flash:generateContent",
        "headers": {
          "Content-Type": "application/json",
          "x-goog-api-key": "{REDACTED}",
          "user-agent": "google-genai-sdk/{VERSION_NUMBER} {LANGUAGE_LABEL}/{VERSION_NUMBER}",
          "x-goog-api-client": "google-genai-sdk/{VERSION_NUMBER} {LANGUAGE_LABEL}/{VERSION_NUMBER}"
        },
        "body_segments": [
          {
            "contents": [
              {
                "parts": [
                  {
                    "text": "\nYou are a senior engineering thought-partner collaborating with another AI agent. Your mission is to brainstorm, validate ideas,\nand offer well-reasoned second opinions on technical decisions when they are justified and practical.\n\nCRITICAL LINE NUMBER INSTRUCTIONS\nCode is presented with line number markers \"LINE│ code\". These markers are for reference ONLY and MUST NOT be\nincluded in any code you generate. Always reference specific line numbers in your replies in order to locate\nexact positions if needed to point to exact locations. Include a very short code excerpt alongside for clarity.\nInclude context_start_text and context_end_text as backup references. Never include \"LINE│\" markers in generated code\nsnippets.\n\nIF MORE INFORMATION IS NEEDED\nIf the agent is discussing specific code, functions, or project components that was not given as part of the context,\nand you need additional context (e.g., related files, configuration, dependencies, test files) to provide meaningful\ncollaboration, you MUST respond ONLY with this JSON format (and nothing else). Do NOT ask for the same file you've been\nprovided unless for some reason its content is missing or incomplete:\n{\n  \"status\": \"files_required_to_continue\",\n  \"mandatory_instructions\": \"<your critical instructions for the agent>\",\n  \"files_needed\": [\"[file name here]\", \"[or some folder/]\"]\n}\n\nSCOPE & FOCUS\n• Ground every suggestion in the project's current tech stack, languages, frameworks, and constraints.\n• Recommend new technologies or patterns ONLY when they provide clearly superior outcomes with minimal added complexity.\n• Avoid speculative, over-engineered, or unnecessarily abstract designs that exceed current project goals or needs.\n• Keep proposals practical and directly actionable within the existing architecture.\n• Overengineering is an anti-pattern — avoid solutions that introduce unnecessary abstraction, indirection, or\n  configuration in anticipation of complexity that does not yet exist, is not clearly justified by the current scope,\n  and may not arise in the foreseeable future.\n\nCOLLABORATION APPROACH\n1. Treat the collaborating agent as an equally senior peer. Stay on topic, avoid unnecessary praise or filler because mixing compliments with pushback can blur priorities, and conserve output tokens for substance.\n2. Engage deeply with the agent's input – extend, refine, and explore alternatives ONLY WHEN they are well-justified and materially beneficial.\n3. Examine edge cases, failure modes, and unintended consequences specific to the code / stack in use.\n4. Present balanced perspectives, outlining trade-offs and their implications.\n5. Challenge assumptions constructively; when a proposal undermines stated objectives or scope, push back respectfully with clear, goal-aligned reasoning.\n6. Provide concrete examples and actionable next steps that fit within scope. Prioritize direct, achievable outcomes.\n7. Ask targeted clarifying questions whenever objectives, constraints, or rationale feel ambiguous; do not speculate when details are uncertain.\n\nBRAINSTORMING GUIDELINES\n• Offer multiple viable strategies ONLY WHEN clearly beneficial within the current environment.\n• Suggest creative solutions that operate within real-world constraints, and avoid proposing major shifts unless truly warranted.\n• Surface pitfalls early, particularly those tied to the chosen frameworks, languages, design direction or choice.\n• Evaluate scalability, maintainability, and operational realities inside the existing architecture and current\nframework.\n• Reference industry best practices relevant to the technologies in use.\n• Communicate concisely and technically, assuming an experienced engineering audience.\n\nREMEMBER\nAct as a peer, not a lecturer. Avoid overcomplicating. Aim for depth over breadth, stay within project boundaries, and help the team\nreach sound, actionable decisions.\n\n\n=== USER REQUEST ===\nPick a number between 1 and 10 and respond with JUST that number.\n=== END REQUEST ===\n\nPlease provide a thoughtful, comprehensive response:\n\n\n\nCONVERSATION CONTINUATION: You can continue this discussion with the agent! (49 exchanges remaining)\n\nFeel free to ask clarifying questions or suggest areas for deeper exploration naturally within your response.\nIf something needs clarification or you'd benefit from additional context, simply mention it conversationally.\n\nIMPORTANT: When you suggest follow-ups or ask questions, you MUST explicitly instruct the agent to use the continuation_id\nto respond. Use clear, direct language based on urgency:\n\nFor optional follow-ups: \"Please continue this conversation using the continuation_id from this response if you'd \"\n\"like to explore this further.\"\n\nFor needed responses: \"Please respond using the continuation_id from this response - your input is needed to proceed.\"\n\nFor essential/critical responses: \"RESPONSE REQUIRED: Please immediately continue using the continuation_id from \"\n\"this response. Cannot proceed without your clarification/input.\"\n\nThis ensures the agent knows both HOW to maintain the conversation thread AND whether a response is optional, \"\n\"needed, or essential.\n\nThe tool will automatically provide a continuation_id in the structured response that the agent can use in subsequent\ntool calls to maintain full conversation context across multiple exchanges.\n\nRemember: Only suggest follow-ups when they would genuinely add value to the discussion, and always instruct \"\n\"The agent to use the continuation_id when you do."
                  }
                ]
              }
            ],
            "generationConfig": {
              "temperature": 0.2,
              "candidateCount": 1,
              "thinkingConfig": {
                "thinkingBudget": 8110
              }
            }
          }
        ]
      },
      "response": {
        "status_code": 200,
        "headers": {
          "content-type": "application/json; charset=UTF-8",
          "vary": "Origin, X-Origin, Referer",
          "content-encoding": "gzip",
          "date": "Sat, 04 Oct 2025 10:14:27 GMT",
          "server": "scaffolding on HTTPServer2",
          "x-xss-protection": "0",
          "x-frame-options": "SAMEORIGIN",
          "x-content-type-options": "nosniff",
          "server-timing": "gfet4t7; dur=1246",
          "alt-svc": "h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000",
          "transfer-encoding": "chunked"
        },
        "body_segments": [
          {
            "candidates": [
              {
                "content": {
                  "parts": [
                    {
                      "text": "7"
                    }
                  ],
                  "role": "model"
                },
                "finishReason": "STOP",
                "index": 0
              }
            ],
            "usageMetadata": {
              "promptTokenCount": 1085,
              "candidatesTokenCount": 1,
              "totalTokenCount": 1149,
              "promptTokensDetails": [
                {
                  "modality": "TEXT",
                  "tokenCount": 1085
                }
              ],
              "thoughtsTokenCount": 63
            },
            "modelVersion": "gemini-2.5-flash",
            "responseId": "g_PgaIL5LL6VkdUPgr3q2A8"
          }
        ],
        "byte_segments": [],
        "sdk_response_segments": [
          {
            "sdk_http_response": {
              "headers": {
                "content-type": "application/json; charset=UTF-8",
                "vary": "Origin, X-Origin, Referer",
                "content-encoding": "gzip",
                "date": "Sat, 04 Oct 2025 10:14:27 GMT",
                "server": "scaffolding on HTTPServer2",
                "x-xss-protection": "0",
                "x-frame-options": "SAMEORIGIN",
                "x-content-type-options": "nosniff",
                "server-timing": "gfet4t7; dur=1246",
                "alt-svc": "h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000",
                "transfer-encoding": "chunked"
              }
            },
            "candidates": [
              {
                "content": {
                  "parts": [
                    {
                      "text": "7"
                    }
                  ],
                  "role": "model"
                },
                "finish_reason": "STOP",
                "index": 0
              }
            ],
            "model_version": "gemini-2.5-flash",
            "response_id": "g_PgaIL5LL6VkdUPgr3q2A8",
            "usage_metadata": {
              "candidates_token_count": 1,
              "prompt_token_count": 1085,
              "prompt_tokens_details": [
                {
                  "modality": "TEXT",
                  "token_count": 1085
                }
              ],
              "thoughts_token_count": 63,
              "total_token_count": 1149
            }
          }
        ]
      }
    }
  ]
}
```

--------------------------------------------------------------------------------
/utils/model_restrictions.py:
--------------------------------------------------------------------------------

```python
"""
Model Restriction Service

This module provides centralized management of model usage restrictions
based on environment variables. It allows organizations to limit which
models can be used from each provider for cost control, compliance, or
standardization purposes.

Environment Variables:
- OPENAI_ALLOWED_MODELS: Comma-separated list of allowed OpenAI models
- GOOGLE_ALLOWED_MODELS: Comma-separated list of allowed Gemini models
- XAI_ALLOWED_MODELS: Comma-separated list of allowed X.AI GROK models
- OPENROUTER_ALLOWED_MODELS: Comma-separated list of allowed OpenRouter models
- DIAL_ALLOWED_MODELS: Comma-separated list of allowed DIAL models

Example:
    OPENAI_ALLOWED_MODELS=o3-mini,o4-mini
    GOOGLE_ALLOWED_MODELS=flash
    XAI_ALLOWED_MODELS=grok-3,grok-3-fast
    OPENROUTER_ALLOWED_MODELS=opus,sonnet,mistral
"""

import logging
from collections import defaultdict
from typing import Optional

from providers.shared import ProviderType
from utils.env import get_env

logger = logging.getLogger(__name__)


class ModelRestrictionService:
    """Central authority for environment-driven model allowlists.

    Role
        Interpret ``*_ALLOWED_MODELS`` environment variables, keep their
        entries normalised (lowercase), and answer whether a provider/model
        pairing is permitted.

    Responsibilities
        * Parse, cache, and expose per-provider restriction sets
        * Validate configuration by cross-checking each entry against the
          provider’s alias-aware model list
        * Offer helper methods such as ``is_allowed`` and ``filter_models`` to
          enforce policy everywhere model names appear (tool selection, CLI
          commands, etc.).
    """

    # Environment variable names
    ENV_VARS = {
        ProviderType.OPENAI: "OPENAI_ALLOWED_MODELS",
        ProviderType.GOOGLE: "GOOGLE_ALLOWED_MODELS",
        ProviderType.XAI: "XAI_ALLOWED_MODELS",
        ProviderType.OPENROUTER: "OPENROUTER_ALLOWED_MODELS",
        ProviderType.DIAL: "DIAL_ALLOWED_MODELS",
    }

    def __init__(self):
        """Initialize the restriction service by loading from environment."""
        self.restrictions: dict[ProviderType, set[str]] = {}
        self._alias_resolution_cache: dict[ProviderType, dict[str, str]] = defaultdict(dict)
        self._load_from_env()

    def _load_from_env(self) -> None:
        """Load restrictions from environment variables."""
        for provider_type, env_var in self.ENV_VARS.items():
            env_value = get_env(env_var)

            if env_value is None or env_value == "":
                # Not set or empty - no restrictions (allow all models)
                logger.debug(f"{env_var} not set or empty - all {provider_type.value} models allowed")
                continue

            # Parse comma-separated list
            models = set()
            for model in env_value.split(","):
                cleaned = model.strip().lower()
                if cleaned:
                    models.add(cleaned)

            if models:
                self.restrictions[provider_type] = models
                self._alias_resolution_cache[provider_type] = {}
                logger.info(f"{provider_type.value} allowed models: {sorted(models)}")
            else:
                # All entries were empty after cleaning - treat as no restrictions
                logger.debug(f"{env_var} contains only whitespace - all {provider_type.value} models allowed")

    def validate_against_known_models(self, provider_instances: dict[ProviderType, any]) -> None:
        """
        Validate restrictions against known models from providers.

        This should be called after providers are initialized to warn about
        typos or invalid model names in the restriction lists.

        Args:
            provider_instances: Dictionary of provider type to provider instance
        """
        for provider_type, allowed_models in self.restrictions.items():
            provider = provider_instances.get(provider_type)
            if not provider:
                continue

            # Get all supported models using the clean polymorphic interface
            try:
                # Gather canonical models and aliases with consistent formatting
                all_models = provider.list_models(
                    respect_restrictions=False,
                    include_aliases=True,
                    lowercase=True,
                    unique=True,
                )
                supported_models = set(all_models)
            except Exception as e:
                logger.debug(f"Could not get model list from {provider_type.value} provider: {e}")
                supported_models = set()

            # Check each allowed model
            for allowed_model in allowed_models:
                if allowed_model not in supported_models:
                    logger.warning(
                        f"Model '{allowed_model}' in {self.ENV_VARS[provider_type]} "
                        f"is not a recognized {provider_type.value} model. "
                        f"Please check for typos. Known models: {sorted(supported_models)}"
                    )

    def is_allowed(self, provider_type: ProviderType, model_name: str, original_name: Optional[str] = None) -> bool:
        """
        Check if a model is allowed for a specific provider.

        Args:
            provider_type: The provider type (OPENAI, GOOGLE, etc.)
            model_name: The canonical model name (after alias resolution)
            original_name: The original model name before alias resolution (optional)

        Returns:
            True if allowed (or no restrictions), False if restricted
        """
        if provider_type not in self.restrictions:
            # No restrictions for this provider
            return True

        allowed_set = self.restrictions[provider_type]

        if len(allowed_set) == 0:
            # Empty set - allowed
            return True

        # Check both the resolved name and original name (if different)
        names_to_check = {model_name.lower()}
        if original_name and original_name.lower() != model_name.lower():
            names_to_check.add(original_name.lower())

        # If any of the names is in the allowed set, it's allowed
        if any(name in allowed_set for name in names_to_check):
            return True

        # Attempt to resolve canonical names for allowed aliases using provider metadata.
        try:
            from providers.registry import ModelProviderRegistry

            provider = ModelProviderRegistry.get_provider(provider_type)
        except Exception:  # pragma: no cover - registry lookup failure shouldn't break validation
            provider = None

        if provider:
            cache = self._alias_resolution_cache.setdefault(provider_type, {})

            for allowed_entry in list(allowed_set):
                normalized_resolved = cache.get(allowed_entry)

                if not normalized_resolved:
                    try:
                        resolved = provider._resolve_model_name(allowed_entry)
                    except Exception:  # pragma: no cover - resolution failures are treated as non-matches
                        continue

                    if not resolved:
                        continue

                    normalized_resolved = resolved.lower()
                    cache[allowed_entry] = normalized_resolved

                if normalized_resolved in names_to_check:
                    allowed_set.add(normalized_resolved)
                    cache[normalized_resolved] = normalized_resolved
                    return True

        return False

    def get_allowed_models(self, provider_type: ProviderType) -> Optional[set[str]]:
        """
        Get the set of allowed models for a provider.

        Args:
            provider_type: The provider type

        Returns:
            Set of allowed model names, or None if no restrictions
        """
        return self.restrictions.get(provider_type)

    def has_restrictions(self, provider_type: ProviderType) -> bool:
        """
        Check if a provider has any restrictions.

        Args:
            provider_type: The provider type

        Returns:
            True if restrictions exist, False otherwise
        """
        return provider_type in self.restrictions

    def filter_models(self, provider_type: ProviderType, models: list[str]) -> list[str]:
        """
        Filter a list of models based on restrictions.

        Args:
            provider_type: The provider type
            models: List of model names to filter

        Returns:
            Filtered list containing only allowed models
        """
        if not self.has_restrictions(provider_type):
            return models

        return [m for m in models if self.is_allowed(provider_type, m)]

    def get_restriction_summary(self) -> dict[str, any]:
        """
        Get a summary of all restrictions for logging/debugging.

        Returns:
            Dictionary with provider names and their restrictions
        """
        summary = {}
        for provider_type, allowed_set in self.restrictions.items():
            if allowed_set:
                summary[provider_type.value] = sorted(allowed_set)
            else:
                summary[provider_type.value] = "none (provider disabled)"

        return summary


# Global instance (singleton pattern)
_restriction_service: Optional[ModelRestrictionService] = None


def get_restriction_service() -> ModelRestrictionService:
    """
    Get the global restriction service instance.

    Returns:
        The singleton ModelRestrictionService instance
    """
    global _restriction_service
    if _restriction_service is None:
        _restriction_service = ModelRestrictionService()
    return _restriction_service

```

--------------------------------------------------------------------------------
/tests/test_model_enumeration.py:
--------------------------------------------------------------------------------

```python
"""
Integration tests for model enumeration across all provider combinations.

These tests ensure that the _get_available_models() method correctly returns
all expected models based on which providers are configured via environment variables.
"""

import importlib
import json
import os

import pytest

from providers.registry import ModelProviderRegistry
from tools.analyze import AnalyzeTool


@pytest.mark.no_mock_provider
class TestModelEnumeration:
    """Test model enumeration with various provider configurations"""

    def setup_method(self):
        """Set up clean state before each test."""
        # Save original environment state
        self._original_env = {
            "DEFAULT_MODEL": os.environ.get("DEFAULT_MODEL", ""),
            "GEMINI_API_KEY": os.environ.get("GEMINI_API_KEY", ""),
            "OPENAI_API_KEY": os.environ.get("OPENAI_API_KEY", ""),
            "XAI_API_KEY": os.environ.get("XAI_API_KEY", ""),
            "OPENROUTER_API_KEY": os.environ.get("OPENROUTER_API_KEY", ""),
            "CUSTOM_API_URL": os.environ.get("CUSTOM_API_URL", ""),
        }

        # Clear provider registry
        ModelProviderRegistry._instance = None

    def teardown_method(self):
        """Clean up after each test."""
        # Restore original environment
        for key, value in self._original_env.items():
            if value:
                os.environ[key] = value
            elif key in os.environ:
                del os.environ[key]

        # Reload config
        import config

        importlib.reload(config)

        # Clear provider registry
        ModelProviderRegistry._instance = None

    def _setup_environment(self, provider_config):
        """Helper to set up environment variables for testing."""
        # Clear all provider-related env vars first
        for key in ["GEMINI_API_KEY", "OPENAI_API_KEY", "XAI_API_KEY", "OPENROUTER_API_KEY", "CUSTOM_API_URL"]:
            if key in os.environ:
                del os.environ[key]

        # Set new values
        for key, value in provider_config.items():
            if value is not None:
                os.environ[key] = value

        # Set auto mode only if not explicitly set in provider_config
        if "DEFAULT_MODEL" not in provider_config:
            os.environ["DEFAULT_MODEL"] = "auto"

        # Reload config to pick up changes
        import config

        importlib.reload(config)

        # Note: tools.base has been refactored to tools.shared.base_tool and tools.simple.base
        # No longer need to reload as configuration is handled at provider level

    def test_no_models_when_no_providers_configured(self):
        """Test that no native models are included when no providers are configured."""
        self._setup_environment({})  # No providers configured

        tool = AnalyzeTool()
        models = tool._get_available_models()

        # After the fix, models should only be shown from enabled providers
        # With no API keys configured, no providers should be enabled
        # Only OpenRouter aliases might still appear if they're in the registry

        # Filter out OpenRouter aliases that might still appear
        non_openrouter_models = [
            m for m in models if "/" not in m and m not in ["gemini", "pro", "flash", "opus", "sonnet", "haiku"]
        ]

        # No native provider models should be present without API keys
        assert (
            len(non_openrouter_models) == 0
        ), f"No native models should be available without API keys, but found: {non_openrouter_models}"

    def test_openrouter_models_without_api_key(self):
        """Test that OpenRouter models are NOT included when API key is not configured."""
        self._setup_environment({})  # No OpenRouter key

        tool = AnalyzeTool()
        models = tool._get_available_models()

        # OpenRouter-specific models should NOT be present
        openrouter_only_models = ["opus", "sonnet", "haiku"]
        found_count = sum(1 for m in openrouter_only_models if m in models)

        assert found_count == 0, "OpenRouter models should not be included without API key"

    def test_custom_models_without_custom_url(self):
        """Test that custom models are NOT included when CUSTOM_API_URL is not configured."""
        self._setup_environment({})  # No custom URL

        tool = AnalyzeTool()
        models = tool._get_available_models()

        # Custom-only models should NOT be present
        custom_only_models = ["local-llama", "llama3.2"]
        found_count = sum(1 for m in custom_only_models if m in models)

        assert found_count == 0, "Custom models should not be included without CUSTOM_API_URL"

    def test_custom_models_not_exposed_with_openrouter_only(self):
        """Ensure OpenRouter access alone does not surface custom-only endpoints."""
        self._setup_environment({"OPENROUTER_API_KEY": "test-openrouter-key"})

        tool = AnalyzeTool()
        models = tool._get_available_models()

        for alias in ("local-llama", "llama3.2"):
            assert alias not in models, f"Custom model alias '{alias}' should remain hidden without CUSTOM_API_URL"

    def test_no_duplicates_with_overlapping_providers(self):
        """Test that models aren't duplicated when multiple providers offer the same model."""
        self._setup_environment(
            {
                "OPENAI_API_KEY": "test",
                "OPENROUTER_API_KEY": "test",  # OpenRouter also offers OpenAI models
            }
        )

        tool = AnalyzeTool()
        models = tool._get_available_models()

        # Count occurrences of each model
        model_counts = {}
        for model in models:
            model_counts[model] = model_counts.get(model, 0) + 1

        # Check no duplicates
        duplicates = {m: count for m, count in model_counts.items() if count > 1}
        assert len(duplicates) == 0, f"Found duplicate models: {duplicates}"

    @pytest.mark.parametrize(
        "model_name,should_exist",
        [
            ("flash", False),  # Gemini - not available without API key
            ("o3", False),  # OpenAI - not available without API key
            ("grok", False),  # X.AI - not available without API key
            ("gemini-2.5-flash", False),  # Full Gemini name - not available without API key
            ("o4-mini", False),  # OpenAI variant - not available without API key
            ("grok-3-fast", False),  # X.AI variant - not available without API key
        ],
    )
    def test_specific_native_models_only_with_api_keys(self, model_name, should_exist):
        """Test that native models are only present when their provider has API keys configured."""
        self._setup_environment({})  # No providers

        tool = AnalyzeTool()
        models = tool._get_available_models()

        if should_exist:
            assert model_name in models, f"Model {model_name} should be present"
        else:
            assert model_name not in models, f"Native model {model_name} should not be present without API key"

    def test_openrouter_free_model_aliases_available(self, tmp_path, monkeypatch):
        """Free OpenRouter variants should expose both canonical names and aliases."""
        # Configure environment with OpenRouter access only
        self._setup_environment({"OPENROUTER_API_KEY": "test-openrouter-key"})

        # Create a temporary OpenRouter model config with a free variant
        custom_config = {
            "models": [
                {
                    "model_name": "deepseek/deepseek-r1:free",
                    "aliases": ["deepseek-free", "r1-free"],
                    "context_window": 163840,
                    "max_output_tokens": 8192,
                    "supports_extended_thinking": False,
                    "supports_json_mode": True,
                    "supports_function_calling": False,
                    "supports_images": False,
                    "max_image_size_mb": 0.0,
                    "description": "DeepSeek R1 free tier variant",
                }
            ]
        }

        config_path = tmp_path / "openrouter_models.json"
        config_path.write_text(json.dumps(custom_config), encoding="utf-8")
        monkeypatch.setenv("OPENROUTER_MODELS_CONFIG_PATH", str(config_path))

        # Reset cached registries so the temporary config is loaded
        from tools.shared.base_tool import BaseTool

        monkeypatch.setattr(BaseTool, "_openrouter_registry_cache", None, raising=False)

        from providers.openrouter import OpenRouterProvider

        monkeypatch.setattr(OpenRouterProvider, "_registry", None, raising=False)

        # Rebuild the provider registry with OpenRouter registered
        ModelProviderRegistry._instance = None
        from providers.shared import ProviderType

        ModelProviderRegistry.register_provider(ProviderType.OPENROUTER, OpenRouterProvider)

        tool = AnalyzeTool()
        models = tool._get_available_models()

        assert "deepseek/deepseek-r1:free" in models, "Canonical free model name should be available"
        assert "deepseek-free" in models, "Free model alias should be included for MCP validation"


# DELETED: test_auto_mode_behavior_with_environment_variables
# This test was fundamentally broken due to registry corruption.
# It cleared ModelProviderRegistry._instance without re-registering providers,
# causing impossible test conditions (expecting models when no providers exist).
# Functionality is already covered by test_auto_mode_comprehensive.py

# DELETED: test_auto_mode_model_selection_validation
# DELETED: test_environment_variable_precedence
# Both tests suffered from the same registry corruption issue as the deleted test above.
# They cleared ModelProviderRegistry._instance without re-registering providers,
# causing empty model lists and impossible test conditions.
# Auto mode functionality is already comprehensively tested in test_auto_mode_comprehensive.py

```

--------------------------------------------------------------------------------
/tests/test_workflow_file_embedding.py:
--------------------------------------------------------------------------------

```python
"""
Unit tests for workflow file embedding behavior

Tests the critical file embedding logic for workflow tools:
- Intermediate steps: Only reference file names (save Claude's context)
- Final steps: Embed full file content for expert analysis
"""

import os
import tempfile
from unittest.mock import Mock, patch

import pytest

from tools.workflow.workflow_mixin import BaseWorkflowMixin


class TestWorkflowFileEmbedding:
    """Test workflow file embedding behavior"""

    def setup_method(self):
        """Set up test fixtures"""
        # Create a mock workflow tool
        self.mock_tool = Mock()
        self.mock_tool.get_name.return_value = "test_workflow"

        # Bind the methods we want to test - use bound methods
        self.mock_tool._should_embed_files_in_workflow_step = (
            BaseWorkflowMixin._should_embed_files_in_workflow_step.__get__(self.mock_tool)
        )
        self.mock_tool._force_embed_files_for_expert_analysis = (
            BaseWorkflowMixin._force_embed_files_for_expert_analysis.__get__(self.mock_tool)
        )

        # Create test files
        self.test_files = []
        for i in range(2):
            fd, path = tempfile.mkstemp(suffix=f"_test_{i}.py")
            with os.fdopen(fd, "w") as f:
                f.write(f"# Test file {i}\nprint('hello world {i}')\n")
            self.test_files.append(path)

    def teardown_method(self):
        """Clean up test files"""
        for file_path in self.test_files:
            try:
                os.unlink(file_path)
            except OSError:
                pass

    def test_intermediate_step_no_embedding(self):
        """Test that intermediate steps only reference files, don't embed"""
        # Intermediate step: step_number=1, next_step_required=True
        step_number = 1
        continuation_id = None  # New conversation
        is_final_step = False  # next_step_required=True

        should_embed = self.mock_tool._should_embed_files_in_workflow_step(step_number, continuation_id, is_final_step)

        assert should_embed is False, "Intermediate steps should NOT embed files"

    def test_intermediate_step_with_continuation_no_embedding(self):
        """Test that intermediate steps with continuation only reference files"""
        # Intermediate step with continuation: step_number=2, next_step_required=True
        step_number = 2
        continuation_id = "test-thread-123"  # Continuing conversation
        is_final_step = False  # next_step_required=True

        should_embed = self.mock_tool._should_embed_files_in_workflow_step(step_number, continuation_id, is_final_step)

        assert should_embed is False, "Intermediate steps with continuation should NOT embed files"

    def test_final_step_embeds_files(self):
        """Test that final steps embed full file content for expert analysis"""
        # Final step: any step_number, next_step_required=False
        step_number = 3
        continuation_id = "test-thread-123"
        is_final_step = True  # next_step_required=False

        should_embed = self.mock_tool._should_embed_files_in_workflow_step(step_number, continuation_id, is_final_step)

        assert should_embed is True, "Final steps SHOULD embed files for expert analysis"

    def test_final_step_new_conversation_embeds_files(self):
        """Test that final steps in new conversations embed files"""
        # Final step in new conversation (rare but possible): step_number=1, next_step_required=False
        step_number = 1
        continuation_id = None  # New conversation
        is_final_step = True  # next_step_required=False (one-step workflow)

        should_embed = self.mock_tool._should_embed_files_in_workflow_step(step_number, continuation_id, is_final_step)

        assert should_embed is True, "Final steps in new conversations SHOULD embed files"

    @patch("utils.file_utils.read_files")
    @patch("utils.file_utils.expand_paths")
    @patch("utils.conversation_memory.get_thread")
    @patch("utils.conversation_memory.get_conversation_file_list")
    def test_comprehensive_file_collection_for_expert_analysis(
        self, mock_get_conversation_file_list, mock_get_thread, mock_expand_paths, mock_read_files
    ):
        """Test that expert analysis collects relevant files from current workflow and conversation history"""
        # Setup test files for different sources
        conversation_files = [self.test_files[0]]  # relevant_files from conversation history
        current_relevant_files = [
            self.test_files[0],
            self.test_files[1],
        ]  # current step's relevant_files (overlap with conversation)

        # Setup mocks
        mock_thread_context = Mock()
        mock_get_thread.return_value = mock_thread_context
        mock_get_conversation_file_list.return_value = conversation_files
        mock_expand_paths.return_value = self.test_files
        mock_read_files.return_value = "# File content\nprint('test')"

        # Mock model context for token allocation
        mock_model_context = Mock()
        mock_token_allocation = Mock()
        mock_token_allocation.file_tokens = 100000
        mock_model_context.calculate_token_allocation.return_value = mock_token_allocation

        # Set up the tool methods and state
        self.mock_tool.get_current_model_context.return_value = mock_model_context
        self.mock_tool.wants_line_numbers_by_default.return_value = True
        self.mock_tool.get_name.return_value = "test_workflow"

        # Set up consolidated findings
        self.mock_tool.consolidated_findings = Mock()
        self.mock_tool.consolidated_findings.relevant_files = set(current_relevant_files)

        # Set up current arguments with continuation
        self.mock_tool._current_arguments = {"continuation_id": "test-thread-123"}
        self.mock_tool.get_current_arguments.return_value = {"continuation_id": "test-thread-123"}

        # Bind the method we want to test
        self.mock_tool._prepare_files_for_expert_analysis = (
            BaseWorkflowMixin._prepare_files_for_expert_analysis.__get__(self.mock_tool)
        )
        self.mock_tool._force_embed_files_for_expert_analysis = (
            BaseWorkflowMixin._force_embed_files_for_expert_analysis.__get__(self.mock_tool)
        )

        # Call the method
        file_content = self.mock_tool._prepare_files_for_expert_analysis()

        # Verify it collected files from conversation history
        mock_get_thread.assert_called_once_with("test-thread-123")
        mock_get_conversation_file_list.assert_called_once_with(mock_thread_context)

        # Verify it called read_files with ALL unique relevant files
        # Should include files from: conversation_files + current_relevant_files
        # But deduplicated: [test_files[0], test_files[1]] (unique set)
        expected_unique_files = list(set(conversation_files + current_relevant_files))

        # The actual call will be with whatever files were collected and deduplicated
        mock_read_files.assert_called_once()
        call_args = mock_read_files.call_args
        called_files = call_args[0][0]  # First positional argument

        # Verify all expected files are included
        for expected_file in expected_unique_files:
            assert expected_file in called_files, f"Expected file {expected_file} not found in {called_files}"

        # Verify return value
        assert file_content == "# File content\nprint('test')"

    @patch("utils.file_utils.read_files")
    @patch("utils.file_utils.expand_paths")
    def test_force_embed_bypasses_conversation_history(self, mock_expand_paths, mock_read_files):
        """Test that _force_embed_files_for_expert_analysis bypasses conversation filtering"""
        # Setup mocks
        mock_expand_paths.return_value = self.test_files
        mock_read_files.return_value = "# File content\nprint('test')"

        # Mock model context for token allocation
        mock_model_context = Mock()
        mock_token_allocation = Mock()
        mock_token_allocation.file_tokens = 100000
        mock_model_context.calculate_token_allocation.return_value = mock_token_allocation

        # Set up the tool methods
        self.mock_tool.get_current_model_context.return_value = mock_model_context
        self.mock_tool.wants_line_numbers_by_default.return_value = True

        # Call the method
        file_content, processed_files = self.mock_tool._force_embed_files_for_expert_analysis(self.test_files)

        # Verify it called read_files directly (bypassing conversation history filtering)
        mock_read_files.assert_called_once_with(
            self.test_files,
            max_tokens=100000,
            reserve_tokens=1000,
            include_line_numbers=True,
        )

        # Verify it expanded paths to get individual files
        mock_expand_paths.assert_called_once_with(self.test_files)

        # Verify return values
        assert file_content == "# File content\nprint('test')"
        assert processed_files == self.test_files

    def test_embedding_decision_logic_comprehensive(self):
        """Comprehensive test of the embedding decision logic"""
        test_cases = [
            # (step_number, continuation_id, is_final_step, expected_embed, description)
            (1, None, False, False, "Step 1 new conversation, intermediate"),
            (1, None, True, True, "Step 1 new conversation, final (one-step workflow)"),
            (2, "thread-123", False, False, "Step 2 with continuation, intermediate"),
            (2, "thread-123", True, True, "Step 2 with continuation, final"),
            (5, "thread-456", False, False, "Step 5 with continuation, intermediate"),
            (5, "thread-456", True, True, "Step 5 with continuation, final"),
        ]

        for step_number, continuation_id, is_final_step, expected_embed, description in test_cases:
            should_embed = self.mock_tool._should_embed_files_in_workflow_step(
                step_number, continuation_id, is_final_step
            )

            assert should_embed == expected_embed, f"Failed for: {description}"


if __name__ == "__main__":
    pytest.main([__file__])

```

--------------------------------------------------------------------------------
/tests/test_model_metadata_continuation.py:
--------------------------------------------------------------------------------

```python
"""
Test model metadata preservation during conversation continuation.

This test verifies that when using continuation_id without specifying a model,
the system correctly retrieves and uses the model from the previous conversation
turn instead of defaulting to DEFAULT_MODEL or the custom provider's default.

Bug: https://github.com/BeehiveInnovations/zen-mcp-server/issues/111
"""

from unittest.mock import MagicMock, patch

import pytest

from server import reconstruct_thread_context
from utils.conversation_memory import add_turn, create_thread, get_thread
from utils.model_context import ModelContext


class TestModelMetadataContinuation:
    """Test model metadata preservation during conversation continuation."""

    @pytest.mark.asyncio
    async def test_model_preserved_from_previous_turn(self):
        """Test that model is correctly retrieved from previous conversation turn."""
        # Create a thread with a turn that has a specific model
        thread_id = create_thread("chat", {"prompt": "test"})

        # Add an assistant turn with a specific model
        success = add_turn(
            thread_id, "assistant", "Here's my response", model_name="deepseek-r1-8b", model_provider="custom"
        )
        assert success

        # Test continuation without model should use previous turn's model
        arguments = {"continuation_id": thread_id}  # No model specified

        # Mock dependencies to avoid side effects
        with patch("utils.model_context.ModelContext.calculate_token_allocation") as mock_calc:
            mock_calc.return_value = MagicMock(
                total_tokens=200000,
                content_tokens=160000,
                response_tokens=40000,
                file_tokens=64000,
                history_tokens=64000,
            )

            with patch("utils.conversation_memory.build_conversation_history") as mock_build:
                mock_build.return_value = ("=== CONVERSATION HISTORY ===\n", 1000)

                # Call the actual function
                enhanced_args = await reconstruct_thread_context(arguments)

                # Verify model was retrieved from thread
                assert enhanced_args.get("model") == "deepseek-r1-8b"

                # Verify ModelContext would use the correct model
                model_context = ModelContext.from_arguments(enhanced_args)
                assert model_context.model_name == "deepseek-r1-8b"

    @pytest.mark.asyncio
    async def test_reconstruct_thread_context_preserves_model(self):
        """Test that reconstruct_thread_context preserves model from previous turn."""
        # Create thread with assistant turn
        thread_id = create_thread("chat", {"prompt": "initial"})
        add_turn(thread_id, "assistant", "Initial response", model_name="o3-mini", model_provider="openai")

        # Test reconstruction without specifying model
        arguments = {"continuation_id": thread_id, "prompt": "follow-up question"}

        # Mock the model context to avoid initialization issues in tests
        with patch("utils.model_context.ModelContext.calculate_token_allocation") as mock_calc:
            mock_calc.return_value = MagicMock(
                total_tokens=200000,
                content_tokens=160000,
                response_tokens=40000,
                file_tokens=64000,
                history_tokens=64000,
            )

            with patch("utils.conversation_memory.build_conversation_history") as mock_build:
                mock_build.return_value = ("=== CONVERSATION HISTORY ===\n", 1000)

                enhanced_args = await reconstruct_thread_context(arguments)

                # Verify model was retrieved from thread
                assert enhanced_args.get("model") == "o3-mini"

    @pytest.mark.asyncio
    async def test_multiple_turns_uses_last_assistant_model(self):
        """Test that with multiple turns, the last assistant turn's model is used."""
        thread_id = create_thread("chat", {"prompt": "analyze this"})

        # Add multiple turns with different models
        add_turn(thread_id, "assistant", "First response", model_name="gemini-2.5-flash", model_provider="google")
        add_turn(thread_id, "user", "Another question")
        add_turn(thread_id, "assistant", "Second response", model_name="o3", model_provider="openai")
        add_turn(thread_id, "user", "Final question")

        arguments = {"continuation_id": thread_id}

        # Mock dependencies
        with patch("utils.model_context.ModelContext.calculate_token_allocation") as mock_calc:
            mock_calc.return_value = MagicMock(
                total_tokens=200000,
                content_tokens=160000,
                response_tokens=40000,
                file_tokens=64000,
                history_tokens=64000,
            )

            with patch("utils.conversation_memory.build_conversation_history") as mock_build:
                mock_build.return_value = ("=== CONVERSATION HISTORY ===\n", 1000)

                # Call the actual function
                enhanced_args = await reconstruct_thread_context(arguments)

                # Should use the most recent assistant model
                assert enhanced_args.get("model") == "o3"

    @pytest.mark.asyncio
    async def test_no_previous_assistant_turn_defaults(self):
        """Test behavior when there's no previous assistant turn."""
        # Save and set DEFAULT_MODEL for test
        import importlib
        import os

        original_default = os.environ.get("DEFAULT_MODEL", "")
        os.environ["DEFAULT_MODEL"] = "auto"
        import config
        import utils.model_context

        importlib.reload(config)
        importlib.reload(utils.model_context)

        try:
            thread_id = create_thread("chat", {"prompt": "test"})

            # Only add user turns
            add_turn(thread_id, "user", "First question")
            add_turn(thread_id, "user", "Second question")

            arguments = {"continuation_id": thread_id}

            # Mock dependencies
            with patch("utils.model_context.ModelContext.calculate_token_allocation") as mock_calc:
                mock_calc.return_value = MagicMock(
                    total_tokens=200000,
                    content_tokens=160000,
                    response_tokens=40000,
                    file_tokens=64000,
                    history_tokens=64000,
                )

                with patch("utils.conversation_memory.build_conversation_history") as mock_build:
                    mock_build.return_value = ("=== CONVERSATION HISTORY ===\n", 1000)

                    # Call the actual function
                    enhanced_args = await reconstruct_thread_context(arguments)

                    # Should not have set a model
                    assert enhanced_args.get("model") is None

                    # ModelContext should use DEFAULT_MODEL
                    model_context = ModelContext.from_arguments(enhanced_args)
                    from config import DEFAULT_MODEL

                    assert model_context.model_name == DEFAULT_MODEL
        finally:
            # Restore original value
            if original_default:
                os.environ["DEFAULT_MODEL"] = original_default
            else:
                os.environ.pop("DEFAULT_MODEL", None)
            importlib.reload(config)
            importlib.reload(utils.model_context)

    @pytest.mark.asyncio
    async def test_explicit_model_overrides_previous_turn(self):
        """Test that explicitly specifying a model overrides the previous turn's model."""
        thread_id = create_thread("chat", {"prompt": "test"})
        add_turn(thread_id, "assistant", "Response", model_name="gemini-2.5-flash", model_provider="google")

        arguments = {"continuation_id": thread_id, "model": "o3"}  # Explicitly specified

        # Mock dependencies
        with patch("utils.model_context.ModelContext.calculate_token_allocation") as mock_calc:
            mock_calc.return_value = MagicMock(
                total_tokens=200000,
                content_tokens=160000,
                response_tokens=40000,
                file_tokens=64000,
                history_tokens=64000,
            )

            with patch("utils.conversation_memory.build_conversation_history") as mock_build:
                mock_build.return_value = ("=== CONVERSATION HISTORY ===\n", 1000)

                # Call the actual function
                enhanced_args = await reconstruct_thread_context(arguments)

                # Should keep the explicit model
                assert enhanced_args.get("model") == "o3"

    @pytest.mark.asyncio
    async def test_thread_chain_model_preservation(self):
        """Test model preservation across thread chains (parent-child relationships)."""
        # Create parent thread
        parent_id = create_thread("chat", {"prompt": "analyze"})
        add_turn(parent_id, "assistant", "Analysis", model_name="gemini-2.5-pro", model_provider="google")

        # Create child thread using a simple tool instead of workflow tool
        child_id = create_thread("chat", {"prompt": "review"}, parent_thread_id=parent_id)

        # Child thread should be able to access parent's model through chain traversal
        # NOTE: Current implementation only checks current thread (not parent threads)
        context = get_thread(child_id)
        assert context.parent_thread_id == parent_id

        arguments = {"continuation_id": child_id}

        # Mock dependencies
        with patch("utils.model_context.ModelContext.calculate_token_allocation") as mock_calc:
            mock_calc.return_value = MagicMock(
                total_tokens=200000,
                content_tokens=160000,
                response_tokens=40000,
                file_tokens=64000,
                history_tokens=64000,
            )

            with patch("utils.conversation_memory.build_conversation_history") as mock_build:
                mock_build.return_value = ("=== CONVERSATION HISTORY ===\n", 1000)

                # Call the actual function
                enhanced_args = await reconstruct_thread_context(arguments)

                # No turns in child thread yet, so model should not be set
                assert enhanced_args.get("model") is None

```

--------------------------------------------------------------------------------
/simulator_tests/log_utils.py:
--------------------------------------------------------------------------------

```python
"""
Centralized log utility for simulator tests.

This module provides common log reading and parsing functionality
used across multiple simulator test files to reduce code duplication.
"""

import logging
import re
import subprocess
from typing import Optional, Union


class LogUtils:
    """Centralized logging utilities for simulator tests."""

    # Log file paths
    MAIN_LOG_FILE = "logs/mcp_server.log"
    ACTIVITY_LOG_FILE = "logs/mcp_activity.log"

    @classmethod
    def get_server_logs_since(cls, since_time: Optional[str] = None) -> str:
        """
        Get server logs from both main and activity log files.

        Args:
            since_time: Currently ignored, returns all available logs

        Returns:
            Combined logs from both log files
        """
        try:
            main_logs = ""
            activity_logs = ""

            # Read main server log
            try:
                with open(cls.MAIN_LOG_FILE) as f:
                    main_logs = f.read()
            except FileNotFoundError:
                pass

            # Read activity log
            try:
                with open(cls.ACTIVITY_LOG_FILE) as f:
                    activity_logs = f.read()
            except FileNotFoundError:
                pass

            return main_logs + "\n" + activity_logs

        except Exception as e:
            logging.warning(f"Failed to read server logs: {e}")
            return ""

    @classmethod
    def get_recent_server_logs(cls, lines: int = 500) -> str:
        """
        Get recent server logs from the main log file.

        Args:
            lines: Number of recent lines to retrieve (default: 500)

        Returns:
            Recent log content as string
        """
        try:
            with open(cls.MAIN_LOG_FILE) as f:
                all_lines = f.readlines()
                recent_lines = all_lines[-lines:] if len(all_lines) > lines else all_lines
                return "".join(recent_lines)
        except FileNotFoundError:
            logging.warning(f"Log file {cls.MAIN_LOG_FILE} not found")
            return ""
        except Exception as e:
            logging.warning(f"Failed to read recent server logs: {e}")
            return ""

    @classmethod
    def get_server_logs_subprocess(cls, lines: int = 500) -> str:
        """
        Get server logs using subprocess (alternative method).

        Args:
            lines: Number of recent lines to retrieve

        Returns:
            Recent log content as string
        """
        try:
            result = subprocess.run(
                ["tail", "-n", str(lines), cls.MAIN_LOG_FILE], capture_output=True, text=True, timeout=10
            )
            return result.stdout + result.stderr
        except Exception as e:
            logging.warning(f"Failed to get server logs via subprocess: {e}")
            return ""

    @classmethod
    def check_server_logs_for_errors(cls, lines: int = 500) -> list[str]:
        """
        Check server logs for error messages.

        Args:
            lines: Number of recent lines to check

        Returns:
            List of error messages found
        """
        logs = cls.get_recent_server_logs(lines)
        error_patterns = [r"ERROR.*", r"CRITICAL.*", r"Failed.*", r"Exception.*", r"Error:.*"]

        errors = []
        for line in logs.split("\n"):
            for pattern in error_patterns:
                if re.search(pattern, line, re.IGNORECASE):
                    errors.append(line.strip())
                    break

        return errors

    @classmethod
    def extract_conversation_usage_logs(cls, logs: str) -> list[dict[str, int]]:
        """
        Extract token budget calculation information from logs.

        Args:
            logs: Log content to parse

        Returns:
            List of dictionaries containing token usage data
        """
        usage_data = []
        pattern = r"\[CONVERSATION_DEBUG\] Token budget calculation:"

        for line in logs.split("\n"):
            if re.search(pattern, line):
                # Parse the token usage information
                usage_info = {}

                # Extract total capacity
                capacity_match = re.search(r"Total capacity: ([\d,]+)", line)
                if capacity_match:
                    usage_info["total_capacity"] = int(capacity_match.group(1).replace(",", ""))

                # Extract content allocation
                content_match = re.search(r"Content allocation: ([\d,]+)", line)
                if content_match:
                    usage_info["content_allocation"] = int(content_match.group(1).replace(",", ""))

                # Extract conversation tokens
                conv_match = re.search(r"Conversation tokens: ([\d,]+)", line)
                if conv_match:
                    usage_info["conversation_tokens"] = int(conv_match.group(1).replace(",", ""))

                # Extract remaining tokens
                remaining_match = re.search(r"Remaining tokens: ([\d,]+)", line)
                if remaining_match:
                    usage_info["remaining_tokens"] = int(remaining_match.group(1).replace(",", ""))

                if usage_info:
                    usage_data.append(usage_info)

        return usage_data

    @classmethod
    def extract_conversation_token_usage(cls, logs: str) -> list[int]:
        """
        Extract conversation token usage values from logs.

        Args:
            logs: Log content to parse

        Returns:
            List of token usage values
        """
        pattern = r"Conversation history token usage:\s*([\d,]+)"
        usage_values = []

        for match in re.finditer(pattern, logs):
            usage_value = int(match.group(1).replace(",", ""))
            usage_values.append(usage_value)

        return usage_values

    @classmethod
    def extract_thread_creation_logs(cls, logs: str) -> list[dict[str, str]]:
        """
        Extract thread creation logs with parent relationships.

        Args:
            logs: Log content to parse

        Returns:
            List of dictionaries with thread relationship data
        """
        thread_data = []
        pattern = r"\[THREAD\] Created new thread (\w+)(?: with parent (\w+))?"

        for match in re.finditer(pattern, logs):
            thread_info = {"thread_id": match.group(1), "parent_id": match.group(2) if match.group(2) else None}
            thread_data.append(thread_info)

        return thread_data

    @classmethod
    def extract_history_traversal_logs(cls, logs: str) -> list[dict[str, Union[str, int]]]:
        """
        Extract conversation history traversal logs.

        Args:
            logs: Log content to parse

        Returns:
            List of dictionaries with traversal data
        """
        traversal_data = []
        pattern = r"\[THREAD\] Retrieved chain of (\d+) messages for thread (\w+)"

        for match in re.finditer(pattern, logs):
            traversal_info = {"chain_length": int(match.group(1)), "thread_id": match.group(2)}
            traversal_data.append(traversal_info)

        return traversal_data

    @classmethod
    def validate_file_deduplication_in_logs(cls, logs: str, tool_name: str, test_file: str) -> bool:
        """
        Validate that logs show file deduplication behavior.

        Args:
            logs: Log content to parse
            tool_name: Name of the tool being tested
            test_file: Name of the test file to check for deduplication

        Returns:
            True if deduplication evidence is found, False otherwise
        """
        # Look for embedding calculation
        embedding_pattern = f"Calculating embeddings for {test_file}"
        has_embedding = bool(re.search(embedding_pattern, logs))

        # Look for filtering message
        filtering_pattern = f"Filtering {test_file} to prevent duplication"
        has_filtering = bool(re.search(filtering_pattern, logs))

        # Look for skip message
        skip_pattern = f"Skipping {test_file} \\(already processed"
        has_skip = bool(re.search(skip_pattern, logs))

        # Look for tool-specific processing
        tool_pattern = f"\\[{tool_name.upper()}\\].*{test_file}"
        has_tool_processing = bool(re.search(tool_pattern, logs, re.IGNORECASE))

        # Deduplication is confirmed if we see evidence of processing and filtering/skipping
        return has_embedding and (has_filtering or has_skip) and has_tool_processing

    @classmethod
    def search_logs_for_pattern(
        cls, pattern: str, logs: Optional[str] = None, case_sensitive: bool = False
    ) -> list[str]:
        """
        Search logs for a specific pattern.

        Args:
            pattern: Regex pattern to search for
            logs: Log content to search (if None, reads recent logs)
            case_sensitive: Whether the search should be case sensitive

        Returns:
            List of matching lines
        """
        if logs is None:
            logs = cls.get_recent_server_logs()

        flags = 0 if case_sensitive else re.IGNORECASE
        matches = []

        for line in logs.split("\n"):
            if re.search(pattern, line, flags):
                matches.append(line.strip())

        return matches

    @classmethod
    def get_log_file_info(cls) -> dict[str, dict[str, Union[str, int, bool]]]:
        """
        Get information about log files.

        Returns:
            Dictionary with file information for each log file
        """
        import os

        file_info = {}

        for log_file in [cls.MAIN_LOG_FILE, cls.ACTIVITY_LOG_FILE]:
            if os.path.exists(log_file):
                stat = os.stat(log_file)
                file_info[log_file] = {
                    "exists": True,
                    "size_bytes": stat.st_size,
                    "size_mb": round(stat.st_size / (1024 * 1024), 2),
                    "last_modified": stat.st_mtime,
                    "readable": os.access(log_file, os.R_OK),
                }
            else:
                file_info[log_file] = {
                    "exists": False,
                    "size_bytes": 0,
                    "size_mb": 0,
                    "last_modified": 0,
                    "readable": False,
                }

        return file_info

```

--------------------------------------------------------------------------------
/docs/tools/debug.md:
--------------------------------------------------------------------------------

```markdown
# Debug Tool - Systematic Investigation & Expert Analysis

**Step-by-step investigation followed by expert debugging assistance**

The `debug` workflow guides Claude through a systematic investigation process where Claude performs methodical code 
examination, evidence collection, and hypothesis formation across multiple steps. Once the investigation is complete, 
the tool provides expert analysis from the selected AI model (optionally) based on all gathered findings.

## Example Prompts

```
Get gemini to debug why my API returns 400 errors randomly with the full stack trace: [paste traceback]
```

You can also ask it to debug on its own, no external model required (**recommended in most cases**).
```
Use debug tool to find out why the app is crashing, here are some app logs [paste app logs] and a crash trace: [paste crash trace]
```

## How It Works 

The debug tool implements a **systematic investigation methodology** where Claude is guided through structured debugging steps:

**Investigation Phase:**
1. **Step 1**: Claude describes the issue and begins thinking deeply about possible underlying causes, side-effects, and contributing factors
2. **Step 2+**: Claude examines relevant code, traces errors, tests hypotheses, and gathers evidence
3. **Throughout**: Claude tracks findings, relevant files, methods, and evolving hypotheses with confidence levels
4. **Backtracking**: Claude can revise previous steps when new insights emerge
5. **Completion**: Once investigation is thorough, Claude signals completion

**Expert Analysis Phase:**
After Claude completes the investigation, it automatically calls the selected AI model with (unless confidence is **certain**, 
in which case expert analysis is bypassed):
- Complete investigation summary with all steps and findings
- Relevant files and methods identified during investigation  
- Final hypothesis and confidence assessment
- Error context and supporting evidence
- Visual debugging materials if provided

This structured approach ensures Claude performs methodical groundwork before expert analysis, resulting in significantly better debugging outcomes and more efficient token usage.

**Special Note**: If you want Claude to perform the entire debugging investigation without calling another model, you can include "don't use any other model" in your prompt, and Claude will complete the full workflow independently.

## Key Features

- **Multi-step investigation process** with evidence collection and hypothesis evolution
- **Systematic code examination** with file and method tracking throughout investigation
- **Confidence assessment and revision** capabilities for investigative steps
- **Backtracking support** to revise previous steps when new insights emerge
- **Expert analysis integration** that provides final debugging recommendations based on complete investigation
- **Error context support**: Stack traces, logs, and runtime information
- **Visual debugging**: Include error screenshots, stack traces, console output
- **Conversation threading**: Continue investigations across multiple sessions
- **Large context analysis**: Handle extensive log files and multiple related code files
- **Multi-language support**: Debug issues across Python, JavaScript, Java, C#, Swift, and more
- **Web search integration**: Identifies when additional research would help solve problems

## Tool Parameters

**Investigation Step Parameters:**
- `step`: Current investigation step description (required)
- `step_number`: Current step number in investigation sequence (required)
- `total_steps`: Estimated total investigation steps (adjustable as process evolves)
- `next_step_required`: Whether another investigation step is needed
- `findings`: Discoveries and evidence collected in this step (required)
- `files_checked`: All files examined during investigation (tracks exploration path)
- `relevant_files`: Files directly tied to the root cause or its effects
- `relevant_methods`: Specific methods/functions involved in the issue
- `hypothesis`: Current best guess about the underlying cause
- `confidence`: Confidence level in current hypothesis (exploring/low/medium/high/certain)
- `continuation_id`: Thread ID for continuing investigations across sessions
- `images`: Visual debugging materials (error screenshots, logs, etc.)

**Model Selection:**
- `model`: auto|pro|flash|flash-2.0|flashlite|o3|o3-mini|o4-mini|gpt4.1|gpt5|gpt5-mini|gpt5-nano (default: server default)
- `thinking_mode`: minimal|low|medium|high|max (default: medium, Gemini only)
- `use_assistant_model`: Whether to use expert analysis phase (default: true, set to false to use Claude only)

## Usage Examples

**Error Debugging:**
```
Debug this TypeError: 'NoneType' object has no attribute 'split' in my parser.py
```

**With Stack Trace:**
```
Use gemini to debug why my API returns 500 errors with this stack trace: [paste full traceback]
```

**With File Context:**
```
Debug without using external model, the authentication failure in auth.py and user_model.py
```

**Performance Debugging:**
```
Debug without using external model to find out why the app is consuming excessive memory during bulk edit operations
```

**Runtime Environment Issues:**
```
Debug deployment issues with server startup failures, here's the runtime info: [environment details]
```

## Investigation Methodology

The debug tool enforces a thorough, structured investigation process:

**Step-by-Step Investigation (Claude-Led):**
1. **Initial Problem Description:** Claude describes the issue and begins thinking about possible causes, side-effects, and contributing factors
2. **Code Examination:** Claude systematically examines relevant files, traces execution paths, and identifies suspicious patterns
3. **Evidence Collection:** Claude gathers findings, tracks files checked, and identifies methods/functions involved
4. **Hypothesis Formation:** Claude develops working theories about the root cause with confidence assessments
5. **Iterative Refinement:** Claude can backtrack and revise previous steps as understanding evolves
6. **Investigation Completion:** Claude signals when sufficient evidence has been gathered

**Expert Analysis Phase (Another AI Model When Used):**
Once investigation is complete, the selected AI model performs:
- **Root Cause Analysis:** Deep analysis of all investigation findings and evidence
- **Solution Recommendations:** Specific fixes with implementation guidance
- **Prevention Strategies:** Measures to avoid similar issues in the future
- **Testing Approaches:** Validation methods for proposed solutions

**Key Benefits:**
- **Methodical Evidence Collection:** Ensures no critical information is missed
- **Progressive Understanding:** Hypotheses evolve as investigation deepens
- **Complete Context:** Expert analysis receives full investigation history
- **Efficient Token Usage:** Structured approach prevents redundant back-and-forth

## Debugging Categories

**Runtime Errors:**
- Exceptions and crashes
- Null pointer/reference errors
- Type errors and casting issues
- Memory leaks and resource exhaustion

**Logic Errors:**
- Incorrect algorithm implementation
- Off-by-one errors and boundary conditions
- State management issues
- Race conditions and concurrency bugs

**Integration Issues:**
- API communication failures
- Database connection problems
- Third-party service integration
- Configuration and environment issues

**Performance Problems:**
- Slow response times
- Memory usage spikes
- CPU-intensive operations
- I/O bottlenecks

## Best Practices

**For Investigation Steps:**
- **Be thorough in step descriptions**: Explain what you're examining and why
- **Track all files examined**: Include even files that don't contain the bug (tracks investigation path)
- **Document findings clearly**: Summarize discoveries, suspicious patterns, and evidence
- **Evolve hypotheses**: Update theories as investigation progresses
- **Use backtracking wisely**: Revise previous steps when new insights emerge
- **Include visual evidence**: Screenshots, error dialogs, console output

**For Initial Problem Description:**
- **Provide complete error context**: Full stack traces, error messages, and logs
- **Describe expected vs actual behavior**: Clear symptom description
- **Include environment details**: Runtime versions, configuration, deployment context
- **Mention previous attempts**: What debugging steps have already been tried
- **Be specific about occurrence**: When, where, and how the issue manifests

## Advanced Features

**Large Log Analysis:**
With models like Gemini Pro (1M context), you can include extensive log files for comprehensive analysis:
```
"Debug application crashes using these large log files: app.log, error.log, system.log"
```

**Multi-File Investigation:**
Analyze multiple related files simultaneously to understand complex issues:
```
"Debug the data processing pipeline issues across processor.py, validator.py, and output_handler.py"
```

**Web Search Integration:**
The tool can recommend specific searches for error messages, known issues, or documentation:
```
After analysis: "Recommended searches for Claude: 'Django 4.2 migration error specific_error_code', 'PostgreSQL connection pool exhaustion solutions'"
```

## When to Use Debug vs Other Tools

- **Use `debug`** for: Specific runtime errors, exceptions, crashes, performance issues requiring systematic investigation
- **Use `codereview`** for: Finding potential bugs in code without specific errors or symptoms
- **Use `analyze`** for: Understanding code structure and flow without troubleshooting specific issues
- **Use `precommit`** for: Validating changes before commit to prevent introducing bugs

## Investigation Example

**Step 1:** "The user authentication fails intermittently with no error logs. I need to investigate the auth flow and identify where failures might occur silently."

**Step 2:** "Examined auth.py and found three potential failure points: token validation, database connectivity, and session management. No obvious bugs yet but need to trace execution flow."

**Step 3:** "Found suspicious async/await pattern in session_manager.py lines 45-67. The await might be missing exception handling. This could explain silent failures."

**Completion:** Investigation reveals likely root cause in exception handling, ready for expert analysis with full context.

```