This is page 3 of 3. Use http://codebase.md/king-of-the-grackles/reddit-mcp-poc?lines=true&page={x} to view the full context. # Directory Structure ``` ├── .env.sample ├── .gemini │ └── settings.json ├── .gitignore ├── .python-version ├── .specify │ ├── memory │ │ └── constitution.md │ ├── scripts │ │ └── bash │ │ ├── check-implementation-prerequisites.sh │ │ ├── check-task-prerequisites.sh │ │ ├── common.sh │ │ ├── create-new-feature.sh │ │ ├── get-feature-paths.sh │ │ ├── setup-plan.sh │ │ └── update-agent-context.sh │ └── templates │ ├── agent-file-template.md │ ├── plan-template.md │ ├── spec-template.md │ └── tasks-template.md ├── package.json ├── pyproject.toml ├── README.md ├── reddit-research-agent.md ├── reports │ ├── ai-llm-weekly-trends-reddit-analysis-2025-01-20.md │ ├── saas-solopreneur-reddit-communities.md │ ├── top-50-active-AI-subreddits.md │ ├── top-50-subreddits-saas-ai-builders.md │ └── top-50-subreddits-saas-solopreneurs.md ├── server.json ├── specs │ ├── 003-fastmcp-context-integration.md │ ├── 003-implementation-summary.md │ ├── 003-phase-1-context-integration.md │ ├── 003-phase-2-progress-monitoring.md │ ├── agent-reasoning-visibility.md │ ├── agentic-discovery-architecture.md │ ├── chroma-proxy-architecture.md │ ├── deep-research-reddit-architecture.md │ └── reddit-research-agent-spec.md ├── src │ ├── __init__.py │ ├── chroma_client.py │ ├── config.py │ ├── models.py │ ├── resources.py │ ├── server.py │ └── tools │ ├── __init__.py │ ├── comments.py │ ├── discover.py │ ├── posts.py │ └── search.py ├── tests │ ├── test_context_integration.py │ └── test_tools.py └── uv.lock ``` # Files -------------------------------------------------------------------------------- /specs/agentic-discovery-architecture.md: -------------------------------------------------------------------------------- ```markdown 1 | # Agentic Discovery Architecture with OpenAI Agents SDK 2 | 3 | ## Overview 4 | This document outlines the refactoring of the monolithic `discover.py` tool into a modular, agentic architecture using OpenAI's Python Agents SDK. Each agent has a single, well-defined responsibility and can hand off to other specialized agents as needed. 5 | 6 | ### Why Agentic Architecture? 7 | 8 | The current monolithic `discover.py` file (400+ lines) combines multiple concerns: 9 | - Query processing and analysis 10 | - API interaction and error handling 11 | - Scoring and ranking algorithms 12 | - Result formatting and synthesis 13 | - Batch operations management 14 | 15 | This creates several problems: 16 | 1. **Testing Complexity**: Can't test scoring without API calls 17 | 2. **Limited Reusability**: Can't use validation logic elsewhere 18 | 3. **Performance Issues**: Sequential processing of batch requests 19 | 4. **Maintenance Burden**: Changes risk breaking unrelated functionality 20 | 5. **Scaling Challenges**: Adding features requires modifying core logic 21 | 22 | The agentic approach solves these issues by decomposing functionality into specialized, autonomous agents that collaborate through well-defined interfaces. 23 | 24 | ## Architecture Principles 25 | 26 | 1. **Single Responsibility**: Each agent performs one specific task excellently 27 | 2. **Composability**: Agents can be combined in different ways for various workflows 28 | 3. **Testability**: Each agent can be tested in isolation 29 | 4. **Observability**: Full tracing of agent decision-making process 30 | 5. **Efficiency**: Smart routing and parallel execution where possible 31 | 32 | ## Directory Structure 33 | 34 | ``` 35 | reddit-research-mcp/src/ 36 | ├── agents/ 37 | │ ├── __init__.py 38 | │ ├── discovery_orchestrator.py 39 | │ ├── query_analyzer.py 40 | │ ├── subreddit_scorer.py 41 | │ ├── search_executor.py 42 | │ ├── batch_manager.py 43 | │ ├── validator.py 44 | │ └── synthesizer.py 45 | ├── models/ 46 | │ ├── __init__.py 47 | │ ├── discovery_context.py 48 | │ └── discovery_models.py 49 | ├── tools/ 50 | │ └── discover_agent.py 51 | ``` 52 | 53 | ## Agent Specifications 54 | 55 | ### 1. Discovery Orchestrator Agent 56 | **File**: `agents/discovery_orchestrator.py` 57 | 58 | **Purpose**: Routes discovery requests to the appropriate specialized agent based on query type and requirements. 59 | 60 | **Why This Agent?** 61 | The Discovery Orchestrator serves as the intelligent entry point that prevents inefficient processing. In the monolithic approach, every query goes through the same pipeline regardless of complexity. This agent enables: 62 | - **Smart Routing**: Simple queries skip unnecessary analysis steps 63 | - **Resource Optimization**: Uses appropriate agents based on query complexity 64 | - **Error Isolation**: Failures in one path don't affect others 65 | - **Scalability**: New discovery strategies can be added without modifying core logic 66 | 67 | **Architectural Role**: 68 | - **Entry Point**: First agent in every discovery workflow 69 | - **Traffic Director**: Routes to specialized agents based on intent 70 | - **Fallback Handler**: Manages errors and edge cases gracefully 71 | - **Performance Optimizer**: Chooses fastest path for each query type 72 | 73 | **Problem Solved**: 74 | The monolithic `discover.py` processes all queries identically, wasting resources on simple validations and lacking optimization for batch operations. The orchestrator eliminates this inefficiency. 75 | 76 | **Key Interactions**: 77 | - **Receives**: Raw discovery requests from the main entry point 78 | - **Delegates To**: Query Analyzer (complex), Batch Manager (multiple), Validator (verification), Search Executor (simple) 79 | - **Returns**: Final results from delegated agents 80 | 81 | **Key Responsibilities**: 82 | - Analyze incoming discovery requests 83 | - Determine optimal discovery strategy 84 | - Route to appropriate specialized agent 85 | - Handle edge cases and errors gracefully 86 | 87 | **Model**: `gpt-4o-mini` (lightweight routing decisions) 88 | 89 | **Handoffs**: 90 | - Query Analyzer (for complex queries) 91 | - Batch Manager (for multiple queries) 92 | - Validator (for direct validation) 93 | - Search Executor (for simple searches) 94 | 95 | **Implementation**: 96 | ```python 97 | from agents import Agent, handoff 98 | from agents.extensions.handoff_prompt import RECOMMENDED_PROMPT_PREFIX 99 | 100 | discovery_orchestrator = Agent[DiscoveryContext]( 101 | name="Discovery Orchestrator", 102 | instructions=f"""{RECOMMENDED_PROMPT_PREFIX} 103 | You are a routing agent for Reddit discovery requests. 104 | 105 | Analyze the incoming request and determine the best path: 106 | - Complex queries needing analysis → Query Analyzer 107 | - Batch/multiple queries → Batch Manager 108 | - Direct subreddit validation → Validator 109 | - Simple searches → Search Executor 110 | 111 | Consider efficiency and accuracy when routing. 112 | """, 113 | model="gpt-4o-mini", 114 | handoffs=[query_analyzer, batch_manager, validator, search_executor] 115 | ) 116 | ``` 117 | 118 | ### 2. Query Analyzer Agent 119 | **File**: `agents/query_analyzer.py` 120 | 121 | **Purpose**: Analyzes and enhances search queries for better results. 122 | 123 | **Why This Agent?** 124 | Reddit's search API is notoriously limited and literal. The Query Analyzer transforms vague or complex user queries into optimized search strategies. This agent provides: 125 | - **Semantic Understanding**: Interprets user intent beyond literal keywords 126 | - **Query Expansion**: Adds synonyms and related terms for comprehensive results 127 | - **Search Strategy**: Determines best approach (broad vs. specific search) 128 | - **Intent Classification**: Distinguishes between topic exploration vs. specific community search 129 | 130 | **Architectural Role**: 131 | - **Query Preprocessor**: Enhances queries before they hit the Reddit API 132 | - **Intent Detector**: Classifies what the user is really looking for 133 | - **Strategy Advisor**: Recommends search approaches to downstream agents 134 | - **NLP Specialist**: Applies language understanding to improve results 135 | 136 | **Problem Solved**: 137 | The monolithic approach uses raw queries directly, leading to poor results when users use natural language or ambiguous terms. This agent bridges the gap between human expression and API requirements. 138 | 139 | **Key Interactions**: 140 | - **Receives From**: Discovery Orchestrator (complex queries) 141 | - **Processes**: Raw user queries into structured search plans 142 | - **Hands Off To**: Search Executor (with enhanced query and strategy) 143 | - **Provides**: Keywords, expanded terms, and intent classification 144 | 145 | **Key Responsibilities**: 146 | - Extract keywords and intent 147 | - Expand query with related terms 148 | - Classify query type (topic, community, specific) 149 | - Generate search strategies 150 | 151 | **Tools**: 152 | ```python 153 | @function_tool 154 | def extract_keywords(wrapper: RunContextWrapper[DiscoveryContext], text: str) -> List[str]: 155 | """Extract meaningful keywords from query text.""" 156 | # Implementation from current discover.py 157 | 158 | @function_tool 159 | def expand_query(wrapper: RunContextWrapper[DiscoveryContext], query: str) -> QueryExpansion: 160 | """Expand query with synonyms and related terms.""" 161 | # Generate variations and related terms 162 | 163 | @function_tool 164 | def classify_intent(wrapper: RunContextWrapper[DiscoveryContext], query: str) -> QueryIntent: 165 | """Classify the intent behind the query.""" 166 | # Return: topic_search, community_search, validation, etc. 167 | ``` 168 | 169 | **Output Type**: 170 | ```python 171 | class AnalyzedQuery(BaseModel): 172 | original: str 173 | keywords: List[str] 174 | expanded_terms: List[str] 175 | intent: QueryIntent 176 | suggested_strategy: str 177 | confidence: float 178 | ``` 179 | 180 | **Model**: `gpt-4o` (complex language understanding) 181 | 182 | **Handoffs**: Search Executor (with enhanced query) 183 | 184 | ### 3. Subreddit Scorer Agent 185 | **File**: `agents/subreddit_scorer.py` 186 | 187 | **Purpose**: Scores and ranks subreddit relevance with detailed confidence metrics. 188 | 189 | **Why This Agent?** 190 | Reddit's search API returns results in arbitrary order with many false positives. The Subreddit Scorer applies sophisticated ranking algorithms to surface the most relevant communities. This agent provides: 191 | - **Multi-Factor Scoring**: Combines name match, description relevance, and activity levels 192 | - **False Positive Detection**: Identifies and penalizes misleading matches 193 | - **Confidence Metrics**: Provides transparency about why results are ranked 194 | - **Activity Weighting**: Prioritizes active communities over dead ones 195 | 196 | **Architectural Role**: 197 | - **Quality Filter**: Ensures only relevant results reach the user 198 | - **Ranking Engine**: Orders results by true relevance, not API defaults 199 | - **Confidence Calculator**: Provides scoring transparency 200 | - **Post-Processor**: Refines raw search results into useful recommendations 201 | 202 | **Problem Solved**: 203 | The monolithic approach has scoring logic embedded throughout, making it hard to tune or test. False positives (like "pythonball" for "python") pollute results. This agent centralizes and perfects scoring logic. 204 | 205 | **Key Interactions**: 206 | - **Receives From**: Search Executor (raw search results) 207 | - **Processes**: Unranked subreddits into scored, ranked list 208 | - **Sends To**: Result Synthesizer (for final formatting) 209 | - **Collaborates With**: Batch Manager (for scoring multiple search results) 210 | 211 | **Key Responsibilities**: 212 | - Calculate name match scores 213 | - Evaluate description relevance 214 | - Assess community activity 215 | - Apply penalties for false positives 216 | - Generate confidence scores 217 | 218 | **Tools**: 219 | ```python 220 | @function_tool 221 | def calculate_name_match(wrapper: RunContextWrapper[DiscoveryContext], 222 | subreddit_name: str, query: str) -> float: 223 | """Calculate how well subreddit name matches query.""" 224 | # Implementation from current discover.py 225 | 226 | @function_tool 227 | def calculate_description_score(wrapper: RunContextWrapper[DiscoveryContext], 228 | description: str, query: str) -> float: 229 | """Score based on query presence in description.""" 230 | # Implementation from current discover.py 231 | 232 | @function_tool 233 | def calculate_activity_score(wrapper: RunContextWrapper[DiscoveryContext], 234 | subscribers: int) -> float: 235 | """Score based on community size and activity.""" 236 | # Implementation from current discover.py 237 | 238 | @function_tool 239 | def calculate_penalties(wrapper: RunContextWrapper[DiscoveryContext], 240 | subreddit_name: str, query: str) -> float: 241 | """Apply penalties for likely false positives.""" 242 | # Implementation from current discover.py 243 | ``` 244 | 245 | **Output Type**: 246 | ```python 247 | class ScoredSubreddit(BaseModel): 248 | name: str 249 | confidence: float 250 | match_type: str 251 | score_breakdown: Dict[str, float] 252 | ranking: int 253 | ``` 254 | 255 | **Model**: `gpt-4o-mini` (mathematical calculations) 256 | 257 | **Tool Use Behavior**: `stop_on_first_tool` (direct scoring results) 258 | 259 | ### 4. Search Executor Agent 260 | **File**: `agents/search_executor.py` 261 | 262 | **Purpose**: Executes Reddit API searches efficiently with error handling. 263 | 264 | **Why This Agent?** 265 | Direct API interaction requires careful error handling, rate limit management, and caching. The Search Executor isolates all Reddit API complexity from other agents. This agent provides: 266 | - **API Abstraction**: Other agents don't need to know Reddit API details 267 | - **Error Resilience**: Handles rate limits, timeouts, and API failures gracefully 268 | - **Caching Layer**: Prevents redundant API calls for identical queries 269 | - **Result Validation**: Ensures data integrity before passing downstream 270 | 271 | **Architectural Role**: 272 | - **API Gateway**: Single point of contact with Reddit API 273 | - **Error Handler**: Manages all API-related failures and retries 274 | - **Cache Manager**: Stores and retrieves recent search results 275 | - **Data Validator**: Ensures results are complete and valid 276 | 277 | **Problem Solved**: 278 | The monolithic approach mixes API calls with business logic, making it hard to handle errors consistently or implement caching. This agent centralizes all API interaction concerns. 279 | 280 | **Key Interactions**: 281 | - **Receives From**: Query Analyzer (enhanced queries) or Orchestrator (simple queries) 282 | - **Interacts With**: Reddit API via PRAW client 283 | - **Sends To**: Subreddit Scorer (for ranking) 284 | - **Caches**: Results in context for reuse by other agents 285 | 286 | **Key Responsibilities**: 287 | - Execute Reddit API search calls 288 | - Handle API errors and rate limits 289 | - Validate returned results 290 | - Cache results for efficiency 291 | 292 | **Tools**: 293 | ```python 294 | @function_tool 295 | async def search_reddit(wrapper: RunContextWrapper[DiscoveryContext], 296 | query: str, limit: int = 250) -> List[RawSubreddit]: 297 | """Execute Reddit search API call.""" 298 | reddit = wrapper.context.reddit_client 299 | results = [] 300 | for subreddit in reddit.subreddits.search(query, limit=limit): 301 | results.append(RawSubreddit.from_praw(subreddit)) 302 | return results 303 | 304 | @function_tool 305 | def handle_api_error(wrapper: RunContextWrapper[DiscoveryContext], 306 | error: Exception) -> ErrorStrategy: 307 | """Determine how to handle API errors.""" 308 | # Retry logic, fallback strategies, etc. 309 | ``` 310 | 311 | **Output Type**: 312 | ```python 313 | class SearchResults(BaseModel): 314 | query: str 315 | results: List[RawSubreddit] 316 | total_found: int 317 | api_calls: int 318 | cached: bool 319 | errors: List[str] 320 | ``` 321 | 322 | **Model**: `gpt-4o-mini` (simple execution) 323 | 324 | **Handoffs**: Subreddit Scorer (for ranking results) 325 | 326 | ### 5. Batch Discovery Manager Agent 327 | **File**: `agents/batch_manager.py` 328 | 329 | **Purpose**: Manages batch discovery operations for multiple queries. 330 | 331 | **Why This Agent?** 332 | Users often need to discover communities across multiple related topics. The Batch Manager orchestrates parallel searches efficiently. This agent provides: 333 | - **Parallel Execution**: Runs multiple searches concurrently for speed 334 | - **Deduplication**: Removes duplicate subreddits across different searches 335 | - **API Optimization**: Minimizes total API calls through smart batching 336 | - **Result Aggregation**: Combines multiple search results intelligently 337 | 338 | **Architectural Role**: 339 | - **Parallel Coordinator**: Manages multiple Search Executor instances 340 | - **Resource Manager**: Optimizes API usage across batch operations 341 | - **Result Aggregator**: Merges and deduplicates results from multiple searches 342 | - **Performance Optimizer**: Ensures batch operations complete quickly 343 | 344 | **Problem Solved**: 345 | The monolithic approach processes batch queries sequentially, leading to slow performance. It also lacks sophisticated deduplication and aggregation logic for multiple searches. 346 | 347 | **Key Interactions**: 348 | - **Receives From**: Discovery Orchestrator (batch requests) 349 | - **Spawns**: Multiple Search Executor agents in parallel 350 | - **Coordinates**: Parallel execution and result collection 351 | - **Sends To**: Result Synthesizer (aggregated results) 352 | 353 | **Key Responsibilities**: 354 | - Coordinate multiple search operations 355 | - Optimize API calls through batching 356 | - Aggregate results from multiple searches 357 | - Manage parallel execution 358 | 359 | **Tools**: 360 | ```python 361 | @function_tool 362 | async def coordinate_batch(wrapper: RunContextWrapper[DiscoveryContext], 363 | queries: List[str]) -> BatchPlan: 364 | """Plan optimal batch execution strategy.""" 365 | # Determine parallelization, caching opportunities 366 | 367 | @function_tool 368 | def merge_batch_results(wrapper: RunContextWrapper[DiscoveryContext], 369 | results: List[SearchResults]) -> BatchResults: 370 | """Merge results from multiple searches.""" 371 | # Deduplicate, aggregate, summarize 372 | ``` 373 | 374 | **Model**: `gpt-4o` (complex coordination) 375 | 376 | **Handoffs**: Multiple Search Executor agents (in parallel) 377 | 378 | **Implementation Note**: Uses dynamic handoff creation for parallel execution 379 | 380 | ### 6. Subreddit Validator Agent 381 | **File**: `agents/validator.py` 382 | 383 | **Purpose**: Validates subreddit existence and accessibility. 384 | 385 | **Why This Agent?** 386 | Users often have specific subreddit names that need verification. The Validator provides quick, focused validation without the overhead of full search. This agent provides: 387 | - **Direct Validation**: Checks specific subreddit names efficiently 388 | - **Access Verification**: Confirms subreddits are public and accessible 389 | - **Alternative Suggestions**: Recommends similar communities if validation fails 390 | - **Metadata Retrieval**: Gets detailed info about valid subreddits 391 | 392 | **Architectural Role**: 393 | - **Verification Specialist**: Focused solely on validation tasks 394 | - **Fast Path**: Provides quick responses for known subreddit names 395 | - **Fallback Provider**: Suggests alternatives when validation fails 396 | - **Metadata Fetcher**: Retrieves comprehensive subreddit information 397 | 398 | **Problem Solved**: 399 | The monolithic approach treats validation as a special case of search, which is inefficient. Users waiting to verify "r/python" shouldn't trigger a full search pipeline. 400 | 401 | **Key Interactions**: 402 | - **Receives From**: Discovery Orchestrator (direct validation requests) 403 | - **Validates**: Specific subreddit names via Reddit API 404 | - **Returns**: Validation status with metadata or alternatives 405 | - **May Trigger**: Search Executor (to find alternatives if validation fails) 406 | 407 | **Key Responsibilities**: 408 | - Check if subreddit exists 409 | - Verify accessibility (not private/banned) 410 | - Get detailed subreddit information 411 | - Suggest alternatives if invalid 412 | 413 | **Tools**: 414 | ```python 415 | @function_tool 416 | def validate_subreddit(wrapper: RunContextWrapper[DiscoveryContext], 417 | subreddit_name: str) -> ValidationResult: 418 | """Validate if subreddit exists and is accessible.""" 419 | # Implementation from current discover.py 420 | 421 | @function_tool 422 | def get_subreddit_info(wrapper: RunContextWrapper[DiscoveryContext], 423 | subreddit_name: str) -> SubredditInfo: 424 | """Get detailed information about a subreddit.""" 425 | # Fetch all metadata 426 | ``` 427 | 428 | **Output Type**: 429 | ```python 430 | class ValidationResult(BaseModel): 431 | valid: bool 432 | name: str 433 | reason: Optional[str] 434 | info: Optional[SubredditInfo] 435 | suggestions: List[str] 436 | ``` 437 | 438 | **Model**: `gpt-4o-mini` (simple validation) 439 | 440 | ### 7. Result Synthesizer Agent 441 | **File**: `agents/synthesizer.py` 442 | 443 | **Purpose**: Synthesizes and formats final discovery results. 444 | 445 | **Why This Agent?** 446 | Raw scored results need intelligent synthesis to be truly useful. The Result Synthesizer transforms data into actionable insights. This agent provides: 447 | - **Intelligent Summarization**: Creates meaningful summaries from result patterns 448 | - **Actionable Recommendations**: Suggests next steps based on results 449 | - **Flexible Formatting**: Adapts output format to use case 450 | - **Insight Generation**: Identifies patterns and relationships in results 451 | 452 | **Architectural Role**: 453 | - **Final Processor**: Last agent before results return to user 454 | - **Insight Generator**: Transforms data into understanding 455 | - **Format Adapter**: Ensures results match expected output format 456 | - **Recommendation Engine**: Provides actionable next steps 457 | 458 | **Problem Solved**: 459 | The monolithic approach mixes result formatting throughout the code, making it hard to maintain consistent output or add new insights. This agent centralizes all presentation logic. 460 | 461 | **Key Interactions**: 462 | - **Receives From**: Subreddit Scorer or Batch Manager (scored/aggregated results) 463 | - **Synthesizes**: Raw data into formatted, insightful output 464 | - **Generates**: Summaries, recommendations, and metadata 465 | - **Returns**: Final formatted results to the orchestrator 466 | 467 | **Key Responsibilities**: 468 | - Format results for presentation 469 | - Generate summaries and insights 470 | - Create recommendations 471 | - Add metadata and next actions 472 | 473 | **Tools**: 474 | ```python 475 | @function_tool 476 | def format_results(wrapper: RunContextWrapper[DiscoveryContext], 477 | results: List[ScoredSubreddit]) -> FormattedResults: 478 | """Format results for final output.""" 479 | # Structure for easy consumption 480 | 481 | @function_tool 482 | def generate_recommendations(wrapper: RunContextWrapper[DiscoveryContext], 483 | results: FormattedResults) -> List[str]: 484 | """Generate actionable recommendations.""" 485 | # Next steps, additional searches, etc. 486 | ``` 487 | 488 | **Output Type**: 489 | ```python 490 | class DiscoveryOutput(BaseModel): 491 | results: List[FormattedSubreddit] 492 | summary: DiscoverySummary 493 | recommendations: List[str] 494 | metadata: DiscoveryMetadata 495 | ``` 496 | 497 | **Model**: `gpt-4o` (synthesis and insights) 498 | 499 | ## Agent Collaboration Workflow 500 | 501 | ### Example: Complex Query Discovery 502 | 503 | When a user searches for "machine learning communities for beginners": 504 | 505 | 1. **Discovery Orchestrator** receives request, identifies complexity, routes to Query Analyzer 506 | 2. **Query Analyzer** extracts keywords ["machine learning", "beginners", "ML", "learn"], expands query, identifies intent as "topic_search" 507 | 3. **Search Executor** runs enhanced searches for each term variation 508 | 4. **Subreddit Scorer** ranks results, penalizing advanced communities, boosting beginner-friendly ones 509 | 5. **Result Synthesizer** formats top results with recommendations for getting started 510 | 511 | ### Example: Batch Validation 512 | 513 | When validating multiple subreddit names ["r/python", "r/datascience", "r/doesnotexist"]: 514 | 515 | 1. **Discovery Orchestrator** identifies validation request, routes to Batch Manager 516 | 2. **Batch Manager** spawns three parallel Validator agents 517 | 3. **Validators** check each subreddit simultaneously 518 | 4. **Result Synthesizer** aggregates validation results, suggests alternatives for invalid entries 519 | 520 | ## Shared Models and Context 521 | 522 | ### Discovery Context 523 | **File**: `models/discovery_context.py` 524 | 525 | ```python 526 | from dataclasses import dataclass 527 | import praw 528 | from typing import Dict, Any, Optional 529 | 530 | @dataclass 531 | class DiscoveryContext: 532 | reddit_client: praw.Reddit 533 | query_metadata: Optional[QueryMetadata] = None 534 | discovery_config: DiscoveryConfig = field(default_factory=DiscoveryConfig) 535 | api_call_counter: int = 0 536 | cache: Dict[str, Any] = field(default_factory=dict) 537 | 538 | @dataclass 539 | class QueryMetadata: 540 | original_query: str 541 | intent: str 542 | timestamp: float 543 | user_preferences: Dict[str, Any] 544 | 545 | @dataclass 546 | class DiscoveryConfig: 547 | include_nsfw: bool = False 548 | max_api_calls: int = 10 549 | cache_ttl: int = 300 550 | default_limit: int = 10 551 | ``` 552 | 553 | ### Discovery Models 554 | **File**: `models/discovery_models.py` 555 | 556 | ```python 557 | from pydantic import BaseModel 558 | from typing import List, Dict, Optional, Literal 559 | 560 | class QueryIntent(BaseModel): 561 | type: Literal["topic_search", "community_search", "validation", "batch"] 562 | confidence: float 563 | 564 | class RawSubreddit(BaseModel): 565 | name: str 566 | title: str 567 | description: str 568 | subscribers: int 569 | over_18: bool 570 | created_utc: float 571 | url: str 572 | 573 | @classmethod 574 | def from_praw(cls, subreddit): 575 | """Create from PRAW subreddit object.""" 576 | return cls( 577 | name=subreddit.display_name, 578 | title=subreddit.title, 579 | description=subreddit.public_description[:100], 580 | subscribers=subreddit.subscribers, 581 | over_18=subreddit.over18, 582 | created_utc=subreddit.created_utc, 583 | url=f"https://reddit.com/r/{subreddit.display_name}" 584 | ) 585 | 586 | class ConfidenceScore(BaseModel): 587 | overall: float 588 | name_match: float 589 | description_match: float 590 | activity_score: float 591 | penalties: float 592 | 593 | class DiscoverySummary(BaseModel): 594 | total_found: int 595 | returned: int 596 | coverage: Literal["comprehensive", "good", "partial", "limited"] 597 | top_by_confidence: List[str] 598 | confidence_distribution: Dict[str, int] 599 | ``` 600 | 601 | ## Main Entry Point 602 | 603 | ### Discover Agent Tool 604 | **File**: `tools/discover_agent.py` 605 | 606 | ```python 607 | from agents import Agent, Runner 608 | from src.models.discovery_context import DiscoveryContext 609 | from src.agents import discovery_orchestrator 610 | import praw 611 | 612 | async def discover_subreddits_agent( 613 | query: Optional[str] = None, 614 | queries: Optional[List[str]] = None, 615 | reddit: praw.Reddit = None, 616 | limit: int = 10, 617 | include_nsfw: bool = False 618 | ) -> DiscoveryOutput: 619 | """ 620 | Agentic version of discover_subreddits using OpenAI Agents SDK. 621 | 622 | Maintains backward compatibility with existing interface. 623 | """ 624 | # Initialize context 625 | context = DiscoveryContext( 626 | reddit_client=reddit, 627 | discovery_config=DiscoveryConfig( 628 | include_nsfw=include_nsfw, 629 | default_limit=limit 630 | ) 631 | ) 632 | 633 | # Prepare input 634 | if queries: 635 | input_text = f"Batch discovery for queries: {queries}" 636 | else: 637 | input_text = f"Discover subreddits for: {query}" 638 | 639 | # Run discovery through orchestrator 640 | result = await Runner.run( 641 | starting_agent=discovery_orchestrator, 642 | input=input_text, 643 | context=context, 644 | run_config=RunConfig( 645 | max_turns=20, 646 | workflow_name="Reddit Discovery", 647 | trace_metadata={"query": query or queries} 648 | ) 649 | ) 650 | 651 | return result.final_output 652 | ``` 653 | 654 | ## Implementation Strategy 655 | 656 | ### Phase 1: Foundation (Week 1) 657 | 1. Set up project structure and dependencies 658 | 2. Create base models and context objects 659 | 3. Implement Search Executor and Validator agents 660 | 4. Basic integration tests 661 | 662 | ### Phase 2: Core Agents (Week 2) 663 | 1. Implement Query Analyzer with NLP tools 664 | 2. Create Subreddit Scorer with confidence metrics 665 | 3. Build Result Synthesizer 666 | 4. Add comprehensive testing 667 | 668 | ### Phase 3: Orchestration (Week 3) 669 | 1. Implement Discovery Orchestrator with routing logic 670 | 2. Create Batch Manager for parallel execution 671 | 3. Add handoff patterns and error handling 672 | 4. Integration with existing MCP server 673 | 674 | ### Phase 4: Optimization (Week 4) 675 | 1. Add caching layer 676 | 2. Optimize model selection per agent 677 | 3. Implement tracing and monitoring 678 | 4. Performance testing and tuning 679 | 680 | ## Benefits Over Current Implementation 681 | 682 | 1. **Modularity**: Each agent is independent and focused 683 | 2. **Scalability**: Easy to add new discovery strategies 684 | 3. **Observability**: Full tracing of decision process 685 | 4. **Testability**: Each agent can be unit tested 686 | 5. **Flexibility**: Agents can be reused in different workflows 687 | 6. **Performance**: Parallel execution and smart caching 688 | 7. **Maintainability**: Clear separation of concerns 689 | 690 | ## Migration Path 691 | 692 | 1. **Parallel Development**: Build new system alongside existing 693 | 2. **Feature Flag**: Toggle between old and new implementation 694 | 3. **Gradual Rollout**: Test with subset of queries first 695 | 4. **Backward Compatible**: Same interface as current discover.py 696 | 5. **Monitoring**: Compare results between old and new 697 | 698 | ## Testing Strategy 699 | 700 | ### Unit Tests 701 | - Each agent tested independently 702 | - Mock Reddit client and context 703 | - Test all tools and handoffs 704 | 705 | ### Integration Tests 706 | - End-to-end discovery workflows 707 | - Multiple query types 708 | - Error scenarios 709 | 710 | ### Performance Tests 711 | - API call optimization 712 | - Caching effectiveness 713 | - Parallel execution benefits 714 | 715 | ## Monitoring and Observability 716 | 717 | 1. **Tracing**: Full agent decision tree 718 | 2. **Metrics**: API calls, latency, cache hits 719 | 3. **Logging**: Structured logs per agent 720 | 4. **Debugging**: Replay agent conversations 721 | 722 | ## Future Enhancements 723 | 724 | 1. **Learning**: Agents improve from feedback 725 | 2. **Personalization**: User-specific discovery preferences 726 | 3. **Advanced NLP**: Better query understanding 727 | 4. **Community Graph**: Relationship mapping between subreddits 728 | 5. **Trend Detection**: Identify emerging communities 729 | 730 | ## Conclusion 731 | 732 | This agentic architecture transforms the monolithic discover.py into a flexible, scalable system of specialized agents. Each agent excels at its specific task while the orchestrator ensures optimal routing and efficiency. The result is a more maintainable, testable, and powerful discovery system that can evolve with changing requirements. ```