This is page 1 of 2. Use http://codebase.md/ghuntley/how-to-ralph-wiggum?lines=true&page={x} to view the full context.
# Directory Structure
```
├── .gitignore
├── .vscode
│ └── settings.json
├── files
│ ├── AGENTS.md
│ ├── IMPLEMENTATION_PLAN.md
│ ├── loop.sh
│ ├── PROMPT_build.md
│ └── PROMPT_plan.md
├── index.html
├── README.md
└── references
├── nah.png
├── ralph-diagram.png
└── sandbox-environments.md
```
# Files
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
```
1 | /.DS_Store
```
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
```markdown
1 | # The Ralph Playbook
2 |
3 | December 2025 boiled [Ralph's](https://ghuntley.com/ralph/) powerful yet dumb little face to the top of most AI-related timelines.
4 |
5 | I try to pay attention to the crazy-smart insights [@GeoffreyHuntley](https://x.com/GeoffreyHuntley) shares, but I can't say Ralph really clicked for me this summer. Now, all of the recent hubbub has made it hard to ignore.
6 |
7 | [@mattpocockuk](https://x.com/mattpocockuk/status/2008200878633931247) and [@ryancarson](https://x.com/ryancarson/status/2008548371712135632)'s overviews helped a lot - right until Geoff came in and [said 'nah'](https://x.com/GeoffreyHuntley/status/2008731415312236984).
8 |
9 | <img src="references/nah.png" alt="nah" width="500" />
10 |
11 | ## So what is the optimal way to Ralph?
12 |
13 | Many folks seem to be getting good results with various shapes - but I wanted to read the tea leaves as closely as possible from the person who not only captured this approach but also has had the most ass-time in the seat putting it through its paces.
14 |
15 | So I dug in to really _RTFM_ on [recent videos](https://www.youtube.com/watch?v=O2bBWDoxO4s) and Geoff's [original post](https://ghuntley.com/ralph/) to try and untangle for myself what works best.
16 |
17 | Below is the result - a (likely OCD-fueled) Ralph Playbook that organizes the miscellaneous details for putting this all into practice w/o hopefully neutering it in the process.
18 |
19 | > Digging into all of this has also brought to mind some possibly valuable [additional enhancements](#enhancements) to the core approach that aim to stay aligned with the guidelines that make Ralph work so well.
20 |
21 | > [!TIP] > [📖 View as Formatted Guide →](https://ClaytonFarr.github.io/ralph-playbook/)
22 |
23 | ---
24 |
25 | ## Table of Contents
26 |
27 | - [Workflow](#workflow)
28 | - [Key Principles](#key-principles)
29 | - [Loop Mechanics](#loop-mechanics)
30 | - [Files](#files)
31 | - [Enhancements?](#enhancements)
32 |
33 | ---
34 |
35 | ## Workflow
36 |
37 | A picture is worth a thousand tweets and an hour-long video. Geoff's [overview here](https://ghuntley.com/ralph/) (sign up to his newsletter to see full article) really helped clarify the workflow details for moving from 1) idea → 2) individual JTBD-aligned specs → 3) comprehensive implementation plan → 4) Ralph work loops.
38 |
39 | 
40 |
41 | ### 🗘 Three Phases, Two Prompts, One Loop
42 |
43 | This diagram clarified for me that Ralph isn't just "a loop that codes." It's a funnel with 3 Phases, 2 Prompts, and 1 Loop.
44 |
45 | #### Phase 1. Define Requirements (LLM conversation)
46 |
47 | - Discuss project ideas → identify Jobs to Be Done (JTBD)
48 | - Break individual JTBD into topic(s) of concern
49 | - Use subagents to load info from URLs into context
50 | - LLM understands JTBD topic of concern: subagent writes `specs/FILENAME.md` for each topic
51 |
52 | #### Phase 2 / 3. Run Ralph Loop (two modes, swap `PROMPT.md` as needed)
53 |
54 | Same loop mechanism, different prompts for different objectives:
55 |
56 | | Mode | When to use | Prompt focus |
57 | | ---------- | -------------------------------------- | ------------------------------------------------------- |
58 | | _PLANNING_ | No plan exists, or plan is stale/wrong | Generate/update `IMPLEMENTATION_PLAN.md` only |
59 | | _BUILDING_ | Plan exists | Implement from plan, commit, update plan as side effect |
60 |
61 | _Prompt differences per mode:_
62 |
63 | - 'PLANNING' prompt does gap analysis (specs vs code) and outputs a prioritized TODO list—no implementation, no commits.
64 | - 'BUILDING' prompt assumes plan exists, picks tasks from it, implements, runs tests (backpressure), commits.
65 |
66 | _Why use the loop for both modes?_
67 |
68 | - BUILDING requires it: inherently iterative (many tasks × fresh context = isolation)
69 | - PLANNING uses it for consistency: same execution model, though often completes in 1-2 iterations
70 | - Flexibility: if plan needs refinement, loop allows multiple passes reading its own output
71 | - Simplicity: one mechanism for everything; clean file I/O; easy stop/restart
72 |
73 | _Context loaded each iteration:_ `PROMPT.md` + `AGENTS.md`
74 |
75 | _PLANNING mode loop lifecycle:_
76 |
77 | 1. Subagents study `specs/*` and existing `/src`
78 | 2. Compare specs against code (gap analysis)
79 | 3. Create/update `IMPLEMENTATION_PLAN.md` with prioritized tasks
80 | 4. No implementation
81 |
82 | _BUILDING mode loop lifecycle:_
83 |
84 | 1. _Orient_ – subagents study `specs/*` (requirements)
85 | 2. _Read plan_ – study `IMPLEMENTATION_PLAN.md`
86 | 3. _Select_ – pick the most important task
87 | 4. _Investigate_ – subagents study relevant `/src` ("don't assume not implemented")
88 | 5. _Implement_ – N subagents for file operations
89 | 6. _Validate_ – 1 subagent for build/tests (backpressure)
90 | 7. _Update `IMPLEMENTATION_PLAN.md`_ – mark task done, note discoveries/bugs
91 | 8. _Update `AGENTS.md`_ – if operational learnings
92 | 9. _Commit_
93 | 10. _Loop ends_ → context cleared → next iteration starts fresh
94 |
95 | #### Concepts
96 |
97 | | Term | Definition |
98 | | ----------------------- | --------------------------------------------------------------- |
99 | | _Job to be Done (JTBD)_ | High-level user need or outcome |
100 | | _Topic of Concern_ | A distinct aspect/component within a JTBD |
101 | | _Spec_ | Requirements doc for one topic of concern (`specs/FILENAME.md`) |
102 | | _Task_ | Unit of work derived from comparing specs to code |
103 |
104 | _Relationships:_
105 |
106 | - 1 JTBD → multiple topics of concern
107 | - 1 topic of concern → 1 spec
108 | - 1 spec → multiple tasks (specs are larger than tasks)
109 |
110 | _Example:_
111 |
112 | - JTBD: "Help designers create mood boards"
113 | - Topics: image collection, color extraction, layout, sharing
114 | - Each topic → one spec file
115 | - Each spec → many tasks in implementation plan
116 |
117 | _Topic Scope Test: "One Sentence Without 'And'"_
118 |
119 | - Can you describe the topic of concern in one sentence without conjoining unrelated capabilities?
120 | - ✓ "The color extraction system analyzes images to identify dominant colors"
121 | - ✗ "The user system handles authentication, profiles, and billing" → 3 topics
122 | - If you need "and" to describe what it does, it's probably multiple topics
123 |
124 | ---
125 |
126 | ## Key Principles
127 |
128 | ### ⏳ Context Is _Everything_
129 |
130 | - When 200K+ tokens advertised = ~176K truly usable
131 | - And 40-60% context utilization for "smart zone"
132 | - Tight tasks + 1 task per loop = _100% smart zone context utilization_
133 |
134 | This informs and drives everything else:
135 |
136 | - _Use the main agent/context as a scheduler_
137 | - Don't allocate expensive work to main context; spawn subagents whenever possible instead
138 | - _Use subagents as memory extension_
139 | - Each subagent gets ~156kb that's garbage collected
140 | - Fan out to avoid polluting main context
141 | - _Simplicity and brevity win_
142 | - Applies to number of parts in system, loop config, and content
143 | - Verbose inputs degrade determinism
144 | - _Prefer Markdown over JSON_
145 | - To define and track work, for better token efficiency
146 |
147 | ### 🧭 Steering Ralph: Patterns + Backpressure
148 |
149 | Creating the right signals & gates to steer Ralph's successful output is **critical**. You can steer from two directions:
150 |
151 | - _Steer upstream_
152 | - Ensure deterministic setup:
153 | - Allocate first ~5,000 tokens for specs
154 | - Every loop's context is allocated with the same files so model starts from known state (`PROMPT.md` + `AGENTS.md`)
155 | - Your existing code shapes what gets used and generated
156 | - If Ralph is generating wrong patterns, add/update utilities and existing code patterns to steer it toward correct ones
157 | - _Steer downstream_
158 | - Create backpressure via tests, typechecks, lints, builds, etc. that will reject invalid/unacceptable work
159 | - Prompt says "run tests" generically. `AGENTS .md` specifies actual commands to make backpressure project-specific
160 | - Backpressure can extend beyond code validation: some acceptance criteria resist programmatic checks - creative quality, aesthetics, UX feel. LLM-as-judge tests can provide backpressure for subjective criteria with binary pass/fail. ([More detailed thoughts below](#non-deterministic-backpressure) on how to approach this with Ralph.)
161 | - _Remind Ralph to create/use backpressure_
162 | - Remind Ralph to use backpressure when implementing: "Important: When authoring documentation, capture the why — tests and implementation importance."
163 |
164 | ### 🙏 Let Ralph Ralph
165 |
166 | Ralph's effectiveness comes from how much you trust it do the right thing (eventually) and engender its ability to do so.
167 |
168 | - _Let Ralph Ralph_
169 | - Lean into LLM's ability to self-identify, self-correct and self-improve
170 | - Applies to implementation plan, task definition and prioritization
171 | - Eventual consistency achieved through iteration
172 | - _Use protection_
173 | - To operate autonomously, Ralph requires `--dangerously-skip-permissions` - asking for approval on every tool call would break the loop. This bypasses Claude's permission system entirely - so a sandbox becomes your only security boundary.
174 | - Philosophy: "It's not if it gets popped, it's when. And what is the blast radius?"
175 | - Running without a sandbox exposes credentials, browser cookies, SSH keys, and access tokens on your machine
176 | - Run in isolated environments with minimum viable access:
177 | - Only the API keys and deploy keys needed for the task
178 | - No access to private data beyond requirements
179 | - Restrict network connectivity where possible
180 | - Options: Docker sandboxes (local), Fly Sprites/E2B/etc. (remote/production) - [additional notes](references/sandbox-environments.md)
181 | - Additional escape hatches: Ctrl+C stops the loop; `git reset --hard` reverts uncommitted changes; regenerate plan if trajectory goes wrong
182 |
183 | ### 🚦 Move Outside the Loop
184 |
185 | To get the most out of Ralph, you need to get out of his way. Ralph should be doing _all_ of the work, including decided which planned work to implement next and how to implement it. Your job is now to sit on the loop, not in it - to engineer the setup and environment that will allow Ralph to succeed.
186 |
187 | _Observe and course correct_ – especially early on, sit and watch. What patterns emerge? Where does Ralph go wrong? What signs does he need? The prompts you start with won't be the prompts you end with - they evolve through observed failure patterns.
188 |
189 | _Tune it like a guitar_ – instead of prescribing everything upfront, observe and adjust reactively. When Ralph fails a specific way, add a sign to help him next time.
190 |
191 | But signs aren't just prompt text. They're _anything_ Ralph can discover:
192 |
193 | - Prompt guardrails - explicit instructions like "don't assume not implemented"
194 | - `AGENTS .md` - operational learnings about how to build/test
195 | - Utilities in your codebase - when you add a pattern, Ralph discovers it and follows it
196 | - Other discoverable, relevant inputs…
197 |
198 | And remember, _the plan is disposable:_
199 |
200 | - If it's wrong, throw it out, and start over
201 | - Regeneration cost is one Planning loop; cheap compared to Ralph going in circles
202 | - Regenerate when:
203 | - Ralph is going off track (implementing wrong things, duplicating work)
204 | - Plan feels stale or doesn't match current state
205 | - Too much clutter from completed items
206 | - You've made significant spec changes
207 | - You're confused about what's actually done
208 |
209 | ---
210 |
211 | ## Loop Mechanics
212 |
213 | ### Outer Loop Control
214 |
215 | Geoff's initial minimal form of `loop.sh` script:
216 |
217 | ```bash
218 | while :; do cat PROMPT.md | claude ; done
219 | ```
220 |
221 | _Note:_ The same approach can be used with other CLIs; e.g. `amp`, `codex`, `opencode`, etc.
222 |
223 | _What controls task continuation?_
224 |
225 | The continuation mechanism is elegantly simple:
226 |
227 | 1. _Bash loop runs_ → feeds `PROMPT.md` to claude
228 | 2. _PROMPT.md instructs_ → "Study IMPLEMENTATION_PLAN.md and choose the most important thing"
229 | 3. _Agent completes one task_ → updates IMPLEMENTATION_PLAN.md on disk, commits, exits
230 | 4. _Bash loop restarts immediately_ → fresh context window
231 | 5. _Agent reads updated plan_ → picks next most important thing
232 |
233 | _Key insight:_ The IMPLEMENTATION_PLAN.md file persists on disk between iterations and acts as shared state between otherwise isolated loop executions. Each iteration deterministically loads the same files (`PROMPT.md` + `AGENTS.md` + `specs/*`) and reads the current state from disk.
234 |
235 | _No sophisticated orchestration needed_ - just a dumb bash loop that keeps restarting the agent, and the agent figures out what to do next by reading the plan file each time.
236 |
237 | ### Inner Loop Control (Task Execution)
238 |
239 | A single task execution has no hard technical limit. Control relies on:
240 |
241 | - _Scope discipline_ - PROMPT.md instructs "one task" and "commit when tests pass"
242 | - _Backpressure_ - tests/build failures force the agent to fix issues before committing
243 | - _Natural completion_ - agent exits after successful commit
244 |
245 | _Ralph can go in circles, ignore instructions, or take wrong directions_ - this is expected and part of the tuning process. When Ralph "tests you" by failing in specific ways, you add guardrails to the prompt or adjust backpressure mechanisms. The nondeterminism is manageable through observation and iteration.
246 |
247 | ### Enhanced Loop Example
248 |
249 | Wraps core loop with mode selection (plan/build), max-iterations support, and git push after each iteration.
250 |
251 | _This enhancement uses two saved prompt files:_
252 |
253 | - `PROMPT_plan.md` - Planning mode (gap analysis, generates/updates plan)
254 | - `PROMPT_build.md` - Building mode (implements from plan)
255 |
256 | ```bash
257 | #!/bin/bash
258 | # Usage: ./loop.sh [plan] [max_iterations]
259 | # Examples:
260 | # ./loop.sh # Build mode, unlimited iterations
261 | # ./loop.sh 20 # Build mode, max 20 iterations
262 | # ./loop.sh plan # Plan mode, unlimited iterations
263 | # ./loop.sh plan 5 # Plan mode, max 5 iterations
264 |
265 | # Parse arguments
266 | if [ "$1" = "plan" ]; then
267 | # Plan mode
268 | MODE="plan"
269 | PROMPT_FILE="PROMPT_plan.md"
270 | MAX_ITERATIONS=${2:-0}
271 | elif [[ "$1" =~ ^[0-9]+$ ]]; then
272 | # Build mode with max iterations
273 | MODE="build"
274 | PROMPT_FILE="PROMPT_build.md"
275 | MAX_ITERATIONS=$1
276 | else
277 | # Build mode, unlimited (no arguments or invalid input)
278 | MODE="build"
279 | PROMPT_FILE="PROMPT_build.md"
280 | MAX_ITERATIONS=0
281 | fi
282 |
283 | ITERATION=0
284 | CURRENT_BRANCH=$(git branch --show-current)
285 |
286 | echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
287 | echo "Mode: $MODE"
288 | echo "Prompt: $PROMPT_FILE"
289 | echo "Branch: $CURRENT_BRANCH"
290 | [ $MAX_ITERATIONS -gt 0 ] && echo "Max: $MAX_ITERATIONS iterations"
291 | echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
292 |
293 | # Verify prompt file exists
294 | if [ ! -f "$PROMPT_FILE" ]; then
295 | echo "Error: $PROMPT_FILE not found"
296 | exit 1
297 | fi
298 |
299 | while true; do
300 | if [ $MAX_ITERATIONS -gt 0 ] && [ $ITERATION -ge $MAX_ITERATIONS ]; then
301 | echo "Reached max iterations: $MAX_ITERATIONS"
302 | break
303 | fi
304 |
305 | # Run Ralph iteration with selected prompt
306 | # -p: Headless mode (non-interactive, reads from stdin)
307 | # --dangerously-skip-permissions: Auto-approve all tool calls (YOLO mode)
308 | # --output-format=stream-json: Structured output for logging/monitoring
309 | # --model opus: Primary agent uses Opus for complex reasoning (task selection, prioritization)
310 | # Can use 'sonnet' in build mode for speed if plan is clear and tasks well-defined
311 | # --verbose: Detailed execution logging
312 | cat "$PROMPT_FILE" | claude -p \
313 | --dangerously-skip-permissions \
314 | --output-format=stream-json \
315 | --model opus \
316 | --verbose
317 |
318 | # Push changes after each iteration
319 | git push origin "$CURRENT_BRANCH" || {
320 | echo "Failed to push. Creating remote branch..."
321 | git push -u origin "$CURRENT_BRANCH"
322 | }
323 |
324 | ITERATION=$((ITERATION + 1))
325 | echo -e "\n\n======================== LOOP $ITERATION ========================\n"
326 | done
327 | ```
328 |
329 | _Mode selection:_
330 |
331 | - No keyword → Uses `PROMPT_build.md` for building (implementation)
332 | - `plan` keyword → Uses `PROMPT_plan.md` for planning (gap analysis, plan generation)
333 |
334 | _Max-iterations:_
335 |
336 | - Limits the _outer loop_ (number of tasks attempted; NOT tool calls within a single task)
337 | - Each iteration = one fresh context window = one task from IMPLEMENTATION_PLAN.md = one commit
338 | - `./loop.sh` runs unlimited (manual stop with Ctrl+C)
339 | - `./loop.sh 20` runs max 20 iterations then stops
340 |
341 | _Claude CLI flags:_
342 |
343 | - `-p` (headless mode): Enables non-interactive operation, reads prompt from stdin
344 | - `--dangerously-skip-permissions`: Bypasses all permission prompts for fully automated runs
345 | - `--output-format=stream-json`: Outputs structured JSON for logging/monitoring/visualization
346 | - `--model opus`: Primary agent uses Opus for task selection, prioritization, and coordination (can use `sonnet` for speed if tasks are clear)
347 | - `--verbose`: Provides detailed execution logging
348 |
349 | ---
350 |
351 | ## Files
352 |
353 | ```
354 | project-root/
355 | ├── loop.sh # Ralph loop script
356 | ├── PROMPT_build.md # Build mode instructions
357 | ├── PROMPT_plan.md # Plan mode instructions
358 | ├── AGENTS.md # Operational guide loaded each iteration
359 | ├── IMPLEMENTATION_PLAN.md # Prioritized task list (generated/updated by Ralph)
360 | ├── specs/ # Requirement specs (one per JTBD topic)
361 | │ ├── [jtbd-topic-a].md
362 | │ └── [jtbd-topic-b].md
363 | ├── src/ # Application source code
364 | └── src/lib/ # Shared utilities & components
365 | ```
366 |
367 | ### `loop.sh`
368 |
369 | The outer loop script that orchestrates Ralph iterations.
370 |
371 | See [Loop Mechanics](#loop-mechanics) section for detailed implementation examples and configuration options.
372 |
373 | _Setup:_ Make the script executable before first use:
374 |
375 | ```bash
376 | chmod +x loop.sh
377 | ```
378 |
379 | _Core function:_ Continuously feeds prompt file to claude, manages iteration limits, and pushes changes after each task completion.
380 |
381 | ### PROMPTS
382 |
383 | The instruction set for each loop iteration. Swap between PLANNING and BUILDING versions as needed.
384 |
385 | _Prompt Structure:_
386 |
387 | | Section | Purpose |
388 | | ---------------------- | ----------------------------------------------------- |
389 | | _Phase 0_ (0a, 0b, 0c) | Orient: study specs, source location, current plan |
390 | | _Phase 1-4_ | Main instructions: task, validation, commit |
391 | | _999... numbering_ | Guardrails/invariants (higher number = more critical) |
392 |
393 | _Key Language Patterns_ (Geoff's specific phrasing):
394 |
395 | - "study" (not "read" or "look at")
396 | - "don't assume not implemented" (critical - the Achilles' heel)
397 | - "using parallel subagents" / "up to N subagents"
398 | - "only 1 subagent for build/tests" (backpressure control)
399 | - "Think extra hard" (now "Ultrathink)
400 | - "capture the why"
401 | - "keep it up to date"
402 | - "if functionality is missing then it's your job to add it"
403 | - "resolve them or document them"
404 |
405 | #### `PROMPT_plan.md` Template
406 |
407 | _Notes:_
408 |
409 | - Update [project-specific goal] placeholder below.
410 | - Current subagents names presume using Claude.
411 |
412 | ```
413 | 0a. Study `specs/*` with up to 250 parallel Sonnet subagents to learn the application specifications.
414 | 0b. Study @IMPLEMENTATION_PLAN.md (if present) to understand the plan so far.
415 | 0c. Study `src/lib/*` with up to 250 parallel Sonnet subagents to understand shared utilities & components.
416 | 0d. For reference, the application source code is in `src/*`.
417 |
418 | 1. Study @IMPLEMENTATION_PLAN.md (if present; it may be incorrect) and use up to 500 Sonnet subagents to study existing source code in `src/*` and compare it against `specs/*`. Use an Opus subagent to analyze findings, prioritize tasks, and create/update @IMPLEMENTATION_PLAN.md as a bullet point list sorted in priority of items yet to be implemented. Ultrathink. Consider searching for TODO, minimal implementations, placeholders, skipped/flaky tests, and inconsistent patterns. Study @IMPLEMENTATION_PLAN.md to determine starting point for research and keep it up to date with items considered complete/incomplete using subagents.
419 |
420 | IMPORTANT: Plan only. Do NOT implement anything. Do NOT assume functionality is missing; confirm with code search first. Treat `src/lib` as the project's standard library for shared utilities and components. Prefer consolidated, idiomatic implementations there over ad-hoc copies.
421 |
422 | ULTIMATE GOAL: We want to achieve [project-specific goal]. Consider missing elements and plan accordingly. If an element is missing, search first to confirm it doesn't exist, then if needed author the specification at specs/FILENAME.md. If you create a new element then document the plan to implement it in @IMPLEMENTATION_PLAN.md using a subagent.
423 | ```
424 |
425 | #### `PROMPT_build.md` Template
426 |
427 | _Note:_ Current subagents names presume using Claude.
428 |
429 | ```
430 | 0a. Study `specs/*` with up to 500 parallel Sonnet subagents to learn the application specifications.
431 | 0b. Study @IMPLEMENTATION_PLAN.md.
432 | 0c. For reference, the application source code is in `src/*`.
433 |
434 | 1. Your task is to implement functionality per the specifications using parallel subagents. Follow @IMPLEMENTATION_PLAN.md and choose the most important item to address. Before making changes, search the codebase (don't assume not implemented) using Sonnet subagents. You may use up to 500 parallel Sonnet subagents for searches/reads and only 1 Sonnet subagent for build/tests. Use Opus subagents when complex reasoning is needed (debugging, architectural decisions).
435 | 2. After implementing functionality or resolving problems, run the tests for that unit of code that was improved. If functionality is missing then it's your job to add it as per the application specifications. Ultrathink.
436 | 3. When you discover issues, immediately update @IMPLEMENTATION_PLAN.md with your findings using a subagent. When resolved, update and remove the item.
437 | 4. When the tests pass, update @IMPLEMENTATION_PLAN.md, then `git add -A` then `git commit` with a message describing the changes. After the commit, `git push`.
438 |
439 | 99999. Important: When authoring documentation, capture the why — tests and implementation importance.
440 | 999999. Important: Single sources of truth, no migrations/adapters. If tests unrelated to your work fail, resolve them as part of the increment.
441 | 9999999. As soon as there are no build or test errors create a git tag. If there are no git tags start at 0.0.0 and increment patch by 1 for example 0.0.1 if 0.0.0 does not exist.
442 | 99999999. You may add extra logging if required to debug issues.
443 | 999999999. Keep @IMPLEMENTATION_PLAN.md current with learnings using a subagent — future work depends on this to avoid duplicating efforts. Update especially after finishing your turn.
444 | 9999999999. When you learn something new about how to run the application, update @AGENTS.md using a subagent but keep it brief. For example if you run commands multiple times before learning the correct command then that file should be updated.
445 | 99999999999. For any bugs you notice, resolve them or document them in @IMPLEMENTATION_PLAN.md using a subagent even if it is unrelated to the current piece of work.
446 | 999999999999. Implement functionality completely. Placeholders and stubs waste efforts and time redoing the same work.
447 | 9999999999999. When @IMPLEMENTATION_PLAN.md becomes large periodically clean out the items that are completed from the file using a subagent.
448 | 99999999999999. If you find inconsistencies in the specs/* then use an Opus 4.5 subagent with 'ultrathink' requested to update the specs.
449 | 999999999999999. IMPORTANT: Keep @AGENTS.md operational only — status updates and progress notes belong in `IMPLEMENTATION_PLAN.md`. A bloated AGENTS.md pollutes every future loop's context.
450 | ```
451 |
452 | ### `AGENTS.md`
453 |
454 | Single, canonical "heart of the loop" - a concise, operational "how to run/build" guide.
455 |
456 | - NOT a changelog or progress diary
457 | - Describes how to build/run the project
458 | - Captures operational learnings that improve the loop
459 | - Keep brief (~60 lines)
460 |
461 | Status, progress, and planning belong in `IMPLEMENTATION_PLAN.md`, not here.
462 |
463 | _Loopback / Immediate Self-Evaluation:_
464 |
465 | AGENTS.md should contain the project-specific commands that enable loopback - the ability for Ralph to immediately evaluate his work within the same loop. This includes:
466 |
467 | - Build commands
468 | - Test commands (targeted and full suite)
469 | - Typecheck/lint commands
470 | - Any other validation tools
471 |
472 | The BUILDING prompt says "run tests" generically; AGENTS.md specifies the actual commands. This is how backpressure gets wired in per-project.
473 |
474 | #### Example
475 |
476 | ```
477 | ## Build & Run
478 |
479 | Succinct rules for how to BUILD the project:
480 |
481 | ## Validation
482 |
483 | Run these after implementing to get immediate feedback:
484 |
485 | - Tests: `[test command]`
486 | - Typecheck: `[typecheck command]`
487 | - Lint: `[lint command]`
488 |
489 | ## Operational Notes
490 |
491 | Succinct learnings about how to RUN the project:
492 |
493 | ...
494 |
495 | ### Codebase Patterns
496 |
497 | ...
498 | ```
499 |
500 | ### `IMPLEMENTATION_PLAN.md`
501 |
502 | Prioritized bullet-point list of tasks derived from gap analysis (specs vs code) - generated by Ralph.
503 |
504 | - _Created_ via PLANNING mode
505 | - _Updated_ during BUILDING mode (mark complete, add discoveries, note bugs)
506 | - _Can be regenerated_ – Geoff: "I have deleted the TODO list multiple times" → switch to PLANNING mode
507 | - _Self-correcting_ – BUILDING mode can even create new specs if missing
508 |
509 | The circularity is intentional: eventual consistency through iteration.
510 |
511 | _No pre-specified template_ - let Ralph/LLM dictate and manage format that works best for it.
512 |
513 | ### `specs/*`
514 |
515 | One markdown file per topic of concern. These are the source of truth for what should be built.
516 |
517 | - Created during Requirements phase (human + LLM conversation)
518 | - Consumed by both PLANNING and BUILDING modes
519 | - Can be updated if inconsistencies discovered (rare, use subagent)
520 |
521 | _No pre-specified template_ - let Ralph/LLM dictate and manage format that works best for it.
522 |
523 | ### `src/` and `src/lib/`
524 |
525 | Application source code and shared utilities/components.
526 |
527 | Referenced in `PROMPT.md` templates for orientation steps.
528 |
529 | ---
530 |
531 | ## Enhancements?
532 |
533 | I'm (Clayton) still determining the value/viability of these possible enhancements, but the opportunities sound promising.
534 |
535 | I'm still determining the value/viability of these, but the opportunities sound promising:
536 |
537 | - [Claude's AskUserQuestionTool for Planning](#use-claudes-askuserquestiontool-for-planning) - use Claude's built-in interview tool to systematically clarify JTBD, edge cases, and acceptance criteria for specs.
538 | - [Acceptance-Driven Backpressure](#acceptance-driven-backpressure) - Derive test requirements during planning from acceptance criteria. Prevents "cheating" - can't claim done without appropriate tests passing.
539 | - [Non-Deterministic Backpressure](#non-deterministic-backpressure) - Using LLM-as-judge for tests against subjective tasks (tone, aesthetics, UX). Binary pass/fail reviews that iterate until pass.
540 | - [Ralph-Friendly Work Branches](#ralph-friendly-work-branches) - Asking Ralph to "filter to feature X" at runtime is unreliable. Instead, create scoped plan per branch upfront.
541 | - [JTBD → Story Map → SLC Release](#jtbd--story-map--slc-release) - Push the power of "Letting Ralph Ralph" to connect JTBD's audience and activities to Simple/Lovable/Complete releases.
542 |
543 | ---
544 |
545 | ### Use Claude's AskUserQuestionTool for Planning
546 |
547 | During Phase 1 (Define Requirements), use Claude's built-in `AskUserQuestionTool` to systematically explore JTBD, topics of concern, edge cases, and acceptance criteria through structured interview before writing specs.
548 |
549 | _When to use:_ Minimal/vague initial requirements, need to clarify constraints, or multiple valid approaches exist.
550 |
551 | _Invoke:_ "Interview me using AskUserQuestion to understand [JTBD/topic/acceptance criteria/...]"
552 |
553 | Claude will ask targeted questions to clarify requirements and ensure alignment before producing `specs/*.md` files.
554 |
555 | _Flow:_
556 |
557 | 1. Start with known information →
558 | 2. _Claude interviews via AskUserQuestion_ →
559 | 3. Iterate until clear →
560 | 4. Claude writes specs with acceptance criteria →
561 | 5. Proceed to planning/building
562 |
563 | No code or prompt changes needed - this simply enhances Phase 1 using existing Claude Code capabilities.
564 |
565 | _Inspiration_ - [Thariq's X post](https://x.com/trq212/status/2005315275026260309):
566 |
567 | ---
568 |
569 | ### Acceptance-Driven Backpressure
570 |
571 | Geoff's Ralph _implicitly_ connects specs → implementation → tests through emergent iteration. This enhancement would make that connection _explicit_ by deriving test requirements during planning, creating a direct line from "what success looks like" to "what verifies it."
572 |
573 | This enhancement connects acceptance criteria (in specs) directly to test requirements (in implementation plan), improving backpressure quality by:
574 |
575 | - _Preventing "no cheating"_ - Can't claim done without required tests derived from acceptance criteria
576 | - _Enabling TDD workflow_ - Test requirements known before implementation starts
577 | - _Improving convergence_ - Clear completion signal (required tests pass) vs ambiguous ("seems done?")
578 | - _Maintaining determinism_ - Test requirements in plan (known state) not emergent (probabilistic)
579 |
580 | #### Compatibility with Core Philosophy
581 |
582 | | Principle | Maintained? | How |
583 | | --------------------- | ----------- | ----------------------------------------------------------- |
584 | | Monolithic operation | ✅ Yes | One agent, one task, one loop at a time |
585 | | Backpressure critical | ✅ Yes | Tests are the mechanism, just derived explicitly now |
586 | | Context efficiency | ✅ Yes | Planning decides tests once vs building rediscovering |
587 | | Deterministic setup | ✅ Yes | Test requirements in plan (known state) not emergent |
588 | | Let Ralph Ralph | ✅ Yes | Ralph still prioritizes and chooses implementation approach |
589 | | Plan is disposable | ✅ Yes | Wrong test requirements? Regenerate plan |
590 | | "Capture the why" | ✅ Yes | Test intent documented in plan before implementation |
591 | | No cheating | ✅ Yes | Required tests prevent placeholder implementations |
592 |
593 | #### The Prescriptiveness Balance
594 |
595 | The critical distinction:
596 |
597 | _Acceptance criteria_ (in specs) = Behavioral outcomes, observable results, what success looks like
598 |
599 | - ✅ "Extracts 5-10 dominant colors from any uploaded image"
600 | - ✅ "Processes images <5MB in <100ms"
601 | - ✅ "Handles edge cases: grayscale, single-color, transparent backgrounds"
602 |
603 | _Test requirements_ (in implementation plan) = Verification points derived from acceptance criteria
604 |
605 | - ✅ "Required tests: Extract 5-10 colors, Performance <100ms, Handle grayscale edge case"
606 |
607 | _Implementation approach_ (up to Ralph) = Technical decisions about how to achieve it
608 |
609 | - ❌ "Use K-means clustering with 3 iterations and LAB color space conversion"
610 |
611 | The key: _Specify WHAT to verify (outcomes), not HOW to implement (approach)_
612 |
613 | This maintains "Let Ralph Ralph" principle - Ralph decides implementation details while having clear success signals.
614 |
615 | #### Architecture: Three-Phase Connection
616 |
617 | ```
618 | Phase 1: Requirements Definition
619 | specs/*.md + Acceptance Criteria
620 | ↓
621 | Phase 2: Planning (derives test requirements)
622 | IMPLEMENTATION_PLAN.md + Required Tests
623 | ↓
624 | Phase 3: Building (implements with tests)
625 | Implementation + Tests → Backpressure
626 | ```
627 |
628 | #### Phase 1: Requirements Definition
629 |
630 | During the human + LLM conversation that produces specs:
631 |
632 | - Discuss JTBD and break into topics of concern
633 | - Use subagents to load external context as needed
634 | - _Discuss and define acceptance criteria_ - what observable, verifiable outcomes indicate success
635 | - Keep criteria behavioral (outcomes), not implementation (how to build it)
636 | - LLM writes specs including acceptance criteria however makes most sense for the spec
637 | - Acceptance criteria become the foundation for deriving test requirements in planning phase
638 |
639 | #### Phase 2: Planning Mode Enhancement
640 |
641 | Modify `PROMPT_plan.md` instruction 1 to include test derivation. Add after the first sentence:
642 |
643 | ```markdown
644 | For each task in the plan, derive required tests from acceptance criteria in specs - what specific outcomes need verification (behavior, performance, edge cases). Tests verify WHAT works, not HOW it's implemented. Include as part of task definition.
645 | ```
646 |
647 | #### Phase 3: Building Mode Enhancement
648 |
649 | Modify `PROMPT_build.md` instructions:
650 |
651 | _Instruction 1:_ Add after "choose the most important item to address":
652 |
653 | ```markdown
654 | Tasks include required tests - implement tests as part of task scope.
655 | ```
656 |
657 | _Instruction 2:_ Replace "run the tests for that unit of code" with:
658 |
659 | ```markdown
660 | run all required tests specified in the task definition. All required tests must exist and pass before the task is considered complete.
661 | ```
662 |
663 | _Prepend new guardrail_ (in the 9s sequence):
664 |
665 | ```markdown
666 | 999. Required tests derived from acceptance criteria must exist and pass before committing. Tests are part of implementation scope, not optional. Test-driven development approach: tests can be written first or alongside implementation.
667 | ```
668 |
669 | ---
670 |
671 | ### Non-Deterministic Backpressure
672 |
673 | Some acceptance criteria resist programmatic validation:
674 |
675 | - _Creative quality_ - Writing tone, narrative flow, engagement
676 | - _Aesthetic judgments_ - Visual harmony, design balance, brand consistency
677 | - _UX quality_ - Intuitive navigation, clear information hierarchy
678 | - _Content appropriateness_ - Context-aware messaging, audience fit
679 |
680 | These require human-like judgment but need backpressure to meet acceptance criteria during building loop.
681 |
682 | _Solution:_ Add LLM-as-Judge tests as backpressure with binary pass/fail.
683 |
684 | LLM reviews are non-deterministic (same artifact may receive different judgments across runs). This aligns with Ralph philosophy: "deterministically bad in an undeterministic world." The loop provides eventual consistency through iteration—reviews run until pass, accepting natural variance.
685 |
686 | #### What Needs to Be Created (First Step)
687 |
688 | Create two files in `src/lib/`:
689 |
690 | ```
691 | src/lib/
692 | llm-review.ts # Core fixture - single function, clean API
693 | llm-review.test.ts # Reference examples showing the pattern (Ralph learns from these)
694 | ```
695 |
696 | ##### `llm-review.ts` - Binary pass/fail API Ralph discovers:
697 |
698 | ```typescript
699 | interface ReviewResult {
700 | pass: boolean;
701 | feedback?: string; // Only present when pass=false
702 | }
703 |
704 | function createReview(config: {
705 | criteria: string; // What to evaluate (behavioral, observable)
706 | artifact: string; // Text content OR screenshot path
707 | intelligence?: "fast" | "smart"; // Optional, defaults to 'fast'
708 | }): Promise<ReviewResult>;
709 | ```
710 |
711 | _Multimodal support:_ Both intelligence levels would use multimodal model (text + vision). Artifact type detection is automatic:
712 |
713 | - Text evaluation: `artifact: "Your content here"` → Routes as text input
714 | - Vision evaluation: `artifact: "./tmp/screenshot.png"` → Routes as vision input (detects .png, .jpg, .jpeg extensions)
715 |
716 | _Intelligence levels_ (quality of judgment, not capability type):
717 |
718 | - `fast` (default): Quick, cost-effective models for straightforward evaluations
719 | - Example: Gemini 3.0 Flash (multimodal, fast, cheap)
720 | - `smart`: Higher-quality models for nuanced aesthetic/creative judgment
721 | - Example: GPT 5.1 (multimodal, better judgment, higher cost)
722 |
723 | The fixture implementation selects appropriate models. (Examples are current options, not requirements.)
724 |
725 | ##### `llm-review.test.ts` - Shows Ralph how to use it (text and vision examples):
726 |
727 | ```typescript
728 | import { createReview } from "@/lib/llm-review";
729 |
730 | // Example 1: Text evaluation
731 | test("welcome message tone", async () => {
732 | const message = generateWelcomeMessage();
733 | const result = await createReview({
734 | criteria:
735 | "Message uses warm, conversational tone appropriate for design professionals while clearly conveying value proposition",
736 | artifact: message, // Text content
737 | });
738 | expect(result.pass).toBe(true);
739 | });
740 |
741 | // Example 2: Vision evaluation (screenshot path)
742 | test("dashboard visual hierarchy", async () => {
743 | await page.screenshot({ path: "./tmp/dashboard.png" });
744 | const result = await createReview({
745 | criteria:
746 | "Layout demonstrates clear visual hierarchy with obvious primary action",
747 | artifact: "./tmp/dashboard.png", // Screenshot path
748 | });
749 | expect(result.pass).toBe(true);
750 | });
751 |
752 | // Example 3: Smart intelligence for complex judgment
753 | test("brand visual consistency", async () => {
754 | await page.screenshot({ path: "./tmp/homepage.png" });
755 | const result = await createReview({
756 | criteria:
757 | "Visual design maintains professional brand identity suitable for financial services while avoiding corporate sterility",
758 | artifact: "./tmp/homepage.png",
759 | intelligence: "smart", // Complex aesthetic judgment
760 | });
761 | expect(result.pass).toBe(true);
762 | });
763 | ```
764 |
765 | _Ralph learns from these examples:_ Both text and screenshots work as artifacts. Choose based on what needs evaluation. The fixture handles the rest internally.
766 |
767 | _Future extensibility:_ Current design uses single `artifact: string` for simplicity. Can expand to `artifact: string | string[]` if clear patterns emerge requiring multiple artifacts (before/after comparisons, consistency across items, multi-perspective evaluation). Composite screenshots or concatenated text could handle most multi-item needs.
768 |
769 | #### Integration with Ralph Workflow
770 |
771 | _Planning Phase_ - Update `PROMPT_plan.md`:
772 |
773 | After:
774 |
775 | ```
776 | ...Study @IMPLEMENTATION_PLAN.md to determine starting point for research and keep it up to date with items considered complete/incomplete using subagents.
777 | ```
778 |
779 | Insert this:
780 |
781 | ```
782 | When deriving test requirements from acceptance criteria, identify whether verification requires programmatic validation (measurable, inspectable) or human-like judgment (perceptual quality, tone, aesthetics). Both types are equally valid backpressure mechanisms. For subjective criteria that resist programmatic validation, explore src/lib for non-deterministic evaluation patterns.
783 | ```
784 |
785 | _Building Phase_ - Update `PROMPT_build.md`:
786 |
787 | Prepend new guardrail (in the 9s sequence):
788 |
789 | ```markdown
790 | 9999. Create tests to verify implementation meets acceptance criteria and include both conventional tests (behavior, performance, correctness) and perceptual quality tests (for subjective criteria, see src/lib patterns).
791 | ```
792 |
793 | _Discovery, not documentation:_ Ralph learns LLM review patterns from `llm-review.test.ts` examples during `src/lib` exploration (Phase 0c). No AGENTS.md updates needed - the code examples are the documentation.
794 |
795 | #### Compatibility with Core Philosophy
796 |
797 | | Principle | Maintained? | How |
798 | | --------------------- | ----------- | -------------------------------------------------------------------------------------------------------------------------------------------- |
799 | | Backpressure critical | ✅ Yes | Extends backpressure to non-programmatic acceptance |
800 | | Deterministic setup | ⚠️ Partial | Criteria in plan (deterministic), evaluation non-deterministic but converges through iteration. Intentional tradeoff for subjective quality. |
801 | | Context efficiency | ✅ Yes | Fixture reused via `src/lib`, small test definitions |
802 | | Let Ralph Ralph | ✅ Yes | Ralph discovers pattern, chooses when to use, writes criteria |
803 | | Plan is disposable | ✅ Yes | Review requirements part of plan, regenerate if wrong |
804 | | Simplicity wins | ✅ Yes | Single function, binary result, no scoring complexity |
805 | | Add signs for Ralph | ✅ Yes | Light prompt additions, learning from code exploration |
806 |
807 | ---
808 |
809 | ### Ralph-Friendly Work Branches
810 |
811 | _The Critical Principle:_ Geoff's Ralph works from a single, disposable plan where Ralph picks "most important." To use branches with Ralph while maintaining this pattern, you must scope at plan creation, not at task selection.
812 |
813 | _Why this matters:_
814 |
815 | - ❌ _Wrong approach_: Create full plan, then ask Ralph to "filter" tasks at runtime → unreliable (70-80%), violates determinism
816 | - ✅ _Right approach_: Create a scoped plan upfront for each work branch → deterministic, simple, maintains "plan is disposable"
817 |
818 | _Solution:_ Add a `plan-work` mode to create a work-scoped IMPLEMENTATION_PLAN.md on the current branch. User creates work branch, then runs `plan-work` with a natural language description of the work focus. The LLM uses this description to scope the plan. Post planning, Ralph builds from this already-scoped plan with zero semantic filtering - just picks "most important" as always.
819 |
820 | _Terminology:_ "Work" is intentionally broad - it can describe features, topics of concern, refactoring efforts, infrastructure changes, bug fixes, or any coherent body of related changes. The work description you pass to `plan-work` is natural language for the LLM - it can be prose, not constrained by git branch naming rules.
821 |
822 | #### Design Principles
823 |
824 | - ✅ _Each Ralph session operates monolithically_ on ONE body of work per branch
825 | - ✅ _User creates branches manually_ - full control over naming conventions and strategy (e.g. worktrees)
826 | - ✅ _Natural language work descriptions_ - pass prose to LLM, unconstrained by git naming rules
827 | - ✅ _Scoping at plan creation_ (deterministic) not task selection (probabilistic)
828 | - ✅ _Single plan per branch_ - one IMPLEMENTATION_PLAN.md per branch
829 | - ✅ _Plan remains disposable_ - regenerate scoped plan when wrong/stale for a branch
830 | - ✅ No dynamic branch switching within a loop session
831 | - ✅ Maintains simplicity and determinism
832 | - ✅ Optional - main branch workflow still works
833 | - ✅ No semantic filtering at build time - Ralph just picks "most important"
834 |
835 | #### Workflow
836 |
837 | _1. Full Planning (on main branch)_
838 |
839 | ```bash
840 | ./loop.sh plan
841 | # Generate full IMPLEMENTATION_PLAN.md for entire project
842 | ```
843 |
844 | _2. Create Work Branch_
845 |
846 | User performs:
847 |
848 | ```bash
849 | git checkout -b ralph/user-auth-oauth
850 | # Create branch with whatever naming convention you prefer
851 | # Suggestion: ralph/* prefix for work branches
852 | ```
853 |
854 | _3. Scoped Planning (on work branch)_
855 |
856 | ```bash
857 | ./loop.sh plan-work "user authentication system with OAuth and session management"
858 | # Pass natural language description - LLM uses this to scope the plan
859 | # Creates focused IMPLEMENTATION_PLAN.md with only tasks for this work
860 | ```
861 |
862 | _4. Build from Plan (on work branch)_
863 |
864 | ```bash
865 | ./loop.sh
866 | # Ralph builds from scoped plan (no filtering needed)
867 | # Picks most important task from already-scoped plan
868 | ```
869 |
870 | _5. PR Creation (when work complete)_
871 |
872 | User performs:
873 |
874 | ```bash
875 | gh pr create --base main --head ralph/user-auth-oauth --fill
876 | ```
877 |
878 | #### Work-Scoped Loop Script
879 |
880 | Extends the base enhanced loop script to add work branch support with scoped planning:
881 |
882 | ```bash
883 | #!/bin/bash
884 | set -euo pipefail
885 |
886 | # Usage:
887 | # ./loop.sh [plan] [max_iterations] # Plan/build on current branch
888 | # ./loop.sh plan-work "work description" # Create scoped plan on current branch
889 | # Examples:
890 | # ./loop.sh # Build mode, unlimited
891 | # ./loop.sh 20 # Build mode, max 20
892 | # ./loop.sh plan 5 # Full planning, max 5
893 | # ./loop.sh plan-work "user auth" # Scoped planning
894 |
895 | # Parse arguments
896 | MODE="build"
897 | PROMPT_FILE="PROMPT_build.md"
898 |
899 | if [ "$1" = "plan" ]; then
900 | # Full planning mode
901 | MODE="plan"
902 | PROMPT_FILE="PROMPT_plan.md"
903 | MAX_ITERATIONS=${2:-0}
904 | elif [ "$1" = "plan-work" ]; then
905 | # Scoped planning mode
906 | if [ -z "$2" ]; then
907 | echo "Error: plan-work requires a work description"
908 | echo "Usage: ./loop.sh plan-work \"description of the work\""
909 | exit 1
910 | fi
911 | MODE="plan-work"
912 | WORK_DESCRIPTION="$2"
913 | PROMPT_FILE="PROMPT_plan_work.md"
914 | MAX_ITERATIONS=${3:-5} # Default 5 for work planning
915 | elif [[ "$1" =~ ^[0-9]+$ ]]; then
916 | # Build mode with max iterations
917 | MAX_ITERATIONS=$1
918 | else
919 | # Build mode, unlimited
920 | MAX_ITERATIONS=0
921 | fi
922 |
923 | ITERATION=0
924 | CURRENT_BRANCH=$(git branch --show-current)
925 |
926 | # Validate branch for plan-work mode
927 | if [ "$MODE" = "plan-work" ]; then
928 | if [ "$CURRENT_BRANCH" = "main" ] || [ "$CURRENT_BRANCH" = "master" ]; then
929 | echo "Error: plan-work should be run on a work branch, not main/master"
930 | echo "Create a work branch first: git checkout -b ralph/your-work"
931 | exit 1
932 | fi
933 |
934 | echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
935 | echo "Mode: plan-work"
936 | echo "Branch: $CURRENT_BRANCH"
937 | echo "Work: $WORK_DESCRIPTION"
938 | echo "Prompt: $PROMPT_FILE"
939 | echo "Plan: Will create scoped IMPLEMENTATION_PLAN.md"
940 | [ "$MAX_ITERATIONS" -gt 0 ] && echo "Max: $MAX_ITERATIONS iterations"
941 | echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
942 |
943 | # Warn about uncommitted changes to IMPLEMENTATION_PLAN.md
944 | if [ -f "IMPLEMENTATION_PLAN.md" ] && ! git diff --quiet IMPLEMENTATION_PLAN.md 2>/dev/null; then
945 | echo "Warning: IMPLEMENTATION_PLAN.md has uncommitted changes that will be overwritten"
946 | read -p "Continue? [y/N] " -n 1 -r
947 | echo
948 | [[ ! $REPLY =~ ^[Yy]$ ]] && exit 1
949 | fi
950 |
951 | # Export work description for PROMPT_plan_work.md
952 | export WORK_SCOPE="$WORK_DESCRIPTION"
953 | else
954 | # Normal plan/build mode
955 | echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
956 | echo "Mode: $MODE"
957 | echo "Branch: $CURRENT_BRANCH"
958 | echo "Prompt: $PROMPT_FILE"
959 | echo "Plan: IMPLEMENTATION_PLAN.md"
960 | [ "$MAX_ITERATIONS" -gt 0 ] && echo "Max: $MAX_ITERATIONS iterations"
961 | echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
962 | fi
963 |
964 | # Verify prompt file exists
965 | if [ ! -f "$PROMPT_FILE" ]; then
966 | echo "Error: $PROMPT_FILE not found"
967 | exit 1
968 | fi
969 |
970 | # Main loop
971 | while true; do
972 | if [ "$MAX_ITERATIONS" -gt 0 ] && [ "$ITERATION" -ge "$MAX_ITERATIONS" ]; then
973 | echo "Reached max iterations: $MAX_ITERATIONS"
974 |
975 | if [ "$MODE" = "plan-work" ]; then
976 | echo ""
977 | echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
978 | echo "Scoped plan created: $WORK_DESCRIPTION"
979 | echo "To build, run:"
980 | echo " ./loop.sh 20"
981 | echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
982 | fi
983 | break
984 | fi
985 |
986 | # Run Ralph iteration with selected prompt
987 | # -p: Headless mode (non-interactive, reads from stdin)
988 | # --dangerously-skip-permissions: Auto-approve all tool calls (YOLO mode)
989 | # --output-format=stream-json: Structured output for logging/monitoring
990 | # --model opus: Primary agent uses Opus for complex reasoning (task selection, prioritization)
991 | # Can use 'sonnet' for speed if plan is clear and tasks well-defined
992 | # --verbose: Detailed execution logging
993 |
994 | # For plan-work mode, substitute ${WORK_SCOPE} in prompt before piping
995 | if [ "$MODE" = "plan-work" ]; then
996 | envsubst < "$PROMPT_FILE" | claude -p \
997 | --dangerously-skip-permissions \
998 | --output-format=stream-json \
999 | --model opus \
1000 | --verbose
1001 | else
1002 | cat "$PROMPT_FILE" | claude -p \
1003 | --dangerously-skip-permissions \
1004 | --output-format=stream-json \
1005 | --model opus \
1006 | --verbose
1007 | fi
1008 |
1009 | # Push to current branch
1010 | CURRENT_BRANCH=$(git branch --show-current)
1011 | git push origin "$CURRENT_BRANCH" || {
1012 | echo "Failed to push. Creating remote branch..."
1013 | git push -u origin "$CURRENT_BRANCH"
1014 | }
1015 |
1016 | ITERATION=$((ITERATION + 1))
1017 | echo -e "\n\n======================== LOOP $ITERATION ========================\n"
1018 | done
1019 | ```
1020 |
1021 | #### `PROMPT_plan_work.md` Template
1022 |
1023 | _Note:_ Identical to `PROMPT_plan.md` but with scoping instructions and `WORK_SCOPE` env var substituted (automatically by the loop script).
1024 |
1025 | ```
1026 | 0a. Study `specs/*` with up to 250 parallel Sonnet subagents to learn the application specifications.
1027 | 0b. Study @IMPLEMENTATION_PLAN.md (if present) to understand the plan so far.
1028 | 0c. Study `src/lib/*` with up to 250 parallel Sonnet subagents to understand shared utilities & components.
1029 | 0d. For reference, the application source code is in `src/*`.
1030 |
1031 | 1. You are creating a SCOPED implementation plan for work: "${WORK_SCOPE}". Study @IMPLEMENTATION_PLAN.md (if present; it may be incorrect) and use up to 500 Sonnet subagents to study existing source code in `src/*` and compare it against `specs/*`. Use an Opus subagent to analyze findings, prioritize tasks, and create/update @IMPLEMENTATION_PLAN.md as a bullet point list sorted in priority of items yet to be implemented. Ultrathink. Consider searching for TODO, minimal implementations, placeholders, skipped/flaky tests, and inconsistent patterns. Study @IMPLEMENTATION_PLAN.md to determine starting point for research and keep it up to date with items considered complete/incomplete using subagents.
1032 |
1033 | IMPORTANT: This is SCOPED PLANNING for "${WORK_SCOPE}" only. Create a plan containing ONLY tasks directly related to this work scope. Be conservative - if uncertain whether a task belongs to this work, exclude it. The plan can be regenerated if too narrow. Plan only. Do NOT implement anything. Do NOT assume functionality is missing; confirm with code search first. Treat `src/lib` as the project's standard library for shared utilities and components. Prefer consolidated, idiomatic implementations there over ad-hoc copies.
1034 |
1035 | ULTIMATE GOAL: We want to achieve the scoped work "${WORK_SCOPE}". Consider missing elements related to this work and plan accordingly. If an element is missing, search first to confirm it doesn't exist, then if needed author the specification at specs/FILENAME.md. If you create a new element then document the plan to implement it in @IMPLEMENTATION_PLAN.md using a subagent.
1036 | ```
1037 |
1038 | #### Compatibility with Core Philosophy
1039 |
1040 | | Principle | Maintained? | How |
1041 | | ---------------------- | ----------- | ------------------------------------------------------------------------ |
1042 | | Monolithic operation | ✅ Yes | Ralph still operates as single process within branch |
1043 | | One task per loop | ✅ Yes | Unchanged |
1044 | | Fresh context | ✅ Yes | Unchanged |
1045 | | Deterministic | ✅ Yes | Scoping at plan creation (deterministic), not runtime (prob.) |
1046 | | Simple | ✅ Yes | Optional enhancement, main workflow still works |
1047 | | Plan-driven | ✅ Yes | One IMPLEMENTATION_PLAN.md per branch |
1048 | | Single source of truth | ✅ Yes | One plan per branch - scoped plan replaces full plan on branch |
1049 | | Plan is disposable | ✅ Yes | Regenerate scoped plan anytime: `./loop.sh plan-work "work description"` |
1050 | | Markdown over JSON | ✅ Yes | Still markdown plans |
1051 | | Let Ralph Ralph | ✅ Yes | Ralph picks "most important" from already-scoped plan - no filter |
1052 |
1053 | ---
1054 |
1055 | ### JTBD → Story Map → SLC Release
1056 |
1057 | #### Topics of Concern → Activities
1058 |
1059 | Geoff's [suggested workflow](https://ghuntley.com/content/images/size/w2400/2025/07/The-ralph-Process.png) already aligns planning with Jobs-to-be-Done — breaking JTBDs into topics of concern, which in turn become specs. I love this and I think there's an opportunity to lean further into the product benefits this approach affords by reframing _topics of concern_ as _activities_.
1060 |
1061 | Activities are verbs in a journey ("upload photo", "extract colors") rather than capabilities ("color extraction system"). They're naturally scoped by user intent.
1062 |
1063 | > Topics: "color extraction", "layout engine" → capability-oriented
1064 | > Activities: "upload photo", "see extracted colors", "arrange layout" → journey-oriented
1065 |
1066 | #### Activities → User Journey
1067 |
1068 | Activities — and their constituent steps — sequence naturally into a user flow, creating a _journey structure_ that makes gaps and dependencies visible. A _[User Story Map](https://www.nngroup.com/articles/user-story-mapping/)_ organizes activities as columns (the journey backbone) with capability depths as rows — the full space of what _could_ be built:
1069 |
1070 | ```
1071 | UPLOAD → EXTRACT → ARRANGE → SHARE
1072 |
1073 | basic auto manual export
1074 | bulk palette templates collab
1075 | batch AI themes auto-layout embed
1076 | ```
1077 |
1078 | #### User Journey → Release Slices
1079 |
1080 | Horizontal slices through the map become candidate releases. Not every activity needs new capability in every release — some cells stay empty, and that's fine if the slice is still coherent:
1081 |
1082 | ```
1083 | UPLOAD → EXTRACT → ARRANGE → SHARE
1084 |
1085 | Release 1: basic auto export
1086 | ───────────────────────────────────────────────────
1087 | Release 2: palette manual
1088 | ───────────────────────────────────────────────────
1089 | Release 3: batch AI themes templates embed
1090 | ```
1091 |
1092 | #### Release Slices → SLC Releases
1093 |
1094 | The story map gives you _structure_ for slicing. Jason Cohen's _[Simple, Lovable, Complete (SLC)](https://longform.asmartbear.com/slc/)_ gives you _criteria_ for what makes a slice good:
1095 |
1096 | - _Simple_ — Narrow scope you can ship fast. Not every activity, not every depth.
1097 | - _Complete_ — Fully accomplishes a job within that scope. Not a broken preview.
1098 | - _Lovable_ — People actually want to use it. Delightful within its boundaries.
1099 |
1100 | _Why SLC over MVP?_ MVPs optimize for learning at the customer's expense — "minimum" often means broken or frustrating. SLC flips this: learn in-market _while_ delivering real value. If it succeeds, you have optionality. If it fails, you still treated users well.
1101 |
1102 | Each slice can become a release with a clear value and identity:
1103 |
1104 | ```
1105 | UPLOAD → EXTRACT → ARRANGE → SHARE
1106 |
1107 | Palette Picker: basic auto export
1108 | ───────────────────────────────────────────────────
1109 | Mood Board: palette manual
1110 | ───────────────────────────────────────────────────
1111 | Design Studio: batch AI themes templates embed
1112 | ```
1113 |
1114 | - _Palette Picker_ — Upload, extract, export. Instant value from day one.
1115 | - _Mood Board_ — Adds arrangement. Creative expression enters the journey.
1116 | - _Design Studio_ — Professional features: batch processing, AI themes, embeddable output.
1117 |
1118 | ---
1119 |
1120 | #### Operationalizing with Ralph
1121 |
1122 | The concepts above — activities, story maps, SLC releases — are the _thinking tools_. How do we translate them into Ralph's workflow?
1123 |
1124 | _Default Ralph approach:_
1125 |
1126 | 1. _Define Requirements_: Human + LLM define JTBD topics of concern → `specs/*.md`
1127 | 2. _Create Tasks Plan_: LLM analyzes all specs + current code → `IMPLEMENTATION_PLAN.md`
1128 | 3. _Build_: Ralph builds against full scope
1129 |
1130 | This works well for capability-focused work (features, refactors, infrastructure). But it doesn't naturally produce valuable (SLC) product releases - it produces "whatever the specs describe".
1131 |
1132 | _Activities → SLC Release approach:_
1133 |
1134 | To get SLC releases, we need to ground activities in audience context. Audience defines WHO has the JTBDs, which in turn informs WHAT activities matter and what "lovable" means.
1135 |
1136 | ```
1137 | Audience (who)
1138 | └── has JTBDs (why)
1139 | └── fulfilled by Activities (how)
1140 | ```
1141 |
1142 | ##### Workflow
1143 |
1144 | _I. Requirements Phase (2 steps):_
1145 |
1146 | Still performed in LLM conversations with the human, similar to the default Ralph approach.
1147 |
1148 | 1. _Define audience and their JTBDs_ — WHO are we building for and what OUTCOMES do they want?
1149 |
1150 | - Human + LLM discuss and determine the audience(s) and their JTBDs (outcomes they want)
1151 | - May contain multiple connected audiences (e.g. "designer" creates, "client" reviews)
1152 | - Generates `AUDIENCE_JTBD.md`
1153 |
1154 | 2. _Define activities_ — WHAT do users do to accomplish their JTBDs?
1155 |
1156 | - Informed by `AUDIENCE_JTBD.md`
1157 | - For each JTBD, identify activities necessary to accomplish it
1158 | - For each activity, determine:
1159 | - Capability depths (basic → enhanced) — levels of sophistication
1160 | - Desired outcome(s) at each depth — what does success look like?
1161 | - Generates `specs/*.md` (one per activity)
1162 |
1163 | The discrete steps within activities are implicit and LLM can infer them during planning.
1164 |
1165 | _II. Planning Phase:_
1166 |
1167 | Performed in Ralph loop with _updated_ planning prompt.
1168 |
1169 | - LLM analyzes:
1170 | - `AUDIENCE_JTBD.md` (who, desired outcomes)
1171 | - `specs/*` (what could be built)
1172 | - Current code state (what exists)
1173 | - LLM determines next SLC slice (which activities, at what capability depths) and plans tasks for that slice
1174 | - LLM generates `IMPLEMENTATION_PLAN.md`
1175 | - _Human verifies_ plan before building:
1176 | - Does the scope represent a coherent SLC release?
1177 | - Are the right activities included at the right depths?
1178 | - If wrong → re-run planning loop to regenerate plan, optionally updating inputs or planning prompt
1179 | - If right → proceed to building
1180 |
1181 | _III. Building Phase:_
1182 |
1183 | Performed in Ralph loop with standard building prompt.
1184 |
1185 | ##### Updated Planning Prompt
1186 |
1187 | Variant of `PROMPT_plan.md` that adds audience context and SLC-oriented slice recommendation.
1188 |
1189 | _Notes:_
1190 |
1191 | - Unlike the default template, this does not have a `[project-specific goal]` placeholder — the goal is implicit: recommend the most valuable next release for the audience.
1192 | - Current subagents names presume using Claude.
1193 |
1194 | ```
1195 | 0a. Study @AUDIENCE_JTBD.md to understand who we're building for and their Jobs to Be Done.
1196 | 0b. Study `specs/*` with up to 250 parallel Sonnet subagents to learn JTBD activities.
1197 | 0c. Study @IMPLEMENTATION_PLAN.md (if present) to understand the plan so far.
1198 | 0d. Study `src/lib/*` with up to 250 parallel Sonnet subagents to understand shared utilities & components.
1199 | 0e. For reference, the application source code is in `src/*`.
1200 |
1201 | 1. Sequence the activities in `specs/*` into a user journey map for the audience in @AUDIENCE_JTBD.md. Consider how activities flow into each other and what dependencies exist.
1202 |
1203 | 2. Determine the next SLC release. Use up to 500 Sonnet subagents to compare `src/*` against `specs/*`. Use an Opus subagent to analyze findings. Ultrathink. Given what's already implemented recommend which activities (at what capability depths) form the most valuable next release. Prefer thin horizontal slices - the narrowest scope that still delivers real value. A good slice is Simple (narrow, achievable), Lovable (people want to use it), and Complete (fully accomplishes a meaningful job, not a broken preview).
1204 |
1205 | 3. Use an Opus subagent (ultrathink) to analyze and synthesize the findings, prioritize tasks, and create/update @IMPLEMENTATION_PLAN.md as a bullet point list sorted in priority of items yet to be implemented for the recommended SLC release. Begin plan with a summary of the recommended SLC release (what's included and why), then list prioritized tasks for that scope. Consider TODOs, placeholders, minimal implementations, skipped tests - but scoped to the release. Note discoveries outside scope as future work.
1206 |
1207 | IMPORTANT: Plan only. Do NOT implement anything. Do NOT assume functionality is missing; confirm with code search first. Treat `src/lib` as the project's standard library for shared utilities and components. Prefer consolidated, idiomatic implementations there over ad-hoc copies.
1208 |
1209 | ULTIMATE GOAL: We want to achieve the most valuable next release for the audience in @AUDIENCE_JTBD.md. Consider missing elements and plan accordingly. If an element is missing, search first to confirm it doesn't exist, then if needed author the specification at specs/FILENAME.md. If you create a new element then document the plan to implement it in @IMPLEMENTATION_PLAN.md using a subagent.
1210 | ```
1211 |
1212 | ##### Notes
1213 |
1214 | _Why `AUDIENCE_JTBD.md` as a separate artifact:_
1215 |
1216 | - Single source of truth — prevents drift across specs
1217 | - Enables holistic reasoning: "What does this audience need MOST?"
1218 | - JTBDs captured alongside audience (the "why" lives with the "who")
1219 | - Referenced twice: during spec creation AND SLC planning
1220 | - Keeps activity specs focused on WHAT, not repeating WHO
1221 |
1222 | _Cardinalities:_
1223 |
1224 | - One audience → many JTBDs ("Designer" has "capture space", "explore concepts", "present to client")
1225 | - One JTBD → many activities ("capture space" includes upload, measurements, room detection)
1226 | - One activity → can serve multiple JTBDs ("upload photo" serves both "capture" and "gather inspiration")
1227 |
```
--------------------------------------------------------------------------------
/files/AGENTS.md:
--------------------------------------------------------------------------------
```markdown
1 | ## Build & Run
2 |
3 | Succinct rules for how to BUILD the project:
4 |
5 | ## Validation
6 |
7 | Run these after implementing to get immediate feedback:
8 |
9 | - Tests: `[test command]`
10 | - Typecheck: `[typecheck command]`
11 | - Lint: `[lint command]`
12 |
13 | ## Operational Notes
14 |
15 | Succinct learnings about how to RUN the project:
16 |
17 | ...
18 |
19 | ### Codebase Patterns
20 |
21 | ...
22 |
```
--------------------------------------------------------------------------------
/files/IMPLEMENTATION_PLAN.md:
--------------------------------------------------------------------------------
```markdown
1 | <!-- Generated by LLM with content and structure it deems most appropriate -->
2 |
```
--------------------------------------------------------------------------------
/.vscode/settings.json:
--------------------------------------------------------------------------------
```json
1 | {
2 | "cSpell.words": [
3 | "backpressure",
4 | "collab",
5 | "elif",
6 | "envsubst",
7 | "farr",
8 | "geoff",
9 | "geoff's",
10 | "ghuntley",
11 | "grayscale",
12 | "inspectable",
13 | "mattpocockuk",
14 | "monolithically",
15 | "multimodal",
16 | "opencode",
17 | "operationalizing",
18 | "pipefail",
19 | "ryancarson",
20 | "stdlib",
21 | "thariq's",
22 | "typechecks",
23 | "ultrathink",
24 | "undeterministic",
25 | "worktrees"
26 | ]
27 | }
28 |
```
--------------------------------------------------------------------------------
/files/PROMPT_plan.md:
--------------------------------------------------------------------------------
```markdown
1 | 0a. Study `specs/*` with up to 250 parallel Sonnet subagents to learn the application specifications.
2 | 0b. Study @IMPLEMENTATION_PLAN.md (if present) to understand the plan so far.
3 | 0c. Study `src/lib/*` with up to 250 parallel Sonnet subagents to understand shared utilities & components.
4 | 0d. For reference, the application source code is in `src/*`.
5 |
6 | 1. Study @IMPLEMENTATION_PLAN.md (if present; it may be incorrect) and use up to 500 Sonnet subagents to study existing source code in `src/*` and compare it against `specs/*`. Use an Opus subagent to analyze findings, prioritize tasks, and create/update @IMPLEMENTATION_PLAN.md as a bullet point list sorted in priority of items yet to be implemented. Ultrathink. Consider searching for TODO, minimal implementations, placeholders, skipped/flaky tests, and inconsistent patterns. Study @IMPLEMENTATION_PLAN.md to determine starting point for research and keep it up to date with items considered complete/incomplete using subagents.
7 |
8 | IMPORTANT: Plan only. Do NOT implement anything. Do NOT assume functionality is missing; confirm with code search first. Treat `src/lib` as the project's standard library for shared utilities and components. Prefer consolidated, idiomatic implementations there over ad-hoc copies.
9 |
10 | ULTIMATE GOAL: We want to achieve [project-specific goal]. Consider missing elements and plan accordingly. If an element is missing, search first to confirm it doesn't exist, then if needed author the specification at specs/FILENAME.md. If you create a new element then document the plan to implement it in @IMPLEMENTATION_PLAN.md using a subagent.
11 |
```
--------------------------------------------------------------------------------
/files/loop.sh:
--------------------------------------------------------------------------------
```bash
1 | #!/bin/bash
2 | # Usage: ./loop.sh [plan] [max_iterations]
3 | # Examples:
4 | # ./loop.sh # Build mode, unlimited iterations
5 | # ./loop.sh 20 # Build mode, max 20 iterations
6 | # ./loop.sh plan # Plan mode, unlimited iterations
7 | # ./loop.sh plan 5 # Plan mode, max 5 iterations
8 |
9 | # Parse arguments
10 | if [ "$1" = "plan" ]; then
11 | # Plan mode
12 | MODE="plan"
13 | PROMPT_FILE="PROMPT_plan.md"
14 | MAX_ITERATIONS=${2:-0}
15 | elif [[ "$1" =~ ^[0-9]+$ ]]; then
16 | # Build mode with max iterations
17 | MODE="build"
18 | PROMPT_FILE="PROMPT_build.md"
19 | MAX_ITERATIONS=$1
20 | else
21 | # Build mode, unlimited (no arguments or invalid input)
22 | MODE="build"
23 | PROMPT_FILE="PROMPT_build.md"
24 | MAX_ITERATIONS=0
25 | fi
26 |
27 | ITERATION=0
28 | CURRENT_BRANCH=$(git branch --show-current)
29 |
30 | echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
31 | echo "Mode: $MODE"
32 | echo "Prompt: $PROMPT_FILE"
33 | echo "Branch: $CURRENT_BRANCH"
34 | [ $MAX_ITERATIONS -gt 0 ] && echo "Max: $MAX_ITERATIONS iterations"
35 | echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
36 |
37 | # Verify prompt file exists
38 | if [ ! -f "$PROMPT_FILE" ]; then
39 | echo "Error: $PROMPT_FILE not found"
40 | exit 1
41 | fi
42 |
43 | while true; do
44 | if [ $MAX_ITERATIONS -gt 0 ] && [ $ITERATION -ge $MAX_ITERATIONS ]; then
45 | echo "Reached max iterations: $MAX_ITERATIONS"
46 | break
47 | fi
48 |
49 | # Run Ralph iteration with selected prompt
50 | # -p: Headless mode (non-interactive, reads from stdin)
51 | # --dangerously-skip-permissions: Auto-approve all tool calls (YOLO mode)
52 | # --output-format=stream-json: Structured output for logging/monitoring
53 | # --model opus: Primary agent uses Opus for complex reasoning (task selection, prioritization)
54 | # Can use 'sonnet' in build mode for speed if plan is clear and tasks well-defined
55 | # --verbose: Detailed execution logging
56 | cat "$PROMPT_FILE" | claude -p \
57 | --dangerously-skip-permissions \
58 | --output-format=stream-json \
59 | --model opus \
60 | --verbose
61 |
62 | # Push changes after each iteration
63 | git push origin "$CURRENT_BRANCH" || {
64 | echo "Failed to push. Creating remote branch..."
65 | git push -u origin "$CURRENT_BRANCH"
66 | }
67 |
68 | ITERATION=$((ITERATION + 1))
69 | echo -e "\n\n======================== LOOP $ITERATION ========================\n"
70 | done
```
--------------------------------------------------------------------------------
/files/PROMPT_build.md:
--------------------------------------------------------------------------------
```markdown
1 | 0a. Study `specs/*` with up to 500 parallel Sonnet subagents to learn the application specifications.
2 | 0b. Study @IMPLEMENTATION_PLAN.md.
3 | 0c. For reference, the application source code is in `src/*`.
4 |
5 | 1. Your task is to implement functionality per the specifications using parallel subagents. Follow @IMPLEMENTATION_PLAN.md and choose the most important item to address. Before making changes, search the codebase (don't assume not implemented) using Sonnet subagents. You may use up to 500 parallel Sonnet subagents for searches/reads and only 1 Sonnet subagent for build/tests. Use Opus subagents when complex reasoning is needed (debugging, architectural decisions).
6 | 2. After implementing functionality or resolving problems, run the tests for that unit of code that was improved. If functionality is missing then it's your job to add it as per the application specifications. Ultrathink.
7 | 3. When you discover issues, immediately update @IMPLEMENTATION_PLAN.md with your findings using a subagent. When resolved, update and remove the item.
8 | 4. When the tests pass, update @IMPLEMENTATION_PLAN.md, then `git add -A` then `git commit` with a message describing the changes. After the commit, `git push`.
9 |
10 | 99999. Important: When authoring documentation, capture the why — tests and implementation importance.
11 | 999999. Important: Single sources of truth, no migrations/adapters. If tests unrelated to your work fail, resolve them as part of the increment.
12 | 9999999. As soon as there are no build or test errors create a git tag. If there are no git tags start at 0.0.0 and increment patch by 1 for example 0.0.1 if 0.0.0 does not exist.
13 | 99999999. You may add extra logging if required to debug issues.
14 | 999999999. Keep @IMPLEMENTATION_PLAN.md current with learnings using a subagent — future work depends on this to avoid duplicating efforts. Update especially after finishing your turn.
15 | 9999999999. When you learn something new about how to run the application, update @AGENTS.md using a subagent but keep it brief. For example if you run commands multiple times before learning the correct command then that file should be updated.
16 | 99999999999. For any bugs you notice, resolve them or document them in @IMPLEMENTATION_PLAN.md using a subagent even if it is unrelated to the current piece of work.
17 | 999999999999. Implement functionality completely. Placeholders and stubs waste efforts and time redoing the same work.
18 | 9999999999999. When @IMPLEMENTATION_PLAN.md becomes large periodically clean out the items that are completed from the file using a subagent.
19 | 99999999999999. If you find inconsistencies in the specs/* then use an Opus 4.5 subagent with 'ultrathink' requested to update the specs.
20 | 999999999999999. IMPORTANT: Keep @AGENTS.md operational only — status updates and progress notes belong in `IMPLEMENTATION_PLAN.md`. A bloated AGENTS.md pollutes every future loop's context.
```
--------------------------------------------------------------------------------
/references/sandbox-environments.md:
--------------------------------------------------------------------------------
```markdown
1 | <!-- cSpell:disable -->
2 |
3 | # Sandbox Environments for AI Agent Workflows
4 |
5 | _Security model:_ The sandbox (Docker/E2B) provides the security boundary. Inside the sandbox, Claude runs with full permissions because the container itself is isolated.
6 |
7 | _Security philosophy:_
8 |
9 | > "It's not if it gets popped, it's when it gets popped. And what is the blast radius?"
10 |
11 | Run on dedicated VMs or local Docker sandboxes. Restrict network connectivity, provide only necessary credentials, and ensure no access to private data beyond what the task requires.
12 |
13 | ---
14 |
15 | ## Options
16 |
17 | ### Sprites (Fly.io)
18 |
19 | - Persistent Linux environments that survive between executions indefinitely
20 | - Firecracker VM isolation with up to 8 vCPUs and 8GB RAM
21 | - Fast checkpoint/restore (~300ms create, <1s restore)
22 | - Auto-sleep after 30 seconds of inactivity
23 | - Unique HTTPS URL per Sprite for webhooks, APIs, public access
24 | - Layer 3 network policies for egress control (whitelist domains or use default LLM-friendly list)
25 | - CLI, REST API, JavaScript SDK, Go SDK (Python and Elixir coming soon)
26 | - Pre-installed tools: Claude Code, Codex CLI, Gemini CLI, Python 3.13, Node.js 22.20
27 | - $30 free credits to start (~500 Sprites worth)
28 |
29 | _Philosophy:_ Fly.io argues that "ephemeral sandboxes are obsolete" and that AI agents need persistent computers, not disposable containers. Sprites treat sandboxes as "actual computers" where data, packages, and services persist across executions on ext4 NVMe storage—no need to rebuild environments repeatedly. As they put it: "Claude doesn't want a stateless container."
30 |
31 | _Unique Features:_
32 |
33 | - _Stateful persistence_: Files, packages, databases survive between runs indefinitely
34 | - _Transactional snapshots_: Copy-on-write checkpoints capture entire disk state; stores last 5 checkpoints
35 | - _Idle cost optimization_: Auto-sleep when inactive (30s timeout), resume on request (<1s wake)
36 | - _Cold start_: Creation in 1-2 seconds, restore under 1 second
37 | - _Claude integration_: Pre-installed skills teach Claude how to use Sprites (port forwarding, etc.)
38 | - _Storage billing_: Pay only for blocks written, not allocated space; TRIM-friendly
39 | - _No time limits_: Unlike ephemeral sandboxes (typically 15-minute limits), Sprites support long-running workloads
40 |
41 | _Pricing:_
42 |
43 | | Resource | Cost | Minimum |
44 | | -------- | ---------------- | ------------------- |
45 | | CPU | $0.07/CPU-hour | 6.25% utilization/s |
46 | | Memory | $0.04375/GB-hour | 250MB per second |
47 | | Storage | $0.00068/GB-hour | Actual blocks only |
48 |
49 | - Free trial: $30 in credits (~500 Sprites)
50 | - Plan: $20/month includes monthly credits; overages at published rates
51 | - Example costs: 4-hour coding session ~$0.46, web app with 30 active hours ~$4/month
52 |
53 | _Specs:_
54 |
55 | | Spec | Value |
56 | | ------------ | -------------------------------------------------------------- |
57 | | Isolation | Firecracker microVM (hardware-isolated) |
58 | | Resources | Up to 8 vCPUs, 8GB RAM per execution (fixed, not configurable) |
59 | | Storage | 100GB initial ext4 partition on NVMe, auto-scaling capacity |
60 | | Cold Start | <1 second restore, 1-2 seconds creation |
61 | | Timeout | None (persistent); auto-sleeps after 30 seconds inactivity |
62 | | Active Limit | 10 simultaneous active Sprites on base plan; unlimited cold |
63 | | Network | Port 8080 proxied for HTTP services; isolated networks |
64 |
65 | _Limitations:_
66 |
67 | - Resource caps (8 vCPU, 8GB RAM, 100GB storage) not configurable yet
68 | - 30-second idle timeout not configurable
69 | - Region selection not available (auto-assigned based on geographic location)
70 | - Maximum 10 active sprites on base plan (unlimited cold/inactive sprites allowed)
71 | - Best for personal/organizational tools; not designed for million-user scale apps
72 |
73 | _Links:_
74 |
75 | - Official: https://sprites.dev/
76 | - Documentation: https://docs.sprites.dev/
77 | - Fly.io Blog: https://fly.io/blog/code-and-let-live/
78 | - JavaScript SDK: https://github.com/superfly/sprites-js
79 | - Go SDK: https://github.com/superfly/sprites-go
80 | - Elixir SDK: https://github.com/superfly/sprites-ex
81 | - Community: https://community.fly.io/c/sprites/
82 |
83 | ---
84 |
85 | ### E2B
86 |
87 | - Purpose-built for AI agents and LLM workflows
88 | - Pre-built template `anthropic-claude-code` ships with Claude Code CLI ready
89 | - Single-line SDK calls in Python or JavaScript (v1.5.1+)
90 | - Full filesystem + git for progress.txt, prd.json, and repo operations
91 | - 24-hour session limits on Pro plan (1 hour on Hobby)
92 | - Native access to 200+ MCP tools via Docker partnership (GitHub, Notion, Stripe, etc.)
93 | - Configurable compute: 1-8 vCPU, 512MB-8GB RAM
94 |
95 | _Philosophy:_ E2B believes AI agents need transient, immutable workloads with hardware-level kernel isolation. Each sandbox runs in its own Firecracker microVM, providing the same isolation as AWS Lambda. The focus is on developer experience—one SDK call to create a sandbox.
96 |
97 | _Unique Features:_
98 |
99 | - _Fastest cold start_: ~150-200ms via Firecracker microVMs
100 | - _Pre-built Claude template_: Zero-setup Claude Code integration
101 | - _Docker MCP Partnership_: Native access to 200+ MCP tools from Docker's catalog
102 | - _Pause/Resume (Beta)_: Save full VM state including memory (~4s per 1GB to pause, ~1s to resume, state persists up to 30 days)
103 | - _Network controls_: `allowInternetAccess` toggle, `network.allowOut`/`network.denyOut` for granular CIDR/domain filtering
104 | - _Domain filtering_: Works for HTTP (port 80) and TLS (port 443) via SNI inspection
105 |
106 | _Pricing:_
107 |
108 | | Plan | Monthly Fee | Session Limit | Notes |
109 | | ---------- | ----------- | ------------- | --------------------------- |
110 | | Hobby | $0 | 1 hour | + $100 one-time credit |
111 | | Pro | $150 | 24 hours | + usage costs |
112 | | Enterprise | Custom | Custom | SSO, SLA, dedicated support |
113 |
114 | _Usage Rates (per second):_
115 |
116 | | Resource | Rate |
117 | | -------- | -------------- |
118 | | 2 vCPU | $0.000028/s |
119 | | Memory | $0.0000045/GiB |
120 |
121 | _Specs:_
122 |
123 | | Spec | Value |
124 | | ------------- | -------------------------------------- |
125 | | Isolation | Firecracker microVM |
126 | | Cold Start | ~150-200ms |
127 | | Timeout | 1 hour (Hobby), 24 hours (Pro) |
128 | | Compute | 1-8 vCPU, 512MB-8GB RAM (configurable) |
129 | | Filesystem | Full Linux with git support |
130 | | Pre-installed | Node.js, curl, ripgrep, Claude Code |
131 |
132 | _Limitations:_
133 |
134 | - No native sandbox clone/fork functionality
135 | - No bulk file reading API
136 | - Domain filtering limited to HTTP/HTTPS ports (UDP/QUIC not supported)
137 | - Self-hosted version lacks built-in network policies
138 | - Occasional 502 timeout errors on long operations
139 | - Sandbox "not found" errors near timeout boundaries
140 |
141 | _Links:_
142 |
143 | - Official: https://e2b.dev/
144 | - Documentation: https://e2b.dev/docs
145 | - Pricing: https://e2b.dev/pricing
146 | - Python Guide: https://e2b.dev/blog/python-guide-run-claude-code-in-an-e2b-sandbox
147 | - JavaScript Guide: https://e2b.dev/blog/javascript-guide-run-claude-code-in-an-e2b-sandbox
148 | - Claude Code Template: https://e2b.dev/docs/code-interpreter/claude-code
149 | - MCP Server: https://github.com/e2b-dev/mcp-server
150 | - GitHub: https://github.com/e2b-dev/E2B
151 |
152 | ---
153 |
154 | ### Modal
155 |
156 | Modal Sandboxes are the Modal primitive for safely running untrusted code from LLMs, users, or third-party sources. Built on Modal's serverless container fabric with gVisor isolation.
157 |
158 | _Key Features:_
159 |
160 | - Pure Python SDK for defining sandboxes with one line of code (also JS/Go SDKs)
161 | - Execute arbitrary commands with `sandbox.exec()` and stream output
162 | - Autoscale from zero to 10,000+ concurrent sandboxes
163 | - Dynamic image definition at runtime from model output
164 | - Built-in tunneling for HTTP/WebSocket connections to sandbox servers
165 | - Granular egress policies via CIDR allowlists
166 | - Named sandboxes for persistent reference and pooling
167 | - Production-proven: Lovable and Quora run millions of code executions daily
168 |
169 | _Philosophy:_ Modal treats sandboxes as secure, ephemeral compute units that inherit its serverless fabric. The focus is on Python-first AI/ML workloads with aggressive cost optimization through scale-to-zero, trading cold start latency for resource efficiency.
170 |
171 | _Unique Features:_
172 |
173 | - _Sandbox Connect Tokens_: Authenticated HTTP/WebSocket access with unspoofable `X-Verified-User-Data` headers for access control
174 | - _Memory Snapshots_: Capture container memory state to reduce cold starts to <3s even with large dependencies like PyTorch
175 | - _Idle Timeout_: Auto-terminate sandboxes after configurable inactivity period
176 | - _Filesystem Snapshots_: Preserve state across sandbox instances for 24+ hour workflows
177 | - _No pre-provisioning_: Sandboxes created on-demand without capacity planning
178 |
179 | _Pricing (as of late 2025, after 65% price reduction):_
180 |
181 | | Plan | Monthly Fee | Credits Included | Seats | Container Limits |
182 | | ---------- | ----------- | ---------------- | ----- | ------------------------------- |
183 | | Starter | $0 | $30/month | 3 | 100 containers, 10 GPU |
184 | | Team | $250 | $100/month | ∞ | 1,000 containers, 50 GPU |
185 | | Enterprise | Custom | Volume discounts | ∞ | Custom limits, HIPAA, SSO, etc. |
186 |
187 | _Compute Rates (per second):_
188 |
189 | | Resource | Rate | Notes |
190 | | --------------------- | ---------------- | ---------------------------- |
191 | | Sandbox/Notebook CPU | $0.00003942/core | Per physical core (= 2 vCPU) |
192 | | Standard Function CPU | $0.0000131/core | Per physical core |
193 | | Memory | $0.00000222/GiB | Pay for actual usage |
194 | | GPU (A10G) | $0.000306/s | ~$1.10/hr |
195 | | GPU (A100 40GB) | $0.000583/s | ~$2.10/hr |
196 | | GPU (H100) | $0.001097/s | ~$3.95/hr |
197 |
198 | _Special Credits:_ Startups up to $25k, Academics up to $10k free compute
199 |
200 | _Specs:_
201 |
202 | | Spec | Value |
203 | | ------------------ | --------------------------------------------- |
204 | | Isolation | gVisor (Google's container runtime) |
205 | | Cold Start | ~1s container boot, 2-5s typical with imports |
206 | | With Snapshots | <3s even with large dependencies |
207 | | Default Timeout | 5 minutes |
208 | | Max Timeout | 24 hours (use snapshots for longer) |
209 | | Idle Timeout | Configurable auto-termination |
210 | | Filesystem | Ephemeral (use Volumes for persistence) |
211 | | Network Default | Secure-by-default, no incoming connections |
212 | | Egress Control | `block_network=True` or `cidr_allowlist` |
213 | | Concurrent Scaling | 10,000+ sandboxes |
214 |
215 | _Volumes (Persistent Storage):_
216 |
217 | - High-performance distributed filesystem (up to 2.5 GB/s bandwidth)
218 | - Volumes v2 (beta): No file count limit, 1 TiB max file size, HIPAA-compliant deletion
219 | - Explicit `commit()` required to persist changes
220 | - Last-write-wins for concurrent modifications to same file
221 | - Best for model weights, checkpoints, and datasets
222 |
223 | _Limitations:_
224 |
225 | - Cold start penalties when containers spin down (2-5s typical)
226 | - No on-premises deployment option
227 | - Sandboxes cannot access other Modal workspace resources by default
228 | - Single-language focus (Python-optimized, less suited for multi-language untrusted code)
229 | - Volumes require explicit reload to see changes from other containers
230 | - Less suited for persistent, long-lived environments vs microVM solutions
231 |
232 | _Modal vs E2B for AI Agents:_
233 |
234 | | Aspect | Modal | E2B |
235 | | ---------------- | ------------------------------- | ------------------------------ |
236 | | Isolation | gVisor containers | Firecracker microVMs |
237 | | Cold Start | 2-5s typical, <3s with snapshot | ~150ms |
238 | | Session Duration | Up to 24h (stateless) | Up to 24h (Pro), persistent |
239 | | Self-Hosting | No (managed only) | Experimental |
240 | | Multi-Language | Python-focused | Python, JS, Ruby, C++ |
241 | | Network Control | Granular egress policies | Allow/deny lists |
242 | | Best For | Python ML/AI, batch workloads | Multi-language agent sandboxes |
243 |
244 | _Links:_
245 |
246 | - Sandbox Product: https://modal.com/use-cases/sandboxes
247 | - Sandbox Docs: https://modal.com/docs/guide/sandboxes
248 | - Sandbox Networking: https://modal.com/docs/guide/sandbox-networking
249 | - API Reference: https://modal.com/docs/reference/modal.Sandbox
250 | - Safe Code Execution Example: https://modal.com/docs/examples/safe_code_execution
251 | - Coding Agent Example: https://modal.com/docs/examples/agent
252 | - Pricing: https://modal.com/pricing
253 | - Cold Start Guide: https://modal.com/docs/guide/cold-start
254 | - Volumes: https://modal.com/docs/guide/volumes
255 | - Security: https://modal.com/docs/guide/security
256 |
257 | ---
258 |
259 | ### Cloudflare Sandboxes
260 |
261 | - Open Beta (announced June 2025), still experimental
262 | - Edge-native (330+ global locations)
263 | - Pay for active CPU only (not provisioned resources)
264 | - Best if already in Cloudflare ecosystem
265 | - R2 bucket mounting via FUSE enables data persistence (added November 2025)
266 | - Git operations support (added August 2025)
267 | - Rich output: charts, tables, HTML, JSON, images
268 |
269 | _Philosophy:_ Cloudflare takes a security-first approach using a "bindings" model where code has zero network access by default and can only access external APIs through explicitly defined bindings. This eliminates entire classes of security vulnerabilities by making capabilities explicitly opt-in.
270 |
271 | _Unique Features:_
272 |
273 | - _Edge-native execution_: Run sandboxes in 330+ global locations
274 | - _Bindings model_: Zero network access by default; explicit opt-in for external APIs
275 | - _R2 FUSE mounting_: S3-compatible storage mounting for persistence (R2, S3, GCS, Backblaze B2, MinIO)
276 | - _Preview URLs_: Public URLs for exposing services from sandboxes
277 | - _`keepAlive: true`_: Option for indefinite runtime
278 |
279 | _Pricing (as of November 2025):_
280 |
281 | | Resource | Cost | Included (Workers Paid) |
282 | | -------------- | ---------------- | ----------------------- |
283 | | Base Plan | $5/month | - |
284 | | CPU | $0.000020/vCPU-s | 375 vCPU-minutes |
285 | | Memory | $0.0000025/GiB-s | 25 GiB-hours |
286 | | Disk | $0.00000007/GB-s | 200 GB-hours |
287 | | Network Egress | $0.025-$0.05/GB | Varies by region |
288 |
289 | _Instance Types (added October 2025):_
290 |
291 | | Type | vCPU | Memory | Disk |
292 | | ---------- | ---- | ------- | ----- |
293 | | lite | 1/16 | 256 MiB | 2 GB |
294 | | basic | 1/4 | 1 GiB | 4 GB |
295 | | standard-1 | 1 | 3 GiB | 5 GB |
296 | | standard-2 | 2 | 6 GiB | 10 GB |
297 | | standard-4 | 4 | 12 GiB | 20 GB |
298 |
299 | _Specs:_
300 |
301 | | Spec | Value |
302 | | -------------- | --------------------------------------- |
303 | | Isolation | Container |
304 | | Cold Start | 1-5 seconds |
305 | | Edge Locations | 330+ global |
306 | | Storage | Ephemeral; persistent via R2 FUSE mount |
307 | | Network | Bindings model (zero access by default) |
308 | | Max Memory | 400 GiB concurrent (account limit) |
309 | | Max CPU | 100 vCPU concurrent (account limit) |
310 | | Max Disk | 2 TB concurrent |
311 | | Image Storage | 50 GB per account |
312 |
313 | _Limitations:_
314 |
315 | - Cold starts 1-5 seconds (slower than Workers' milliseconds)
316 | - Binary network controls without bindings
317 | - Bucket mounting only works with `wrangler deploy`, not `wrangler dev`
318 | - SDK/container version must match
319 | - Sandbox ID case sensitivity issues with preview URLs
320 | - Still in open beta; ecosystem maturing
321 |
322 | _Links:_
323 |
324 | - Official: https://sandbox.cloudflare.com/
325 | - SDK Documentation: https://developers.cloudflare.com/sandbox/
326 | - Containers Pricing: https://developers.cloudflare.com/containers/pricing/
327 | - Container Limits: https://developers.cloudflare.com/containers/platform-details/limits/
328 | - Persistent Storage Tutorial: https://developers.cloudflare.com/sandbox/tutorials/persistent-storage/
329 | - GitHub SDK: https://github.com/cloudflare/sandbox-sdk
330 |
331 | ---
332 |
333 | ## Comparison Table
334 |
335 | | Feature | Sprites | E2B | Modal | Cloudflare |
336 | | ---------------- | ------------------- | ------------------- | -------------------- | ------------------ |
337 | | Setup | Easy | Very Easy | Easy | Easy |
338 | | Free Tier | $30 credit | $100 credit | $30/month | $5/mo Workers Paid |
339 | | Isolation | Firecracker microVM | Firecracker microVM | gVisor container | Container |
340 | | Cold Start | <1 second | ~150ms | 2-5s (or <3s w/snap) | 1-5 seconds |
341 | | Max Timeout | None (persistent) | 24 hours (Pro) | 24 hours | Configurable |
342 | | Claude CLI | Pre-installed | Prebuilt template | Manual | Manual |
343 | | Git Support | Yes | Yes | Yes | Yes |
344 | | Persistent Files | Yes (permanent) | 24 hours | Via Volumes | Via R2 FUSE mount |
345 | | Checkpoints | Yes (~300ms) | Pause/Resume (Beta) | Memory Snapshots | No |
346 | | Network Controls | Layer 3 policies | Allow/deny lists | CIDR allowlists | Bindings model |
347 | | Edge Locations | Fly.io regions | - | - | 330+ global |
348 | | Max Concurrent | 10 active (base) | Plan-based | 10,000+ | Plan-based |
349 | | Self-Hosting | Fly.io only | Experimental | No | No |
350 | | MCP Tools | - | 200+ (Docker) | - | - |
351 | | Best For | Long-running agents | AI agent loops | Python ML workloads | Edge apps |
352 |
353 | ---
354 |
355 | ## Other Options
356 |
357 | ### Daytona
358 |
359 | Founded by the creators of Codeanywhere (2009), pivoted in February 2025 from development environments to AI code execution infrastructure. 35,000+ GitHub stars (AGPL-3.0 license).
360 |
361 | _Key Features:_
362 |
363 | - Sub-90ms sandbox creation (container-based, faster than E2B's ~150ms microVM)
364 | - Python SDK (`daytona_sdk` on PyPI) and TypeScript SDK
365 | - Official LangChain integration (`langchain-daytona-data-analysis`)
366 | - MCP Server support for Claude/Anthropic integrations
367 | - OCI/Docker image compatibility
368 | - Built-in Git and LSP support
369 | - GPU support for ML workloads (enterprise tier)
370 | - Unlimited persistence (sandboxes can live forever via object storage archiving)
371 | - Virtual desktops (Linux, Windows, macOS with programmatic control)
372 |
373 | _Philosophy:_ Daytona believes AI will automate the majority of programming tasks. Their agent-agnostic architecture enables parallel sandboxed environments for testing solutions simultaneously without affecting the developer's primary workspace.
374 |
375 | _Unique Features:_
376 |
377 | - _Fastest cold start_: ~90ms (container-based, faster than E2B's microVM)
378 | - _LangChain integration_: Official `langchain-daytona-data-analysis` package
379 | - _MCP Server_: Native Claude/Anthropic integration support
380 | - _Virtual desktops_: Linux, Windows, macOS with programmatic control
381 | - _Unlimited persistence_: Sandboxes can live forever via object storage archiving
382 |
383 | _Pricing:_
384 |
385 | | Item | Cost |
386 | | --------------- | --------------------------------------------- |
387 | | Free Credits | $200 (requires credit card) |
388 | | Startup Program | Up to $50k in credits |
389 | | Small Sandbox | ~$0.067/hour (1 vCPU, 1 GiB RAM) |
390 | | Billing | Pay-per-second; stopped/archived minimal cost |
391 |
392 | _Specs:_
393 |
394 | | Spec | Default | Maximum |
395 | | ---------- | ----------- | -------------------- |
396 | | vCPU | 1 | 4 (contact for more) |
397 | | RAM | 1 GB | 8 GB |
398 | | Disk | 3 GiB | 10 GB |
399 | | Auto-stop | 15 min idle | Disabled |
400 | | Cold start | ~90ms | - |
401 | | Isolation | Docker/OCI | Kata/Sysbox optional |
402 |
403 | _Network Egress Tiers:_
404 |
405 | - Tier 1 & 2: Restricted network access
406 | - Tier 3 & 4: Full internet with custom CIDR rules
407 | - All tiers whitelist essential services (NPM, PyPI, GitHub, Anthropic/OpenAI APIs, etc.)
408 |
409 | _Limitations:_
410 |
411 | - Container isolation by default (not microVM like E2B)
412 | - Cannot snapshot running sandboxes
413 | - Long-session stability still maturing
414 | - Young ecosystem compared to E2B
415 | - Requires credit card for free credits
416 |
417 | _Daytona vs E2B:_
418 |
419 | | Aspect | Daytona | E2B |
420 | | ---------------- | ------------------ | ------------------- |
421 | | Isolation | Docker containers | Firecracker microVM |
422 | | Cold start | ~90ms | ~150ms |
423 | | Free credits | $200 (CC required) | $100 (no CC) |
424 | | Max session | Unlimited | 24 hours (Pro) |
425 | | GitHub stars | 35k+ | 10k+ |
426 | | Network controls | Tier-based | Allow/deny lists |
427 |
428 | _Links:_
429 |
430 | - Official: https://www.daytona.io/
431 | - Documentation: https://www.daytona.io/docs/en/
432 | - GitHub: https://github.com/daytonaio/daytona
433 | - Python SDK: https://pypi.org/project/daytona_sdk/
434 | - Network Limits: https://www.daytona.io/docs/en/network-limits/
435 | - Sandbox Management: https://www.daytona.io/docs/en/sandbox-management/
436 | - LangChain Integration: https://docs.langchain.com/oss/python/integrations/tools/daytona_data_analysis
437 | - MCP Servers Guide: https://www.daytona.io/dotfiles/production-ready-mcp-servers-at-scale-with-claude-daytona
438 |
439 | ---
440 |
441 | ### Google Cloud Run
442 |
443 | Google Cloud's serverless container platform with strong security isolation, designed for production workloads at scale.
444 |
445 | _Key Features:_
446 |
447 | - Two-layer sandbox isolation (hardware + kernel)
448 | - Automatic scaling (including scale-to-zero)
449 | - Pay-per-second billing (100ms granularity)
450 | - NVIDIA L4 GPU support for AI inference (24 GB VRAM)
451 | - Direct VPC egress with firewall controls
452 | - Cloud Storage and NFS volume mounts for persistence
453 | - Request timeout up to 60 minutes (services), 7 days (jobs)
454 | - Up to 1000 concurrent requests per instance
455 | - Built-in HTTPS, IAM, Secret Manager integration
456 | - Source-based deployment (no Dockerfile required)
457 |
458 | _Philosophy:_ Google's approach treats Cloud Run as the "easy button" for serverless containers. Unlike dedicated AI sandbox providers, Cloud Run is a general-purpose platform that happens to work well for AI agents. The security model provides defense-in-depth through gVisor (1st gen) or Linux microVMs (2nd gen), with seccomp filtering in both. For AI-specific workloads, Google offers Agent Engine (fully managed) and GKE Agent Sandbox (Kubernetes-native) as alternatives.
459 |
460 | _Unique Features:_
461 |
462 | - _Dual execution environments_: 1st gen (gVisor-based, smaller attack surface) or 2nd gen (Linux microVM, more compatibility)
463 | - _GPU scale-to-zero_: L4 GPUs spin down when idle, eliminating GPU idle costs
464 | - _Startup CPU Boost_: Temporarily increases CPU during cold start (up to 50% faster startups for Java)
465 | - _VPC Flow Logs_: Full visibility into network traffic for compliance
466 | - _Network tags_: Granular firewall rules via VPC network tags on revisions
467 | - _Volume mounts_: Cloud Storage FUSE or NFS (Cloud Filestore) for persistent data
468 |
469 | _Pricing (Tier 1 regions, e.g., us-central1):_
470 |
471 | | Resource | On-Demand | Free Tier (Monthly) |
472 | | -------- | --------------------- | ------------------------------------ |
473 | | CPU | $0.000024/vCPU-second | 180,000 vCPU-seconds (~50 hrs) |
474 | | Memory | $0.0000025/GiB-second | 360,000 GiB-seconds (~100 hrs @ 1GB) |
475 | | Requests | $0.40/million | 2 million |
476 | | GPU (L4) | ~$0.67/hour | None |
477 |
478 | - Always-on billing is ~30% cheaper than on-demand
479 | - Tier 2 regions (Asia, South America) are ~40% more expensive
480 | - GPU requires minimum 4 vCPU + 16 GiB memory
481 | - New customers: $300 free credits for 90 days
482 |
483 | _Specs:_
484 |
485 | | Spec | Value |
486 | | --------------- | ---------------------------------------------------------------------- |
487 | | Isolation | Two-layer: hardware (x86 virtualization) + kernel (gVisor/microVM) |
488 | | Cold Start | 2-5 seconds typical; sub-second with Startup CPU Boost + min instances |
489 | | Max Timeout | 60 minutes (services), 168 hours/7 days (jobs) |
490 | | Max Memory | 32 GiB |
491 | | Max CPU | 8 vCPU (or 4 vCPU with GPU) |
492 | | Max Concurrency | 1000 requests/instance (default 80) |
493 | | Max Instances | 1000 (configurable) |
494 | | GPU | NVIDIA L4 (24 GB VRAM), 1 per instance, <5s startup |
495 | | Storage | Ephemeral; use Cloud Storage or NFS mounts for persistence |
496 |
497 | _Network/Egress Controls:_
498 |
499 | - Direct VPC egress without Serverless VPC Access connector
500 | - Network tags on service revisions for firewall rules
501 | - VPC Service Controls for data exfiltration prevention
502 | - Organization policies to enforce VPC-only egress
503 | - Cloud NAT supported for outbound IP control
504 | - VPC Flow Logs for traffic visibility
505 |
506 | _Limitations:_
507 |
508 | - No persistent local disk (must use Cloud Storage or NFS volume mounts)
509 | - Cold start latency higher than E2B/Sprites (2-5s vs <1s) without pre-warming
510 | - Setup complexity: Requires GCP project, billing, IAM configuration
511 | - VPC complexity: Network egress controls require VPC setup
512 | - Job connection breaks: Jobs >1 hour may experience connection breaks during maintenance
513 | - GPU regions limited: L4 GPUs only available in select regions
514 | - No pre-built AI agent template (unlike E2B)
515 | - Memory/session management must be built manually
516 |
517 | _Google Cloud AI Agent Options Comparison:_
518 |
519 | | Criteria | Cloud Run | Agent Engine | GKE Agent Sandbox |
520 | | ---------------------- | ---------------------------- | ---------------- | ----------------------- |
521 | | Setup Complexity | Medium | Low | High |
522 | | Infrastructure Control | Medium | Low | High |
523 | | Memory/Session Mgmt | Manual | Built-in | Manual |
524 | | Isolation | gVisor/microVM | Built-in sandbox | gVisor + Kata |
525 | | Cold Start | 2-5s (sub-second w/pre-warm) | Sub-second | Sub-second (warm pools) |
526 | | Best For | Flexible serverless | Fastest to prod | Enterprise scale |
527 |
528 | _SDK/API Options:_
529 |
530 | - Python: `pip install google-cloud-run`
531 | - Node.js: `npm install @google-cloud/run`
532 | - Go, Java, .NET, Ruby, PHP, Rust client libraries available
533 | - REST API and gcloud CLI
534 | - Terraform provider for IaC
535 |
536 | _Links:_
537 |
538 | - Documentation: https://cloud.google.com/run/docs
539 | - AI Agents Guide: https://cloud.google.com/run/docs/ai-agents
540 | - Pricing: https://cloud.google.com/run/pricing
541 | - Security Design: https://cloud.google.com/run/docs/securing/security
542 | - Quotas & Limits: https://cloud.google.com/run/quotas
543 | - GPU Support: https://cloud.google.com/run/docs/configuring/services/gpu
544 | - VPC Egress: https://cloud.google.com/run/docs/configuring/vpc-direct-vpc
545 | - Volume Mounts: https://cloud.google.com/run/docs/configuring/services/cloud-storage-volume-mounts
546 | - Python SDK: https://pypi.org/project/google-cloud-run/
547 | - Quickstarts: https://cloud.google.com/run/docs/quickstarts
548 |
549 | ---
550 |
551 | ### Replit
552 |
553 | Full development environment with built-in LLM agent (Agent 3).
554 |
555 | _Key Features:_
556 |
557 | - Agent 3 can run autonomously for up to 200 minutes without supervision
558 | - Self-testing loop: executes code, identifies errors, fixes, and reruns until tests pass
559 | - Proprietary testing system claimed to be 3x faster and 10x more cost-effective than Computer Use models
560 | - Can build other agents and automations from natural language descriptions
561 | - Built on 10+ years of infrastructure investment (custom file system, VM orchestration)
562 |
563 | _Philosophy:_ Replit positions as an "agent-first" platform focused on eliminating "accidental complexity" (per CEO Amjad Masad). Target audience: anyone who wants to build software, not just engineers. The goal is to make Agent the primary interface for software creation.
564 |
565 | _Unique Features:_
566 |
567 | - _Agent 3 autonomy_: Up to 200 minutes of autonomous execution
568 | - _Self-testing loop_: Automatic error detection and fixing
569 | - _Agent building_: Can create other agents from natural language
570 | - _Full IDE integration_: Complete development environment, not just sandbox
571 | - _MCP support_: Integration guide available
572 |
573 | _Pricing:_
574 |
575 | | Plan | Monthly | Agent Access | Compute | Storage |
576 | | ---------- | -------- | ------------------- | ----------------- | ------- |
577 | | Starter | Free | Limited (daily cap) | 1 vCPU, 2 GiB RAM | 2 GiB |
578 | | Core | $25 | Full Agent 3 | 4 vCPU, 8 GiB RAM | 50 GiB |
579 | | Teams | $40/user | Full Agent 3 + RBAC | 4 vCPU, 8 GiB RAM | 50 GiB |
580 | | Enterprise | Custom | Full + SSO/SAML | Custom | Custom |
581 |
582 | - Core plan includes $25/month in AI credits
583 | - Teams plan includes $40/month in credits + 50 viewer seats
584 | - Annual billing: ~20% discount
585 |
586 | _Specs:_
587 |
588 | | Spec | Starter | Core/Teams |
589 | | ---------- | ------------ | ------------- |
590 | | vCPU | 1 | 4 |
591 | | RAM | 2 GiB | 8 GiB |
592 | | Storage | 2 GiB | 50 GiB |
593 | | Agent Time | Daily limits | Up to 200 min |
594 |
595 | _Limitations:_
596 |
597 | - **No public API for programmatic Agent access** — designed exclusively for in-browser interactive use, not for CI/CD pipelines or external autonomous agent orchestration
598 | - Agent frequently gets stuck in loops on simple tasks
599 | - Over-autonomy risk (can override user intent)
600 | - External API authentication problems reported
601 | - Unpredictable credit consumption ($100-300/month reported overages)
602 | - Over 60% of developers report agent stalls/errors regularly (per surveys)
603 | - Notable July 2025 incident where Agent deleted a production database
604 |
605 | _Links:_
606 |
607 | - Official: https://replit.com/
608 | - Pricing: https://replit.com/pricing
609 | - Agent 3 Announcement: https://blog.replit.com/introducing-agent-3-our-most-autonomous-agent-yet
610 | - 2025 Year in Review: https://blog.replit.com/2025-replit-in-review
611 | - AI Billing Docs: https://docs.replit.com/billing/ai-billing
612 | - MCP Integration Guide: https://docs.replit.com/tutorials/mcp-in-3
613 | - Fast Mode Docs: https://docs.replit.com/replitai/fast-mode
614 |
615 | ---
616 |
617 | ## Local Docker Options
618 |
619 | ### Docker Official Sandboxes
620 |
621 | _Quick Start:_
622 |
623 | ```bash
624 | docker sandbox run claude # Basic
625 | docker sandbox run -w ~/my-project claude # Custom workspace
626 | docker sandbox run claude "your task" # With prompt
627 | docker sandbox run claude -c # Continue last session
628 | ```
629 |
630 | _Key Details:_
631 |
632 | - Credentials stored in persistent volume `docker-claude-sandbox-data`
633 | - `--dangerously-skip-permissions` enabled by default
634 | - Base image includes: Node.js, Python 3, Go, Git, Docker CLI, GitHub CLI, ripgrep, jq
635 | - Container persists in background; re-running reuses same container
636 | - Non-root user with sudo access
637 |
638 | _Links:_ https://docs.docker.com/ai/sandboxes/claude-code/
639 |
640 | ---
641 |
642 | ## Comparison: E2B vs Docker Local
643 |
644 | | Aspect | E2B (Cloud) | Docker Local |
645 | | ----------------- | ------------------- | ---------------------- |
646 | | Setup | SDK call | `docker sandbox run` |
647 | | Isolation | Firecracker microVM | Container |
648 | | Cost | ~$0.05/hr | Free (your hardware) |
649 | | Max Duration | 24 hours | Unlimited |
650 | | Network | Full internet | Full internet |
651 | | State Persistence | Session-based | Volume-based |
652 | | Multi-tenant Safe | Yes | No (local only) |
653 | | Best For | Production, CI/CD | Local dev, prototyping |
654 |
655 | ---
656 |
657 | ## Recommendation for This Project
658 |
659 | ### For Production/Multi-tenant: Use E2B
660 |
661 | 1. Pre-built Claude Code template = zero setup friction
662 | 2. 24-hour sessions handle long-running autonomous agents
663 | 3. Full filesystem for progress.txt, prd.json, git repos
664 | 4. Proven in production (Lovable, Quora use it)
665 | 5. True isolation (Firecracker microVM)
666 | 6. 200+ MCP tools via Docker partnership
667 |
668 | ### For Long-Running Persistent Agents: Use Sprites
669 |
670 | 1. No session time limits (persistent environments)
671 | 2. Transactional snapshots for version control of entire OS
672 | 3. Auto-sleep when idle reduces costs
673 | 4. Pre-installed Claude Code and AI CLI tools
674 | 5. Best for agents that need to maintain state across days/weeks
675 |
676 | ### For Local Development: Use Docker Sandboxes
677 |
678 | 1. _Quick prototyping_: `docker sandbox run claude`
679 | 2. _With git automation_: `claude-sandbox` (TextCortex)
680 | 3. _Minimal setup_: Uses persistent credentials volume
681 | 4. Free - runs on your own hardware
682 | 5. Unlimited session duration
683 |
```