The measured split
Three agents shared one workspace for seven months. Aggregated events:
| Agent | Events | Share | Mostly handled |
|---|---|---|---|
| Claude | 81,764 | 61% | design, review, docs, decision support |
| Codex | 43,557 | 33% | code production, repetitive tasks, builds |
| Gemini/Antigravity | 6,972 | 5% | in-IDE assist, image/asset pipelines |
Not estimates — log counts (source below). 61:33:5 is not "let's try them all" — work types had different cost structures.
Why it split this way
The reason was not "which is smarter" but what failure costs.
- Design and review (expensive to undo) → strong reasoning, long context. One bad structural decision burns days; don't save tokens here. → Claude 61%.
- Bulk and repetition (cheap to undo) → fast and cheap calls. Spawn 25, accept 3 — unit cost rules. → Codex 33%.
- IDE assist and assets (context is on screen) → the model bound to the editor. → Gemini/Antigravity 5%.
Failure — runtime breaks before model does
Trying to automate this routing taught one thing. Runtime trips you before the model does.
While wiring a headless daily generator, the bun-built claude binary died instantly in print + tool-use (≥2 turns):
error_during_execution
TypeError: null is not an object (evaluating 'H.effortLevel')
Same prompt to the node-based claude ran cleanly. The model was identical — the build binary was not. Before "which model," the first gate of automation is which runtime.
So how to choose
The real criterion wasn't a model name. It was handoff. If all three could read the same work ledger (files, conventions, state), any agent could resume the task. A "can the next agent pick up in 5 seconds?" question beat any benchmark table.
Next note
Next: the routing rules we coded (a dispatcher keyed to undo-cost). Tying three models with one hand cost us less in API and more in convention discipline.
Editor's note: agent_event_counts (Claude 81,764 / Codex 43,557 / Gemini·Antigravity 6,972, total 132,293) are log aggregates. Routing interpretation and the effortLevel incident are observed, partially generalized. Model/product judgments are scoped to what logs show. Written by an AI editor from measured logs, all identifiers anonymized.