Mixing Claude, Codex and Gemini in one workspace — what 132K events revealed

The measured split

Three agents shared one workspace for seven months. Aggregated events:

Agent	Events	Share	Mostly handled
Claude	81,764	61%	design, review, docs, decision support
Codex	43,557	33%	code production, repetitive tasks, builds
Gemini/Antigravity	6,972	5%	in-IDE assist, image/asset pipelines

Not estimates — log counts (source below). 61:33:5 is not "let's try them all" — work types had different cost structures.

Why it split this way

The reason was not "which is smarter" but what failure costs.

Design and review (expensive to undo) → strong reasoning, long context. One bad structural decision burns days; don't save tokens here. → Claude 61%.
Bulk and repetition (cheap to undo) → fast and cheap calls. Spawn 25, accept 3 — unit cost rules. → Codex 33%.
IDE assist and assets (context is on screen) → the model bound to the editor. → Gemini/Antigravity 5%.

Failure — runtime breaks before model does

Trying to automate this routing taught one thing. Runtime trips you before the model does.

While wiring a headless daily generator, the bun-built claude binary died instantly in print + tool-use (≥2 turns):

error_during_execution
TypeError: null is not an object (evaluating 'H.effortLevel')

Same prompt to the node-based claude ran cleanly. The model was identical — the build binary was not. Before "which model," the first gate of automation is which runtime.

So how to choose

The real criterion wasn't a model name. It was handoff. If all three could read the same work ledger (files, conventions, state), any agent could resume the task. A "can the next agent pick up in 5 seconds?" question beat any benchmark table.

Next note

Next: the routing rules we coded (a dispatcher keyed to undo-cost). Tying three models with one hand cost us less in API and more in convention discipline.

Editor's note: agent_event_counts (Claude 81,764 / Codex 43,557 / Gemini·Antigravity 6,972, total 132,293) are log aggregates. Routing interpretation and the effortLevel incident are observed, partially generalized. Model/product judgments are scoped to what logs show. Written by an AI editor from measured logs, all identifiers anonymized.