Built to Be Replaced

We have rebuilt the harness more times than we have rebuilt anything else we own. Each time it felt like an admission of failure, like we'd gotten the foundation wrong and were finally paying for it. It took a few rounds to see it the other way around. The rebuilds weren't the tax on a bad design. They were the design working.

A harness is never finished, because the thing it wraps keeps changing underneath it. What you throw away each time, and what you refuse to, turns out to be the clearest map you have of what was ever real.

A harness is a pile of assumptions about what the model can't do

Every harness is built around the model's weaknesses on the day you wrote it. The model returns broken structure, so you wrap it in a wall of validation. It forgets the format halfway through, so you re-state the format every turn. It hallucinates a tool's arguments, so you add a guardrail. It runs out of room, so you write a compaction strategy to throw old context away. None of this is wrong. All of it is scaffolding poured against a specific, dated weakness.

Then a better model arrives, and the weakness is gone, and the scaffolding you were proudest of is now the thing standing between you and the upgrade. The validation wall, against a model that returns clean structure, is just a wall. The guardrail starts blocking a capability the model has since learned to handle. The compaction strategy is throwing away context the new model was big enough to use. The crutch outlives the limp, and a crutch nobody needs is just a thing to trip on.

The logic you're proudest of is usually the first thing the next model makes you delete.

The gains came from removing, not adding

The instinct, when an agent is unreliable, is to add. One more check, one more fallback, one more clever layer. For a while we did that, and the harness grew dense and defensive and slow to change.

The rebuilds that actually moved things were the opposite. They were subtractions. We pulled out elaborate, hand-written tool definitions and handed the model a plainer, more general way to act. We tore out coordination machinery and replaced it with a simple, honest handoff. Almost every rebuild that helped, the diff was mostly red. This isn't only our experience; the teams moving fastest in public keep reporting the same shape, their biggest gains coming from deleting structure rather than piling more on. As the model gets stronger, the job stops being "build more scaffolding" and becomes "get out of its way."

Every rebuild that helped, the diff was mostly red.

Some rebuilds you earn at two in the morning

Not every rebuild comes from the model getting smarter. Some come from a page in the dark.

When a customer fires off three messages in a row, our loop drops its half-written reply and starts over on the combined text. It's a clean idea and it reads beautifully: don't answer the first message when the next two change the question. It worked until the day it didn't. A customer double-messaged in the half-second after the loop had already placed an order. The restart fired, the combined text ran from the top, and it placed the order again. One quick double-message, two orders.

The fix was not a patch on the restart. It was a new thing the whole loop now believes: the moment a tool with a real side effect runs, the turn is committed, and nothing restarts it. That belief didn't exist in the old harness and couldn't be bolted on as a special case. We rebuilt the loop around it. The incident didn't cost us a bug. It bought us a principle.

What survives every rebuild was never about the model

Here is the pattern it took all the rebuilds to notice. Lay the old harness next to the new one and the churn is never spread evenly. The parsing, the prompt gymnastics, the per-turn reminders, the special cases, all of it gets gutted and rewritten. But the spine never moves. The loop. The gates that ask a human before anything irreversible. The memory strategy. The log of what happened and why.

Those parts survive because they were never bets on the model. They're what we learned about the customer, the work, and the cost of being wrong, and none of it expires when the weights do. The half of the harness you keep rebuilding was always the half quietly assuming the model would stay the same. The half you never touch was never about the model at all.

Each rebuild deletes what we assumed about the model. It never touches what we learned about the customer.

Build it to be replaced

So we stopped trying to build the harness that lasts. That was the wrong target, and chasing it only made the scaffolding more elaborate and more expensive to tear out. A harness written this year is a thing of this year. It is fit to today's model, and today's model is temporary, so the harness is temporary too, and pretending otherwise just raises the cost of the rebuild you already know is coming.

The better target is the opposite of permanence. Keep the seams loose. Keep the model-specific cleverness in places you can reach with one hand, so that when the next model lands you can pull the old assumptions out in an afternoon instead of a quarter. The teams that move fastest aren't the ones who guessed the durable architecture on the first try. Nobody guesses that. They're the ones who made the next rebuild cheap, because they were honest that it was coming.

The model arrives finished and is gone in a season. The harness is never finished, and that is the point.