The Harness

A language model, on its own, does one thing. Give it some text and it predicts what comes next. That's the whole act. It has no memory of yesterday, no way to read a file, no hands. Leave it alone and it answers once, then goes quiet.

Everything that makes a model feel like an agent lives outside the model. The remembering, the tool use, the twenty minutes of work that comes back finished: none of that is the model itself. It's the harness, the runtime around the model that turns a thing predicting tokens into a system that gets work done.

We spent a year arguing about which model is smartest. Build anything real and you learn a quieter truth. The harness matters more than the model inside it. The model is the part you download. The harness is the part you build.

Loop. The spine, repeated every turn: assemble context → call the model → run the tool it asked for → fold the result back → repeat, until the model answers with text instead of a tool call. The prompt enters once at the top-left; Done exits once.

The loop is the spine

A single model call is a sentence. A loop is a paragraph that knows where it's going.

The pattern is almost embarrassingly simple. Call the model, read what it asked for, run the tool, feed the result back, and ask again, until it says it's done. That cycle is the whole difference between a chatbot and an agent. Most agents that work in production are some version of this loop, kept honest.

The art isn't in the loop's shape. It's in knowing when to stop, when to retry, and how to keep each pass from poisoning the next.

Tools are the hands

A model can say "I'll check the database." It can't check anything. The harness is what turns that intention into a real query and brings the rows back.

Tools are how a model reaches the world: shell, files, search, an HTTP call, an MCP server. The model proposes and the harness disposes. The unglamorous half of tool-building is the part nobody demos: validating the model's arguments, handling the call that times out, returning an error the model can actually recover from. A tool that fails loudly teaches the model to try again. A tool that fails silently teaches it nothing.

Context is the real bottleneck

Benchmarks don't tell you this part. At small scale, the model's intelligence is the ceiling. At real scale, the context is.

Every pass through the loop adds to a finite window: the system prompt, the history, the last tool's output, all of it. Run long enough and you hit the wall. So the harness has to manage memory, dull work and decisive work at the same time. It prunes what's stale. It summarises what's bulky. It compacts the conversation mid-task without losing the thread, and hands off cleanly when one window ends and the next begins.

In practice this looks mundane. We keep the last couple dozen turns in full and a rolling summary of everything older. When a customer comes back hours later we don't replay the thread, we stamp the gap instead, a quiet "three hours later" the model can read, so it feels the time pass without paying to carry the history.

At a handful of requests, you tune prompts. At billions of tokens a month, you tune memory.

Managing memory isn't only about what you keep. It's about where you put it. The model has no memory of yesterday. Every call, it reads the whole conversation fresh, top to bottom, so where a sentence sits is when it happened. Hand it a price in the middle of the conversation and it narrates the lookup: "let me just pull that up...". Move the same sentence up into the system prompt, the part it reads as what was already true before you spoke, and now it just says the price, flat, the way you recite your own phone number. Same fact, same model. The only thing that moved was where in the story the sentence sat. Everyone knows position changes what a model pays attention to. Fewer notice it changes what the model thinks it is.

When I think about what actually breaks as a system grows, and Easy AI now runs tens of billions of tokens a month, it's almost never that the model got dumber. It's that the context got messy. Reliability at scale is a memory problem wearing a model's clothes.

Control is what lets you trust it

An agent with real tools is an agent that can do real damage. The same call that reads a row can delete one.

So the last piece of the harness is restraint: permissions, sandboxes, a human asked before anything irreversible, hard limits that stop a runaway before it runs. This is the part that feels like friction right up until the moment it saves you. You don't give an agent the keys to production and hope. You give it a gate, and you decide what passes through.

One limit we learned the hard way. When a customer fires off three messages in a row, the loop drops its half-written reply and starts over on the combined text. But the instant a tool with a real side effect runs, like placing an order, that restart stays locked for the rest of the turn. The first version didn't have the lock, and one quick double-message became two orders.

The harness also keeps the record of all of it: what the model saw, what it tried, what it was allowed to do, what failed, and what was worth carrying forward. Not for tidiness. An agent you can't replay is an agent you can't trust, because you can't tell a real success from a lucky one.

A good harness lifts the model

There's a flip side to all this, and it's the bet we made at Easy AI early, back when the leaderboards were the whole conversation. A good harness doesn't just survive a weaker model. It lifts one.

Give a small, cheap model a tight loop, clean context, and tools that fail loudly, and it'll do work you'd have sworn needed a frontier model. We run open models a fraction the size of the famous ones, and they hold. Not because they're secretly brilliant. Because the harness carries the reliability they can't. Where the customer actually stands, a small model in a good harness beats a frontier model in a careless one, and it isn't close.

There's a name for part of this now: routing, the cheap model for the easy turns and the frontier one for the rest. We do it too, and it works. But routing is what you collect after the harness has already made size matter less. The scaffolding quietly does the work people keep waiting on more parameters to do.

A frontier model is a better guess. A good harness is why you stop needing one.

Why the harness outlasts the model

Models will keep changing. You'll swap the one underneath this paragraph for a better one within months, and again after that. What you won't throw away is the loop you tuned, the tools you hardened, the memory strategy you earned the hard way, the gates you placed with care.

The intelligence arrives in a download. The reliability you build by hand, in the scaffolding around the model, in all the small decisions nobody sees. That's what makes people trust the system at nine in the morning, when it matters.

The model is the part everyone talks about. The harness is the part that ships.