The Economics of a Token

At one request a day, a token costs nothing worth counting. At tens of billions a month, it is the line on the invoice that quietly decides whether there is a business behind the product at all.

And a request is never one token, or even one call. A single thing a customer asks for fans out inside the loop into a handful of model calls, sometimes a dozen, each one reasoning, reaching for a tool, reading the result, checking its own work. Every one of those calls re-reads the conversation from the top, because the model has no memory and the whole story gets handed back each time. You are not paying per message. You are paying per message, times every step it takes to answer, times every word you chose to carry along for the ride.

A token is rent, not a purchase

This is the part the pricing page hides. A token looks like something you buy: a price, a quantity, a line item. But the model forgets everything between turns, so the words you keep aren't bought once and owned. They are re-read on the next pass, and the one after that, billed again every time. You are not buying intelligence by the token. You are renting attention, and the rent comes due every loop.

Once you see it as rent, the question stops being "how cheap is a token" and becomes "what is this word earning, every single turn it stays in the window." Most words, asked that way, don't make rent.

Cheaper tokens made us spend more

The price of a token has been falling like a stone, roughly tenfold a year for as long as anyone's been counting. The natural assumption is that the bill falls with it. It doesn't. Ours went up.

A cheaper token doesn't shrink what you spend; it lowers the bar for what you'll attempt. When a call is nearly free you add one more verification step, one more retry, one more pass of reasoning you'd never have paid for a year ago. The floor drops and the building gets taller. The savings the price cut promised get spent, instantly and invisibly, on ambition you didn't used to be able to afford.

So the discipline matters more as tokens get cheaper, not less. The cheaper they are, the faster they accumulate when nobody is watching, and the easier it is to mistake a falling unit price for a falling bill.

A cheaper token never shrank the bill. It only raised the ceiling on what we'd try.

The cheapest token is the one you didn't send

If you can't win on price, you win on volume, by sending fewer. This is where memory stops being a feature and becomes a budget.

We keep the last couple dozen turns in full and a rolling summary of everything older, because the recent turns earn their place and the old ones rarely do. When a customer comes back hours later, we don't replay the thread, we stamp the gap, a quiet "three hours later" the model can read. The full history was a real cost paying for an outcome we could get for a fraction of the tokens, so we stopped paying it. None of this is clever. It is just refusing to send words that aren't earning their rent.

The cheapest token is the one you were about to send and didn't.

Price per token is the wrong number

The number on the pricing page is per token. The number that lands on your invoice is per finished task, and the two come apart faster than anyone expects. A model that is cheap per token but second-guesses itself, narrates, retries, and wanders can cost more to reach an answer than a pricier one that gets there in three clean turns. You don't buy tokens. You buy outcomes, and tokens are merely how you're billed for them.

This is also the real argument for a small model, and it isn't the one people usually make. The point of a small, cheap model, open weights a fraction the size of the famous ones, isn't only that each token costs less. It's that inside a tight loop with a clean context, it finishes in fewer tokens. Cheaper per token and fewer tokens to the answer: the saving multiplies. At our volume that per-token rent is also why the model runs on our own hardware instead of someone else's meter, because past a certain number of tokens a month, renting each one simply stops making sense.

What this is not is an argument for a drawer full of models. A drawer of models is a drawer of quirks, each with its own way of failing and its own place to drop context on the handoff. Routing the occasional hard turn to something larger is a footnote you add after the small model is already finishing the work, not the plan itself. The lever was never how many models you get to choose between. It was building one that finishes.

Reliability and cost are the same line

This is the part that surprised me. I expected cost work and reliability work to pull against each other, the cheap thing being the flimsy thing. At scale they turned out to be the same job. The messy context that blew the budget was the same messy context that made the model lose the thread. Every token we pruned for the invoice was a token that was no longer there to distract.

So we stopped treating the bill as something to apologise for and started reading it as a diagnostic. A conversation that costs too much is usually a conversation the harness is handling badly. The invoice was telling us where the system was sloppy, in the one language nobody on the team could argue with.

The model sets what a token can do. The harness sets how many you spend to get there. Only one of those shows up on the invoice every morning.