Benchmark, part two

When does self-hosting actually pay?

My benchmark settled quality and latency. The 30B I hosted matched the best cloud model within noise, five to ten times faster. This piece answers the question everyone asked next: the money. The answer is less flattering to self-hosting than the internet suggests.

Inputs: Measured token profiles
Prices: List, July 2026, sourced
Output: A formula you can re-run
Extends: The 1,200-call benchmark
Published: Jul 4, 2026

The short version

The API stays cheaper longer than you think

When I published the benchmark, one question kept coming back: surely self-hosting beats paying per token? Fair question. I had measured quality and speed, but not money. So I priced it out, using the benchmark's own logged rows.

Everything below is priced against my real workload. One call goes out with an image; a few hundred tokens come back. That is the whole unit of work.

So what does that unit cost on the cloud? Call this the cloud bill: a price you pay per thousand calls, with nothing owned and nothing fixed. At July 2026 list prices, a thousand of those calls runs between a dollar and $4.92. The top of that range is the strongest model in my benchmark, priced from its own logged tokens. The receipts are in the next section. Either way: pocket change at small volume.

Now the other option: renting a GPU and serving the same model yourself. Renting is the opposite shape of cost. It is a flat monthly bill, not a price per call. It arrives whether you make ten calls or ten million.

A flat monthly bill against a per-call price means there is a crossover volume. I did the division. Against the strongest cloud bill, renting a GPU starts winning somewhere between 33,000 and 267,000 calls a month, depending on which card you rent. Owning a card changes the math again. A used card plus its electricity runs about $60 a month, and that owning cost breaks even near 12,000 calls a month. Below those volumes, the answer is boring. Just pay per token.

So why self-host at all? Because cost was never the strongest argument. In the benchmark, the model I hosted answered in 0.77 seconds. The cloud arms took 3.6 to 7.8 seconds. And the images never left the machine. If your product needs the speed or the privacy, the rest of this page prices that choice at your volume.

The anchors

Four numbers that set the frame

Per-call costs are computed from the benchmark's measured token medians at each provider's July 2026 list prices; hardware costs are named providers' list prices on the same date.

$4.92the cloud bill per 1,000 calls on the strongest arm, from its measured median of roughly 340 input and 260 output tokens per call.

$1.24the cloud bill per 1,000 calls on the budget arm, same method. The cheap tiers are genuinely cheap.

$160a month to rent the cheapest credible 24 GB GPU that can serve this 30B model.

~12,000calls a month where owning a card overtakes the strongest cloud bill, electricity included.

The cloud side

A call costs half a cent, at worst

The benchmark logged the token usage of all 1,200 analysis calls. So the per-call cost here is arithmetic. Multiply the logged tokens by each provider's list price and you have the bill I actually paid.

Start with the strongest arm. Its median call sent about 340 tokens in and got about 260 back, and the extended-thinking trace is billed inside that output, because that is how the provider counts it. Price those tokens at the model's list rates ($3 per million in, $15 per million out) and the cloud bill comes to $4.92 per thousand calls. That is less than half a cent per call. The budget arm came in cheaper still, at $1.24 per thousand calls.

One detail surprises people. The image itself is billed as input tokens, and providers count the same image very differently. One arm's median call carried roughly 1,270 input tokens, nearly all of them the image. Sounds expensive. It is not: at flash-class pricing of $0.50 per million input tokens, that whole call costs about a tenth of a cent. The cheap tiers are genuinely hard to beat.

And you can cut these prices further before touching any hardware. Batch APIs halve the bill at all three major providers, if you can wait up to a day for answers. Prompt caching cuts repeated context to a tenth of the input price. Offline, repetitive workload? The API side of this argument just got twice as strong.

The self-hosted side

Renting is expensive. Owning is cheap.

My benchmark ran on a workstation card: an RTX 6000 Ada with 48 GB of VRAM. It served the 30B vision model and a small 3B side by side through llama.cpp. Renting that same capability in a datacenter is the honest comparison, and it costs about $1,314 a month (OVH's list price for the L40S, the card's 48 GB datacenter sibling). That is the top of the renting range.

But the 30B alone needs less card than I gave it. Its quantized weights come to about 19 GB, so people commonly run it on 24 GB consumer cards. You give up context headroom, and you run one model instead of two. That class of card rents for $160 to $290 a month on RunPod, Vast, or a dedicated Hetzner box. So renting spans a wide band: $160 a month at the low end, $1,314 a month at the datacenter top. I did not benchmark on those cards. Treat this tier as sizing guidance, not measurement.

Owning is where the numbers really turn. A used 24 GB RTX 3090 traded at $750 to $1,100 to buy through the first half of 2026. Spread that purchase over three years and the card costs about $25 a month to own. Electricity adds the rest: this class of card pulls around 350 watts under sustained load, which at Dutch household prices adds a few tens of dollars a month if it never sleeps. Realistically it sleeps a lot, and you pay less. Call the owning cost about $60 a month, everything included.

What about capacity? Not the problem you might expect. Even at a conservative 70 tokens per second, one card can serve hundreds of thousands of calls like mine every month. For a small product, the card is never the bottleneck. The bill is.

The formula

Run it on your own volume

The whole comparison reduces to one line. Take your flat monthly GPU cost. Divide it by what one call costs you on the cloud bill. The result is your break-even: the monthly call volume where owning or renting starts paying for itself. Past that volume, every call saves you money. Below it, the card is a subscription you did not need.

Here is that line worked with my numbers, against the strongest cloud bill of $4.92 per thousand calls. Owning a card at $60 a month breaks even near 12,000 calls a month. Renting at $290 a month breaks even near 59,000 calls a month. The big 48 GB datacenter rental, at the top of the range, needs about 267,000 calls a month. Now point the same math at the cheapest cloud bill, roughly a dollar per thousand calls: renting essentially never pays, and even owning a card needs steady volume before it wins.

Here's the thing about most published break-even analyses. They model giant 70B-plus systems on multi-GPU datacenter rigs, at $3,000 to $20,000 a month of fixed cost. At those prices, of course you need billions of tokens to justify self-hosting. A 30B-class model on one card is a different economic regime. The fixed side is an order of magnitude cheaper. And it happens to be where small production workloads actually live.

The crossover

Below 12,000 calls a month, pay per call

The cloud bill climbs with volume; the two hardware options are flat monthly costs. Owning a used card runs about $60 a month (the purchase spread over three years, plus electricity), and the bill reaches that near 12,000 calls. Renting a GPU at $290 a month takes until about 59,000.

The honest conclusion

Cost is the third best reason to self-host

For what it is worth, the answer surprised me too. I went in assuming the expensive options earned their price. Then the benchmark showed the cloud's thinking mode adding nothing I could measure, while charging for it. And now the cost math says the cheap API tiers are close to unbeatable at low volume. Twice wrong, in the same direction.

So here is the honest decision rule. Low or spiky volume, and flash-tier quality passes your bar? Pay per token and move on. Self-hosting earns its keep in three cases: steady volume on hardware you own, a latency gap your users can feel, or data that cannot leave your infrastructure. Notice that two of those three have nothing to do with the invoice. In my deployment, they were the ones that mattered.

Weighing this for a real product? Send me your workload shape and your current bill. I will run this arithmetic on your numbers and tell you which side of the line you are on, at no charge.

The benchmark this extends: quality, latency, raw rows The system and data behind both write-ups, re-runnable Send me your bill or your architecture

Sources

Measured token medians, latencies, and the benchmark hardware (RTX 6000 Ada, llama.cpp)the benchmark's public repoaccessed 2026-07-04
Anthropic list prices ($3/$15 per million tokens on the mid tier, $1/$5 budget tier)platform.claude.comaccessed 2026-07-04
Google Gemini flash-class list prices ($0.50 per million input tokens) and image token countingai.google.devaccessed 2026-07-04
OpenAI list prices and 50% batch discountdevelopers.openai.comaccessed 2026-07-04
24 GB GPU rental range, roughly $160 to $500 a month across tiersrunpod.ioaccessed 2026-07-04
RTX 4090 marketplace rental floor and medianvast.aiaccessed 2026-07-04
Dedicated 20 GB Ada GPU server at 234 euros a month plus a 114 euro setup feehetzner.comaccessed 2026-07-04
L40S 48 GB at $1.80 an hour, about $1,314 a month continuousovhcloud.comaccessed 2026-07-04
Used RTX 3090 street prices through H1 2026 ($750 to $1,100, rising)craftrigs.comaccessed 2026-07-04
Measured inference power draw (roughly 280 to 410 watts on a 24 GB card under load)gigagpu.comaccessed 2026-07-04
Dutch and German household electricity priceseuenergyprices.euaccessed 2026-07-04
30B-class throughput on 24 GB consumer cards (roughly 70 to 196 tokens a second measured)runaihome.comaccessed 2026-07-04