Back

Benchmark

A self-hosted model matched cloud quality at a tenth of the latency.

Instead of arguing about self-hosted versus cloud, I ran the numbers on a real workload: 1,200 vision-analysis calls across six backend arms, same scenes, two judges with no stake in the result. The data and the script are public, so you can re-run it.

Calls
1,200 across six arms
Judges
Two, blind
Run
May 2026
Data
Public, re-runnable
Published

The short version

The finding, first

On this workload a 30-billion-parameter vision model I hosted myself matched the best cloud model's answer quality to within statistical noise, answered five to ten times faster, and cost nothing per call beyond hardware it was already running. The self-hosted model returned in about 0.77 seconds; the cloud models took 3.6 to 7.8 seconds for the same work.

That is one workload, not a law. But it is a measured result with the raw data attached, not an opinion. Everything below is how it was run and what each number means, so you can judge whether it transfers to your case, or clone it and find out.

Results

Four numbers that carry it

From 1,200 calls across six backend arms, three cloud models, a no-extended-thinking control, and two self-hosted, on the same scenes.

0.77sfor the self-hosted 30B model to answer, against 3.6 to 7.8s for the cloud models on the same scenes.
5 to 10xfaster self-hosted, with no per-call fee on top of the hardware it already ran on.
6.05 / 6.60blind-judged quality, self-hosted against the best cloud model, out of 9. Inside the noise, not a real gap.
1,200vision-analysis calls, six backend arms, scored by two judges that had no stake in the outcome.

Method

How it was run

The benchmark came out of a real system, not a lab harness. I had built a device that watches a room and works out what matters in it: bare-metal firmware on an STM32U5 that sleeps and wakes itself, an agentic backend, and a vision model doing the actual analysis. That gave me a genuine workload to measure against.

Six backend arms went in: three cloud models, one of them run a second time with extended thinking switched off as a control, and two models I hosted myself. Each was shown the same scenes, with repeated measurements per backend so a single lucky or unlucky call could not swing it. 1,200 vision-analysis calls in total. Latency and cost were recorded for every call, and answer quality was scored 0 to 9 by two independent judges on a shared subset of the runs. The full write-up is results/findings_v3.md in the repo, and the runner and raw rows are beside it so the whole thing re-runs.

Findings

What the numbers said

Speed was not close. The self-hosted 30B model answered in about 0.77 seconds; the cloud models took 3.6 to 7.8 seconds for the same work, five to ten times slower, and billed per call on top of it.

Quality was close enough not to matter. On the 0-to-9 scale the self-hosted model scored 6.05 against the best cloud model's 6.60. Two blind judges, same scenes: a gap that size sits inside the noise, not in the answers.

The expensive option's headline feature bought nothing here. The cloud model's extended-thinking mode added no measurable quality on this workload while charging more tokens and more time for it.

And the cheap seat has a floor. The smaller of the two models I hosted, a 3B, was simply too small for the job: weakest on quality and unable to use tools at all, which is worth knowing before you pick it to save money.

Why this matters

Measure before you commit a model

The expensive default bought nothing I could measure on this job, and the cheapest option quietly failed at the thing that mattered. Neither is something you can feel from the outside.

That is the whole argument for measuring before you put a model into production. The right choice is workload-specific, and a benchmark on your own traffic is cheap next to a year of the wrong inference bill. When I take on an AI-cost problem, this is the first thing I do: run your workload against the alternatives and let the numbers pick.