Benchmark
A self-hosted model matched cloud quality at a tenth of the latency.
Instead of arguing about self-hosted versus cloud, I ran the numbers on a real workload: 1,200 vision-analysis calls across six backend arms, same scenes, two judges with no stake in the result. The data and the script are public, so you can re-run it.
The short version
The finding, first
On this workload a 30-billion-parameter vision model I hosted myself matched the best cloud model's answer quality to within statistical noise, answered five to ten times faster, and cost nothing per call beyond hardware it was already running. The self-hosted model returned in about 0.77 seconds; the cloud models took 3.6 to 7.8 seconds for the same work.
That is one workload, not a law. But it is a measured result with the raw data attached, not an opinion. Everything below is how it was run and what each number means, so you can judge whether it transfers to your case, or clone it and find out.
Results
Four numbers that carry it
From 1,200 calls across six backend arms, three cloud models, a no-extended-thinking control, and two self-hosted, on the same scenes.
Method
How it was run
The benchmark came out of a real system, not a lab harness. I had built a device that watches a room and works out what matters in it: bare-metal firmware on an STM32U5 that sleeps and wakes itself, an agentic backend, and a vision model doing the actual analysis. That gave me a genuine workload to measure against.
Six backend arms went in: three cloud models, one of them run a second time with extended thinking switched off as a control, and two models I hosted myself. Each was shown the same scenes, with repeated measurements per backend so a single lucky or unlucky call could not swing it. 1,200 vision-analysis calls in total. Latency and cost were recorded for every call, and answer quality was scored 0 to 9 by two independent judges on a shared subset of the runs. The full write-up is results/findings_v3.md in the repo, and the runner and raw rows are beside it so the whole thing re-runs.
Findings
What the numbers said
Speed was not close. The self-hosted 30B model answered in about 0.77 seconds; the cloud models took 3.6 to 7.8 seconds for the same work, five to ten times slower, and billed per call on top of it.
Quality was close enough not to matter. On the 0-to-9 scale the self-hosted model scored 6.05 against the best cloud model's 6.60. Two blind judges, same scenes: a gap that size sits inside the noise, not in the answers.
The expensive option's headline feature bought nothing here. The cloud model's extended-thinking mode added no measurable quality on this workload while charging more tokens and more time for it.
And the cheap seat has a floor. The smaller of the two models I hosted, a 3B, was simply too small for the job: weakest on quality and unable to use tools at all, which is worth knowing before you pick it to save money.
Why this matters
Measure before you commit a model
The expensive default bought nothing I could measure on this job, and the cheapest option quietly failed at the thing that mattered. Neither is something you can feel from the outside.
That is the whole argument for measuring before you put a model into production. The right choice is workload-specific, and a benchmark on your own traffic is cheap next to a year of the wrong inference bill. When I take on an AI-cost problem, this is the first thing I do: run your workload against the alternatives and let the numbers pick.