Back

Field notes

Where a demo ends and production begins.

A demo proves an idea works once. Production is where it has to keep working while the bill climbs, things fail in ways you cannot reproduce, and the load stops being polite. The failures cluster in three places. Here is each one, and the measured proof behind how I handle it.

Topic
Production AI infrastructure
Failure modes
Cost, reliability, scale
Proof
Public and re-runnable
Published

The short version

Three places it breaks

Almost anything works in a demo. Put it in front of real users and it tends to fail in one of three places: the cost gets away from you, it breaks somewhere you cannot see, or it buckles under load. AI makes each sharper, because a model is an expensive, non-deterministic dependency you now have to run in production like everything else.

None of these are solved by a bigger model or a better slide. They are engineering problems with measurable answers. Below is how I think about each, with a piece of public, re-runnable proof attached to it rather than a claim.

Cost

The bill you cannot explain

At small scale the cloud and inference bills are invisible. Real traffic turns them into a number nobody can account for. With AI the common failure is paying top dollar for the biggest model on a job a smaller one does just as well, or a cheap model that retries three times costing more than the right one used once.

I do not guess at this, I measure it. In my own benchmark a self-hosted 30B model matched the best cloud model's quality within statistical noise at five to ten times the speed and no per-call cost. The point is not that self-hosting always wins. It is that the right model is workload-specific, and a benchmark on your own traffic is cheap next to a year of the wrong bill.

Reliability

The failures that throw no error

The outages that hurt are the ones with nothing in the logs: no trace of what happened, nothing built to be watched, so every incident is a guessing game. The worst of them never throw an error at all, which is the exact failure mode an AI feature adds when a model quietly returns something plausible and wrong.

The fix is unglamorous and it works: put in the instrumentation and the failure handling that turn a mystery into something you can see, then fix. On MetricHost that meant full observability across two regions so a problem showed up as a signal, not a support ticket.

Scale

Fine at a hundred, not at ten thousand

Infrastructure that held for a hundred users falls over at ten thousand. This is the heavy work that keeps a product standing under real traffic: running across regions so people reach the nearest one, walling each customer's data off from the next, the networking, and backups that actually restore when it counts.

MetricHost is the worked example: a multi-region platform on k3s and Cilium/eBPF, with idle servers that hibernate and wake on player connect so capacity is not burned at 3am. That last part is also a cost lever, which is the usual truth in production: reliability, scale, and cost are the same system looked at from three sides.

The through-line

Measure, then build

The thread through all three is the same: decide with numbers, not with opinions or defaults. Measure where the money goes before you cut it, instrument the failure before you chase it, and load-test the shape before you trust it.

That discipline is also why I can hand you proof instead of a pitch. The benchmark re-runs, the architecture is written up, and even the tool that governs this site is public. If one of these three is your problem right now, send me the detail and I will tell you where I would start.