Back

Case study

A one-page concept, taken to a multi-region game-hosting platform.

A game studio came to me with a pitch deck and no platform team. Six months later it was live in production across two regions. I set the architecture and the plan and directed the build, bringing in one other engineer to build parts of it to that structure. The code is theirs and stays private, so this is how it actually works, and why I built it the way I did.

Engagement
~6 months
My role
Architect, lead, client lead
Regions
2 live
Stack
k3s · Go · Java 21 · Next.js

The short version

What it is, in plain terms

MetricHost lets that studio sell server hosting to their own players and turn a profit doing it. The hard part of that business is that an idle server costs the same as a busy one, so the platform puts idle servers to sleep and wakes them the moment a player joins. The player never notices, and that reclaimed capacity is where most of the margin comes from.

It runs across two regions so players reach the nearest one, each customer's data stays walled off from the next, and it holds up under real load. When the hosting market shifted and the numbers changed, I gave the studio the honest math instead of quietly running up their bill. The rest of this page is how it works, for anyone who wants the detail.

The problem

A pitch, and no one to build it

The pitch was a one-page concept and a list of games to run: Minecraft, Valheim, Rust, that kind of thing. What the job actually needed was a real distributed system, and there was no one in-house to build it.

The off-the-shelf panels don't get you there. Pterodactyl and AMP are monoliths: one process running everything, no way to scale a single piece. They also leave idle servers running at full cost, and for a hosting business that bill is the whole game.

So they brought me in to design and build it, with the one other engineer taking the frontend and parts of the control plane.

The shape of it

Five planes, each with one job

Nothing here can take down something it has no business touching. A billing deploy never restarts a running game server. A hacked game mod can't reach the platform. Operators work in their own isolated space, behind their own login.

A player's connection skips the application entirely. Their game traffic goes straight through a fast TCP proxy to their own server, so they're never stuck waiting behind someone else's billing request. The dashboard and the APIs are a separate layer: one gateway in front of nine backend services. New capacity is set up on its own by a small control plane written in Go. Operators get their own console. Underneath all of it sit the databases, a cache, an event stream, and file storage.

It all runs on self-hosted Kubernetes (k3s), across two regions, with nothing exposed to the public internet (traffic comes in over Cloudflare tunnels). Inside the cluster, the network decides who can talk to whom based on the workload itself, not its IP address, and enforces that down in the operating system, so a hacked game mod can't even open a connection to the billing service. I went with self-hosted Kubernetes over a managed cloud service for one stubborn reason: the managed offerings won't let a machine use disk as overflow memory, and that overflow is exactly what the cost savings below depend on.

The decisions

Six calls, and what each one cost me

A platform isn't interesting for its parts list. It's interesting for what you chose, what you gave up to choose it, and whether the trade held.

  1. Idle servers wind down in stages instead of running flat out.

    A server sitting empty holds onto the same memory as a full one, and for a hosting business that idle memory is just margin you're burning. So empty servers wind down in steps. First the CPU drops to almost nothing while the memory stays put, which wakes back up in well under a second. If a server stays empty longer, its cold memory is moved out to fast local disk while the server keeps running, which wakes in a few seconds. Either way, the freed-up memory goes back to the pool for a server that's actually busy.

    TradeoffEach step frees more memory but takes a little longer to wake, so every game sets its own limits: how quickly its servers wind down, and how far. Players never notice. When someone reconnects, the proxy holds the connection and buffers their first moves while the server comes back.

  2. Adding a server is a workflow that survives a crash, not a loop and a retry.

    Bringing up a new game server isn't one step, it's several: reserve a machine, boot it, join it to the cluster, lock it down, mark it ready. A control plane written in Go runs that whole sequence as a durable workflow (Temporal), so if it crashes halfway through, it picks up exactly where it left off instead of stranding a half-built server that nobody cleans up. It decides when to add servers from real demand, the ones players are actually waiting on, not a timer. And it keeps one spare server built and ready, so when a rush arrives all at once, the first players in aren't stuck waiting for a cold machine to boot.

    TradeoffAnything that can spend money on its own needs guardrails: a budget ceiling, a hard cap on how many servers it can start, a cooldown, a limit on parallel builds. Plus a background job that cleans up stray servers, but only ever the ones the control plane itself created. The always-on baseline is tagged differently, so the cleanup can never touch it.

  3. The regional database is the truth. The global index is best-effort.

    Services don't call each other directly to get work done. They post events to a shared stream (Redpanda) and move on, so something like a user deletion spreads to every service on its own. With this many services, that's the only way GDPR erasure stays manageable.

    TradeoffThere's a catch worth being honest about. Gateways find a server by reading one shared, global index, but each region has its own separate event stream, so a server created in one region doesn't show up in that index on its own. To close the gap, the regional service writes to the index itself. It does this deliberately as a best effort: the write only happens after the server is already saved safely in its own region, and if the index write fails, it's logged and dropped, never rolled back. The reasoning is simple. A shaky lookup table should never be able to erase a server that really exists. And a missed write fixes itself: until the index catches up, gateways just can't route players to that server yet, and the server's next status change tries the write again.

  4. The admin console is slow on purpose.

    Operators get their own console and their own API, walled off and read-only by default. Two database connections back it: one that can only read, one that can write to exactly two tables. The dangerous moves, impersonating a user or issuing a refund, take two people: one to ask, a different one to approve.

    TradeoffMore friction for the operator, which is the point. You can't approve your own request, the request expires, and every write needs a fresh second factor. For a console that can reach into a customer's account, I'll take the friction.

  5. US traffic goes to US, not through Europe first.

    Console, RCON, the file manager, live metrics: they're interactive, and a hop across the Atlantic on every action feels broken. So per-server operations go straight to the region that owns the server. The frontend looks up where a server lives and routes there directly over Cloudflare's edge.

    TradeoffThat splits the world in two: each region is its own data plane, but auth, billing, and the registry stay global so a session works everywhere. The seam between them was the thing I had to get exactly right, which is the cross-region write two decisions up.

  6. Moved the network to Cilium eBPF, carefully.

    k3s ships with flannel, which enforces network rules through long iptables chains that get slower as you add services. Cilium does it in the kernel with eBPF, keyed on which workload is talking instead of which IP. That's what makes a rule like "game servers can't reach billing" both cheap and genuinely enforced, not just configured.

    TradeoffSwapping the network layer can take the whole cluster down with it, so I rehearsed the full cutover on a throwaway cluster first (routing, DNS, policy, pod-to-pod) before going near production, then cleaned up the dead flannel interfaces. kube-proxy stayed in place. The win here is the policy datapath, not ripping kube-proxy out.

What shipped

Measured, not rounded up

Engagement facts, not vanity counts. The architecture behind them is documented in full in the public repo.

58→86%a server's margin once its idle memory is handed back (platform's own model)
<1sto wake a lightly idle server, so players never feel the savings
2regions, players routed to the nearest
25→8minbackend CI after the rework

How it ended

The mechanism worked. The vendor moved the floor.

The wind-down did its job: idle memory goes back to the pool, which on the platform's own math is the difference between barely covering costs and reselling that memory as headroom.

Then the vendor raised the price of its machines overnight. That's a different kind of problem. Freeing up memory makes each server cheaper to run; it does nothing when the machines themselves cost more. What I did about that, the math and the call it led to, is its own write-up below.