
The infrastructure choices behind Agent Hive: Fly machines per colony, Supabase Postgres for control-plane state, Vercel for the dashboard, and why this stack.

14-day trial. No DevOps. No Sales call. Provisioned in under a minute.
People ask what we run on. The short answer: Fly machines for the agent runtime, Supabase for control-plane storage, Vercel for the dashboard, OpenAI and Anthropic and Gemini for the models. The longer answer is a set of trade-offs worth writing down, because each piece is load-bearing for a different reason.
Every Agent Hive customer gets a private Fly machine when they sign up. That machine boots a clean Hive runtime image with the customer's credentials mounted as Fly secrets, opens a private network connection back to our control plane, and starts accepting jobs.
Why a machine per customer, not a pool of shared workers?
The thing Fly is bad at is being a database. Their volumes work for local state but they are not the durable, queryable store you want for the control plane.
Everything Agent Hive needs to query (the colony list, the org chart, the issues table, the approvals queue, the cost rollups) lives in a single shared Supabase Postgres instance. Each tenant's data is row-level-isolated; every query goes through a service-role connection that pins a colony_id claim, and Postgres row-level-security policies enforce the rest.
Two practical reasons we picked Supabase here.
What we would do differently if we started today: model the costs and metrics tables as time-series from day one, instead of starting with a normal Postgres table and migrating later. We are in the middle of that migration now and it is the kind of work nobody enjoys.
The Agent Hive dashboard is a Next.js 15 app on Vercel. We run the App Router, server components everywhere we can, and a thin client-side layer for the realtime cursor + presence work and the Cmd-K command bar. The dashboard is at apps/web in the monorepo.
Vercel is not the cheapest hosting option, and we are aware of every dollar of edge network spend. The reason we keep paying for it is the deploy story. The dashboard ships maybe 15 times a day during a heads-down sprint week, and the "git push, get a preview URL, share it with a customer in 90 seconds" loop is one of the few things in our toolchain that is genuinely faster than the alternative.
If we ever build a self-hosted Agent Hive (we have customers asking), the dashboard part is the hardest to port. The Fly + Supabase pieces are portable to anything that runs a Linux container; the Vercel-shaped dashboard would need to be repackaged for plain Node. We will cross that bridge when a customer signs a contract that requires it.
The agent runtime is bring-your-own-key for any of: Anthropic Claude, OpenAI GPT-4 family, Google Gemini. Most of our managed-tier customers are running Claude Sonnet because the tool-use reliability is, at the moment, the best of the three for this kind of workload. We do not have a strong opinion about which model you should pick; we have a strong opinion that the choice should be a single env-var change, not a code rewrite.
A few things this stack is honest about not handling well, that we have on the roadmap.
The shape will keep evolving. The point of writing this down is so the next thing we change has a documented baseline to change from.