Agent Hive mark

People ask what we run on. The short answer: Fly machines for the agent runtime, Supabase for control-plane storage, Vercel for the dashboard, OpenAI and Anthropic and Gemini for the models. The longer answer is a set of trade-offs worth writing down, because each piece is load-bearing for a different reason.

Per-colony Fly machines

Every Agent Hive customer gets a private Fly machine when they sign up. That machine boots a clean Hive runtime image with the customer's credentials mounted as Fly secrets, opens a private network connection back to our control plane, and starts accepting jobs.

Why a machine per customer, not a pool of shared workers?

Isolation is cheap, not free, and we want the cheap kind. A pool of shared workers is operationally simpler at first, but the first time a customer's tool call leaks a token into another customer's process environment, you have an incident that will end the company. We are not willing to underwrite that risk. A machine per customer means the failure surface for credential bleed is a Linux kernel boundary, which is the strongest one available without going to dedicated hardware.
Fly machines are fast to start. A cold start, from "issue API call" to "Hive runtime ready," is about 6 seconds at the moment. That is fast enough to provision a colony in under a minute end-to-end, including the Postgres schema setup and the dashboard render. Cold starts on EC2 or GCE for an equivalent VM would put us five to ten times slower, which would break the "60-second sign-up" promise.
We can stop them when nobody is using them. Most colonies are idle most of the time. Fly's "scale to zero" is real. A machine costs us cents per hour while it is running, and zero while it is stopped. Compared to a 24/7 EC2 worker pool, the bill is a different shape of curve entirely.

The thing Fly is bad at is being a database. Their volumes work for local state but they are not the durable, queryable store you want for the control plane.

Supabase Postgres for the control plane

Everything Agent Hive needs to query (the colony list, the org chart, the issues table, the approvals queue, the cost rollups) lives in a single shared Supabase Postgres instance. Each tenant's data is row-level-isolated; every query goes through a service-role connection that pins a colony_id claim, and Postgres row-level-security policies enforce the rest.

Two practical reasons we picked Supabase here.

Realtime over the wire, included. The dashboard is push, not poll. Supabase ships a realtime channel layer that pipes Postgres changes to subscribed browser clients with no extra service to operate. Every "live cost," "approval pending," and "agent step" update in the dashboard flows through a Supabase realtime channel. We considered building this ourselves on top of Postgres LISTEN/NOTIFY and a custom WebSocket service; the Supabase version is fine and we get to ship instead of operate it.
Branching for migrations. Every PR that touches the schema gets a Supabase branch. We run the migration against the branch, run the test suite against the branch, and only land if both pass. The first time we did a hot schema change on production data without a branch, we lost about two hours. The third time we tried branching, we never went back.

What we would do differently if we started today: model the costs and metrics tables as time-series from day one, instead of starting with a normal Postgres table and migrating later. We are in the middle of that migration now and it is the kind of work nobody enjoys.

Vercel for the dashboard

The Agent Hive dashboard is a Next.js 15 app on Vercel. We run the App Router, server components everywhere we can, and a thin client-side layer for the realtime cursor + presence work and the Cmd-K command bar. The dashboard is at apps/web in the monorepo.

Vercel is not the cheapest hosting option, and we are aware of every dollar of edge network spend. The reason we keep paying for it is the deploy story. The dashboard ships maybe 15 times a day during a heads-down sprint week, and the "git push, get a preview URL, share it with a customer in 90 seconds" loop is one of the few things in our toolchain that is genuinely faster than the alternative.

If we ever build a self-hosted Agent Hive (we have customers asking), the dashboard part is the hardest to port. The Fly + Supabase pieces are portable to anything that runs a Linux container; the Vercel-shaped dashboard would need to be repackaged for plain Node. We will cross that bridge when a customer signs a contract that requires it.

The model layer

The agent runtime is bring-your-own-key for any of: Anthropic Claude, OpenAI GPT-4 family, Google Gemini. Most of our managed-tier customers are running Claude Sonnet because the tool-use reliability is, at the moment, the best of the three for this kind of workload. We do not have a strong opinion about which model you should pick; we have a strong opinion that the choice should be a single env-var change, not a code rewrite.

What this stack does not solve

A few things this stack is honest about not handling well, that we have on the roadmap.

Cross-region failover. Today, your colony lives in one Fly region. If that region goes down, your agents pause. Adding a hot standby in a second region is on the list but is not in flight yet.
On-prem. A customer who needs the whole thing inside their own VPC is asking for a different shape of product, which we will build when the contract is signed but not before.
GPU-resident models. Some customers want a self-hosted Llama or Qwen running next to the agents. The Fly machine shape supports it; we just have not built the runtime integration. On the list, behind a customer asking.

The shape will keep evolving. The point of writing this down is so the next thing we change has a documented baseline to change from.

Running agents on Fly and Supabase

The only platform to run an AI-native company.

Per-colony Fly machines

Supabase Postgres for the control plane

Vercel for the dashboard

The model layer

What this stack does not solve