Otherix - Architecture¶
A self-hosted control plane for running QEMU virtual machines across a fleet of Linux hypervisor nodes. This document is a high-level orientation for engineers with a devops/SRE background: what the moving parts are, how they talk, and where state lives.
What it is¶
Otherix manages the full lifecycle of VMs (create, start/stop, console, delete) on bare
hypervisors. A VM is created directly from a disk-image URL - there is no template or image
registry. Users own VMs; nodes, networks, and storage pools are shared infrastructure managed
by administrators. The system is operated through a kubectl-style CLI (otherix) that speaks a
REST API, including declarative multi-document YAML manifests (otherix create -f).
It is deliberately small in surface: no multi-tenancy, no external database, no Postgres/Redis, no libvirt. Just three Go binaries and embedded etcd.
The three processes¶
operator ── otherix (CLI) ──REST──▶ otherix-api ──embeds──▶ etcd
│ (single stateful store, in-process)
│ mTLS over a cluster CA
▼
otherix-agent (one per hypervisor node)
│
▼
QEMU + Linux networking
otherix-api - the control plane. A single self-contained process that:
- serves the public REST API (JWT or otx_ API-token auth);
- embeds an etcd member in-process - the only stateful component;
- runs the async job dispatcher and periodic maintenance loops in the same process (there is no
separate scheduler or reconciler daemon);
- talks to agents over mutual TLS.
For HA, multiple otherix-api replicas self-cluster into one etcd Raft cluster and share work by
claiming jobs off the etcd-backed queue.
otherix-agent - the per-node daemon (Linux only). Owns the QEMU processes, the node's local
image cache, and the host networking data plane (bridges, tap devices, VXLAN overlay, WireGuard
mesh, nftables NAT). It keeps its own state on the local filesystem and never touches the control
plane's etcd - all coordination flows through a periodic heartbeat.
otherix - the CLI. Cobra-based, kubectl/gh ergonomics, multi-cluster config under
~/.otherix/config.
Build note:
make buildproducesotherix-api,otherix-agent, andotherix. The control plane is one process; "scheduler" and "reconciler" are in-process loops, not separate binaries.
How state is split¶
There are two sources of truth, and keeping them separate is the core design choice:
| Desired state | Observed state | |
|---|---|---|
| What | What the user asked for: VM cpu/memory/disks, networks, pools | What is actually running: qemu pid, phase, metrics, cached images |
| Where | Control-plane etcd (internal/etcdstore) |
On the agent's local disk |
| How it moves | Pushed down in each heartbeat response | Reported up in each heartbeat request |
The control plane writes desired state and the agent's reconcilers converge toward it,
reporting observed state back. A VM's API view shows both: top-level fields are what you want;
a derived status reflects what the agent last reported. This is the same desired/observed loop
Kubernetes uses, on a 30-second heartbeat.
Storage: embedded etcd¶
etcd runs inside otherix-api (no network hop for the control plane's own reads/writes). It is the
single stateful service - there is no SQL database and no schema migrations; structure is enforced
in application code (internal/etcdstore).
- Keys are laid out as
/otherix/<resource>/<id>(JSON values), withuniq/...keys for uniqueness (e.g. VM name, user email) andindex/...keys for list ordering. - Atomicity comes from etcd transactions: a row plus its uniqueness guards plus its async job all commit in one compare-and-set transaction. Large cleanup sweeps are chunked under etcd's per-transaction op limit.
- HA is etcd Raft. Replicas join as learners and are auto-promoted once caught up; peer (Raft) traffic is mutual-TLS, chained to the cluster CA.
- Backups are periodic etcd snapshots written to a configurable directory with retention.
Cluster-wide settings (default pool, overlay supernet, VNI range, underlay MTU) live in a single etcd document, seeded once at boot, rather than being duplicated across each replica's config file.
Async operations and the job queue¶
Anything that touches a node or takes more than a moment is asynchronous:
- The API validates the request, writes a task plus a job to etcd in one transaction, and
returns
202 Acceptedwith a task id. - The in-process dispatcher claims the job (one replica wins via a compare-and-set), calls the owning agent, and polls it to completion.
- The client polls
GET /v1/tasks/{id}(or uses--wait) until the task reachessuccess/failed/cancelled.
Sync 200 is reserved for genuinely fast operations (pause/resume, console-token issuance). VM
create/delete, start/stop/poweroff/reboot, and pool scans are all async.
The queue is etcd-backed (claim-by-revision, per-job attempt budget). Job redelivery is safe: a task carries the agent's task id, so a control-plane restart resumes polling instead of re-running the operation, and a committed-terminal task is never reopened.
Periodic loops handle node liveness (promote healthy / mark unreachable / mark gone by heartbeat freshness), expired-row cleanup, storage-pool scan triggers, and etcd backups.
Security model¶
- Users authenticate with passwords (argon2id) to get a short-lived JWT (15 min) plus a rotating
refresh token; or with long-lived
otx_API tokens. Refresh tokens carry theft detection - reusing a revoked one burns the whole token family. - RBAC has four fixed roles (
admin,operator,developer,viewer) and a static permission matrix withown/anyscopes. Cross-user resources return404, never403, so existence is never leaked. - Agents authenticate to the control plane with mutual TLS. A per-cluster CA is generated on
first boot; each node enrolls by submitting a CSR with a one-time join token, and its identity is
its certificate CN (
node-<name>). The control-plane API verifies agents by certificate fingerprint. - Bootstrap is trust-on-first-use: an enrolling agent fetches the cluster CA over a pinned fingerprint, then all subsequent traffic is verified against that CA.
Networking (agent data plane)¶
Each node's agent programs the host network directly via netlink/nftables (no libvirt, no OVS):
- a Linux bridge per managed network, with tap devices wired into QEMU;
- a VXLAN overlay for cross-node VM traffic, with a controller-authoritative forwarding database (the control plane computes the FDB; agents do not learn);
- a WireGuard mesh carrying the overlay between nodes (the control plane recomputes the full peer set each heartbeat; agents just apply it);
- nftables masquerade for VM egress to the internet.
MTUs are sized for the encapsulation stack (1500 underlay, 1440 WireGuard, 1390 overlay).
Overlay egress (anycast gateway + DNS). VMs on a private overlay reach the internet through
per-node SNAT - the model Kubernetes CNIs use: each node masquerades its own overlay traffic out its
uplink, so there is no central egress gateway to bottleneck or fail. The subtlety is live migration:
a VM has to keep working after it moves to another node, so everything it depends on is made
anycast - identical on every node. Its default gateway is a fixed link-local address
(169.254.1.1, with a deterministic per-overlay MAC) present on every node's overlay bridge, so the
local node always answers ARP and routes the VM's egress no matter where it runs (being link-local,
it also costs no address from the tenant's subnet). DNS is served at that same address by a small
per-node forwarder that relays to whatever upstream resolver the node itself uses. The guiding rule:
a VM is only ever handed anycast addresses (gateway, DNS), never a node-specific one - so nothing it
caches breaks when it migrates. (The node also installs a route for the overlay subnet back to the
bridge, so return traffic finds the VM; egress is opt-in per network.)
VM lifecycle on a node¶
QEMU is driven directly (no libvirt), controlled over a QMP socket. The agent:
- materializes the disk image from a per-pool, basename-keyed cache with IfNotPresent semantics and
optional SHA-256 enforcement;
- builds a NoCloud cloud-init seed ISO when user-data is supplied;
- launches qemu-system-{x86_64,aarch64} (KVM when /dev/kvm is present, TCG otherwise);
- exposes the serial console over a WebSocket bridge for otherix vm console.
Lifecycle operations are guarded per-VM so concurrent requests can't race, and graceful stop never force-kills - destructive actions fail toward inaction.
Repository map¶
| Path | What lives there |
|---|---|
cmd/{api,agent,cli} |
The three binary entry points |
internal/api |
REST server, router, middleware, handlers, response envelope |
internal/etcd |
Embedded etcd runtime, clustering, backups |
internal/etcdstore |
The control-plane store over etcd (key schema, transactions) |
internal/store |
Shared row/params/result types and error sentinels |
internal/auth |
Passwords, JWT, tokens, RBAC, CSR signing |
internal/worker |
Async dispatcher + periodic scheduler |
internal/agent |
Node runtime: VM manager, QEMU, netfabric, heartbeat, reconcilers |
internal/agentapi |
Generated CP-to-agent API (from api/openapi/agent.yaml) |
internal/config |
koanf-based config (env prefix OTHERIX_) |
api/openapi |
The two API contracts (control-plane + agent) |
deploy/, dev/ |
Container images, example configs, Lima dev tooling, smoke tests |
Operational notes¶
- Config is YAML plus environment overrides (
OTHERIX_prefix,__for nesting). The control plane needs a data directory for etcd and a place for its CA and certs; the agent is configured by the one-shototherix-agent bootstrapcommand and then runsotherix-agent serve. - Images: control plane is distroless; the agent image is for dev/CI only (running real VMs needs host KVM and privileges).
- Tests: unit tests run with
make test; etcd-backed integration tests (store + API end-to-end, all embedding etcd in-process, no Docker) withmake test-etcd; the Linux data-plane suite (real bridges/taps/VXLAN/WireGuard in network namespaces) withmake test-netfabric. - Architectures: amd64 and arm64. The agent is Linux only; macOS is supported as a development platform via Lima.
- Upgrade ordering (CP before agents): the heartbeat receiver decodes with
DisallowUnknownFields, so the heartbeat wire contract is neither forward- nor backward-compatible across a field add or removal. When a release changes the heartbeat shape, upgrade the control plane first, then the agents. During the window a not-yet-upgraded agent may be rejected (400) and demotedready -> unreachable -> gone; this is non-destructive (running VMs are preserved,vm_runtimeis never orphaned by the demotion) and self-heals once the agent is upgraded. Upgrading agents first can break heartbeats against the old CP.