Certificates¶
Otherix runs its own internal PKI. A single cluster CA anchors three certificate types: the etcd peer (Raft) certs that secure replica-to-replica traffic, the per-replica control-plane server certs, and the per-node agent certs. All of it is provisioned automatically on first boot - operators normally do not touch a certificate file. This page documents the lifecycle so you can reason about trust, expiry, and recovery.
The cluster CA¶
The cluster CA is the root of trust for everything below it.
- Algorithm / validity: ECDSA P-384, self-signed, ~10-year validity,
CN=otherix-cluster-ca, marked as a CA. - Generated once, on first boot of a
singleorbootstrapreplica. - Stored in two places that must stay consistent:
- On disk, at
cluster_ca.cert_file/cluster_ca.key_file(default/var/lib/otherix/ca/cluster-ca.{crt,key}). The CA must be on disk before etcd starts, because the peer-mTLS plane needs a CA-signed cert pre-start. - In etcd, as the active
ca_certsrow. On boot the api-server syncs the on-disk CA into etcd (BootstrapClusterCA) so the/v1/caendpoint and the node-join CSR signer have an active row.
- On disk, at
On a single/bootstrap node the on-disk CA is generated on first boot and
reloaded on every restart. A join node does not mint its own CA: it fetches
the existing cluster's CA over the join protocol (see below) and persists it
before its etcd member starts.
Joiner CA fetch (trust on first use)¶
A joining replica with no CA on disk fetches it from an existing replica via
POST /v1/cluster/join. The transport uses trust-on-first-use: the connection
skips chain verification (the target's serving cert does not chain to the CA being
fetched), and the joiner instead pins the returned CA against an operator-supplied
ca_fingerprint (SHA-256 hex). A mismatch aborts the join. This is the same TOFU
model agents use to bootstrap.
On-disk CA and etcd CA must agree
If the on-disk CA fingerprint does not match the active ca_certs row in etcd,
the api-server refuses to start (CA divergence) rather than serve two trust
roots. This happens if you wipe the etcd data dir but keep an old on-disk CA
(or vice-versa), or restore an etcd snapshot whose CA differs from the on-disk
files. Keep the two in sync. For dev, make etcd-reset wipes the etcd data dir
and the on-disk CA together so the next boot regenerates a matched pair.
See Backups for restore-time guidance.
Per-replica control-plane cert¶
Each replica presents one leaf cert on both its inbound agent listener and its
outbound dials to agents (ExtKeyUsage serverAuth + clientAuth). The loader
(LoadOrGenerateCPCert) selects one of three modes:
- Mode A - operator files. Set
cp_cert.cert_fileandcp_cert.key_fileto use an externally-issued cert (corporate CA, Let's Encrypt, FIPS scenarios). Both must be set; missing files when configured is fatal. The cluster CA is still loaded from etcd to validate inbound agent client certs. - Mode B - local cache. With
cp_cert.local_cache.enabled: true, a cert cached at/var/lib/otherix/certs/cp-cert.{crt,key}is reused across restarts when it still chains to the current CA, is not near expiry, and its SANs cover the expected set. Otherwise it falls through to Mode C and the regenerated cert is re-cached. - Mode C - auto-generate (default). A fresh ECDSA P-384 leaf signed by the
cluster CA, ~365-day validity (
cp_cert.validity, minimum 24h),CN=otherix-cp-replica.
SANs¶
The auto-generated cert's SAN set is the auto-detected baseline - localhost,
127.0.0.1, os.Hostname(), and the non-wildcard listen address - unioned with
any cp_cert.additional_sans you configure, deduped. The agent must be able to
reach the CP at a name or IP present in this SAN set. If agents dial the CP by a
hostname or VIP that auto-detection misses, add it to cp_cert.additional_sans.
etcd peer (Raft) certs¶
Replica-to-replica etcd traffic uses mutual TLS. Before the member starts, the
api-server provisions peer material from the on-disk cluster CA
(ProvisionPeerCert):
- Auto-generate (default): a fresh peer leaf signed by the cluster CA, written
under
etcd.peer_auto_dir(default/var/lib/otherix/peer/), with the CA cert materialized as the peer trust file. SANs derive from the member'speer_urlhost plus the localhost baseline. - Operator override: set all three of
etcd.peer_cert_file,etcd.peer_key_file,etcd.peer_ca_fileto pre-distribute peer certs. The files must exist and the leaf must chain to the cluster CA (verified at boot, so a wrong-CA cert fails fast instead of being rejected opaquely at Raft-handshake time).
Peer URLs are therefore https whenever peer mTLS material is set, which is
always in production.
Node (agent) certs¶
Each agent gets a client cert when it joins, via the node-join CSR flow:
- The agent generates a keypair and submits a CSR to
POST /v1/nodes/joinwith a plaintext join token. - The CP validates the token and signs the CSR with the cluster CA. The cert
template is server-authoritative:
CN=node-<name>, SAN includesnode-<name>.agents.otherix.local, ~1-year validity. The CSR's own Subject/SAN are ignored (defense against CN injection). - Agent identity is its cert CN. The agent parses
node-<name>from its own cert at startup to learn its node name; the CP binds heartbeats to that identity.
The agent persists its cert material under /var/lib/otherix/certs/ and reuses it
across restarts. Bootstrap is a one-time event; routine agent restarts do not
re-issue.
Rotation¶
Rotation is not yet automated
There is no automatic certificate rotation. When a cert approaches expiry the operator re-issues it manually:
- CP replica cert: in Mode C it regenerates every boot, so restarting the replica issues a fresh leaf. In Mode B, delete the cached cert and restart. In Mode A, replace the operator files and restart.
- Node cert: re-run the agent bootstrap to obtain a fresh node cert (delete the existing cert material first; see Troubleshooting).
- Cluster CA / peer cert: CA rotation is planned but not implemented. The 10-year CA validity makes this a long-horizon concern; peer certs regenerate from the CA on each boot.
Track cert expiry out of band (e.g. monitor not_after from the boot logs, or
inspect a member with openssl s_client).