A practical reference for the failure modes operators hit most. Each row gives the
likely cause and how to inspect. Most diagnostics start from one of:
otherix node list / otherix node get <name> - node status, last heartbeat.
otherix vm get <name> - VM status and, for a failed async operation, the task
error code + message.
otherix pool list / otherix pool get <name> - pool status, available space.
The task error itself - async operations (vm create/delete, lifecycle, scan)
report a stable code and a human message. The CLI prints CP-side failures as
api_error: <code>: <message>.
journalctl -u otherix-agent on the node - agent-side bootstrap, heartbeat, and
qemu logs.
The api-server logs - cluster, CA, and worker activity.
Agent registered (CSR redeemed) but heartbeat not arriving - usually mTLS handshake failure or the agent cannot reach the CP
journalctl -u otherix-agent -f for heartbeat lines or TLS handshake errors. Confirm the agent reaches the CP at a name/IP in the CP cert SAN set (see Certificates); add missing names to cp_cert.additional_sans
Node flips ready -> unreachable
Heartbeats stopped arriving for longer than the stale threshold (default 90s)
Check the agent process is alive and the network path to the CP. The node returns to ready automatically once heartbeats resume
Node advances to gone
Unreachable past the gone-grace window (default 5m)
Recover the agent and restart it; if the node is permanently retired, no action needed
agent_unreachable on an operation
The CP could not reach the agent's mTLS endpoint when dispatching work
Verify the agent is running and its advertised endpoint is reachable from the CP. Check otherix node get <name> for last heartbeat
api_error: unauthenticated from the CLI
Access token expired (JWTs default to 15-min TTL) or the stored token was revoked
Re-login, or use a long-lived otx_* API token (otherix config add cluster stores one). If the cluster CA rotated, refresh the stored credential
otherix vm get shows status: creating indefinitely
The owning agent is unreachable, so the create task cannot progress
Check otherix node get <node> for the node's status / last heartbeat; confirm the agent endpoint is reachable from the CP
Task error qemu_spawn_failed
qemu could not launch on the node. Commonly KVM is unavailable (e.g. nested virt not exposed) so the agent falls back to TCG software emulation; a malformed cmdline can still fail
Read the task error message. On the node, journalctl -u otherix-agent shows the qemu invocation. TCG is functional but slow; provision KVM for real workloads
Task error image_unavailable
The agent could not materialize the image URL onto the chosen pool during create (download or cache write failed)
Read the task error - the underlying agent fetch failure surfaces under its own code (below). Fix the cause and retry
Task error checksum_mismatch
The downloaded image's SHA-256 does not match the --image-sha256 you supplied (or the cached sidecar)
The URL is serving different bytes than expected. Verify the URL and checksum, then retry
Task error download_failed
The agent could not fetch the image URL (network, 404, auth)
Confirm the URL is reachable from the node. Check agent logs for the HTTP error, fix, and retry
Task error node_not_ready
The target node is not in ready state at dispatch
otherix node list; wait for the node to reach ready, or remove a --node hint pinning an unavailable node
Task error vm_create_failed / vm_delete_failed
Catch-all for a create/delete failure not covered by a more specific code
Read the task error message for the underlying cause
otherix cluster set-default-pool <name>, or pass --pool explicitly
pool_not_found
The named pool does not exist anywhere in the cluster
otherix pool list to see registered pools; create it or fix the name
pool_not_on_node on vm create --node X
The pool name has no instance on node X
otherix pool list --node X; pick a different node, register the pool on X, or drop the --node hint
pool_name_ambiguous
The pool name resolves to multiple instances and the request did not disambiguate
Specify the node (--node) so a single pool instance is selected
no_eligible_nodes on vm create
The pool exists but every hosting node is cordoned, unreachable, not yet ready, or out of capacity (CPU / memory / disk, including node-pressure exclusion)
otherix node list and otherix pool list for status and free space. Uncordon a node, free capacity, or add a node. The 409 payload lists per-candidate utilization
path_not_allowed on pool create
The pool path is not under any storage_pools.allowed_path_prefixes entry
Pick a path under an existing prefix, or widen the allowlist in api.yaml
These surface during join-node startup or agent bootstrap; inspect the
api-server logs (joiner) or journalctl -u otherix-agent (agent).
Symptom
Likely cause
How to inspect / recover
cluster CA fingerprint mismatch (joiner) or bootstrap: CA fingerprint mismatch (agent)
The pinned ca_fingerprint does not match the CA the server returned - operator typo or, rarely, a MITM
Re-check the pinned fingerprint against the live cluster CA (mint a fresh join token to read the current fingerprint). If it genuinely differs, treat as a security event
token_expired / HTTP 401 token_expired
Join token TTL elapsed
Mint a fresh join token and retry
token_exhausted
A multi-use join token hit its max_uses cap
Mint a fresh token
cluster CA divergence: on-disk CA ... does not match active etcd CA
The on-disk cluster CA and the active ca_certs row in etcd disagree (e.g. wiped etcd data dir with a stale on-disk CA, or a restore mismatch)
Make the two consistent - restore the matching CA, or for dev make etcd-reset wipes both together. See Certificates and Backups
Joiner registered as learner but never becomes a voter
The learner has not caught up, or the promote loop cannot reach quorum
Watch GET /v1/cluster/members; the ~15s promote loop converts caught-up learners automatically. Check the joiner's etcd logs for replication progress
cert <path> exists but key <path> missing (agent)
Partial bootstrap state - a mid-flight crash or manual file deletion left cert without key (or vice-versa)
Delete the orphaned cert + key + CA material under the agent's cert dir (/var/lib/otherix/certs/), then re-run the agent bootstrap. Identity is derived from the cert CN - no sidecar file to clean up
connection refused fetching /v1/ca or /v1/cluster/join
The target CP is not running or not reachable from the joining host
Confirm the CP is up (curl -k https://<cp-host>:<port>/healthz) and that the join host can reach it
Liveness / readiness:/healthz (process up) and /readyz (dependencies
reachable) sit outside /v1/. A 503 from /readyz means the api-server cannot
reach its store.
Cluster membership:GET /v1/cluster/members shows voters and learners -
the first thing to check on any HA / quorum issue. See
High availability.
Worker activity: if async tasks stay pending forever, confirm
workers.enabled: true and that the agent client is provisioned - a CP with
workers disabled accepts tasks but never runs them.