Scheduler configuration guide¶
Otherix's VM placement scheduler ranks candidate (node, pool) targets
by a kubernetes-style LeastAllocated score across the enabled resource
dimensions. This document covers the per-resource knobs operators
expose in placement.resources and the safety trade-offs of memory
and disk overcommit.
Companion references:
deploy/config/api.example.yaml— annotated production config.dev/config/api.yaml— strict defaults, parallel structure.
Per-resource placement settings¶
Each of cpu, memory, and disk is configured independently:
placement:
algorithm: "resource_aware" # or "least_vm_count"
resources:
cpu:
enabled: true
overcommit_ratio: 1.0
memory:
enabled: true
overcommit_ratio: 1.0
disk:
enabled: true
overcommit_ratio: 1.0
-
enabled(defaulttrue) — include the resource in the fit check AND scoring formula. Disabling drops the resource from both. The LeastAllocated denominator floats with the count of enabled dimensions, so disabling one does not bias the score scale. -
overcommit_ratio(default1.0) — multiplier on the per- resource effective availability: 1.0— strict, no overcommit (production-safe default).> 1.0— overcommit, capacity inflated.< 1.0— reserve headroom, capacity deflated.
Validation rejects overcommit_ratio <= 0 at startup. Validation
warnings — not errors — are logged at startup for any > 1.0
configuration so operators see the trade-off they opted into on each
api-server restart.
Startup warnings¶
otherix-api emits a slog Warn line at startup for each per-resource
overcommit and for the all-resources-disabled scenario:
WARN placement config warning="placement.resources.memory.overcommit_ratio=1.50 — overcommit enabled (OOM kill risk under memory pressure); see docs/scheduler-configuration.md"
Ratios above 2.0 use stronger "extreme overcommit" language. All
three resources disabled emits a separate warning that scoring degrades
to count-based fallback (Least-VM-count parity without switching the cluster
algorithm).
Default config (every resource enabled at ratio 1.0) emits zero warnings — the absence on subsequent restarts confirms the cluster is running with strict accounting.
Memory overcommit safety¶
memory.overcommit_ratio > 1.0 enables placement decisions that pack
VMs whose combined memory requests exceed physical host RAM. This
is dangerous:
- VMs allocate memory lazily. Actual usage trails allocation, so combined working sets often fit even during aggressive overcommit.
- BUT when cumulative actual usage exceeds (physical RAM + swap), the Linux OOM killer reaps a process — typically QEMU — to recover memory. Affected VMs crash abruptly.
- No graceful degradation. The VM's guest OS does not get notified; qemu just disappears.
Host configuration before enabling memory overcommit¶
-
Provision adequate swap. Rule of thumb:
swap ≥ (overcommit_ratio - 1) × RAM. Example: 16 GiB RAM with ratio 1.5 → at least 8 GiB swap recommended. Swap I/O is far slower than RAM, so this is a safety net, not a performance feature. -
Review the
vm.overcommit_memorysysctl on each agent host: 0(default, heuristic) — kernel decides per allocation. Usually fine for moderate ratios (≤ 1.5) on well-provisioned hosts.1(always allow) — kernel never refuses an allocation. Risky; defers OOM to actual fault time.2(strict accounting) — kernel refuses allocations that would push commit charge pastswap + vm.overcommit_ratio% × RAM. Predictable but requires careful sizing of the kernelvm.overcommit_ratiosysctl (distinct from Otherix's per-resourceovercommit_ratioconfig).
The conservative choice for production-ish Otherix homelabs is
either 0 or 2 with the kernel sysctl deliberately tuned. Do not
set 1.
- Monitor swap usage continuously. Frequent swap activity is a leading indicator of memory pressure approaching kernel-OOM thresholds. Plumb host swap into your metrics pipeline before enabling overcommit.
Recommended ratios¶
| Use case | CPU | Memory | Disk |
|---|---|---|---|
| Production-grade isolation | 1.0 | 1.0 | 1.0 |
| Homelab dev VMs (idle-heavy) | 2.0 | 1.2 | 1.1 |
| Testing / experiments / scratch | 4.0 | 1.5 | 1.5 |
CPU overcommit at typical ratios (≤ 4.0) is benign — Linux schedulers handle vCPU oversubscription gracefully. The performance trade-off (context-switch overhead, noisy neighbour) is recoverable.
Memory overcommit above 1.5 is not recommended outside throwaway test fleets. Disk overcommit above 1.5 is risky for any VM whose disks may grow toward their allocated capacity.
Disk overcommit considerations¶
disk.overcommit_ratio > 1.0 lets Otherix over-subscribe pool capacity
based on the assumption that sparse qcow2 disks consume less than their
allocated size. Risks:
- VMs that write heavily can grow toward their allocated capacity. Multiple VMs hitting that limit simultaneously can fill the pool.
- Pool scans (default 15 min, configurable via
workers.storage_pool_scan) detect filling pools but do not prevent placement decisions made before the scan lands. - Cascading "no space left on device" errors corrupt guest filesystems and can wedge running VMs.
The pool_effective_capacity view already subtracts pending-VM disk
commitments from scan-reported availability, so the scheduler never
double-counts. Disk overcommit explicitly opts out of that safety and
trusts the operator's growth model.
Safe disk overcommit requires:
- Sparse qcow2 disk format (Otherix default).
- Pool monitoring with alerting before fill.
- Awareness of typical workload growth patterns in the cluster.
Disabling resources entirely¶
Setting enabled: false removes the resource from placement
consideration:
- Fit check skipped — no constraint on the dimension.
- Scoring formula adjusts — the denominator drops by one. Remaining enabled dimensions are scored at full magnitude (no scale shift).
- All three disabled → scoring degrades to count-based (LeastAllocated
with zero enabled dimensions is undefined; the scheduler falls back
to Least-VM-count parity, which is the same algorithm
least_vm_countapplies cluster-wide).
When to disable:
- CPU
enabled: false— VMs are mostly idle and memory/disk are the binding constraints. Rare; CPU is usually the cheapest signal to keep. - Memory
enabled: false— almost never. Memory pressure is the most common failure mode on Linux VM hosts. - Disk
enabled: false— shared network storage where per-pool capacity reflects a pooled remote resource rather than node-local bytes. Pool capacity tracking still happens; just doesn't influence placement.
Node-pressure detection¶
Independent of the capacity / overcommit knobs above, the scheduler
honours pressure conditions surfaced under placement.pressure.*.
Pressure detection excludes stressed nodes (or pools, for disk
pressure) from placement — a hard constraint, no operator override.
Three pressure types land together with Sub-iteration B:
- memory — per-node, heartbeat-driven. Free memory below threshold. OOM-kill risk.
- system_disk — per-node, heartbeat-driven. Free root-filesystem bytes below threshold. Agent crash / log truncation risk.
- disk — per-pool, scan-driven. Free pool bytes below threshold. Pool-scoped — other pools on the same node remain eligible.
Memory pressure¶
placement:
pressure:
memory:
enabled: true # disable to skip detection entirely
threshold_percent: 10 # flag when free memory < 10% of total
consecutive_required: 3 # ~90s sustained at default heartbeat cadence
Mechanics:
- CP-side computation. The api-server compares
memory_available_mib / memory_total_mibreported in every heartbeat againstthreshold_percent. The agent does not pre-compute pressure — operators tune one set of thresholds in the api config rather than per-agent. - Asymmetric debouncing. Pressure is set after
consecutive_requiredconsecutive below-threshold observations (default ≈ 90 s at the 30 s heartbeat cadence). Recovery is immediate — the first at-or-above-threshold observation clears the flag and resets the counter. Slow-to-set / fast-to-clear prevents flapping on transient spikes while restoring eligibility promptly. - Hard exclusion. Pressured nodes are filtered out of
ListEligiblePoolsByName. The 409no_eligible_nodesenvelope carriesdetails.reason: "node_pressure"and afiltered_due_to_pressurelist when pressure accounts for the exclusion.
System disk pressure¶
placement:
pressure:
system_disk:
enabled: true # disable to skip detection entirely
threshold_percent: 10 # flag when free root-fs bytes < 10% of total
consecutive_required: 3 # ~90 s sustained at default heartbeat cadence
Mechanics mirror memory pressure exactly — same heartbeat-driven CP-
side computation, same asymmetric debouncing, same hard-exclusion
semantics. Only the raw metric differs: the agent reads
syscall.Statfs("/") on every heartbeat and reports
system_disk_total_bytes / system_disk_available_bytes (both
nullable when the syscall fails — pressure state carries forward
across NULL observations). A system_disk-pressured node is excluded
from placement entirely; agent logs / NVRAM allocation / QMP sockets
all live on the root filesystem, so disk exhaustion threatens the
agent itself rather than just any individual VM.
Pool disk pressure¶
placement:
pressure:
disk:
enabled: true # disable to skip detection entirely
threshold_percent: 15 # flag when free pool bytes < 15% of capacity
consecutive_required: 1 # single scan sets pressure (scans are 15 min apart)
Mechanics:
- Scan-driven CP computation. The scan worker reads pool
available_bytes/capacity_bytesfrom the agent's scan response and runs the same transition function inside the same transaction asUpsertStoragePoolUsage. Scan-completion-only — heartbeat-time observations are not relevant for pool metrics. - Single-observation default. Scans are 15 min apart (default
workers.storage_pool_scan.interval), so a single sub-threshold observation is a sufficient signal; operators can raiseconsecutive_requiredfor multi-scan debouncing if their scan cadence is more aggressive. - Pool-scoped exclusion. Pressured pools are filtered from
ListEligiblePoolsByName; other pools on the same node remain eligible. The 409 envelope'sfiltered_due_to_pressurepayload carries an explicitpoolfield for these entries.
Operator visibility¶
otherix node list shows a computed STATUS column combining raw
status with active node-scoped pressure conditions (memory +
system_disk):
ready
under_pressure # memory OR system_disk pressure set
cordoned, under_pressure
unreachable # pressure data suppressed (stale)
otherix node get <node> renders a pressure: section with one row per
node-scoped condition:
otherix pool list shows pool STATUS independently — a pool with
disk_pressure reads under_pressure regardless of its owning node's
condition:
otherix pool get <pool> adds a pressure: section with the disk:
condition state.
Shared filesystem behavior¶
In typical homelab setups the storage pool's path lives on the
same filesystem as / — a single physical disk hosts both the
agent runtime and VM images. When that filesystem fills, both
system_disk_pressure (heartbeat-driven, node-level) and
disk_pressure (scan-driven, pool-level) fire on the same physical
condition. Both surface in the CLI and the envelope. This is accurate —
the resource is genuinely shared — not duplication. Multi-mount
setups (advanced) detect independently per filesystem.
When to tune¶
- Lower
threshold_percentif the cluster runs hot and operators want later (less conservative) flagging. 5% is a reasonable floor; below that, OOM / disk-full becomes statistically imminent before pressure can fire. - Raise
threshold_percentfor predictable workloads or conservative homelab setups. 15-20% memory / 20-25% disk is typical for clusters carrying interactive workloads. - Lower
consecutive_requiredto 1 to react on first observation — useful for testing the path or clusters with long heartbeat intervals. - Set
enabled: falseto disable detection entirely. Pressure columns still update on the row (the transition function is a no-op when disabled), but flags never set and the SQL filter never excludes anyone. Acceptable for homelab clusters where the operator prefers strict capacity-only scheduling.
Algorithm selection¶
placement.algorithm:
resource_aware(default) — kubernetes-style LeastAllocated across the enabled resources. Tie-break: node name lowercase lexicographic.least_vm_count— Phase 1.5 algorithm. Picks the node with the fewest pinned VMs. Same tie-break. Operator opt-out for clusters that cannot rely on heartbeat metrics or want pure count-based spread.
Per-resource configuration is meaningful only with resource_aware. Under
least_vm_count, the resources block is parsed and validated but does
not influence placement decisions. Validation warnings still fire,
which keeps the configuration honest if an operator flips back to
resource_aware later.
Operator workflow when changing config¶
- Edit
placement.resourcesin your api yaml. - Restart
otherix-api. Watch for startup warnings — each overcommit choice produces one line. - Validate placement behaviour:
otherix node listshows raw vs effective free CPU / memory;otherix pool listshows raw vs effective free disk. The "(effective N free)" suffix renders when pending placements have not yet been observed by a heartbeat / scan. - Trigger a test VM create. The scheduler's decision should respect
the new ratios.
vm createagainst a saturated pool returns 409no_eligible_nodeswith a structureddetails.node_utilizationpayload listing each candidate by name + per-resource usage.
There is currently no API or CLI endpoint to read or change the placement config at runtime — it's a deploy-time decision. Restart required to apply changes; the apartment graceful-shutdown path keeps in-flight tasks alive to the next replica.
See also¶
- Linux kernel:
vm.overcommit_memory,vm.overcommit_ratiosysctls (man 5 proc). - QEMU memory management: tcg / kvm allocation semantics, ballooning (not yet exposed by Otherix).