Kokoro

Surviving a Data Surge as a Single Operator: An Eight-Layer Operations Stack on Constrained Hardware

Author

Kokoro

Date Published

Abstract

This document presents an operations stack for running production services under two constraints — a single operator and budget hardware (single-digit-GB RAM, single-digit-core hosts) — through the lens of a concrete failure scenario: a sudden surge in ingest volume during a market volatility event. The reference workload continuously ingests raw cryptocurrency exchange data into TimescaleDB and serves it to a WebGPU charting frontend, across a small fleet of Linux hosts on a WireGuard mesh. The central design goal is automation: with one operator and no escalation path, the system must detect, remediate, and recover from common faults without human intervention, and escalate to the operator only when automated recovery has provably failed. The document traces how a volume surge propagates into failure, how the stack automatically diagnoses and responds to it, and how each of eight layers absorbs the condition or, failing that, surfaces it honestly. A managed-cloud (AWS) treatment of the same surge serves as a contrast case, clarifying which problems the stack automates away on cheap local mechanisms and which it deliberately declines to solve with elastic capacity.

1. The scenario

Steady state is undramatic. Each exchange streams updates at a roughly stable rate; an ingest process per source normalizes and writes to TimescaleDB; queue depths sit near zero; write latency is flat. The system is sized for this baseline with modest headroom, because the hardware budget does not permit sizing for peak.

A volatility event breaks the baseline. Order-book churn and trade frequency rise sharply and simultaneously across every exchange; the surge is correlated across all sources at once. Ingest message rate climbs several-fold within minutes. This load profile is the one the system was never sized for, and the one that matters most, because it coincides exactly with the moments the served data is most valuable.

The operative question concerns behavior at the ceiling rather than provisioning above it: as the system crosses its capacity limit, does it degrade in a way an unattended automated layer can manage, escalating to the single operator only when automation has exhausted its options? The budget forecloses the alternative of simply provisioning for peak, which makes automated graceful degradation the design's central requirement.

2. Constraints as design premises

Two constraints generate every decision below.

Single operator. No on-call rotation, no escalation. Every alert terminates at one person, so the design must minimize total notification volume rather than route between responders. Operational debt compounds without a second observer: an unsuppressed flaky alert raises the noise floor and hides the next real signal. Documentation has exactly one reader, displaced in time.

Constrained hardware. Capacity cannot be expanded to absorb a fault. Every problem is resolved in software or accepted as a fixed cost. This is precisely why the surge scenario is the right stress test: the textbook answer (add capacity) is unavailable by construction.

A derived principle: minimize novel operational surface. Each custom daemon or exotic component is one with no external operator base to draw on during an incident. Selection favors the most widely deployed option that satisfies the requirement.

3. How the surge propagates — the diagnosis

Before describing the solution, trace the failure in the order it actually unfolds. The value of doing this explicitly is that the propagation order dictates where each defensive layer must sit.

  1. Ingest rate exceeds write throughput. TimescaleDB write latency rises; the per-source ingest queues, previously near-empty, begin to fill. No threshold has tripped yet. This is the earliest observable, and it is sub-alert: only a dashboard shows it.
  2. Queues grow; memory pressure rises. Buffered messages consume RAM, the scarcest resource on the hardware. On a constrained host, this is the first hard limit reached, ahead of disk or CPU.
  3. Write-ahead and ingest logs accumulate faster. Higher event volume means higher log volume. Disk consumption accelerates — the slow background process becomes a fast one.
  4. A threshold finally trips. Typically disk (DiskFillingFast) or memory (HighMemory) crosses its bound. The alert layer activates here — after the dashboard-visible drift in step 1.
  5. Risk of cascade. If a host exhausts memory and the ingest process is killed, or disk fills and writes fail, the host can go unresponsive. NodeDown then fires, and every derivative threshold for that host trips behind it.

The diagnosis yields a design requirement: the stack must surface the condition at step 1 (drift, sub-threshold), act automatically at step 4 (threshold crossed), and prevent step 5 from multiplying into a notification cascade.

4. The contrast case: how AWS would absorb this

It is worth stating the managed-cloud answer plainly, because it clarifies the boundary of what the constrained stack automates.

On AWS, the canonical response to a correlated ingest surge is to decouple and elasticize. A managed buffer (Kinesis Data Streams or SQS) absorbs the rate spike so producers never block on the database; consumers drain the buffer at their own pace. Compute autoscales on queue depth or CPU. Storage (e.g. a managed time-series or relational service) scales largely without operator action, and CloudWatch alarms observe the whole pipeline. The surge is absorbed by temporary capacity: the system grows to meet the load and shrinks afterward, and the operator pays for the peak only while it lasts.

This is the correct architecture when the budget permits elastic spend. It does not fit the stated constraints: the hardware budget is fixed and small, and elastic spend during the most volatile (and most adversarial-to-cost) periods is exactly what must be avoided.

The constrained stack makes a different trade. It accepts that arbitrary load cannot be absorbed, and engineers the crossing of the capacity ceiling to produce graceful, observable, bounded degradation. The recovery work a managed platform performs automatically — buffering, draining, reclaiming, restarting — is reproduced here by cheap, local, deterministic automation running on the hosts themselves. The objective is the same automated self-recovery AWS provides; the difference is that it runs on a fixed-cost substrate and escalates to the operator only when its own actions have provably failed. The remaining sections specify those automated mechanisms against the propagation order from Section 3.

5. Metrics — observing step 1 before it becomes step 4

node_exporter runs on every host; a single Prometheus on the hub scrapes at 15-second intervals with 30-day retention. The pull model keeps hosts stateless: each advertises a metrics endpoint over the WireGuard mesh, and Prometheus discovers and scrapes. Host unreachability is itself an observation — the synthetic up metric evaluates to 0 and NodeDown fires on absence rather than on a positive failure signal, which is essential for detecting step 5.

Storage is ~10 GB of TSDB per 30-day window, within margin on a 100 GB volume.

Application metrics — per-exchange ingest rate, write latency, queue depth — are what make step 1 visible. Native exporters cover host vitals; the textfile collector covers the rest (a scheduled job writes a file, node_exporter exposes it as a gauge), so a custom metric costs five lines of shell and one rule rather than a bespoke exporter.

6. Dashboards — making drift legible

Alertmanager reports threshold violations; Grafana reports state change. The surge's earliest signature is a state change with no violation, so the dashboard layer is where step 1 is caught.

Two dashboards are maintained, both provisioned from version-controlled JSON (UI-built dashboards are non-reproducible and lost with the Grafana data directory):

  1. Host vitals — CPU, memory, disk, network, load, one row per host. Consulted first on any alert to localize the condition to one host or the fleet. During a correlated surge it shows the fleet-wide pattern immediately.
  2. Pipeline state — per-exchange ingest rate, TimescaleDB write latency, queue depth, error count. This is where the operator sees queue depth climbing in step 1, before any alert exists.

Grafana is reachable only over the WireGuard mesh; exposing it publicly would add an authentication-maintenance burden disproportionate to value.

7. Alerting — activating at step 4 without storms

Alertmanager performs routing, grouping, and suppression.

Routing is severity-keyed. Tickets (high disk, CPU, memory) route to email with a 4-hour repeat interval; pages (NodeDown, DiskFillingFast) route to the same destination with a 1-hour repeat interval, raising re-notification frequency for unacknowledged high-severity conditions. One Alertmanager expresses two urgency tiers without a separate paging service.

Grouping by alertname + node collapses a single fault into one notification rather than one per affected metric — directly relevant to a surge, where many thresholds move together.

Inhibition is the step-5 defense: when NodeDown fires, the host's HighCPU, HighDisk, and HighMemory rules are suppressed. They are not false — a dead host trips every threshold — but they add no information beyond NodeDown and would otherwise convert one failure into a cascade.

Silences cover planned operations, bounded in time, so deliberate actions do not page.

SMTP egress constraint. The hub runs on DigitalOcean, which blocks outbound SMTP (ports 25/465/587), so Alertmanager cannot send mail directly. Resolution is a stateless HTTPS relay: a ~30-line Cloudflare Worker accepts Alertmanager's webhook JSON payload (schema version 4) and forwards it to the Resend HTTPS API for delivery. It holds no state and runs within both services' free tiers. A non-standard SMTP port such as 2525 is often left open and could in principle carry the mail, but routing through an HTTPS API call removes the dependency on any port remaining unblocked, which is the more durable design. Deployments where SMTP is permitted omit this bridge; the pattern applies to any provider with equivalent egress restrictions.

8. Reactive remediation — performing the recovery AWS would automate

This is the layer where the system's automation does its primary work: it substitutes for managed-platform auto-recovery by attempting remediation autonomously, escalating to the operator only after an automated fix has been tried and has failed. The closed control loop is what lets a single operator run the fleet — most faults are resolved by the machine before a human is ever aware of them.

The control loop:

  1. Alertmanager fires a webhook to a receiver on the affected host.
  2. The receiver invokes a remediation script. For the surge case, this is the disk-pressure path: prune unused Docker resources, vacuum journald, remove superseded kernels — reclaiming the space that step 3's log acceleration consumed.
  3. If remediation clears the threshold, the next scrape resolves the alert with no operator notification. The surge was survived unattended.
  4. If remediation runs but the threshold remains violated — the surge is large enough that cleanup cannot keep pace — the script exits non-zero. The alert continues firing, routes to higher severity, and pages the operator. The system has correctly escalated a problem it could not solve alone.

The loop's integrity rests on the exit-honest property: the script returns non-zero whenever it has not resolved the condition. A script that always returns success trains the operator to disregard resolved-notifications, destroying the meaning of the resolved state. Honest exit codes preserve the guarantee that "resolved" means resolved — which is exactly the guarantee a single operator must be able to trust during a surge.

The receiver is adnanh/webhook: a single Go binary from the Ubuntu repository (apt install webhook), JSON-configured, no added runtime.

9. Preventive maintenance — keeping headroom for the next surge

Reactive remediation alone is insufficient. Without scheduled maintenance, state accumulates between fires and every reactive invocation is forced into aggressive cleanup — meaning the surge arrives with no headroom to spend. A daily systemd timer performs the slow, low-risk reclamation in calm conditions: docker system prune (unused >7 days), journalctl --vacuum-time=14d, apt autoremove --purge -y.

systemd timers over cron for three properties: Persistent=true runs missed jobs on next boot; output integrates with journalctl; and the unit graph expresses ordering (After=network-online.target) declaratively. cron provides none of these.

10. Cascade prevention

Three mechanisms ensure a surge-induced failure does not amplify.

  • Inhibit rules (Section 7) prevent one host failure from producing one notification per metric.
  • Concurrency guards. The reactive script acquires an exclusive flock at startup; a second invocation arriving while the first runs exits cleanly rather than executing a concurrent prune. Overlap is likely precisely during a surge, when an alert fires mid-preventive-run.
  • External watchdog. If the surge takes out the monitoring host itself — memory exhaustion, disk full — the monitoring system goes silent and cannot report its own failure. A separate uptime monitor (uptime-kuma) on an independent host HTTP-checks the Prometheus and Alertmanager APIs every minute and notifies the operator directly via Resend. It shares no infrastructure with what it watches, which is the precondition for detecting its failure.

Together: no single failure produces a cascade, and failure of the alerting system is itself detected.

11. Runbooks

A runbook annotation is useful only insofar as its target is. One Markdown file per alert is version-controlled alongside the rule, with a fixed six-section structure — Symptoms, First look, Common causes, Mitigation, Investigation, Communication — to support scanning rather than sequential reading under load. Runbooks are revised after each incident; the runbook is the persistence mechanism for incident lessons. The surge described here is exactly the kind of event whose resolution should land back in the relevant runbooks.

12. Configuration persistence

Every install script, config file, and systemd unit is version-controlled, with the organization name parameterized rather than literal. A hard-coded namespace imposes a future rename cost; the namespace is supplied through $INSTALL_NAMESPACE.

13. Operating principles

  1. Automate the response, escalate the exception. The default path for any common fault is autonomous machine remediation; a human is involved only when automation has run and failed. Every alert therefore requires a remediation script or, at minimum, a runbook — a notification that advances nothing toward resolution is operational debt.
  2. Remediation loops must close end-to-end. Fire, receive, remediate, report exit status honestly. A closed loop is what permits unattended survival of events like the surge.
  3. Preventive maintenance runs in calm conditions. Headroom must be created automatically and continuously beforehand, so the reactive path is not forced to scavenge it during the surge.
  4. Remediation exit codes must reflect actual outcome. False success destroys the meaning of the resolved state and breaks the trust the operator places in unattended automation.
  5. Capacity is one answer to load among several. Where elastic capacity is unavailable by budget, automated graceful, observable, bounded degradation serves the same end at fixed cost.

14. Reference implementation

The disk-cleanup pattern and supporting components are released as an open-source Linux-host maintenance toolkit:

https://gitlab.com/kokoro-oss/host-baseline

Four modules:

  • disk-cleanup — preventive timer and reactive webhook for disk-threshold alerts.
  • host-alerts — nine-alert Prometheus rule bundle: disk, CPU, memory, load, swap, node-down, NTP drift, reboot-required.
  • reboot-policy — unattended-upgrades for security patches plus a scheduled weekly reboot window.
  • runbook-stubs — nine starter runbooks mapped one-to-one to the alerts.

Apache-2.0, with no dependencies beyond Prometheus, Alertmanager, and adnanh/webhook. Default install namespaces artifacts under hostbase-*; INSTALL_NAMESPACE produces a clean re-brand. The repository is small enough to audit in full before adoption.

15. Known gaps

  • Backup. Ad-hoc rsync snapshots only; a borg-based module under the same install convention is planned.
  • Intrusion mitigation. fail2ban defaults and a banned-IP-rate alert are absent.
  • Multi-host orchestration. Manual per-host script execution, adequate at single-digit host counts; configuration-management tooling would be required beyond that.
  • Secret management. Secrets reside in permission-restricted environment files, secured by convention; a dedicated vault is a planned, separately scoped effort.

16. Accepted failure modes

A design defined by a fixed budget is also defined by what it declines to defend against. The following failures are understood and accepted as costs rather than engineered away, and naming them is part of operating the system honestly.

  • Surges beyond the escalation horizon. The reactive layer buys time and the operator is the backstop, but a surge severe enough to exhaust disk or memory faster than cleanup reclaims it, and faster than a human can intervene, will lose ingest data for its duration. The system guarantees bounded, observable degradation, not lossless capture under arbitrary load. Lossless capture at peak is precisely the elastic-capacity guarantee the budget forecloses.
  • Correlated multi-host failure. Inhibition and the external watchdog prevent a single failure from cascading into noise, but a fault common to several hosts at once — a shared dependency, a simultaneous resource exhaustion during the same surge — presents as multiple real failures requiring serial manual attention. There is no automated cross-host failover, because there is no spare capacity to fail over to.
  • The restart data-loss window. A host reboot, whether from the weekly maintenance window or a remediation of last resort, drops in-flight ingest for the duration of the restart. Persistence resumes on boot; the gap during the window is accepted.
  • Single-region hub dependency. Prometheus, Alertmanager, and the relay run on one hub. The external watchdog detects hub failure, but recovery is manual. A multi-region control plane is out of scope at this scale.

Each of these has the same root: the absence of spare capacity is the defining constraint, and several classes of failure can only be solved by spending capacity. The design's position is that surviving these events gracefully and visibly, at fixed cost, is a better trade for a single operator than paying continuously for headroom that the rare event would consume.

17. Context

Published operations material bifurcates into enterprise SRE, which assumes a team and elastic capacity, and hobbyist self-hosting, which assumes low stakes. The intermediate case — a single operator running production services with real users under a fixed hardware budget — is underdocumented. The surge scenario above is where that case is tested, and where the difference between absorbing load and surviving it becomes the entire design.


Kokoro · May 2026