The core invariants that define PilotLight and how they apply to distributed computing.
Modern cloud engineering is dominated by tooling: Kubernetes, observability platforms, CI/CD, autoscaling, service meshes, and AI ops. These tools matter, but they often obscure a deeper truth: distributed systems are dynamical systems.
They change over time. They respond to perturbations. They contain feedback loops. They settle into stable modes, oscillate under stress, or diverge into failure.
PilotLight Architecture borrows the “pilot light” metaphor from disaster recovery and gives it deeper meaning: the minimal invariants and control loops that must remain lit even when the system is pushed into chaos.
By reasoning in terms of state, attractors, feedback, damping, and divergence, we move from reactive firefighting to deliberate system design.
Core Concepts
State
Everything measurable that evolves over time:
- queue depth, retry rate, CPU/memory usage
- pod count, request latency, connection pool saturation
Observability is the measurement of this state, but it is not neutral. It actively consumes resources and can alter system behavior.
Attractors
Regions of stability the system naturally gravitates toward:
- steady-state pod count under normal load
- stable latency at expected traffic
- equilibrium queue depth
- warm cache behavior
Well-designed systems return to attractors after disruption.
Perturbations
Events that displace the system from its attractor:
- traffic spikes, deployments, node failures
- dependency slowdowns, network partitions, config changes
Resilient systems absorb them. Fragile ones amplify them.
Feedback Loops
- Positive feedback (dangerous): retries increase load → higher latency → more retries
- Negative feedback (stabilizing): autoscaler adds pods → latency drops → scaling stabilizes
Oscillation
Feedback that is delayed or too aggressive, causing the system to swing:
- autoscaler flapping
- health checks bouncing
- traffic shifting back and forth
Damping
Mechanisms that reduce oscillation and prevent runaway behavior:
- exponential backoff, cooldown windows, rate limiting
- circuit breakers, queue buffering
Divergence
When positive feedback dominates and damping is insufficient, the system moves farther from stability:
- retry storms
- cascading failures
- memory exhaustion spirals
This is where outages happen.
How These Concepts Interact
Distributed systems rarely show these behaviors in isolation. They evolve together under pressure.
A typical lifecycle looks like this:
1. Steady State (Attractor)
The system operates near equilibrium, stable latency, predictable throughput, balanced resources.
2. Perturbation
A traffic spike, deployment, or node failure displaces the system. Queues grow, latency rises, utilization spikes.
3. Feedback Takes Over
- Negative feedback tries to restore balance (autoscaling, backpressure)
- Positive feedback can amplify problems (retries, error storms)
4. Three Possible Trajectories
Stabilization (desired)
Negative feedback + strong damping → system returns to attractor.
Oscillation (common)
Delayed or aggressive feedback → autoscaling flaps, latency swings.
Divergence (failure)
Positive feedback dominates → retry storms, cascading collapse.
5. Damping Decides the Outcome
Strong damping mechanisms determine whether the system stabilizes or diverges. Weak damping turns minor perturbations into major outages.
Shorthand:
Attractor → Perturbation → Feedback → (Stabilization | Oscillation | Divergence) → Damping → Attractor?
Although it’s never quite this clean.
The Telemetry Death Spiral: A Real-World Example
Nowhere is this dynamic more visible, and dangerous, than in observability itself.
Under normal conditions, telemetry cost is negligible. But when the system enters divergence:
- Error rates explode
- Log storms, high-cardinality metrics, and trace volume surge
- Telemetry begins competing with the primary workload for CPU, memory, and network
This creates a classic positive feedback loop:
Failure → More Errors → Telemetry Explosion → Resource Contention → Higher Latency → More Retries → More Failure
The very act of observing the system accelerates its collapse.
Load Shedding as Damping
The solution is deliberate damping: reduce telemetry load to restore stability.
Examples:
- Sampling traces instead of full capture
- Dropping high-cardinality labels
- Disabling debug logs
- Exporter throttling or kill-switches
Minimal Viable Telemetry
| Mode | State | Observability Strategy | Goal |
|---|---|---|---|
| 1. Steady | Attractor | Full traces, high-res metrics, debug | Maximum insight |
| 2. Perturbation | Displaced | Sampled traces, aggregate metrics | Reduced overhead + early warning |
| 3. Divergence | Unstable | Heartbeat only, kill-switches active | Preserve invariants, enable recovery |
The PilotLight
In extreme instability, non-essential components shut down. What remains is the PilotLight:
- Hard invariants that must always hold
- Essential control loops
- Minimal telemetry (heartbeats)
- Stabilization logic (damping mechanisms)
The PilotLight is the set of invariants that remain true even when all other state variables are in chaos. Its job is to guarantee the system can still return to an attractor.
Key Takeaway
In many organizations, “more monitoring” is a reflex, a way to feel productive without actually increasing understanding.
It’s the architectural equivalent of shouting louder when you aren’t being understood.
Beyond a certain point, more metrics do not improve clarity. They create noise.
PilotLight Architecture reframes observability: not as data accumulation, but as system modeling.
We move from being data hoarders to system cartographers.
Traditional Monitoring → Dynamic Mapping (PilotLight)
“What is CPU usage?” → State Identification: Which variables define system behavior?
“Alert if latency > 200ms” → Attractor Mapping: What stable region does the system return to?
“Add more logs” → Perturbation Correlation: What events displace the system?
“We need more dashboards” → Feedback Diagnosis: Are we stabilizing or amplifying?
Outlining the Shape of the System
Observability is the process of outlining the shape of system dynamics over time.
In a healthy system, monitoring confirms that the state remains within a basin of attraction.
When perturbations occur, the goal is not more data, it is visibility into how the system responds:
- does damping engage?
- does feedback stabilize or amplify?
- does the system return to equilibrium?
The Pilotlight Approach
The reality is simple:
More metrics can become a distraction.
If a metric does not help identify an attractor, detect a perturbation, or signal divergence, it is likely noise, and may subsidize future collapse.
“More monitoring” should not mean more data.
It should mean better understanding of system dynamics.
The goal is not zero downtime.
It is predictable recovery.
This is the PilotLight Approach