Chaos Engineering for Node + TypeScript: Survive Process Roulette
testingresilienceops

Chaos Engineering for Node + TypeScript: Survive Process Roulette

UUnknown
2026-03-05
11 min read
Advertisement

Use ‘process roulette’ to teach chaos engineering for TypeScript + Node: graceful shutdowns, health checks, fault injection, and resilience testing.

When your Node service dies at 2 a.m., what did the machine learn?

If you run TypeScript services on Node in production, your worst nightmares look like: sudden process exits, stuck event loops, or a rolling restart that brings user-facing errors. Chaos engineering doesn't have to be drama — it can be rehearsal. In 2026, teams increasingly build chaos into CI/CD and shift-left resilience testing. This article shows how to use the playful idea of "process roulette" — programs that randomly kill processes — to teach rigorous chaos engineering for TypeScript + Node services: fault injection, graceful shutdowns, health checks, and automated resilience tests.

Why process roulette is a useful teaching metaphor

Process roulette — literally killing a process at random — exposes weak spots quickly. It forces you to answer: can my service stop without losing data? Can Kubernetes and my process manager route traffic away? Do health checks reflect real readiness or just a running event loop? Embracing controlled failure lets you harden systems before real incidents.

Overview: Goals and the 2026 context

Goals for your TypeScript Node service:

  • Detect failures quickly with accurate liveness/readiness and app-level checks.
  • Shutdown gracefully so in-flight work completes or is safely cancelled.
  • Introduce fault injection into staging and CI with observability and SLO-based guardrails.
  • Automate chaos as code with tools like LitmusChaos, Gremlin, or Kubernetes-native controllers.

2026 trends to keep in mind:

  • Shift-left chaos: teams run chaos scenarios in ephemeral staging before merging. GitOps and CI pipelines now commonly include chaos test stages.
  • Better observability: OpenTelemetry and SLO-driven chaos tests are standard. Instrumentation is a prerequisite for safe experiments.
  • Serverless & container runtime nuances: more systems run with sandboxed runtimes (gVisor, WASM), which change failure surface area — but process-level chaos still matters for Node containers.

Start with a minimal principle: never surprise the orchestrator

Whether your service is run by pm2, systemd, Docker, or Kubernetes, the orchestrator expects certain signals. On Linux, SIGTERM is the polite request to stop. Docker and Kubernetes send SIGTERM (and later SIGKILL) when stopping a container. Your TypeScript code must:

  • Respond to SIGTERM/SIGINT.
  • Stop accepting new work (drain listeners, stop consuming from queues).
  • Complete or cancel in-flight work within a configured timeout.
  • Exit with appropriate status codes.

Graceful shutdown pattern in TypeScript

Below is a practical pattern using Node's AbortController and async shutdown handlers. It works with HTTP servers, message queue consumers, and background jobs.

import http from 'http';
import { AbortController } from 'node:abort-controller';

const controller = new AbortController();
const { signal } = controller;

const server = http.createServer((req, res) => {
  // Example work that listens to the shutdown signal
  if (signal.aborted) {
    res.writeHead(503);
    return res.end('shutting down');
  }

  // Normal request handling...
  res.end('ok');
});

const shutdownHandlers: Array<() => Promise<void>> = [];

function registerShutdown(handler: () => Promise<void>) {
  shutdownHandlers.push(handler);
}

async function gracefulShutdown(reason = 'SIGTERM') {
  console.log('Shutdown requested:', reason);
  controller.abort(); // inform handlers

  // Stop accepting new connections
  server.close((err) => {
    if (err) console.error('Error closing server', err);
  });

  // Run cleanup handlers with timeout
  const timeout = setTimeout(() => {
    console.warn('Forcing exit after timeout');
    process.exit(1);
  }, 30_000);

  try {
    await Promise.all(shutdownHandlers.map((h) => h()));
    clearTimeout(timeout);
    process.exit(0);
  } catch (e) {
    console.error('Shutdown error', e);
    process.exit(1);
  }
}

process.on('SIGTERM', () => gracefulShutdown('SIGTERM'));
process.on('SIGINT', () => gracefulShutdown('SIGINT'));

server.listen(3000);

// Example registered shutdown task: close DB
registerShutdown(async () => {
  // await db.close();
});

Key takeaways:

  • Use AbortController to broadcast a shutdown signal to async tasks.
  • Stop accepting new requests immediately, then wait for in-flight work to finish.
  • Enforce a maximum shutdown timeout to avoid hanging indefinitely.

Health checks: not all 'alive' checks are equal

Liveness checks answer: Is the process responsive or stuck? Readiness checks answer: Can this instance receive new traffic?

Design rules:

  • Make liveness fast and conservative — detect event-loop stalls or deadlocks, not transient DB latency.
  • Make readiness reflect dependencies: DB connection, cache primed, or migrations applied.
  • Expose application-level endpoints such as /health/live and /health/ready.

Example using a terminus-style approach (conceptual)

import http from 'http';

function liveness() {
  // Quick sanity checks: event loop delay, process memory
  return Promise.resolve();
}

function readiness() {
  // Check DB connectivity, external caches, or feature flags
  return db.ping();
}

// Hook these into /health/live and /health/ready routes

In Kubernetes, set:

  • livenessProbe to a fast endpoint that detects stuck processes.
  • readinessProbe to an endpoint that checks upstream dependencies.
  • Configure initialDelaySeconds and failureThreshold conservatively to avoid flapping on deployments.

Fault injection: from toy scripts to production-grade experiments

Fault injection ranges from a local script that randomly exits a process to orchestrated chaos runs against a staging cluster. Use three levels:

  1. Developer-level: small tools that randomly exit your local process to validate graceful shutdown code.
  2. Staging-level: controlled chaos experiments (pod kill, network partition) with observability and rollback.
  3. Pre-production: automated chaos-as-code in CI that runs against ephemeral clusters and fails the pipeline when SLOs are violated.

Process roulette: a tiny TypeScript fault-injector

Build a compact dev tool that randomly kills a process to test your shutdown handlers. Use cautiously — only on dev/staging instances.

#!/usr/bin/env ts-node
// process-roulette.ts — kills the process randomly, used for dev testing

const minMs = 5_000;
const maxMs = 60_000;

function randBetween(a: number, b: number) {
  return Math.floor(Math.random() * (b - a)) + a;
}

const delay = randBetween(minMs, maxMs);
console.log(`Process roulette armed: will kill process in ${delay}ms`);

setTimeout(() => {
  console.warn('Process roulette: exiting now (SIGTERM)');
  // Simulate a polite stop so graceful shutdown handlers can run
  process.kill(process.pid, 'SIGTERM');
}, delay);

Tie this into your dev images or a sidecar container in a staging Pod to simulate sudden exits. The right approach depends on your objectives: exit quickly to test orchestrator restart behavior, or send SIGTERM to exercise graceful shutdown.

Chaos at scale: Kubernetes and resilience testing frameworks

For cluster-level experiments, use community tools:

  • LitmusChaos — Kubernetes-native chaos experiments (pod-kill, network-loss, IO stress).
  • Chaos Mesh — CRD-based chaos in Kubernetes with rich scenarios.
  • Gremlin — commercial platform that supports Kubernetes, cloud, and host-level attacks.

2025–2026 trend: chaos-as-code and GitOps. Teams define chaos experiments as YAML and run them as part of ephemeral environment creation in CI. Combine these with OpenTelemetry traces and SLO checks to stop experiments when service health is compromised.

Sample pipeline stage (conceptual)

  1. Deploy branch to ephemeral Kubernetes namespace (Helm + GitHub Actions).
  2. Run smoke tests and baseline SLO measurement (latency, error rate).
  3. Apply chaos experiment (pod-kill of 30% of replicas over 5 minutes).
  4. Measure SLOs; fail pipeline if error budget exceeded.
  5. Collect traces, metrics, and logs automatically for postmortem.

Resilience patterns and TypeScript-specific practices

Beyond crashes, make your code resilient to partial failures:

  • Circuit breakers: use libraries like opossum or write a typed wrapper to avoid cascading failures.
  • Retries with backoff: implement idempotent retries, respecting AbortSignal for cancellations during shutdown.
  • Bulkheads: separate job types and resource pools (e.g., dedicated worker threads) so one failure class doesn't take down everything.
  • Typed errors: use discriminated unions in TypeScript for error handling so retry logic is deterministic.

Example: typed retry that respects AbortSignal

type RetryableError = { kind: 'RateLimit' } | { kind: 'Transient' } | { kind: 'Permanent' };

async function retryWithBackoff<T>(fn: (signal: AbortSignal) => Promise<T>, signal: AbortSignal, attempts = 3) {
  let attempt = 0;
  while (attempt < attempts) {
    attempt++;
    try {
      return await fn(signal);
    } catch (e: any) {
      const err: RetryableError = e;
      if (signal.aborted) throw new Error('aborted');
      if (err.kind === 'Permanent') throw err;
      const delay = Math.pow(2, attempt) * 100;
      await new Promise((r, rej) => {
        const t = setTimeout(r, delay);
        signal.addEventListener('abort', () => {
          clearTimeout(t); rej(new Error('aborted'));
        }, { once: true });
      });
    }
  }
  throw new Error('failed after retries');
}

Developer ergonomics: tooling, tsconfig, and linters for safer shutdowns

TypeScript and build-time tools can help prevent risky patterns:

  • ESLint rule to catch process.exit() or un-awaited promises. Example: 'no-process-exit' or custom rule to flag calls from production code.
  • tsconfig with strict checks so error types are explicit: "strict": true, "noImplicitAny": true.
  • Build pipeline that fails if tests do not cover shutdown and health endpoints. Add chaos tests as separate stage.

Example snippet for tsconfig.json (recommended for services):

{
  "compilerOptions": {
    "target": "ES2022",
    "module": "CommonJS",
    "strict": true,
    "esModuleInterop": true,
    "skipLibCheck": true,
    "forceConsistentCasingInFileNames": true,
    "outDir": "dist",
    "sourceMap": true
  }
}

Observability: how to measure whether chaos experiments are safe

Before running chaos, secure observability:

  • Traces (OpenTelemetry): capture spans for critical flows to diagnose timeouts or retries.
  • Metrics: request latency, error rates, active connections, queue sizes, consumer lag.
  • Logs: structured logs with correlation IDs; include shutdown and handler lifecycle events.
  • SLOs: define error budget and latency SLOs — these are the pass/fail criteria for automated experiments.

Practical checklist to harden a Node + TypeScript service (actionable)

  1. Implement graceful shutdown using AbortController and shutdown handlers; test locally with a process-roulette script.
  2. Expose /health/live and /health/ready; implement conservative liveness checks and dependency-aware readiness checks.
  3. Add ESLint rules to ban process.exit calls in production code and enable TypeScript strict mode.
  4. Instrument with OpenTelemetry; ensure traces include shutdown and retry paths.
  5. Automate chaos experiments in staging. Start small: single pod kill, then scale to % of replicas.
  6. Integrate chaos-as-code into CI with SLO assertions. Fail the pipeline when error budgets are exceeded.
  7. Document runbooks and add quick telemetry dashboards for chaos experiments.

Common gotchas and how to avoid them

  • Relying on process uptime: A "healthy" process may still be unable to serve traffic — make readiness checks dependable.
  • Forgotten infinite timers: background timers or unhandled promises can keep Node from exiting; ensure you cancel them on shutdown.
  • Blocking the event loop: expensive synchronous work will block responsiveness and make liveness checks lie. Move heavy CPU tasks to workers.
  • Not testing AbortSignal paths: make sure your retry and database libraries respect AbortSignal so shutdown actually cancels outstanding work.

Integrating with pm2, Docker, and Kubernetes — concrete tips

  • pm2: enable gracefulReload and set wait_ready to true if you use process messaging for readiness. Use pm2 kill only in controlled scenarios.
  • Docker: use HEALTHCHECK and STOPSIGNAL in Dockerfile. Configure container stop grace period with --stop-timeout or PodSpec terminationGracePeriodSeconds.
  • Kubernetes: prefer readiness probes that call application-level checks. Use preStop hook to delay termination while you drain connections (but prefer application-driven draining via SIGTERM).

Putting it together: an experiment plan you can run this week

  1. Add the graceful shutdown pattern to your service (15–60 minutes).
  2. Write two endpoints: /health/live and /health/ready. Test them locally (30 minutes).
  3. Run a local process-roulette script while you curl the health endpoints — verify readiness flips and shutdown logs appear (15–30 minutes).
  4. Create a staging chaos test: kill one pod and watch traffic shift and retries (1–2 hours). Collect traces and measure error rate.
  5. Automate the same scenario in CI with SLO checks gating merges (1–2 days to mature into process).

Final thoughts: treat chaos as rehearsed, observable, and reversible

Process roulette is a fun provocation, but the discipline is the lesson: controlled failure reveals assumptions. In 2026, chaos engineering is no longer optional for teams running user-facing distributed systems — it's part of professional engineering hygiene. Make failure safe by building graceful shutdown, accurate health checks, observability, and automated experiments into your TypeScript Node toolchain.

"You can't fix what you don't rehearse." — a modern ops proverb

Actionable next steps

  • Copy the graceful shutdown example into your repo and run it under a process-roulette script.
  • Add basic readiness and liveness endpoints and wire them into your Kubernetes probes.
  • Set up a single chaos experiment in staging (pod kill) and require OpenTelemetry traces for every experiment run.

If you want a checklist, starter repo, and a CI pipeline snippet that runs chaos-as-code against ephemeral clusters, grab the companion GitHub repo linked from this article (or contact us for a workshop).

Call to action

Start rehearsing failure this week: implement graceful shutdown, add health checks, and run a one-click chaos experiment in your staging environment. If you'd like, download the example TypeScript repo and CI templates I use to run safe chaos runs in ephemeral Kubernetes namespaces — or book a hands-on session to harden your services with SLO-driven chaos tests. Don't wait for the on-call page to teach you the lesson.

Advertisement

Related Topics

#testing#resilience#ops
U

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-03-05T01:31:02.650Z