System Nīti

Chanakya's Manual for Designing Resilient Systems
Before you start some work, always ask three questions — Why am I doing it, What the results might be, and Will I be successful? — Chanakya

You who would be ruler of systems, listen first to the politics of life. In courts, a whisper kills kingdoms; in markets, a slow leak bankrupts treasuries. The state that forgets its walls falls; the software that forgets its contracts collapses. Systems are small republics — they have ministers (processes), spies (telemetry), treasuries (resources), borders (APIs), traitors (bugs) and heirs (maintenance teams). Design your system as a wise king designs a realm: with foresight, with ruthless priorities, and with the humility to expect betrayal.

If you read no further than this opening, remember this like a campaign order: build so that your system survives the crisis you cannot imagine. For those who survive past these opening salvos, I will now strip ceremony and hand you the doctrine, the field-manual, the step-by-step war-plan for constructing systems that endure.

I

The Counsel: Start With Purpose, Not Features

Sutra: A throne without cause is empty — a system without clear intent is fragile.

1. Three Questions (Chanakya's first litmus):

  • Why are we building this? (Business need, user need, metric to change)
  • What will success look like? (SLOs, throughput, latency, adoption targets)
  • What is the failure mode we must survive? (Data loss, extended downtime, security breach)

2. Write down the non-negotiables. Not "nice-to-haves." Not "maybe-laters." The list that follows these three questions becomes your acceptance criteria and boundary for trade-offs.

3. Quantify everything. "Fast" becomes p95 < 100ms; "reliable" becomes availability 99.95%; "cheap" becomes $X per 1M requests. Numbers make politics into engineering.

Checklist (Requirements Sprint):

  • Business intent documented (one paragraph)
  • Primary users & flows mapped
  • Success metrics (SLOs) defined with error budget
  • Critical failure modes enumerated and prioritized
II

The Foundation: Architecture as a Defensive Plan

Sutra: Build the fort before you invite the market.

Architecture is not aesthetic. It is survival planning. Your architecture must answer: how will the system behave under normal load, under burst, under network partitions, under disk corruption, under long-term neglect?

1. Start with a layered model

  • Control plane: orchestration, config, leader election
  • Data plane: actual request handling, storage
  • Infrastructure plane: networking, infra-as-code, ci/cd
  • Observation plane: metrics, logs, traces, alerts

Separate concerns so that failures in one plane minimally affect others.

2. Choose your data model early

OLTP (ACID DB) vs OLAP (column store) vs event log (append-only). Map each domain entity to the most appropriate store. Do not generalize one DB for everything just because it's convenient.

3. Domain-driven decomposition

Design boundaries by business domain; avoid breaking on implementation details. For each service: define its contract (API), data ownership (who is the source of truth), and expected load profile.

Splintering into microservices because microservices are trendy. Microservices should serve ownership and scaling; otherwise they are bureaucracy.
III

The Treasury: Resource Management & Cost Discipline

Sutra: Extravagance is the first ruin of kings.

Resources are your treasury: CPU, memory, I/O, network, and cloud dollars. Treat them as finite and precious.

1. Capacity planning

Estimate QPS, request size, processing time, database ops per request. Convert to CPU/memory/io needs; include headroom (safety margin, typically 2–3x for initial planning).

2. Cost as a first-class constraint

Design caching, batching, and asynchronous patterns to reduce resource spend. Use cost-aware autoscaling; don't autoscale statelessly without bounds.

3. Throttling and quotas

Rate limit at edge (API gateway) and per-tenant. Provide graceful responses (429 + Retry-After) and well-documented quotas.

Checklist (Resource Discipline):

  • Load model & cost projection documented
  • Rate limits and quotas defined
  • Autoscaling policy and bounds configured
IV

The Spies: Observability and Intelligence

Sutra: Spies are the eyes of the king; telemetry is the mind of the system.

A blind ruler reacts slowly. Observability gives you early warning and the ability to reason after incidents.

1. Instrument everything from day one

  • Metrics (counters, gauges, histograms). Focus on p50, p95, p99 latencies.
  • Structured logs with trace and request identifiers.
  • Distributed tracing to connect flows across services.

2. Design meaningful dashboards and alerts

Dashboards for business KPIs, system health (latency, errors, saturation), and infra. Alerts tuned to actionability; paging only when human intervention is needed. Use alert severity and runbooks.

3. Use SLOs & error budgets to govern change

Create SLOs: latency & availability targets per service. Use error budget consumption as a governance mechanism for releases and experiments.

Example: SLO = 99.95% successful requests per minute; error budget = 4.38 minutes/month.
V

The Ministers: Ownership, Teams & Governance

Sutra: A throne without loyal ministers is bare; a system without ownership rots.

Design for teams as much as you design for code.

1. Team boundaries mirror service boundaries

Small, cross-functional teams own services end-to-end: code, infra, runbooks, SLOs.

2. Clear on-call responsibilities

No black boxes. On-call must know how to run, debug, and rollback.

3. Code review and architecture review

Change control with lightweight guardrails: design docs for non-trivial changes, sanity tests, and staged rollout.

Central platforms that solve everything and then block teams. Platform should empower, not enslave.
VI

The Borders: API Design and Contracts

Sutra: Promises bind the weak; contracts bind the strong.

APIs are borders between polities. Make them stable, versioned, and backward-compatible.

1. Contract-first API design

Define API schemas (OpenAPI/Protobuf) and generate client/server stubs. Mock the API and integrate early.

2. Versioning and evolution

Use additive changes; deprecate before removal. Prefer feature flags and negotiation over hard-breaking upgrades.

3. Defensive design

Timeouts, retries with exponential backoff and jitter, idempotency tokens for safe retries.

For idempotent writes use idempotency keys; for non-idempotent, require server-side deduplication.
VII

The Army: Scaling & Data Sharding

Sutra: Armies that cannot coordinate lose battles; data that cannot be partitioned chokes systems.

1. Horizontal scaling first

Design services to be stateless where possible; stateful components require partitioning.

2. Sharding strategies

  • Range-based sharding (ordered keys): efficient for range queries, requires rebalancing.
  • Hash-based sharding: uniform distribution, harder to do range queries.
  • Hybrid: hash by customer, range by time, etc.

3. Rebalancing & consistent hashing

Use consistent hashing or virtual nodes for smoother shard movement. Plan for resharding operations and maintenance windows.

Example: For a multi-tenant billing system, shard by tenant-id (hash), and within tenant use range-partitioned tables for time-series invoice data.
VIII

The Bargain: Caching & Consistency

Sutra: Trade-offs are the currency of rulers.

Caching is powerful but makes correctness harder. Consistency models are a contract you choose.

1. Caching patterns

  • Read-through cache: Cache sits in front of DB, fetch on cache-miss.
  • Write-through / write-behind: Synchronous vs asynchronous update patterns.

Cache invalidation is the hardest problem in distributed systems. Favor simple, conservative strategies: TTLs, explicit invalidation on write.

2. Consistency choices

  • Strong consistency (linearizability) when correctness matters (banking).
  • Eventual consistency where availability and latency matter (social feed).
  • Use causal or session consistency when user-experience requires ordering guarantees.

3. Transactions & Idempotency

Use distributed transactions sparingly. Prefer saga patterns for cross-service workflows: orchestrator or choreography with event compensation.

Practical Example: Payment processing—require strong consistency. Use two-phase commit only if unavoidable; prefer single-service transaction and event-driven updates for downstream processing.
IX

The Siege: Faults, Circuit Breakers and Backpressure

Sutra: When the enemy battering-ram approaches, do not flood the gate with defenders; funnel and control the flow.

1. Expect failures — design for them

  • Implement circuit breakers to avoid cascading failures.
  • Use bulkheads to isolate failure domains.
  • Apply backpressure at ingress: reject, queue, or shed load.

2. Retry and idempotency

  • Retry with exponential backoff and jitter.
  • Ensure operations are idempotent when retried; if not, employ deduplication.

3. Graceful degradation

Serve reduced functionality instead of full failure: degrade recommendation features under load, keep payments working.

  • Circuit breaker trips if error rate > X% and number of requests > Y in window Z.
  • Retry up to N times with backoff base B and jitter J.
X

The Archivist: Data Retention, Backups & Recovery

Sutra: A king without records forgets his lineage; a system without backups forgets its state.

1. Backup strategy

  • RPO & RTO determine backup cadence.
  • For databases: point-in-time recovery, incremental backups, tested restores.
  • For object storage: replication across regions + lifecycle policies.

2. Test restores regularly

Restore process must be part of CI/CD; test in isolated infra.

3. Migration and schema evolution

Use backward-compatible schema changes (add columns optional, create new tables). Online migration strategies: dual-write + backfill, blue-green DB migrations.

XI

The Law: Security & Least Privilege

Sutra: An open gate invites invasion; secrecy and limits protect the realm.

1. Principle of least privilege

IAM roles per service with minimal permissions. Short-lived credentials and mutual TLS for service-to-service auth.

2. Threat modeling

Threat model early. Build the list of assets, threats and mitigations. Automate security controls: dependency scanning, secret scanning, SAST/DAST pipelines.

3. Data protection

Encrypt at rest and in transit. Mask PII and enforce access policies; audit all access.

Maintain and rehearse an incident response plan and a public communication template.
XII

The Market: Deployment Strategy & CI/CD

Sutra: A wise king does not burn the city with each celebration.

1. Deployment patterns

  • Canary releases: route small % of traffic to new version; monitor errors and latency.
  • Blue-green: instant rollback by switching routing.
  • Feature flags: decouple deploy from release.

2. Automation and reproducibility

  • Infrastructure as code.
  • Immutable images/containers.
  • Declarative manifests and scripted rollbacks.

3. Release governance

Gate releases by SLO health, smoke tests, and canary stability. Automate post-deploy verification.

XIII

The Trial by Fire: Chaos & Resilience Testing

Sutra: As soon as the fear approaches near, attack and destroy it.

Chaos engineering is you bringing fear close, so you learn to destroy it on your terms.

1. Start small

Fault-injection on non-critical paths. Inject latency, drop packets, kill instances.

2. Runbook-driven experiments

Predefine hypothesis, metrics, blast radius, and rollback plan. Use chaos results to improve runbooks, timeouts, and redundancy.

XIV

The Heirs: Documentation & Succession

Sutra: If knowledge dies with the king, the realm enters anarchy.

1. Living docs

System overviews, data flows, API contracts — single source of truth (e.g., an internal docs portal). Runbooks for incidents & postmortems.

2. Onboarding as part of code quality

New engineer should be able to ship a small safe change in 1–2 weeks. If not, system is too complex.

XV

The Last Counsel: Governance, Cost & Ethics

Sutra: Not knowledge alone, nor force alone; with both the king endures.

Governance: balance autonomy with platform conventions. Enforce standards without becoming a blockade.

Cost: billability attribution by service; teams own cost and optimization.

Ethics: guard user data. Trade-offs for speed must be informed with privacy in mind.

📋

Closing Campaign Orders — A Practical Runbook

  1. Define success — pick 3 metrics (latency p95, availability %, cost per million).
  2. Draw boundaries — one page diagram of services & data ownership.
  3. Choose stores — map each data domain to DB type and partition key.
  4. Instrument — wire metrics, traces, logs, and a dashboard.
  5. Set SLOs — error budget & release policy.
  6. Automate deployment — CI/CD, canaries, feature flags.
  7. Practice failure — scheduled chaos, restore drills, postmortems.
  8. Document — publish runbooks and onboarding paths.
  9. Govern — retrospective on cost & risk every quarter.
  10. Repeat — architecture evolves; revisit the three questions every sprint.
Build with the humility of the defeated and the strategy of the conqueror. A great system is not one that dazzles at launch — it is the one that stands silent through ten winters of traffic, betrayal, and greed. Design not for glory, but for surviving the day the unknown arrives.