System Nīti — Chanakya's Manual for Designing Resilient Systems

You who would be ruler of systems, listen first to the politics of life. In courts, a whisper kills kingdoms; in markets, a slow leak bankrupts treasuries. The state that forgets its walls falls; the software that forgets its contracts collapses. Systems are small republics — they have ministers (processes), spies (telemetry), treasuries (resources), borders (APIs), traitors (bugs) and heirs (maintenance teams). Design your system as a wise king designs a realm: with foresight, with ruthless priorities, and with the humility to expect betrayal.

If you read no further than this opening, remember this like a campaign order: build so that your system survives the crisis you cannot imagine. For those who survive past these opening salvos, I will now strip ceremony and hand you the doctrine, the field-manual, the step-by-step war-plan for constructing systems that endure.

I

The Counsel: Start With Purpose, Not Features

Sutra: A throne without cause is empty — a system without clear intent is fragile.

1. Three Questions (Chanakya's first litmus):

Why are we building this? (Business need, user need, metric to change)
What will success look like? (SLOs, throughput, latency, adoption targets)
What is the failure mode we must survive? (Data loss, extended downtime, security breach)

2. Write down the non-negotiables. Not "nice-to-haves." Not "maybe-laters." The list that follows these three questions becomes your acceptance criteria and boundary for trade-offs.

3. Quantify everything. "Fast" becomes p95 < 100ms; "reliable" becomes availability 99.95%; "cheap" becomes $X per 1M requests. Numbers make politics into engineering.

Checklist (Requirements Sprint):

Business intent documented (one paragraph)
Primary users & flows mapped
Success metrics (SLOs) defined with error budget
Critical failure modes enumerated and prioritized

II

The Foundation: Architecture as a Defensive Plan

Sutra: Build the fort before you invite the market.

Architecture is not aesthetic. It is survival planning. Your architecture must answer: how will the system behave under normal load, under burst, under network partitions, under disk corruption, under long-term neglect?

1. Start with a layered model

Control plane: orchestration, config, leader election
Data plane: actual request handling, storage
Infrastructure plane: networking, infra-as-code, ci/cd
Observation plane: metrics, logs, traces, alerts

Separate concerns so that failures in one plane minimally affect others.

2. Choose your data model early

OLTP (ACID DB) vs OLAP (column store) vs event log (append-only). Map each domain entity to the most appropriate store. Do not generalize one DB for everything just because it's convenient.

3. Domain-driven decomposition

Design boundaries by business domain; avoid breaking on implementation details. For each service: define its contract (API), data ownership (who is the source of truth), and expected load profile.

Splintering into microservices because microservices are trendy. Microservices should serve ownership and scaling; otherwise they are bureaucracy.

III

The Treasury: Resource Management & Cost Discipline

Sutra: Extravagance is the first ruin of kings.

Resources are your treasury: CPU, memory, I/O, network, and cloud dollars. Treat them as finite and precious.

1. Capacity planning

Estimate QPS, request size, processing time, database ops per request. Convert to CPU/memory/io needs; include headroom (safety margin, typically 2–3x for initial planning).

2. Cost as a first-class constraint

Design caching, batching, and asynchronous patterns to reduce resource spend. Use cost-aware autoscaling; don't autoscale statelessly without bounds.

3. Throttling and quotas

Rate limit at edge (API gateway) and per-tenant. Provide graceful responses (429 + Retry-After) and well-documented quotas.

Checklist (Resource Discipline):

Load model & cost projection documented
Rate limits and quotas defined
Autoscaling policy and bounds configured

IV

The Spies: Observability and Intelligence

Sutra: Spies are the eyes of the king; telemetry is the mind of the system.

A blind ruler reacts slowly. Observability gives you early warning and the ability to reason after incidents.

1. Instrument everything from day one

Metrics (counters, gauges, histograms). Focus on p50, p95, p99 latencies.
Structured logs with trace and request identifiers.
Distributed tracing to connect flows across services.

2. Design meaningful dashboards and alerts

Dashboards for business KPIs, system health (latency, errors, saturation), and infra. Alerts tuned to actionability; paging only when human intervention is needed. Use alert severity and runbooks.

3. Use SLOs & error budgets to govern change

Create SLOs: latency & availability targets per service. Use error budget consumption as a governance mechanism for releases and experiments.

Example: SLO = 99.95% successful requests per minute; error budget = 4.38 minutes/month.
                    

V

The Ministers: Ownership, Teams & Governance

Sutra: A throne without loyal ministers is bare; a system without ownership rots.

Design for teams as much as you design for code.

1. Team boundaries mirror service boundaries

Small, cross-functional teams own services end-to-end: code, infra, runbooks, SLOs.

2. Clear on-call responsibilities

No black boxes. On-call must know how to run, debug, and rollback.

3. Code review and architecture review

Change control with lightweight guardrails: design docs for non-trivial changes, sanity tests, and staged rollout.

Central platforms that solve everything and then block teams. Platform should empower, not enslave.

VI

The Borders: API Design and Contracts

Sutra: Promises bind the weak; contracts bind the strong.

APIs are borders between polities. Make them stable, versioned, and backward-compatible.

1. Contract-first API design

Define API schemas (OpenAPI/Protobuf) and generate client/server stubs. Mock the API and integrate early.

2. Versioning and evolution

Use additive changes; deprecate before removal. Prefer feature flags and negotiation over hard-breaking upgrades.

3. Defensive design

Timeouts, retries with exponential backoff and jitter, idempotency tokens for safe retries.

For idempotent writes use idempotency keys; for non-idempotent, require server-side deduplication.

VII

The Army: Scaling & Data Sharding

Sutra: Armies that cannot coordinate lose battles; data that cannot be partitioned chokes systems.

1. Horizontal scaling first

Design services to be stateless where possible; stateful components require partitioning.

2. Sharding strategies

Range-based sharding (ordered keys): efficient for range queries, requires rebalancing.
Hash-based sharding: uniform distribution, harder to do range queries.
Hybrid: hash by customer, range by time, etc.

3. Rebalancing & consistent hashing

Use consistent hashing or virtual nodes for smoother shard movement. Plan for resharding operations and maintenance windows.

Example: For a multi-tenant billing system, shard by tenant-id (hash), and within tenant use range-partitioned tables for time-series invoice data.
                    

VIII

The Bargain: Caching & Consistency

Sutra: Trade-offs are the currency of rulers.

Caching is powerful but makes correctness harder. Consistency models are a contract you choose.

1. Caching patterns

Read-through cache: Cache sits in front of DB, fetch on cache-miss.
Write-through / write-behind: Synchronous vs asynchronous update patterns.

Cache invalidation is the hardest problem in distributed systems. Favor simple, conservative strategies: TTLs, explicit invalidation on write.

2. Consistency choices

Strong consistency (linearizability) when correctness matters (banking).
Eventual consistency where availability and latency matter (social feed).
Use causal or session consistency when user-experience requires ordering guarantees.

3. Transactions & Idempotency

Use distributed transactions sparingly. Prefer saga patterns for cross-service workflows: orchestrator or choreography with event compensation.

Practical Example: Payment processing—require strong consistency. Use two-phase commit only if unavoidable; prefer single-service transaction and event-driven updates for downstream processing.
                    

IX

The Siege: Faults, Circuit Breakers and Backpressure

Sutra: When the enemy battering-ram approaches, do not flood the gate with defenders; funnel and control the flow.

1. Expect failures — design for them

Implement circuit breakers to avoid cascading failures.
Use bulkheads to isolate failure domains.
Apply backpressure at ingress: reject, queue, or shed load.

2. Retry and idempotency

Retry with exponential backoff and jitter.
Ensure operations are idempotent when retried; if not, employ deduplication.

3. Graceful degradation

Serve reduced functionality instead of full failure: degrade recommendation features under load, keep payments working.

Circuit breaker trips if error rate > X% and number of requests > Y in window Z.
Retry up to N times with backoff base B and jitter J.

X

The Archivist: Data Retention, Backups & Recovery

Sutra: A king without records forgets his lineage; a system without backups forgets its state.

1. Backup strategy

RPO & RTO determine backup cadence.
For databases: point-in-time recovery, incremental backups, tested restores.
For object storage: replication across regions + lifecycle policies.

2. Test restores regularly

Restore process must be part of CI/CD; test in isolated infra.

3. Migration and schema evolution

Use backward-compatible schema changes (add columns optional, create new tables). Online migration strategies: dual-write + backfill, blue-green DB migrations.

XI

The Law: Security & Least Privilege

Sutra: An open gate invites invasion; secrecy and limits protect the realm.

1. Principle of least privilege

IAM roles per service with minimal permissions. Short-lived credentials and mutual TLS for service-to-service auth.

2. Threat modeling

Threat model early. Build the list of assets, threats and mitigations. Automate security controls: dependency scanning, secret scanning, SAST/DAST pipelines.

3. Data protection

Encrypt at rest and in transit. Mask PII and enforce access policies; audit all access.

Maintain and rehearse an incident response plan and a public communication template.

XII

The Market: Deployment Strategy & CI/CD

Sutra: A wise king does not burn the city with each celebration.

1. Deployment patterns

Canary releases: route small % of traffic to new version; monitor errors and latency.
Blue-green: instant rollback by switching routing.
Feature flags: decouple deploy from release.

2. Automation and reproducibility

Infrastructure as code.
Immutable images/containers.
Declarative manifests and scripted rollbacks.

3. Release governance

Gate releases by SLO health, smoke tests, and canary stability. Automate post-deploy verification.

XIII

The Trial by Fire: Chaos & Resilience Testing

Sutra: As soon as the fear approaches near, attack and destroy it.

Chaos engineering is you bringing fear close, so you learn to destroy it on your terms.

1. Start small

Fault-injection on non-critical paths. Inject latency, drop packets, kill instances.

2. Runbook-driven experiments

Predefine hypothesis, metrics, blast radius, and rollback plan. Use chaos results to improve runbooks, timeouts, and redundancy.

XIV

The Heirs: Documentation & Succession

Sutra: If knowledge dies with the king, the realm enters anarchy.

1. Living docs

System overviews, data flows, API contracts — single source of truth (e.g., an internal docs portal). Runbooks for incidents & postmortems.

2. Onboarding as part of code quality

New engineer should be able to ship a small safe change in 1–2 weeks. If not, system is too complex.

XV

The Last Counsel: Governance, Cost & Ethics

Sutra: Not knowledge alone, nor force alone; with both the king endures.

Governance: balance autonomy with platform conventions. Enforce standards without becoming a blockade.

Cost: billability attribution by service; teams own cost and optimization.

Ethics: guard user data. Trade-offs for speed must be informed with privacy in mind.

📋

Closing Campaign Orders — A Practical Runbook

Define success — pick 3 metrics (latency p95, availability %, cost per million).
Draw boundaries — one page diagram of services & data ownership.
Choose stores — map each data domain to DB type and partition key.
Instrument — wire metrics, traces, logs, and a dashboard.
Set SLOs — error budget & release policy.
Automate deployment — CI/CD, canaries, feature flags.
Practice failure — scheduled chaos, restore drills, postmortems.
Document — publish runbooks and onboarding paths.
Govern — retrospective on cost & risk every quarter.
Repeat — architecture evolves; revisit the three questions every sprint.

Build with the humility of the defeated and the strategy of the conqueror. A great system is not one that dazzles at launch — it is the one that stands silent through ten winters of traffic, betrayal, and greed. Design not for glory, but for surviving the day the unknown arrives.