NĪTI
Chanakya's Manual for Designing Resilient Systems

System Nīti

Build so that your system survives the crisis you cannot imagine. A field manual for constructing systems that endure.

Chapters 16 Sutras
Origin Arthashastra
Purpose Resilience
RequirementsArchitectureResourcesObservabilityOwnershipAPIsScalingCachingResilienceBackupsSecurityDeploymentChaosDocumentationGovernance RequirementsArchitectureResourcesObservabilityOwnershipAPIsScalingCachingResilienceBackupsSecurityDeploymentChaosDocumentationGovernance
Preface

You who would be ruler of systems, listen first to the politics of life.

In courts, a whisper kills kingdoms; in markets, a slow leak bankrupts treasuries. The state that forgets its walls falls; the software that forgets its contracts collapses.

Systems are small republics — they have ministers (processes), spies (telemetry), treasuries (resources), borders (APIs), traitors (bugs) and heirs (maintenance teams). Design your system as a wise king designs a realm: with foresight, with ruthless priorities, and with the humility to expect betrayal.

If you read no further than this opening, remember this like a campaign order: build so that your system survives the crisis you cannot imagine.

01 / Requirements

A throne without cause is empty — a system without clear intent is fragile.

Three Questions. Non-negotiables. Quantify everything.

Why are we building this? Business need, user need, metric to change. What will success look like? SLOs, throughput, latency, adoption targets. What is the failure mode we must survive? Data loss, extended downtime, security breach.

Write down the non-negotiables. Not "nice-to-haves." Not "maybe-laters." The list that follows these three questions becomes your acceptance criteria and boundary for trade-offs.

Quantify everything. "Fast" becomes p95 < 100ms; "reliable" becomes availability 99.95%; "cheap" becomes $X per 1M requests. Numbers make politics into engineering.

  • Business intent documented in one paragraph
  • Primary users & flows mapped and validated
  • Success metrics (SLOs) defined with error budget
  • Critical failure modes enumerated and prioritized
02 / Architecture

Build the fort before you invite the market.

Architecture is not aesthetic. It is survival planning.

Your architecture must answer: how will the system behave under normal load, under burst, under network partitions, under disk corruption, under long-term neglect?

  • Control plane: orchestration, config, leader election
  • Data plane: actual request handling, storage
  • Infrastructure plane: networking, infra-as-code, CI/CD
  • Observation plane: metrics, logs, traces, alerts

Choose your data model early. OLTP vs OLAP vs event log. Map each domain entity to the most appropriate store. Do not generalize one DB for everything just because it's convenient.

Design boundaries by business domain; avoid breaking on implementation details. For each service: define its contract (API), data ownership (who is the source of truth), and expected load profile.

Splintering into microservices because microservices are trendy. Microservices should serve ownership and scaling; otherwise they are bureaucracy.

03 / Resources

Extravagance is the first ruin of kings.

Resources are your treasury. Treat them as finite and precious.

Capacity planning. Estimate QPS, request size, processing time, database ops per request. Convert to CPU/memory/IO needs; include headroom (safety margin, typically 2–3x for initial planning).

Cost as a first-class constraint. Design caching, batching, and asynchronous patterns to reduce resource spend. Use cost-aware autoscaling; don't autoscale statelessly without bounds.

Throttling and quotas. Rate limit at edge (API gateway) and per-tenant. Provide graceful responses (429 + Retry-After) and well-documented quotas.

  • Load model & cost projection documented
  • Rate limits and quotas defined
  • Autoscaling policy and bounds configured
04 / Observability

Spies are the eyes of the king; telemetry is the mind of the system.

A blind ruler reacts slowly. Observability gives you early warning.

Instrument everything from day one. Metrics (counters, gauges, histograms). Focus on p50, p95, p99 latencies. Structured logs with trace and request identifiers. Distributed tracing to connect flows across services.

Design meaningful dashboards and alerts. Dashboards for business KPIs, system health (latency, errors, saturation), and infra. Alerts tuned to actionability; paging only when human intervention is needed.

Use SLOs & error budgets to govern change. Create SLOs: latency & availability targets per service. Use error budget consumption as a governance mechanism for releases and experiments.

Example: SLO = 99.95% successful requests per minute; error budget = 4.38 minutes/month.

05 / Ownership

A throne without loyal ministers is bare; a system without ownership rots.

Design for teams as much as you design for code.

Team boundaries mirror service boundaries. Small, cross-functional teams own services end-to-end: code, infra, runbooks, SLOs.

Clear on-call responsibilities. No black boxes. On-call must know how to run, debug, and rollback.

Code review and architecture review. Change control with lightweight guardrails: design docs for non-trivial changes, sanity tests, and staged rollout.

Central platforms that solve everything and then block teams. Platform should empower, not enslave.

06 / APIs

Promises bind the weak; contracts bind the strong.

APIs are borders between polities. Make them stable, versioned, and backward-compatible.

Contract-first API design. Define API schemas (OpenAPI/Protobuf) and generate client/server stubs. Mock the API and integrate early.

Versioning and evolution. Use additive changes; deprecate before removal. Prefer feature flags and negotiation over hard-breaking upgrades.

Defensive design. Timeouts, retries with exponential backoff and jitter, idempotency tokens for safe retries.

For idempotent writes use idempotency keys; for non-idempotent, require server-side deduplication.

07 / Scaling

Armies that cannot coordinate lose battles; data that cannot be partitioned chokes systems.

Horizontal scaling first. Design services to be stateless where possible.

  • Range-based sharding (ordered keys): efficient for range queries, requires rebalancing.
  • Hash-based sharding: uniform distribution, harder to do range queries.
  • Hybrid: hash by customer, range by time, etc.

Use consistent hashing or virtual nodes for smoother shard movement. Plan for resharding operations and maintenance windows.

Example: For a multi-tenant billing system, shard by tenant-id (hash), and within tenant use range-partitioned tables for time-series invoice data.

08 / Caching

Trade-offs are the currency of rulers.

Caching is powerful but makes correctness harder. Consistency models are a contract you choose.

Read-through cache: Cache sits in front of DB, fetch on cache-miss. Write-through / write-behind: Synchronous vs asynchronous update patterns.

Cache invalidation is the hardest problem in distributed systems. Favor simple, conservative strategies: TTLs, explicit invalidation on write.

  • Strong consistency (linearizability) when correctness matters (banking).
  • Eventual consistency where availability and latency matter (social feed).
  • Causal or session consistency when user-experience requires ordering guarantees.

Use distributed transactions sparingly. Prefer saga patterns for cross-service workflows: orchestrator or choreography with event compensation.

Practical Example: Payment processing—require strong consistency. Use two-phase commit only if unavoidable; prefer single-service transaction and event-driven updates for downstream processing.

09 / Resilience

When the enemy battering-ram approaches, do not flood the gate with defenders; funnel and control the flow.

Expect failures — design for them.

  • Implement circuit breakers to avoid cascading failures.
  • Use bulkheads to isolate failure domains.
  • Apply backpressure at ingress: reject, queue, or shed load.

Retry with exponential backoff and jitter. Ensure operations are idempotent when retried; if not, employ deduplication.

Serve reduced functionality instead of full failure: degrade recommendation features under load, keep payments working.

Circuit breaker trips if error rate > X% and number of requests > Y in window Z. Retry up to N times with backoff base B and jitter J.

10 / Backups

A king without records forgets his lineage; a system without backups forgets its state.

RPO & RTO determine backup cadence.

  • For databases: point-in-time recovery, incremental backups, tested restores.
  • For object storage: replication across regions + lifecycle policies.

Test restores regularly. Restore process must be part of CI/CD; test in isolated infra.

Use backward-compatible schema changes (add columns optional, create new tables). Online migration strategies: dual-write + backfill, blue-green DB migrations.

11 / Security

An open gate invites invasion; secrecy and limits protect the realm.

Principle of least privilege. Threat model early.

IAM roles per service with minimal permissions. Short-lived credentials and mutual TLS for service-to-service auth.

Threat model early. Build the list of assets, threats and mitigations. Automate security controls: dependency scanning, secret scanning, SAST/DAST pipelines.

Encrypt at rest and in transit. Mask PII and enforce access policies; audit all access. Maintain and rehearse an incident response plan and a public communication template.

12 / Deployment

A wise king does not burn the city with each celebration.

Canary releases. Blue-green. Feature flags.

  • Canary releases: route small % of traffic to new version; monitor errors and latency.
  • Blue-green: instant rollback by switching routing.
  • Feature flags: decouple deploy from release.

Infrastructure as code. Immutable images/containers. Declarative manifests and scripted rollbacks.

Gate releases by SLO health, smoke tests, and canary stability. Automate post-deploy verification.

13 / Chaos

As soon as the fear approaches near, attack and destroy it.

Chaos engineering is you bringing fear close, so you learn to destroy it on your terms.

Start small. Fault-injection on non-critical paths. Inject latency, drop packets, kill instances.

Runbook-driven experiments. Predefine hypothesis, metrics, blast radius, and rollback plan. Use chaos results to improve runbooks, timeouts, and redundancy.

14 / Docs

If knowledge dies with the king, the realm enters anarchy.

Living docs. Onboarding as part of code quality.

System overviews, data flows, API contracts — single source of truth. Runbooks for incidents & postmortems.

New engineer should be able to ship a small safe change in 1–2 weeks. If not, system is too complex.

15 / Governance

Not knowledge alone, nor force alone; with both the king endures.

Balance autonomy with platform conventions.

Governance: balance autonomy with platform conventions. Enforce standards without becoming a blockade.

Cost: billability attribution by service; teams own cost and optimization.

Ethics: guard user data. Trade-offs for speed must be informed with privacy in mind.

Build with the humility of the defeated and the strategy of the conqueror. A great system is not one that dazzles at launch — it is the one that stands silent through ten winters of traffic, betrayal, and greed.

16 / The Decree

10 Steps to a Resilient Realm

Design not for glory, but for surviving the day the unknown arrives.

Define success

Pick 3 metrics: latency p95, availability %, cost per million.

Draw boundaries

One page diagram of services & data ownership.

Choose stores

Map each data domain to DB type and partition key.

Instrument

Wire metrics, traces, logs, and a dashboard.

Set SLOs

Error budget & release policy.

Automate deployment

CI/CD, canaries, feature flags.

Practice failure

Scheduled chaos, restore drills, postmortems.

Document

Publish runbooks and onboarding paths.

Govern

Retrospective on cost & risk every quarter.

Repeat

Architecture evolves; revisit the three questions every sprint.

A great system is not one that dazzles at launch — it is the one that stands silent through ten winters of traffic, betrayal, and greed. Design not for glory, but for surviving the day the unknown arrives.