A Field Manual · Chanakya, restated for engineers

System Nīti .

Build so that your system survives the crisis you cannot imagine. Sixteen sutras and one decree — for those who rule small republics of services, treasuries of data, and borders of APIs.

00 Sutras

00 Decrees

00 Minutes

00 Realm

Begin the doctrine

Scroll

Requirements · Architecture · Resources · Observability · Ownership · APIs · Scaling · Caching · Resilience · Backups · Security · Deployment · Chaos · Documentation · Governance · Requirements · Architecture · Resources · Observability · Ownership · APIs · Scaling · Caching · Resilience · Backups · Security · Deployment · Chaos · Documentation · Governance ·

You who would be ruler of systems, listen first to the politics of life.

In courts, a whisper kills kingdoms; in markets, a slow leak bankrupts treasuries. The state that forgets its walls falls; the software that forgets its contracts collapses.

Systems are small republics — they have ministers (processes), spies (telemetry), treasuries (resources), borders (APIs), traitors (bugs) and heirs (maintenance teams). Design your system as a wise king designs a realm: with foresight, with ruthless priorities, and with the humility to expect betrayal.

If you read no further than this opening, remember this like a campaign order: build so that your system survives the crisis you cannot imagine.

Read sutra 01

A throne without cause is empty — a system without clear intent is fragile.

Three Questions. Non-negotiables. Quantify everything.

Why are we building this? Business need, user need, metric to change. What will success look like? SLOs, throughput, latency, adoption targets. What is the failure mode we must survive? Data loss, extended downtime, security breach.

Write down the non-negotiables. Not "nice-to-haves." Not "maybe-laters." The list that follows these three questions becomes your acceptance criteria and boundary for trade-offs.

Quantify everything. "Fast" becomes p95 < 100ms; "reliable" becomes availability 99.95%; "cheap" becomes $X per 1M requests. Numbers make politics into engineering.

Business intent documented in one paragraph
Primary users & flows mapped and validated
Success metrics (SLOs) defined with error budget
Critical failure modes enumerated and prioritized

Read sutra 02

Build the fort before you invite the market.

Architecture is not aesthetic. It is survival planning.

Your architecture must answer: how will the system behave under normal load, under burst, under network partitions, under disk corruption, under long-term neglect?

Control plane — orchestration, config, leader election
Data plane — actual request handling, storage
Infrastructure plane — networking, infra-as-code, CI/CD
Observation plane — metrics, logs, traces, alerts

Choose your data model early. OLTP vs OLAP vs event log. Map each domain entity to the most appropriate store. Do not generalize one DB for everything just because it is convenient.

Design boundaries by business domain; avoid breaking on implementation details. For each service: define its contract (API), data ownership (source of truth), and expected load profile.

Splintering into microservices because microservices are trendy. Microservices should serve ownership and scaling; otherwise they are bureaucracy.

Read sutra 03

Extravagance is the first ruin of kings.

Resources are your treasury. Treat them as finite and precious.

Capacity planning. Estimate QPS, request size, processing time, database ops per request. Convert to CPU / memory / IO needs; include headroom (safety margin, typically 2–3× for initial planning).

Cost as a first-class constraint. Design caching, batching, and asynchronous patterns to reduce resource spend. Use cost-aware autoscaling; do not autoscale statelessly without bounds.

Throttling and quotas. Rate limit at edge (API gateway) and per-tenant. Provide graceful responses (429 + Retry-After) and well-documented quotas.

Load model & cost projection documented
Rate limits and quotas defined
Autoscaling policy and bounds configured

Read sutra 04

Spies are the eyes of the king; telemetry is the mind of the system.

A blind ruler reacts slowly. Observability gives you early warning.

Instrument everything from day one. Metrics (counters, gauges, histograms). Focus on p50, p95, p99 latencies. Structured logs with trace and request identifiers. Distributed tracing to connect flows across services.

Design meaningful dashboards and alerts. Dashboards for business KPIs, system health (latency, errors, saturation), and infra. Alerts tuned to actionability; page only when human intervention is needed.

Use SLOs & error budgets to govern change. Create SLOs: latency & availability targets per service. Use error budget consumption as a governance mechanism for releases and experiments.

Example: SLO = 99.95% successful requests per minute; error budget = 4.38 minutes per month.

Read sutra 05

A throne without loyal ministers is bare; a system without ownership rots.

Design for teams as much as you design for code.

Team boundaries mirror service boundaries. Small, cross-functional teams own services end-to-end: code, infra, runbooks, SLOs.

Clear on-call responsibilities. No black boxes. On-call must know how to run, debug, and rollback.

Code review and architecture review. Change control with lightweight guardrails: design docs for non-trivial changes, sanity tests, and staged rollout.

Central platforms that solve everything and then block teams. Platform should empower, not enslave.

Read sutra 06

Promises bind the weak; contracts bind the strong.

APIs are borders between polities. Make them stable, versioned, and backward-compatible.

Contract-first API design. Define API schemas (OpenAPI / Protobuf) and generate client / server stubs. Mock the API and integrate early.

Versioning and evolution. Use additive changes; deprecate before removal. Prefer feature flags and negotiation over hard-breaking upgrades.

Defensive design. Timeouts, retries with exponential backoff and jitter, idempotency tokens for safe retries.

For idempotent writes use idempotency keys; for non-idempotent, require server-side deduplication.

Read sutra 07

Armies that cannot coordinate lose battles; data that cannot be partitioned chokes systems.

Horizontal scaling first. Design services to be stateless where possible.

Range-based sharding — ordered keys; efficient for range queries; requires rebalancing
Hash-based sharding — uniform distribution; harder to do range queries
Hybrid — hash by customer, range by time, and so on

Use consistent hashing or virtual nodes for smoother shard movement. Plan for resharding operations and maintenance windows.

Example: For a multi-tenant billing system, shard by tenant-id (hash); within a tenant, use range-partitioned tables for time-series invoice data.

Read sutra 08

Trade-offs are the currency of rulers.

Caching is powerful but makes correctness harder. Consistency models are a contract you choose.

Read-through cache sits in front of the DB, fetching on miss. Write-through / write-behind are synchronous vs asynchronous update patterns.

Cache invalidation is the hardest problem in distributed systems. Favor simple, conservative strategies: TTLs, explicit invalidation on write.

Strong consistency (linearizability) — when correctness matters (banking)
Eventual consistency — where availability and latency matter (social feed)
Causal or session consistency — when user experience requires ordering

Use distributed transactions sparingly. Prefer saga patterns for cross-service workflows: orchestrator or choreography with event compensation.

Practical: payments demand strong consistency. Use two-phase commit only if unavoidable; prefer a single-service transaction and event-driven updates for downstream processing.

Read sutra 09

When the battering-ram approaches, do not flood the gate with defenders; funnel and control the flow.

Expect failures — design for them.

Implement circuit breakers to avoid cascading failures
Use bulkheads to isolate failure domains
Apply backpressure at ingress: reject, queue, or shed load

Retry with exponential backoff and jitter. Ensure operations are idempotent when retried; if not, employ deduplication.

Serve reduced functionality instead of full failure: degrade recommendation features under load, keep payments working.

Circuit breaker trips if error rate > X% and number of requests > Y in window Z. Retry up to N times with backoff base B and jitter J.

Read sutra 10

A king without records forgets his lineage; a system without backups forgets its state.

RPO & RTO determine backup cadence.

For databases: point-in-time recovery, incremental backups, tested restores
For object storage: replication across regions + lifecycle policies

Test restores regularly. Restore process must be part of CI/CD; test in isolated infra.

Use backward-compatible schema changes (add columns optional, create new tables). Online migration strategies: dual-write + backfill, blue-green DB migrations.

Read sutra 11

An open gate invites invasion; secrecy and limits protect the realm.

Principle of least privilege. Threat model early.

IAM roles per service with minimal permissions. Short-lived credentials and mutual TLS for service-to-service auth.

Threat model early. Build the list of assets, threats and mitigations. Automate security controls: dependency scanning, secret scanning, SAST / DAST pipelines.

Encrypt at rest and in transit. Mask PII and enforce access policies; audit all access. Maintain and rehearse an incident response plan and a public communication template.

Read sutra 12

A wise king does not burn the city with each celebration.

Canary releases. Blue-green. Feature flags.

Canary releases — route small % of traffic to new version; monitor errors and latency
Blue-green — instant rollback by switching routing
Feature flags — decouple deploy from release

Infrastructure as code. Immutable images / containers. Declarative manifests and scripted rollbacks.

Gate releases by SLO health, smoke tests, and canary stability. Automate post-deploy verification.

Read sutra 13

As soon as the fear approaches near, attack and destroy it.

Chaos engineering is you bringing fear close, so you learn to destroy it on your terms.

Start small. Fault-injection on non-critical paths. Inject latency, drop packets, kill instances.

Runbook-driven experiments. Predefine hypothesis, metrics, blast radius, and rollback plan. Use chaos results to improve runbooks, timeouts, and redundancy.

Read sutra 14

If knowledge dies with the king, the realm enters anarchy.

Living docs. Onboarding as part of code quality.

System overviews, data flows, API contracts — single source of truth. Runbooks for incidents & postmortems.

New engineer should be able to ship a small safe change in 1–2 weeks. If not, the system is too complex.

Read sutra 15

Not knowledge alone, nor force alone; with both the king endures.

Balance autonomy with platform conventions.

Governance — balance autonomy with platform conventions. Enforce standards without becoming a blockade.

Cost — billability attribution by service; teams own cost and optimization.

Ethics — guard user data. Trade-offs for speed must be informed with privacy in mind.

Build with the humility of the defeated and the strategy of the conqueror. A great system is not one that dazzles at launch — it is the one that stands silent through ten winters of traffic, betrayal, and greed.

— System Nīti, on the nature of resilience

16 The Decree

01 10

10 Steps to a Resilient Realm

Design not for glory, but for surviving the day the unknown arrives.

Define success

Pick 3 metrics: latency p95, availability %, cost per million.

Draw boundaries

One page diagram of services & data ownership.

Choose stores

Map each data domain to DB type and partition key.

Instrument

Wire metrics, traces, logs, and a dashboard.

Set SLOs

Error budget & release policy.

Automate deployment

CI / CD, canaries, feature flags.

Practice failure

Scheduled chaos, restore drills, postmortems.

Document

Publish runbooks and onboarding paths.

Govern

Retrospective on cost & risk every quarter.

Repeat

Architecture evolves; revisit the three questions every sprint.

Scroll Reveal Drag ← →

A great system is not one that dazzles at launch — it is the one that stands silent through ten winters of traffic, betrayal, and greed.

Design not for glory, but for surviving the day the unknown arrives.

Return to the top