API / Microservices Design Patterns Interview Questions
The Decompose by Business Capability pattern assigns one microservice per business capability — a stable, high-level function the organisation performs to deliver value. A business capability answers "what does this part of the business do?" not "how does the software do it?", so capabilities make durable service boundaries that outlast technology changes.
For an online retailer, a first-pass capability map might look like: Order Management (placement, tracking, cancellations), Inventory Management (stock levels, reservations, replenishment), Customer Management (profiles, preferences, loyalty), Payment Processing (authorisation, settlement, refunds), and Fulfilment (pick, pack, dispatch). Each becomes a microservice candidate.
How to identify capabilities in practice:
- Capability mapping workshops — work top-down with business stakeholders. Ask "what does this team do?" rather than "what systems do they use?". Group answers into named, stable functions.
- Organisational alignment — each recognisable business unit typically owns one or more capabilities. Conway's Law predicts that service structure will mirror team communication patterns.
- Stability test — a valid capability has existed for years and will persist even if the technology stack is replaced entirely. "Accept and fulfil orders" has been a retail capability since long before e-commerce.
- Independence check — if a capability can be changed and deployed without touching a sibling capability, it is a good service boundary. Frequent cross-team coordination to deploy one service is a signal that the boundary is wrong.
Sub-capabilities (e.g., Refunds within Order Management) may become separate services only when they evolve at a different pace, have distinct scaling needs, or are owned by a dedicated team. The deliverable is a capability map — a living diagram that drives service topology and team ownership throughout the programme.
The Decompose by Subdomain pattern uses Eric Evans' Domain-Driven Design taxonomy to carve out service boundaries. A subdomain is a coherent slice of the problem domain. Instead of decomposing by technical layer or org chart, you model the real-world domain first, then map each subdomain to one or a small cluster of services.
DDD classifies every subdomain into one of three types, which directly influence investment and build-vs-buy decisions:
- Core subdomain — the competitive differentiator. This is where the business wins or loses. Custom-built with the best engineers. Example: personalised recommendation engine at a streaming service.
- Supporting subdomain — necessary but not differentiating. Still custom-built but simpler. Example: a notification service that sends emails and push alerts.
- Generic subdomain — commodity functionality available off-the-shelf. Buy or use open source; do not reinvent. Example: user authentication and identity management (Keycloak, Auth0).
A Bounded Context defines the explicit boundary within which a specific domain model applies. Two subdomains may share a term — "Customer" in Sales has a credit limit and purchase history; "Customer" in Support has open tickets and SLA status — but forcing one "Customer" object to serve both contexts creates a bloated model. A Bounded Context keeps these clean and separate.
The relationship between subdomains and Bounded Contexts is often 1:1, but a large core subdomain may be split into multiple Bounded Contexts for team autonomy. Each Bounded Context is a strong microservice candidate with its own data store and ubiquitous language. The Context Map documents how contexts integrate: via Shared Kernel, Customer/Supplier, Conformist, Anti-Corruption Layer, or Open Host Service relationships — each implying different levels of coupling.
The Strangler Fig pattern — coined by Martin Fowler after the strangler fig tree that gradually wraps and replaces its host — is an incremental migration strategy for moving functionality out of a monolith into microservices. Instead of a risky "big bang" rewrite, you build new services alongside the running monolith, route traffic to them one capability at a time, and eventually decommission the hollowed-out monolith.
The three-step migration cycle for each capability:
- Insert a facade — place a reverse proxy, API gateway, or routing layer in front of the monolith. All traffic flows through this facade, which initially passes everything unchanged to the monolith.
- Extract and build — implement the selected capability as a new microservice with its own data store. Migrate the relevant data. Run dark launches or a Parallel Run to validate correctness.
- Redirect — update the facade to route requests for that capability to the new service. The monolith no longer handles it. Repeat for the next capability.
Over many iterations, the monolith shrinks (it is "strangled") until it handles no capabilities and can be switched off.
When to use it: Use the Strangler Fig whenever the monolith is too large, complex, or poorly understood to rewrite safely in one step; when the business cannot afford a freeze on new feature delivery during migration; or when the team needs to build confidence in microservice patterns before committing fully. It is the default recommended strategy for production monolith migrations.
When not to use it: If the monolith is small and the codebase is well understood, a targeted rewrite may be faster. Also avoid if the monolith's architecture makes clean extraction practically impossible without massive refactoring first — in that case, Branch by Abstraction (Q5) must precede the extraction.
The Anti-Corruption Layer (ACL) is a translation boundary placed at the edge of a service to prevent an external model — typically from a legacy system or a foreign bounded context — from contaminating the service's own domain model. Without it, the consuming service must adopt the vocabulary, data shapes, and assumptions of the external system, gradually corrupting its clean internal design.
The ACL consists of three collaborating components:
- Facade — presents a clean interface to the internal domain, hiding the existence of the external system entirely.
- Adapter — calls the external system's API, reads its events from a message broker, or queries its data store on behalf of the facade.
- Translator/Mapper — converts between the external model and the internal domain model in both directions. For example, a legacy ERP might call a product a "SKU Item" with a flat
unitPriceinteger field; the translator converts this into the internalProductaggregate with a properMoneyvalue object.
Key use cases:
- Integrating with a legacy monolith during a Strangler Fig migration — the new service has its own clean model while the ACL handles translation to/from the old system.
- Consuming a third-party SaaS API without exposing its schema to your core domain model.
- Bridging two bounded contexts that use overlapping but divergent concepts of the same entity.
The ACL ensures that internal domain objects evolve independently of whatever is outside the boundary. When the external system changes its model, only the ACL needs updating — not the core domain logic. This is especially valuable during long-running migrations where the legacy system remains in use for months or years.
Branch by Abstraction is an incremental migration technique that replaces an existing component without disrupting the codebase or requiring a long-lived code branch. The key mechanic is introducing an abstraction (an interface or abstract class) over the existing component so that all callers depend on the abstraction rather than the concrete implementation. The replacement is then developed behind that same abstraction and switched in when ready.
The four-step process:
- Create the abstraction — introduce an interface or abstract class that captures the component's contract. Update all callers to program against the abstraction, not the concrete class. The system still works exactly as before; only the coupling direction has changed.
- Implement the new version — build the replacement (e.g., a microservice client stub that calls the extracted service) behind the same abstraction. Both the old and new implementations exist simultaneously in the main branch.
- Route clients progressively — use a configuration flag, feature toggle, or simple factory to direct some or all callers to the new implementation while the old one remains available as a fallback.
- Remove the old implementation — once the new implementation is validated in production and all traffic is routed to it, delete the old code. The abstraction itself may also be removed if it no longer serves a purpose.
The critical property of this pattern is that the main codebase remains in a releasable state throughout the migration. There is no feature branch that diverges from main for weeks; the entire process happens in small, shippable increments on the trunk. This makes it a natural complement to the Strangler Fig pattern: Branch by Abstraction prepares the seam at which the Strangler Fig can extract a capability.
The Parallel Run pattern runs an old and a new implementation simultaneously against the same live production input, comparing their outputs to verify correctness before committing to the new system. The legacy system's response is always returned to the caller — it remains the source of truth. The new system's response is captured asynchronously, compared in the background, and any discrepancies are surfaced to developers.
The flow for each production request:
- An intercepting component (a routing layer, the calling service, or a library) fans the request out to both the legacy system and the new service.
- The legacy system's response is returned to the caller immediately — no user impact if the new service is slow or fails.
- The new service's response is captured asynchronously and compared with the legacy response field-by-field.
- Mismatches are logged with enough context for developers to reproduce and diagnose the discrepancy.
- When the mismatch rate reaches zero over a sustained period, the new service takes over as authoritative and the legacy path is removed.
GitHub's open-source Scientist library (Ruby) popularised this technique under the name "controlled experiments". The pattern is particularly valuable for stateful, business-critical calculations — pricing engines, financial reconciliations, eligibility rules — where unit tests cannot fully cover the diversity of real production inputs.
The key safety guarantee: the new service can produce wrong answers, time out, or crash during the parallel phase, and no user is ever affected. This makes it possible to run experiments on 100% of production traffic while accepting zero user-facing risk from the new implementation.
The Bulkhead pattern — named after the watertight compartments in a ship's hull that prevent a single breach from flooding the entire vessel — partitions a system into isolated failure domains so that a critical failure in one domain cannot cascade to others. In the context of service decomposition, it means deliberately grouping services, their infrastructure, and their resource pools into segments that share no mutable state or critical resources with adjacent segments.
A concrete decomposition example: an e-commerce platform partitions into a Browse & Search bulkhead (product catalog, search index, recommendations) and a Checkout & Payments bulkhead (cart, order placement, payment gateway). Even if the Elasticsearch cluster powering search becomes overloaded or crashes entirely, the checkout flow is completely unaffected — it uses a separate set of services, database clusters, thread pools, and message broker topics.
Isolation strategies applied at each level:
- Process isolation — separate containers or OS processes mean a crash or OOM in one service does not affect another.
- Thread/connection pool isolation — each downstream dependency gets its own bounded pool, preventing a slow dependency from exhausting shared resources (this is the resource-level Bulkhead, covered in Q28).
- Infrastructure isolation — separate database clusters, separate message broker partitions, and separate network segments per bulkhead limit the blast radius of an infrastructure failure.
- Deployment isolation — placing bulkheads in separate Kubernetes namespaces, availability zones, or cloud regions ensures that a zone-level outage affects only one bulkhead.
The trade-off is cost: infrastructure isolation requires duplicated resources. Bulkheads are most justified on revenue-critical paths where the cost of cascading failure — lost transactions, SLA breaches, reputational damage — outweighs the overhead of duplication.
The Database per Service pattern mandates that each microservice owns its own persistent data store exclusively. No other service may directly read or write to that store — access is only possible through the owning service's published API. The store may be a separate schema in the same RDBMS engine, a fully separate server instance, or an entirely different database technology chosen to match the service's data model and access patterns.
The core problem it solves is structural data coupling. When services share a database, a schema change in one table can silently break every other service that reads it. Two teams must coordinate every deployment that touches shared tables, making independent deployment — a foundational goal of microservices — impossible in practice.
Benefits enabled by this pattern:
- Independent deployment — schema migrations are scoped to a single service. No cross-team release coordination required.
- Polyglot persistence — each service chooses the database best suited to its workload: relational for orders, document store for product catalog, time-series for IoT metrics, graph for social connections.
- Fault isolation — a database outage in one service does not directly cascade to other services that have separate stores.
- Independent scaling — a high-read service can add read replicas or a caching layer without affecting other services' data infrastructure.
The trade-off is that cross-service queries cannot use SQL JOINs. Queries that used to be a single SQL statement across multiple tables must now be composed at the application level using the API Composition pattern (Q14) or a dedicated CQRS read model (Q12). Cross-service writes must use the Saga pattern (Q10) rather than a single ACID transaction.
The Shared Database anti-pattern occurs when two or more microservices bypass each other's APIs to directly read from and write to the same database schema. It is the most common mistake teams make when splitting a monolith, because it initially appears to be the easiest path — split the code but keep a single database.
Why it fundamentally undermines the microservices model:
- Schema coupling — any team wanting to rename a column, add a NOT NULL constraint, or change a table structure must coordinate with every team whose service touches that table. A simple schema change becomes a multi-team, multi-sprint event.
- Loss of independent deployability — if Service A changes the
orderstable, Service B and Service C must be updated and redeployed simultaneously. The services cannot be independently deployed. - Hidden dependencies — the coupling is invisible at the API level. No OpenAPI spec, no contract test, captures it. It surfaces unexpectedly as a runtime breakage during incidents.
- Technology lock-in — all services are forced to use the same database engine, preventing polyglot persistence and specialised data modelling.
- Operational coupling — a runaway query or bulk migration in one service can saturate the shared database's connection pool and I/O capacity, degrading every other service that shares the same instance.
The correct alternative is the Database per Service pattern (Q8), with cross-service reads handled via API Composition (Q14) or CQRS (Q12), and cross-service writes handled via Sagas (Q10). The short-term pain of separating data stores pays back quickly in deployment independence and incident isolation.
The Saga pattern manages a long-running business transaction that spans multiple services without using a distributed two-phase commit (2PC). A Saga is a sequence of local transactions: each step performs a local commit and then publishes an event or sends a command to trigger the next step. If a step fails, the Saga executes compensating transactions — semantic undos — for each previously completed step.
Example — Place Order Saga:
FORWARD STEPS
1. Order Service → INSERT order (status=PENDING)
→ emit OrderCreated
2. Inventory Svc → UPDATE stock (reserve qty)
→ emit StockReserved OR StockReservationFailed
3. Payment Svc → charge customer card
→ emit PaymentProcessed OR PaymentFailed
4. Order Service → UPDATE order (status=CONFIRMED)
COMPENSATION (if PaymentFailed at step 3)
← Inventory Svc → release reservation (compensate step 2)
← Order Svc → UPDATE order (status=CANCELLED) (compensate step 1)
Key properties:
- ACD, not full ACID — Sagas provide Atomicity (all steps complete or are compensated), Consistency (at application level), and Durability, but not Isolation. Intermediate states are visible to concurrent operations, requiring careful handling of anomalies such as dirty reads and lost updates.
- Eventual consistency — the system reaches a globally consistent state eventually, not immediately after each step.
- Two coordination styles — Choreography (event-driven, no central coordinator) and Orchestration (central saga orchestrator directs participants). See Q11 for the comparison.
Sagas are the standard replacement for distributed transactions in microservice architectures because they work across heterogeneous data stores and do not require all participating services to hold locks simultaneously.
Both styles implement the Saga pattern (Q10) but differ fundamentally in how the steps are coordinated. In Choreography, there is no central authority: each service listens for domain events published by the preceding step and reacts autonomously, emitting its own event to trigger the next participant. In Orchestration, a dedicated Saga orchestrator sends explicit commands to each participant and receives success/failure responses, driving the workflow from a single place.
| Aspect | Choreography | Orchestration |
|---|---|---|
| Coordination | Implicit via domain events on a message broker | Explicit commands from a central orchestrator |
| Coupling | Services are coupled to event topics, not to each other | Orchestrator is coupled to each participant service |
| Visibility | Flow is distributed across services; hard to visualise end-to-end | Entire saga flow is explicit in the orchestrator's state machine |
| Debugging | Requires correlating events across multiple logs and services | Orchestrator state shows exact step and failure point |
| Scalability | Good — no central bottleneck | Orchestrator can become a bottleneck at very high throughput |
| Best for | Simple 2–3 step sagas; teams that already use event-driven patterns | Complex multi-step sagas with many compensations and error paths |
Choreography example — OrderService emits OrderCreated; InventoryService listens and emits StockReserved; PaymentService listens and emits PaymentProcessed; OrderService listens and marks the order confirmed. No single component knows the full flow.
Orchestration example (using AWS Step Functions or Temporal):
OrderSagaOrchestrator:
1. send ReserveStockCommand → InventoryService
2. on StockReserved: send ChargePaymentCommand → PaymentService
3. on PaymentProcessed: send ConfirmOrderCommand → OrderService
4. on PaymentFailed: send ReleaseStockCommand → InventoryService (compensate)
CQRS separates a service's data model into two distinct paths: a Command side that handles writes (state changes) and a Query side that handles reads. Each side can use a different data store, different data model, and even a different technology stack, optimised independently for its purpose.
On the Command side, commands express intent (PlaceOrder, UpdateShippingAddress) and mutate the authoritative write model. On the Query side, pre-built, denormalised read models serve specific views efficiently — for example, an order summary view that joins order, customer, and product data is materialised as a single flat document that can be served without any JOINs.
// Command side (write)
class PlaceOrderCommand { orderId, customerId, items[] }
class OrderCommandHandler {
handle(PlaceOrderCommand cmd) {
Order order = new Order(cmd.orderId, cmd.customerId, cmd.items);
orderRepository.save(order); // write to normalised DB
eventBus.publish(new OrderPlaced(order)); // update read models
}
}
// Query side (read)
class OrderSummaryQuery { orderId }
class OrderQueryHandler {
handle(OrderSummaryQuery q) {
return orderSummaryReadModel.findById(q.orderId); // pre-built view
}
}
When to use CQRS:
- Read and write traffic profiles differ significantly (e.g., 100:1 read-to-write ratio) and require separate scaling strategies.
- Complex domain logic on the write side conflicts with simple, fast reads that need denormalised projections.
- You are using Event Sourcing (Q13), which pairs naturally with CQRS because events update separate read projections.
- The service needs to serve multiple different view shapes to different consumers (mobile, web, analytics) without a one-size-fits-all query model.
CQRS adds complexity: two models to maintain, eventual consistency between them (the read side lags the write side), and more infrastructure. It is over-engineering for simple CRUD services where a single model suffices.
Event Sourcing stores the state of a domain entity not as its current snapshot in a row, but as an append-only log of every domain event that has ever happened to it. The current state is derived on demand by replaying all events for that entity from the beginning (or from the most recent snapshot). Nothing is ever deleted or updated in place — the event log is immutable.
// Event store: append-only
events for Order#42:
[1] OrderPlaced { customerId: C1, items: [...] } t=09:00
[2] AddressUpdated { newAddress: "123 Main St" } t=09:05
[3] PaymentReceived { amount: 59.99, txnId: T77 } t=09:07
[4] OrderShipped { carrier: "UPS", trackingId: U99 } t=09:30
// Replay to derive current state
Order order = new Order();
events.forEach(e -> order.apply(e));
// order.status == SHIPPED
Key properties:
- Full audit trail — every state transition is recorded with its timestamp and actor. No separate audit log needed.
- Temporal queries — you can reconstruct the state of any entity at any point in time by replaying up to a given event position.
- Event replay — if a bug introduced wrong state, replay events through the fixed logic to regenerate correct state.
- Snapshots — for entities with thousands of events, periodic snapshots cache the state at a point in time, allowing replay to start from the snapshot rather than event zero.
Complementing CQRS: Event Sourcing and CQRS are a natural pairing. Every event written to the write-side event store is also published to subscribers that update one or more denormalised read projections (the CQRS query side). Each projection can be an independent view optimised for a specific consumer: a mobile summary, a warehouse pick list, an analytics fact table. When a projection's logic changes, it can be rebuilt by replaying the complete event history.
The API Composition pattern implements a query that requires data from multiple microservices by having an API composer — typically the API gateway, a BFF, or a dedicated aggregation service — call each relevant service in parallel, then join and transform the results in memory before returning a single response to the caller. It replaces the cross-service SQL JOIN that becomes impossible when each service owns its own database.
For example, a "Get Order Details" screen needs data from three services: Order Service (order status, items, timestamps), Customer Service (name, shipping address), and Product Service (product names, images). The composer calls all three — ideally in parallel — and merges the results into one response document.
The approach works well when:
- The amount of data being joined is manageable in memory (hundreds to low thousands of records).
- Queries do not require complex aggregations such as GROUP BY, SUM, or window functions across large datasets.
- Response-time SLAs are met even after adding the latency of parallel service calls.
Limitations of API Composition:
- No transactional consistency — data is fetched from multiple services at different instants, so results may reflect slightly different states.
- Scalability — joining large datasets (e.g., all orders in the last year with full customer profiles) in memory is expensive and may exhaust the composer's heap.
- Complexity — partial failures (one service is down) must be handled gracefully; the composer must decide whether to return partial results or fail the request.
For queries that are too complex or too large for in-memory joining, the CQRS pattern (Q12) with a pre-built, denormalised read model is the preferred alternative.
The dual-write problem arises when a service must atomically write to its own database and publish a message to a message broker in a single operation. If it writes to the DB but crashes before publishing, other services never learn about the change. If it publishes first but the DB write fails, it emits an event for something that never happened. Standard distributed transactions (2PC) across a database and a broker are too heavy and often unsupported.
The Outbox Pattern solves this by writing both the domain change and the message to-be-published in a single local database transaction, then using a relay process to forward outbox records to the broker asynchronously.
-- Same local DB transaction:
BEGIN;
INSERT INTO orders (id, status, ...) VALUES (42, 'PENDING', ...);
INSERT INTO outbox (id, aggregate_type, aggregate_id, event_type, payload)
VALUES (gen_uuid(), 'Order', 42, 'OrderPlaced', '{...json...}');
COMMIT;
-- Relay process (Message Relay / Transactional Outbox Relay):
LOOP:
rows = SELECT * FROM outbox WHERE published = false ORDER BY created_at LIMIT 100;
FOR EACH row:
broker.publish(topic=row.event_type, body=row.payload);
UPDATE outbox SET published = true WHERE id = row.id;
Two relay strategies exist:
- Polling publisher — a background thread or scheduled job polls the outbox table and publishes unpublished records. Simple but adds slight latency and DB load.
- Transaction log tailing — tools like Debezium use the database's CDC (Change Data Capture) log (e.g., PostgreSQL WAL, MySQL binlog) to detect outbox inserts and publish them. Near-real-time with minimal DB overhead.
The relay must publish at-least-once (idempotency key = outbox row ID), so consumers must implement the Idempotent Consumer pattern (Q22) to handle rare duplicates gracefully.
In a Saga (Q10), when a step fails, previously completed steps cannot be undone with a database ROLLBACK because each step has already committed its local transaction and those locks are released. Instead, the Saga executes compensating transactions — purpose-built operations that reverse the business effect of each completed step in reverse order.
A compensating transaction is a semantic undo, not a technical rollback. The key distinction:
- A technical rollback is performed by the database engine before a transaction commits — it undoes uncommitted SQL statements.
- A compensating transaction is a new, forward-moving operation that creates the business-level opposite of an already-committed action.
Example compensations:
- Forward step: reserve 5 units of stock → Compensation: release 5 units of stock reservation
- Forward step: charge customer 9.99 → Compensation: refund customer 9.99
- Forward step: create order in PENDING status → Compensation: update order status to CANCELLED
Important edge cases:
- Pivotal transactions — not all Saga steps can be compensated. A step that sends a physical shipment or charges a non-refundable fee is called a pivot transaction; if it succeeds, the Saga must complete rather than roll back.
- Retriable transactions — some steps after the pivot are guaranteed to succeed eventually (e.g., updating an order status). These steps are retried until success rather than being compensated.
- Idempotency — compensating transactions may be retried if the Saga coordination infrastructure fails, so each compensation must be idempotent.
The API Gateway is a single entry point that sits between external clients and the internal microservice topology. Rather than exposing each service's API directly to the internet, all traffic flows through the gateway. It handles cross-cutting concerns so that individual services do not have to implement them repeatedly.
Core gateway responsibilities:
- Request routing — maps an inbound URL/path to the appropriate downstream service endpoint.
- Authentication and authorisation — validates JWT tokens or API keys, optionally enriches the request with user identity claims before forwarding.
- SSL/TLS termination — terminates HTTPS at the gateway; downstream traffic on the internal network may use HTTP or mTLS separately.
- Rate limiting and throttling — enforces per-client request budgets to prevent abuse.
- Request/response transformation — rewrites headers, translates between REST and gRPC, aggregates partial responses.
# Example: Kong or Nginx API Gateway routing rule
/api/orders/* → http://order-service:8080
/api/products/* → http://product-service:8081
/api/customers/*→ http://customer-service:8082
A general-purpose gateway serves all client types (mobile, web, third-party) through a single API surface. The Backend for Frontend (BFF) pattern (Q18) splits this into client-type-specific gateways when different clients have materially different needs — mobile needs smaller payloads and fewer fields; web needs richer aggregated responses. Moving client-specific transformation logic into a BFF prevents the general gateway from accumulating ever-growing, client-specific business logic that makes it fragile and slow to change.
The rule of thumb: use a general gateway for cross-cutting platform concerns (auth, SSL, rate limiting). Use a BFF for client-tailored data shaping and aggregation. The two can coexist — the BFF sits behind the general gateway.
The Backend for Frontend (BFF) pattern creates a dedicated API backend for each distinct client type — one BFF for the mobile app, one for the web SPA, one for third-party integrations. Each BFF is owned by the team building that frontend and is free to shape, aggregate, and optimise responses exactly as its client needs, without compromising the API shape that other clients rely on.
The driving insight is that different clients have genuinely different needs. A mobile app on a 4G connection needs lightweight payloads with only the fields it displays. A web dashboard needs richer, pre-aggregated data across multiple services. A third-party partner API needs a stable, versioned contract independent of UI feature work.
Client request: GET /mobile/orders/42
Mobile BFF:
parallel fetch:
order = orderService.get(42) // id, status, total only
status = shippingService.track(42) // latest event only
return { id, status, total, latestTracking } // 4 fields, ~200 bytes
Client request: GET /web/orders/42
Web BFF:
parallel fetch:
order = orderService.get(42) // full order model
customer = customerService.get(order.customerId)
items = productService.getBulk(order.itemIds)
return { order, customer, itemDetails } // rich object, ~4 KB
When to use BFF over a general gateway:
- Multiple client types exist with divergent payload, filtering, or aggregation requirements.
- Frontend teams are blocked by a shared gateway team whenever they need API changes.
- Mobile clients suffer from over-fetching because the API was designed for a richer web client.
BFFs and a general API gateway often coexist: the general gateway sits at the edge and handles cross-cutting concerns (auth, SSL, DDoS); each BFF sits behind it and handles client-specific orchestration. A BFF is not a replacement for the gateway — it is a specialisation layer on top.
A Service Mesh is an infrastructure layer that handles all service-to-service communication concerns — traffic management, mutual TLS, retries, circuit breaking, observability — without requiring application code to implement any of it. It consists of two planes:
- Data plane — a sidecar proxy (Envoy, Linkerd-proxy) injected into every pod. All inbound and outbound network traffic for the application container flows through the sidecar. The sidecar applies policies, collects telemetry, and enforces mTLS transparently.
- Control plane — manages and configures the sidecar fleet (Istio Pilot, Linkerd control plane). It distributes routing rules, certificates, and traffic policies to each proxy. The control plane is never in the hot path of production traffic.
# Istio VirtualService — traffic splitting for canary release
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: reviews
spec:
hosts: [reviews]
http:
- route:
- destination:
host: reviews
subset: v1
weight: 90
- destination:
host: reviews
subset: v2
weight: 10
What Envoy (data plane) handles per-request:
- mTLS — terminates inbound TLS and initiates outbound TLS with the peer's certificate, providing service identity without code changes.
- Retries and timeouts — configurable retry budgets and per-route timeouts enforced at the proxy level.
- Circuit breaking — ejects unhealthy upstream hosts from the load-balancing pool.
- Distributed tracing — propagates W3C
traceparentheaders and emits Zipkin-compatible spans. - Traffic splitting — routes a percentage of traffic to canary versions of a service without touching application code.
The Service Mesh is appropriate when an organisation operates many services written in multiple languages and wants consistent, policy-driven networking without embedding SDK-level resilience logic in every service.
The Message Broker pattern introduces a durable intermediary — the broker (Apache Kafka, RabbitMQ, AWS SQS/SNS) — between a producer service and one or more consumer services. The producer publishes a message to the broker and returns immediately without waiting for consumers to process it. Consumers pull messages when ready. The broker stores messages until they are acknowledged, providing durability and decoupling.
// Producer (Java/Kafka)
ProducerRecord<String, String> record =
new ProducerRecord<>("order-events", orderId, orderEventJson);
kafkaProducer.send(record); // non-blocking; returns immediately
// No dependency on whether any consumer is alive
// Consumer (Java/Kafka)
ConsumerRecords<String, String> records =
kafkaConsumer.poll(Duration.ofMillis(500));
for (ConsumerRecord r : records) {
orderEventHandler.handle(r.value());
kafkaConsumer.commitSync(); // acknowledge offset
}
Types of messaging models:
- Publish/Subscribe (topic) — one message is delivered to all subscribed consumer groups. Used for domain events (e.g., Kafka topics). Multiple independent consumers receive the same event.
- Point-to-point (queue) — each message is delivered to exactly one consumer. Used for task distribution (e.g., SQS, RabbitMQ queues). Competing consumers load-balance across queue messages.
The pattern solves temporal coupling: with direct HTTP calls, the caller must wait for the receiver to be available. With a broker, the producer can publish even if all consumers are down — messages accumulate and are processed when consumers recover. It also provides rate smoothing: if a producer bursts messages faster than consumers can process, the broker absorbs the burst and consumers drain at their own pace.
The Request-Reply pattern enables synchronous-like request/response semantics over an asynchronous message channel. The requestor sends a message to a request channel, attaches a unique Correlation ID and a reply-to address (a dedicated reply channel or a temporary queue), and waits for a response. The responder processes the request, copies the Correlation ID into its reply, and publishes the response to the reply channel. The requestor matches incoming responses to pending requests using the Correlation ID.
// Requestor
String correlationId = UUID.randomUUID().toString();
Message request = MessageBuilder
.withBody(payload)
.setHeader("correlationId", correlationId)
.setHeader("replyTo", "order-reply-queue")
.build();
requestChannel.send(request);
CompletableFuture<Response> pending = pendingRequests.put(correlationId);
// ... asynchronously wait for response on "order-reply-queue"
// Responder
Message request = requestChannel.receive();
String corrId = request.getHeaders().get("correlationId");
Response resp = processRequest(request.getBody());
Message reply = MessageBuilder.withBody(resp)
.setHeader("correlationId", corrId).build();
replyChannel.send(reply);
// Requestor matches incoming reply
String corrId = reply.getHeaders().get("correlationId");
pendingRequests.complete(corrId, reply.getBody());
The Correlation ID is essential when multiple in-flight requests use the same reply channel: without it, the requestor cannot determine which response corresponds to which request. If two concurrent requests share the same Correlation ID, each requestor will receive the wrong reply or the ID collision will cause a missed response. IDs must therefore be globally unique (UUID v4 is standard) and the pending-request registry must be thread-safe.
The Idempotent Consumer pattern ensures that processing the same message more than once produces the same outcome as processing it exactly once. It is essential because virtually all message brokers (Kafka, RabbitMQ, SQS) guarantee at-least-once delivery — a message may be redelivered after a consumer crashes before acknowledging, after a network partition, or during broker rebalancing. Without idempotency, redelivery causes duplicate side effects: double charges, duplicate shipments, over-reserved inventory.
// Idempotent consumer using deduplication table
public void handleOrderPlaced(OrderPlacedEvent event) {
String msgId = event.getMessageId(); // unique per message
if (processedMessages.exists(msgId)) {
log.info("Duplicate message {}, skipping", msgId);
return; // idempotency guard: already processed
}
// process inside a transaction that also inserts the msgId
transactionTemplate.execute(status -> {
orderRepository.createFrom(event);
processedMessages.insert(msgId, Instant.now());
return null;
});
}
Implementation strategies:
- Deduplication table — persist message IDs (or idempotency keys) in a table. Before processing, check if the ID exists. Insert the ID and process in the same transaction so a crash between processing and acknowledging still results in a consistent state on retry.
- Natural idempotency — design operations that are inherently idempotent.
UPDATE orders SET status='CONFIRMED' WHERE id=42is idempotent; running it twice has no extra effect. ButINSERT INTO charges (amount, orderId) VALUES (59.99, 42)is not — it creates duplicate rows. - Conditional update — use an optimistic locking version or a state machine check (
WHERE status='PENDING') to ensure the operation only applies in the correct state, making reprocessing a no-op if the state has already advanced.
The deduplication store must be co-located or transactionally integrated with the main data store, otherwise the check-then-insert itself has a race condition.
Event-Driven Architecture (EDA) structures communication around events — immutable records of something that has happened. A producer emits an event to a broker and moves on without knowing or caring who consumes it. Consumers subscribe to events and react asynchronously and independently. No party waits for another.
In synchronous request/response, the caller blocks until the called service returns a result. This creates three forms of coupling:
- Temporal coupling — caller and callee must both be alive at the same instant.
- Behavioural coupling — the caller depends on the callee's response structure and error codes.
- Performance coupling — the caller's response time is bounded below by the callee's processing time.
EDA eliminates all three. The producer has no knowledge of its consumers; a new consumer can subscribe to existing events without any producer code change. The system can scale consumer instances independently, and a slow consumer does not delay the producer.
Trade-offs of EDA:
- Eventual consistency — consumers process events asynchronously; the system is not instantly consistent after an event is published.
- Harder debugging — end-to-end request flows are reconstructed by correlating events across distributed logs rather than reading a single call stack.
- Event schema evolution — changing an event's schema is a breaking change for all consumers; requires careful versioning (additive changes only, or explicit versioned event types).
- No immediate response — EDA is unsuitable for interactions that require a synchronous return value (e.g., a login that must return a JWT).
EDA and request/response often coexist in the same system: synchronous for user-facing read operations that need immediate results; asynchronous events for state change propagation across service boundaries.
These three responsibilities are often all assigned to an API Gateway, but they serve distinct purposes and are worth understanding separately.
| Responsibility | What it does | Example |
|---|---|---|
| Gateway Routing | Forwards an inbound request to a single downstream service based on URL path, host, or header | GET /orders/* → Order Service; GET /products/* → Product Service |
| Gateway Aggregation | Fans a single inbound request out to multiple downstream services, waits for all responses, and merges them into one reply | A dashboard request calls Order Service, Customer Service, and Loyalty Service in parallel and returns a single combined payload |
| Gateway Offloading | Handles cross-cutting concerns on behalf of all services, so each service does not need to implement them individually | SSL/TLS termination, JWT validation, rate limiting, request logging, response compression, CORS headers |
In practice all three often live in the same gateway process, but separating them conceptually helps when deciding how to split responsibilities between a general API gateway (routing + offloading) and a BFF (aggregation + client-specific transformation).
Gateway Offloading deserves special emphasis: it prevents copy-paste of security and infrastructure code across dozens of services. A service that relies on the gateway for SSL termination, rate limiting, and JWT validation contains zero infrastructure boilerplate — only domain logic. If the JWT validation algorithm changes, one gateway configuration update covers every service instantly.
The Circuit Breaker pattern prevents cascading failures by detecting when a downstream service is unavailable and fast-failing subsequent calls instead of letting them queue up and exhaust threads. It is named after the electrical circuit breaker that trips when current exceeds a safe threshold.
The pattern has three states:
- Closed (normal) — calls pass through to the downstream service. The breaker counts failures. When the failure rate (or failure count) exceeds a configured threshold within a time window, the breaker trips to Open.
- Open — all calls are immediately rejected with an error (or the Fallback is invoked) without contacting the downstream service. A timer starts. This gives the failing service time to recover without being bombarded with traffic.
- Half-Open — after the timer expires, the breaker allows a limited number of probe requests through. If they succeed, the breaker transitions back to Closed. If they fail, it returns to Open.
// Resilience4j Circuit Breaker (Java)
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
.failureRateThreshold(50) // open if >50% fail
.waitDurationInOpenState(Duration.ofSeconds(30))
.permittedNumberOfCallsInHalfOpenState(5)
.slidingWindowSize(10)
.build();
CircuitBreaker cb = CircuitBreakerRegistry.of(config)
.circuitBreaker("inventoryService");
Supplier<Stock> decorated = CircuitBreaker
.decorateSupplier(cb, () -> inventoryClient.getStock(sku));
Try<Stock> result = Try.ofSupplier(decorated)
.recover(CallNotPermittedException.class, e -> defaultStock());
The circuit breaker is most effective when combined with a Fallback (Q31) — when the circuit is Open, the fallback returns a cached or degraded response so the caller can still serve the request in a degraded mode rather than propagating a hard error to the user.
The Retry pattern automatically re-attempts a failed operation a limited number of times before declaring it a final failure. On its own, retrying immediately (fixed delay or no delay) can overwhelm a struggling downstream service. Exponential backoff solves this by increasing the delay between retries exponentially:
delay(attempt) = base * 2^attempt
// attempt 1: 1s, attempt 2: 2s, attempt 3: 4s, attempt 4: 8s ...
Adding jitter (random noise) to the backoff prevents the thundering herd problem — when many clients retry simultaneously after a shared outage, they all backoff to the same windows and hammer the recovering service in synchronized bursts. With jitter, each client's retry time is randomised within the backoff window:
// Full jitter (recommended by AWS)
delay(attempt) = random(0, base * 2^attempt)
// Decorrelated jitter
delay(n) = min(cap, random(base, delay(n-1) * 3))
When NOT to retry:
- Non-idempotent operations — a
POST /chargesthat creates a charge must not be retried without idempotency keys; retrying will create duplicate charges. - 4xx client errors — HTTP 400 (Bad Request), 401 (Unauthorised), 403 (Forbidden), 404 (Not Found). These indicate problems with the request itself; retrying will produce the same failure.
- 429 Too Many Requests — only retry after respecting the
Retry-Afterheader; retrying aggressively makes the rate-limit situation worse. - Circuit is Open — if the circuit breaker has already tripped, adding retries amplifies the load on an already-failing service.
- Deadlines exceeded — if the caller's overall timeout budget is already exhausted, retrying only prolongs the client's wait without any chance of success within budget.
The Timeout pattern sets an upper bound on how long a caller will wait for a response from a downstream service. Without timeouts, a slow or unresponsive service causes the calling service's request-handling threads to block indefinitely. When enough threads are blocked, the caller's thread pool is exhausted, and it can no longer serve any incoming requests — the failure cascades upstream.
There are two distinct timeout types to configure on every HTTP/gRPC client:
- Connection timeout — the maximum time allowed to establish the TCP connection (and TLS handshake) to the server. If the server is unreachable or overloaded, the OS may queue the SYN packet indefinitely. A connection timeout of 1–3 seconds is typical for internal services.
- Read (socket/response) timeout — the maximum time to wait for the server to send its response after the connection is established. This covers the time the server spends processing the request. Set this to slightly above the service's P99 latency under normal load.
For asynchronous operations, a deadline (a fixed wall-clock time that the entire operation must complete by) is preferable to a per-hop timeout, because per-hop timeouts can accumulate across a call chain without any single hop exceeding its budget yet the total chain still exceeding the end-user SLA.
Timeout values must be tuned carefully. A timeout that is too short causes unnecessary failures during legitimate traffic spikes; too long defeats the purpose by allowing thread exhaustion before the timeout fires. Use the service's P99 latency measurements as the baseline and add a safety margin (e.g., P99 + 50%).
The Timeout pattern works best in combination with the Circuit Breaker (Q25): once timeouts accumulate and the failure rate crosses the circuit breaker threshold, the circuit opens and stops further timeouts from occurring, protecting the thread pool proactively.
The Bulkhead pattern at the resource level isolates the thread pools and connection pools used to call different downstream dependencies, so that a slow or failed dependency cannot monopolise the shared pool and block calls to unrelated services.
Without Bulkhead: all outbound calls from Service A (to Inventory, Payment, and Notification) share one thread pool. If Inventory becomes slow and holds threads for 30 seconds each, it quickly fills the entire pool. Calls to Payment and Notification then queue up even though those services are healthy.
With Bulkhead: each downstream dependency gets its own bounded pool. Inventory gets 10 threads; Payment gets 10 threads; Notification gets 5 threads. A stalled Inventory pool only blocks Inventory calls.
Two isolation strategies:
- Thread pool isolation — each dependency's calls execute on a dedicated thread pool. The calling thread is released immediately; the dedicated pool thread handles the blocking call. Supports async timeouts because the pool thread can be interrupted. Higher overhead (context switching between pools).
- Semaphore isolation — each dependency is limited to N concurrent calls using a semaphore. No separate thread pool; the calling thread itself makes the blocking call, limited by the semaphore count. Lower overhead but no support for independent timeout interruption — a hung call holds the semaphore and the calling thread.
// Resilience4j Bulkhead (semaphore)
BulkheadConfig config = BulkheadConfig.custom()
.maxConcurrentCalls(10)
.maxWaitDuration(Duration.ofMillis(100))
.build();
Bulkhead bh = BulkheadRegistry.of(config).bulkhead("inventoryService");
Supplier<Stock> decorated = Bulkhead.decorateSupplier(bh,
() -> inventoryClient.getStock(sku));
The Health Check API pattern exposes an HTTP endpoint (typically /health, /actuator/health, or /healthz) that returns the current operational status of a service instance. Load balancers, orchestrators (Kubernetes), and service registries poll this endpoint to determine whether traffic should be routed to an instance and whether it should be restarted.
Two semantically distinct probe types are important (especially in Kubernetes):
- Liveness probe — answers "is this process still alive and not deadlocked?" If it fails, Kubernetes restarts the container. Should only check internal process health — not external dependencies.
- Readiness probe — answers "is this instance ready to serve traffic?" If it fails, the instance is temporarily removed from the load balancer pool (but not restarted). Should check whether all required dependencies (database, downstream services) are reachable.
// Spring Boot Actuator /actuator/health response
{
"status": "UP",
"components": {
"db": {
"status": "UP",
"details": { "database": "PostgreSQL", "result": 1 }
},
"redis": { "status": "UP" },
"diskSpace": {
"status": "UP",
"details": { "free": 10737418240, "threshold": 10485760 }
}
}
}
// Return HTTP 200 when UP; HTTP 503 when DOWN or DEGRADED
Best practices for health endpoints:
- Return a structured JSON body, not just an HTTP status code, so operators can diagnose which dependency is unhealthy.
- Never put liveness and readiness logic in the same endpoint if they have different semantics.
- Keep health checks fast (under 1 second); slow health checks look like outages to the load balancer.
- Include a startup probe for slow-starting services to prevent Kubernetes from restarting them prematurely.
The Rate Limiting pattern caps the number of requests a client (identified by IP, API key, or user ID) can make within a time window. When the limit is exceeded, the server rejects excess requests with an HTTP 429 (Too Many Requests) and optionally includes a Retry-After header. It protects services from accidental or malicious overload, enforces fair-use quotas, and prevents a single client from exhausting shared resources.
Four commonly used algorithms:
- Fixed Window Counter — count requests in fixed time windows (e.g., 0–60 s, 60–120 s). Simple and cheap. Weakness: a burst can occur at the boundary — up to 2× the limit in a single window transition.
- Sliding Window Log — store the exact timestamp of each request. Count requests in the rolling window ending at "now". Precise but memory-intensive (O(N) per client).
- Sliding Window Counter — approximate the sliding window by blending the current and previous fixed-window counts using elapsed time fraction. Good accuracy at low memory cost.
- Token Bucket — a bucket fills with tokens at a fixed rate (e.g., 10 tokens/second, bucket size 100). Each request consumes one token. If the bucket is empty, reject. Allows controlled bursting up to the bucket size.
- Leaky Bucket — requests fill a queue (the "bucket"). The bucket drains at a fixed constant rate. Smooths bursty input to a steady output. Excess requests that overflow the bucket are rejected.
# Redis-based Token Bucket (pseudocode)
tokens = redis.get("rate:" + clientId) or bucketCapacity
if tokens < 1:
return HTTP 429
redis.decrby("rate:" + clientId, 1)
redis.expire("rate:" + clientId, windowSeconds)
# proceed with request
Rate limits are commonly enforced at the API Gateway using Redis (for distributed state across multiple gateway replicas) with the Token Bucket or Sliding Window Counter algorithm.
The Fallback pattern provides an alternative response path when a downstream call fails — whether due to a timeout, an exception, or a Circuit Breaker (Q25) in the Open state. Instead of propagating a hard error to the caller (and potentially all the way to the user), the fallback returns a degraded but functional result that lets the system continue operating at reduced capability.
Common fallback strategies:
- Cached response — return the last successfully retrieved value from a local or distributed cache. Works well for product catalogs, user preferences, and feature flags where slightly stale data is acceptable.
- Default/stub value — return a sensible default. An unavailable recommendation engine falls back to a static "top-10 bestsellers" list.
- Degraded feature — disable the feature entirely and return a response that omits the failed component. A loyalty-points display is hidden rather than blocking the checkout page.
- Alternate service — route to a secondary service (e.g., a read replica, a lower-SLA provider, or a local fallback implementation).
Relationship to Circuit Breaker: The Circuit Breaker detects failure and short-circuits calls. The Fallback defines what to do when the circuit is open (or any call fails). They are complementary: the circuit breaker decides that the call should not be attempted; the fallback provides the alternative response. Resilience4j, Hystrix, and similar libraries allow both to be configured on the same decorated method.
A fallback should never do expensive work — it must return quickly. If the fallback itself can fail, it needs its own timeout and should itself degrade gracefully. A fallback that calls yet another slow service is an anti-pattern.
Both Throttling and Rate Limiting control the flow of requests to protect a service from overload, but they differ in what they do to excess traffic.
| Aspect | Rate Limiting | Throttling |
|---|---|---|
| What happens to excess requests | Rejected immediately — HTTP 429 returned | Slowed down, queued, or delayed — response takes longer |
| Client experience | Hard error; client must back off and retry | Slower response; client waits longer but eventually receives a response |
| Use case | Enforcing hard quotas per client (API monetisation, abuse prevention) | Graceful degradation under peak load; ensuring critical traffic is served first |
| Implementation | Counter/token check before processing | Priority queue, token bucket drain with delay, or thread pool queue with bounded size |
In practice, throttling often applies to internal flows — for example, a batch processing service that reads from a database throttles its own read rate to avoid saturating the DB connection pool. Rate limiting is more commonly applied at the external boundary (API gateway) to control client behaviour.
A service can apply both simultaneously: rate limit external clients to prevent abuse (hard cap), while internally throttling its own outbound calls to downstream services (graceful slowdown) to stay within those services' capacity. Throttling at the outbound call level also prevents the Retry storm anti-pattern, where many retries overload a recovering downstream service.
The Log Aggregation pattern collects log output from every service instance and ships it to a centralised store where it can be searched, correlated, and analysed in one place. Without aggregation, diagnosing an incident across 50 service instances means SSHing into individual machines — impractical at scale.
A typical pipeline (EFK/ELK stack):
- Emit structured logs — each service writes JSON-formatted log events to stdout (preferred in containers) or a log file. Structured logs include fields like
timestamp,level,service,traceId,message. - Log shipper — Fluentd or Filebeat runs as a DaemonSet (one per node in Kubernetes) and tails container log files, applying parsing, filtering, and enrichment rules before forwarding.
- Aggregator/processor — Logstash or Fluentd aggregator buffers, transforms (grok patterns, field extraction), and routes events.
- Storage and search — Elasticsearch (or OpenSearch) indexes log events for full-text and structured queries.
- Visualisation — Kibana (or OpenSearch Dashboards) provides dashboards, search, and alerting.
# Structured JSON log line emitted by a service
{
"timestamp": "2026-04-22T09:01:23.456Z",
"level": "ERROR",
"service": "order-service",
"traceId": "4bf92f3577b34da6a3ce929d0e0e4736",
"spanId": "00f067aa0ba902b7",
"message": "Payment gateway timeout for order 42",
"orderId": 42,
"durationMs": 5001
}
The Correlation ID / Trace ID field is essential: it ties together all log lines from a single end-to-end request across every service that handled it, even when each service writes its logs to a different local file. A single Kibana query on traceId=4bf92f3... shows the entire request journey in chronological order.
The Application Metrics pattern instruments each service to emit numeric measurements — counters, gauges, histograms, and summaries — that describe its runtime behaviour. These metrics feed dashboards, alerting rules, and capacity-planning models that plain logs cannot efficiently support (logs are for discrete events; metrics are for continuous numerical trends).
Common metric types:
- Counter — monotonically increasing (e.g., total HTTP requests served, total errors). Never decremented except on process restart.
- Gauge — a value that goes up and down (e.g., current active connections, JVM heap used, queue depth).
- Histogram — distributes observations into configurable buckets (e.g., request latency distribution, enabling P50/P95/P99 calculations).
Pull model (Prometheus): The Prometheus server periodically scrapes a /metrics HTTP endpoint on each service instance. The service maintains in-memory metric state; Prometheus pulls it on its own schedule.
// Micrometer / Prometheus metric registration (Java)
Counter httpRequests = Counter.builder("http_requests_total")
.tag("method", "GET").tag("status", "200")
.register(Metrics.globalRegistry);
httpRequests.increment();
// Prometheus scrapes GET /actuator/prometheus every 15s
Push model (StatsD, Prometheus Pushgateway): The service actively sends metric updates to a collection agent or gateway. Used when services are short-lived (batch jobs, serverless functions) that do not run long enough to be scraped.
| Aspect | Pull (Prometheus) | Push (StatsD / Pushgateway) |
|---|---|---|
| Discovery | Prometheus discovers targets via service discovery | Service knows the collector address |
| Short-lived jobs | Poor fit — job may finish before being scraped | Good fit — pushes before exit |
| Load on service | Scrape adds a momentary HTTP request | Service bears cost of every metric push |
Audit Logging records a tamper-evident, chronological trail of who performed what action on which resource and when. It is distinct from application or debug logging: application logs record technical events (exceptions, slow queries, service calls) for operational troubleshooting; audit logs record business-level events for compliance, forensics, and accountability — and must be retained even when the original data is deleted.
Events that should always be captured:
- Authentication events — successful logins, failed login attempts, logouts, and token refresh operations. Essential for detecting credential-stuffing attacks and session anomalies.
- Authorisation decisions — both grants and denials. A denied access attempt to a sensitive endpoint may indicate a privilege-escalation attempt.
- Privileged record reads — when a user or service reads personally identifiable information (PII), financial records, or health data. Required by GDPR, HIPAA, and PCI-DSS.
- Create, Update, Delete on critical entities — changes to user accounts, payment methods, configuration, permissions, and order state.
- Administrative actions — role assignments, system configuration changes, secret rotation, and feature flag toggles.
Key properties of a well-designed audit log:
- Immutability — audit records must not be deletable or modifiable after writing. Use append-only stores (AWS CloudTrail, Kafka with infinite retention, an WORM-locked S3 bucket).
- Attribution — every entry must record the identity of the actor (user ID, service principal, IP address) and the target resource.
- Tamper detection — hash chaining or cryptographic signing of records allows detection of modifications.
- Separation from application logs — audit logs should flow through a separate pipeline with stricter retention policies and access controls.
Distributed Tracing reconstructs the end-to-end path of a single request as it flows across multiple microservices, providing a flamegraph of timings that reveals where latency accumulates and where failures occur. Without it, correlating logs from 10 services for a single slow request requires manual, error-prone cross-referencing.
Key concepts:
- Trace — the complete journey of a request, identified by a globally unique
traceId. All spans belonging to the same originating request share this ID. - Span — a named, timed unit of work within a single service (e.g., "HTTP GET /orders/42", "DB SELECT orders", "Kafka publish OrderPlaced"). A span records start time, duration, service name, tags (metadata), and the
parentSpanIdlinking it to its caller. - Context propagation — the tracing library injects the trace context into outbound call headers; the receiving service extracts it and creates a child span with the correct
traceIdandparentSpanId.
# W3C Trace Context header (standard across OpenTelemetry, Jaeger, Zipkin)
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
# version-traceId(128-bit hex) -spanId(64-bit)-flags
# Outgoing HTTP request from Service A to Service B includes this header.
# Service B extracts traceId + parentSpanId, creates its own child span.
# All spans are sent asynchronously to Jaeger / Zipkin / AWS X-Ray.
Two propagation standards are in common use:
- W3C Trace Context (
traceparent+tracestateheaders) — the IETF standard, supported natively by OpenTelemetry. - Zipkin B3 headers (
X-B3-TraceId,X-B3-SpanId,X-B3-ParentSpanId,X-B3-Sampled) — older but still widely used by Istio, Zipkin, and some Jaeger deployments.
For async messaging, the trace context is injected into message headers (Kafka record headers, AMQP headers) so traces span across broker boundaries.
The Access Token pattern uses short-lived cryptographically signed tokens — most commonly JWTs issued via OAuth 2.0 / OpenID Connect — to authenticate client requests to microservices. The client authenticates once with an Authorization Server (Keycloak, Okta, Cognito) and receives a JWT. Subsequent requests carry this token; any service that holds the corresponding public key can validate it locally without calling the Auth Server on every request.
JWT structure — three Base64URL-encoded segments separated by dots (header.payload.signature):
// Header (algorithm and token type)
{ "alg": "RS256", "typ": "JWT" }
// Payload (claims)
{
"sub": "user-42",
"iss": "https://auth.example.com",
"aud": "order-service",
"exp": 1714003200, // expiry Unix timestamp
"scope": "orders:read orders:write"
}
// Signature: RS256(base64(header) + "." + base64(payload), privateKey)
// HTTP request
Authorization: Bearer eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9...
Validation at the receiving service (API Gateway or service itself):
- Decode the header to get the signing algorithm and key ID (
kid). - Fetch (or cache) the public key from the Auth Server's JWKS endpoint.
- Verify the signature using the public key.
- Check
exp(not expired),iss(trusted issuer), andaud(this service is the intended audience). - Extract
subandscopeclaims to enforce authorisation.
Short token lifetimes (5–15 minutes) limit the blast radius if a token is stolen. Refresh tokens (longer-lived, stored securely by the client) are exchanged for new access tokens when the old one expires, without requiring re-authentication.
Mutual TLS (mTLS) extends standard one-way TLS by requiring both sides of a connection to present and verify X.509 certificates. In a microservices context it provides two things simultaneously: an encrypted channel (confidentiality and integrity) and verified service identity (authentication) — without any application-level token or API key. The services prove who they are via their certificates, issued by a trusted internal Certificate Authority.
Standard TLS vs mTLS:
- Standard (one-way) TLS — only the server presents a certificate. The client verifies the server's identity but the server does not verify the client. Used for browser-to-server HTTPS.
- mTLS — both client and server present certificates. Each side verifies the other's certificate against a shared CA. This proves the client is a legitimate service instance, not just any caller that can reach the network.
# Istio PeerAuthentication — enforce STRICT mTLS in a namespace
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: default
namespace: production
spec:
mtls:
mode: STRICT # reject any plaintext or one-way TLS traffic
---
# Istio automatically rotates certificates via its Citadel CA.
# Envoy sidecars handle the TLS handshake transparently.
# Application code sees plain HTTP internally — zero code changes.
In a service mesh (Istio, Linkerd), mTLS is fully transparent to application code: the sidecar proxy handles the TLS handshake using certificates provisioned by the control-plane CA. Certificates are short-lived (e.g., 24 hours) and rotated automatically, eliminating the risk of a compromised long-lived credential.
mTLS is the recommended pattern for east-west (service-to-service) authentication within a cluster. It replaces shared API keys and static secrets with cryptographic identity tied to a specific service workload.
The Secrets Management pattern centralises the storage, access control, rotation, and auditing of sensitive credentials — database passwords, API keys, TLS certificates, encryption keys — in a dedicated secrets store rather than hardcoding them in environment variables, config files, or source code. The goal is to ensure that a compromised container image, log file, or configuration repository cannot expose production credentials.
HashiCorp Vault provides several key capabilities:
- Dynamic secrets — instead of storing a long-lived database password, Vault generates a unique, short-TTL (e.g., 1-hour) database credential on demand for each service instance. When the lease expires, Vault revokes it automatically. If credentials leak, they expire quickly — limiting the blast radius.
- Encryption as a Service — services can ask Vault to encrypt/decrypt data without ever holding the encryption key themselves.
- Leasing and renewal — every secret is issued with a lease. Services renew leases before expiry; Vault revokes them if renewal stops (e.g., after a service crash).
- Audit log — every secret access is logged with the requesting entity, timestamp, and secret path.
AWS Secrets Manager provides:
- Automatic rotation of RDS database credentials on a configurable schedule (Lambda-powered).
- IAM-based access control — only services with the correct IAM role can retrieve a secret.
- Cross-account and cross-region replication for disaster recovery.
In Kubernetes, secrets are typically injected at pod creation via a Vault Agent sidecar or the Vault Secrets Operator (CSI provider), making the secret available as an in-memory file or environment variable at runtime — never baked into the container image or stored in etcd in plaintext.
The Sidecar pattern deploys a helper container alongside the main application container in the same pod (Kubernetes) or VM instance. The sidecar shares the same network namespace, localhost address space, and optionally a shared volume with the main container. It handles cross-cutting concerns so the main application stays free of infrastructure boilerplate.
# Kubernetes Pod with a Fluentd log-shipper sidecar
apiVersion: v1
kind: Pod
metadata:
name: order-service
spec:
containers:
- name: order-service # main application
image: myregistry/order-service:2.1
volumeMounts:
- name: logs
mountPath: /var/log/app
- name: log-shipper # sidecar
image: fluent/fluentd:v1.16
volumeMounts:
- name: logs
mountPath: /var/log/app # reads same log directory
env:
- name: FLUENTD_CONF
value: fluent.conf
volumes:
- name: logs
emptyDir: {}
Common sidecar responsibilities:
- Log shipping — tail application log files and forward to Elasticsearch or a log aggregation pipeline (as in the example above).
- Metrics collection — scrape or poll the application's metrics and expose them in Prometheus format, or push to StatsD.
- Service proxy — Envoy/Linkerd-proxy sidecars intercept all inbound and outbound traffic, handling mTLS, retries, circuit breaking, and tracing without code changes in the main app. (This is the Service Mesh data plane.)
- Configuration reload — watch a ConfigMap or Vault path and write updated configuration to a shared volume that the main app reads without restarting.
- Secret rotation — fetch short-lived secrets from Vault and refresh them in a shared in-memory file before they expire.
The key architectural property: the main application is unaware of its sidecar. It reads log files or environment variables as normal; it makes outbound HTTP calls normally. The sidecar intercepts or supplements transparently. This allows infrastructure capabilities to be upgraded or replaced independently of the application.
The Ambassador pattern is a specialisation of the Sidecar pattern focused on outbound (egress) connections. The ambassador container acts as a local proxy for all traffic the main container sends to external services. Instead of the main application connecting directly to downstream services (with all the attendant concerns of retry logic, circuit breaking, timeouts, and connection pooling), it connects to localhost:<port> on the ambassador, which handles all of that transparently.
# Pod with an Envoy Ambassador for outbound calls
apiVersion: v1
kind: Pod
spec:
containers:
- name: order-service
image: myregistry/order-service:2.1
env:
- name: INVENTORY_URL
value: http://localhost:9901/inventory # ambassador port
- name: envoy-ambassador
image: envoyproxy/envoy:v1.29
args: ["-c", "/etc/envoy/envoy.yaml"]
volumeMounts:
- name: envoy-config
mountPath: /etc/envoy
# envoy.yaml configures upstream cluster, retries, circuit breaker, TLS
Ambassador responsibilities:
- Retry and circuit breaking — the ambassador retries transient failures with exponential backoff; opens the circuit when the upstream is unhealthy.
- Connection pooling — maintains a warm pool of HTTP/2 or gRPC connections to upstream services, avoiding per-request TCP handshake overhead.
- Protocol translation — a legacy service that only speaks HTTP/1.1 can be transparently proxied to an upstream that expects HTTP/2 or gRPC.
- mTLS to upstream — the ambassador terminates plaintext from the main app and re-establishes mTLS to the upstream, so the main app does not need TLS libraries.
- Telemetry — emits distributed trace spans and latency metrics for every outbound call.
The main application's code is simplified to a plain HTTP call to localhost. All network resilience logic lives in the ambassador configuration and can be changed without redeploying the application.
The Adapter pattern (container / structural variant) places a sidecar container alongside the main container to normalise the main container's output into a standard interface that the surrounding infrastructure expects — without modifying the main application. It is essentially a structural translator that makes a non-conforming service look conforming to monitoring, logging, or management infrastructure.
Concrete examples:
- Legacy log format to structured JSON — a legacy Java service writes logs in a custom text format. An Adapter sidecar reads the log file, parses the custom format, and re-emits structured JSON to stdout so the standard Fluentd pipeline can process it identically to every other service.
- Non-standard metrics to Prometheus format — a third-party binary exposes metrics on a proprietary UDP endpoint. An Adapter sidecar reads those metrics and exposes them at
/metricsin Prometheus exposition format, making the service scrape-able by Prometheus without any changes to the binary. - Legacy health endpoint normalisation — a vendor application returns health status in a non-standard format. The Adapter translates it to the standard
{ "status": "UP" }JSON that Kubernetes probes expect.
Adapter vs Ambassador: Both are sidecar specialisations. The Ambassador manages outbound connectivity (egress traffic, retries, circuit breaking). The Adapter manages output format normalisation (translating the service's emitted data — logs, metrics, health — into a standard interface). An Ambassador speaks on behalf of the app to the outside world; an Adapter speaks on behalf of the app to the infrastructure tooling.
The Canary Deployment pattern releases a new version of a service to a small percentage of production traffic first, monitors it closely for errors, latency regressions, and business metric anomalies, then gradually increases its traffic share until it serves 100% — at which point the old version is decommissioned. The name comes from the "canary in a coal mine" practice of using a small probe to detect danger before full exposure.
Blue-Green Deployment maintains two complete, identical production environments — Blue (current live) and Green (new version). Traffic is switched from Blue to Green all at once (or very rapidly). If Green has a problem, rollback is instant: switch traffic back to Blue.
| Aspect | Canary | Blue-Green |
|---|---|---|
| Traffic shift | Gradual (1% → 10% → 50% → 100%) | All-at-once switch |
| User exposure to new version | Small initial subset of users | All users at switch time |
| Infrastructure cost | Only canary instances needed alongside full production | Two full production environments run simultaneously |
| Rollback speed | Seconds (redirect traffic back) | Seconds (flip the switch) |
| Best for | Validating new features on real traffic before full exposure | Pre-validated releases where instant cutover and instant rollback are required |
# Istio VirtualService — 5% canary traffic split
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata: { name: order-service }
spec:
hosts: [order-service]
http:
- route:
- destination: { host: order-service, subset: v1 }
weight: 95
- destination: { host: order-service, subset: v2-canary }
weight: 5
Canary deployments require automated monitoring with clear rollback triggers: if canary error rate or P99 latency exceeds a threshold within the observation window, traffic is automatically redirected back to the stable version.
The Service Registry is a database of network locations (host + port) for all running service instances, kept up-to-date by registration (at startup) and deregistration (at shutdown or failure). Service Discovery is the mechanism by which a service caller looks up the current location of a dependency at runtime, replacing hardcoded hostnames with dynamic lookups.
Two styles of service discovery:
Client-side discovery — the calling service queries the registry directly, receives a list of healthy instances, and performs its own load balancing (round-robin, random, least-connections).
// Spring Cloud / Netflix Eureka client-side discovery
@LoadBalanced // Ribbon intercepts and resolves via Eureka
RestTemplate restTemplate = new RestTemplate();
// Call by logical service name — Ribbon resolves to a real IP:port
String result = restTemplate.getForObject(
"http://inventory-service/api/stock/SKU-99", String.class);
Server-side discovery — the caller sends the request to a load balancer (AWS ALB, HAProxy, Nginx, Kubernetes Service). The load balancer queries the registry and forwards to a healthy instance. The caller needs no registry SDK.
| Aspect | Client-side | Server-side |
|---|---|---|
| Who queries the registry | The calling service (via SDK) | The load balancer |
| Client SDK dependency | Required (Eureka client, Ribbon) | Not required — any HTTP client works |
| Language support | Needs SDK for each language | Works for any language / protocol |
| Examples | Netflix Eureka + Ribbon, Consul + Fabio | Kubernetes Service + kube-proxy, AWS ALB + ECS |
Kubernetes uses server-side discovery natively: a Service resource provides a stable DNS name and VIP; kube-proxy (or iptables/IPVS) routes traffic to healthy pods via Endpoints, which are kept current by the Endpoints controller watching pod readiness probes.
These two patterns describe how service instances get their network location recorded in (and removed from) the Service Registry — not how callers use it.
Self Registration — the service instance itself registers with the registry on startup and deregisters on orderly shutdown. It is also responsible for sending heartbeats so the registry can detect failed instances and remove stale entries.
- Example: A Spring Boot service with the Netflix Eureka client calls
eurekaClient.register()at startup andeurekaClient.deregister()in a shutdown hook. - Drawback: every service must import and configure the registry client library. The service is now coupled to the specific registry technology. If the registry changes, all services need updating.
Third-Party Registration — an external Registrar component monitors service instances (via the platform's event stream) and registers/deregisters them on their behalf. The service itself has zero registry awareness.
- Example (Kubernetes): The Endpoints controller watches pod events. When a pod passes its readiness probe, the controller adds its IP to the Endpoints resource for the corresponding Service. When the pod fails its probe or is deleted, the controller removes it. The pod never calls the registry directly.
- Example (Consul + Docker): A Registrator daemon on each host listens for Docker start/stop events and updates Consul accordingly.
| Aspect | Self Registration | Third-Party Registration |
|---|---|---|
| Coupling | Service coupled to registry client library | Service has no registry dependency |
| Complexity | Simpler — no extra component | Requires a Registrar / controller process |
| Language support | Needs SDK per language | Language-agnostic |
| Used by | Netflix Eureka, Consul client mode | Kubernetes, Consul + Registrator |
