Tools / Monitoring and Observability Interview Questions

1. What is the difference between monitoring and observability? 2. What are the three pillars of observability? 3. What is a Service Level Indicator (SLI) and how does it differ from an SLO and SLA? 4. What is an error budget and how is it used in SRE? 5. What is distributed tracing and how does it work? 6. What is OpenTelemetry and why has it become the industry standard? 7. What is the RED method for monitoring microservices? 8. What are the Four Golden Signals defined by Google SRE? 9. What is Prometheus and how does its pull-based scraping model work? 10. What is Grafana and how does it integrate with Prometheus? 11. What is structured logging and why is it preferred over plain-text logs? 12. What is log aggregation and what tools are commonly used for it? 13. What is alerting fatigue and how can you reduce it? 14. What is the USE method and when should you apply it? 15. What is cardinality in metrics and why does high cardinality cause problems? 16. What is tail-based sampling in distributed tracing and when should you use it? 17. What is a health check endpoint and what should it return? 18. What is synthetic monitoring and how does it differ from real user monitoring (RUM)? 19. What are Core Web Vitals and why do they matter for observability? 20. What is application performance monitoring (APM) and how does it differ from infrastructure monitoring? 21. What is eBPF and how is it revolutionizing observability? 22. What is Jaeger and how does it work as a distributed tracing backend? 23. What is MTTR and MTTD and why do they matter to SRE teams? 24. What is anomaly detection in observability and what are its limitations? 25. What is a runbook and how should it be linked to monitoring alerts? 26. What is a service mesh and how does it enhance observability? 27. What is a postmortem and what makes one blameless? 28. What is the difference between blackbox monitoring and whitebox monitoring? 29. What is Kubernetes monitoring and what are the key components to observe? 30. What is a metric histogram and why is it used for latency measurement? 31. What is chaos engineering and how does it relate to observability? 32. What is log sampling and when should you apply it? 33. What is the difference between push-based and pull-based metrics collection? 34. What is distributed systems observability and what challenges does it introduce compared to monolith observability? 35. What is Datadog and what differentiates it from open-source observability stacks? 36. What is on-call rotation and what makes an on-call experience sustainable? 37. What is continuous profiling and how does it differ from traditional profiling? 38. What is a flame graph and how do you read it? 39. What is the role of an observability platform in incident response? 40. What is OpenMetrics and how does it relate to Prometheus exposition format? 41. What is a dead man's switch alert and when should you use it? 42. What is Thanos and how does it extend Prometheus for large-scale deployments? 43. How does observability apply to event-driven and asynchronous architectures? 44. What is the difference between an alert and a notification in observability? 45. What is observability-driven development (ODD) and how does it shift monitoring left?

Could not find what you were looking for? send us the question and we would be happy to answer your question.

1. What is the difference between monitoring and observability?

Monitoring and observability are related but distinct concepts. Monitoring is the practice of collecting predefined metrics, logs, and alerts to track whether a system is behaving as expected. You decide upfront what to watch — CPU usage, request rate, error count — and dashboards or alerts fire when thresholds are breached. It answers the question: Is something wrong?

Observability goes further. A system is observable if you can understand its internal state purely from its external outputs — metrics, logs, and traces — without deploying new instrumentation every time a new failure mode appears. It answers: Why is something wrong? The term originates from control theory: a system is observable if its internal states can be inferred from its inputs and outputs.

In practice, monitoring is a subset of observability. You can have monitoring without observability (dashboards that tell you something is broken but not why), but you cannot have genuine observability without a solid monitoring foundation. High-cardinality telemetry, distributed tracing, and structured logging are the tools that push a system from merely monitored to truly observable.

Monitoring vs Observability
Aspect	Monitoring	Observability
Core question	Is something broken?	Why is it broken?
Setup	Predefined metrics and alerts	Rich, queryable telemetry
Cardinality	Low — fixed dimensions	High — arbitrary dimensions
Unknown failures	Hard to detect	Explorable after the fact

What is the primary question that observability answers that monitoring alone cannot? Whether a threshold was breached

✗ Try again — that is monitoring's job.

Why the system is behaving unexpectedly

✓ Well done — observability lets you explore unknown failure modes.

How many servers are running

✗ Try again — that is basic infrastructure inventory.

Observability is a concept borrowed from which field? Database theory

✗ Try again.

Networking protocols

✗ Try again.

Control theory

✓ Well done — Rudolf Kálmán introduced the term in control systems engineering.

2. What are the three pillars of observability?

The three pillars of observability are metrics, logs, and traces. Together they give operators three different lenses through which to understand system behavior.

Metrics are numeric time-series data — counters, gauges, and histograms. They are cheap to store and query at scale, making them ideal for dashboards and alerting. Tools like Prometheus scrape and store metrics; Grafana visualizes them. Metrics excel at answering questions like "What is the 99th-percentile latency over the last hour?"

Logs are discrete, timestamped records of events — structured (JSON) or unstructured (plain text). They carry rich context: request IDs, user agents, stack traces. ELK Stack (Elasticsearch, Logstash, Kibana) and Loki are popular log aggregation platforms. Logs are expensive at high volume but irreplaceable when debugging specific incidents.

Traces track a single request as it propagates across multiple services. Each hop is a span; the collection of spans for one request is a trace. Distributed tracing tools like Jaeger, Zipkin, and AWS X-Ray stitch spans together using a shared trace ID injected into request headers. Traces reveal latency bottlenecks that neither metrics nor logs can localize on their own.

Modern observability platforms — Datadog, New Relic, Grafana Cloud — correlate all three pillars so you can jump from a latency spike on a metric dashboard directly into the traces and logs for that time window.

Which pillar is best suited for tracking the end-to-end latency of a request across five microservices? Metrics

✗ Try again — metrics aggregate across many requests and cannot follow a single one.

Traces

✓ Well done — distributed traces follow a single request through every service hop.

Logs

✗ Try again — logs capture events but do not automatically stitch cross-service flows.

What data structure does Prometheus use to store metrics? Time-series

✓ Well done — Prometheus stores every metric as a labeled time-series.

Relational tables

✗ Try again.

Document store

✗ Try again.

3. What is a Service Level Indicator (SLI) and how does it differ from an SLO and SLA?

An SLI (Service Level Indicator) is a specific, measurable signal that reflects user experience — typically a ratio or rate. Common SLIs include availability (percentage of successful HTTP requests), latency (fraction of requests served under 200 ms), and error rate (5xx responses divided by total requests).

An SLO (Service Level Objective) is a target set on top of an SLI. For example: "99.9% of requests must succeed over a rolling 28-day window." The SLO is an internal agreement between engineering and product; it defines what "good enough" looks like and drives the error budget concept.

An SLA (Service Level Agreement) is a contractual commitment made to customers, usually with financial penalties for breach. SLAs are typically looser than SLOs — the SLO is the engineering guardrail that keeps the team well inside the SLA boundary.

The hierarchy is: SLI → SLO → SLA. You measure with SLIs, aim for SLOs, and promise SLAs. Getting this order wrong — alerting directly on SLA thresholds — leaves no runway to detect and fix issues before customers are impacted.

SLI / SLO / SLA Comparison
Term	What it is	Audience	Example
SLI	Measured signal	Engineering	99.95% success rate (last 7 days)
SLO	Internal target	Eng + Product	≥ 99.9% success rate over 28 days
SLA	Customer contract	Customers / Legal	≥ 99.5% uptime or credits issued

Why is the SLO typically set stricter than the SLA? To impress customers with higher numbers

✗ Try again — it is not a marketing decision.

To give engineering runway to detect and fix issues before the SLA is breached

✓ Well done — the gap between SLO and SLA is a safety buffer.

Because SLOs are measured daily while SLAs are monthly

✗ Try again — the window difference is not the reason.

Which term describes a contractual commitment to customers that usually carries financial penalties? SLI

✗ Try again — SLI is just the measured signal.

SLO

✗ Try again — SLO is internal.

SLA

✓ Well done — SLA is the external, legally binding agreement.

4. What is an error budget and how is it used in SRE?

An error budget is the allowable amount of unreliability a service can have within a given SLO window. If your SLO promises 99.9% availability over 30 days, you have 0.1% of that window to spend on failures — roughly 43.2 minutes of downtime. That 43.2 minutes is your error budget.

The budget is consumed whenever the SLI falls below the SLO target. Consumption is tracked in real time. When the budget is healthy (plenty remaining), teams have license to deploy frequently and take calculated risks. When the budget is nearly exhausted, deployments freeze until reliability recovers — this is the error-budget policy.

Error budgets eliminate the adversarial relationship between development velocity and reliability. Developers are incentivized to invest in reliability work because burning the budget costs them deployment freedom. SREs can quantify risk without saying "no" to every release: instead, the budget says how much risk is left.

The burn rate concept extends this further. A burn rate of 1 means you are consuming the budget exactly in line with the window. A burn rate of 10 means you will exhaust the budget ten times faster than the SLO window allows — a signal to page on-call immediately rather than wait for a daily report.

If a service has a 99.9% availability SLO over 30 days, approximately how many minutes of downtime does the error budget allow? 4.3 minutes

✗ Try again — recalculate: 0.1% of 43,200 minutes.

43.2 minutes

✓ Well done — 0.1% × 43,200 minutes = 43.2 minutes.

432 minutes

✗ Try again — that would be a 99% SLO.

What happens when an error budget is exhausted under a strict error-budget policy? Alerts are disabled

✗ Try again.

New feature deployments are frozen until reliability recovers

✓ Well done — the deployment freeze is the enforcement mechanism of the policy.

The SLO target is automatically relaxed

✗ Try again — SLO targets are not auto-adjusted.

5. What is distributed tracing and how does it work?

Distributed tracing is a technique for following a single request as it moves through multiple services in a distributed system. Without it, when a user reports slowness, you might see a problem in Service C but have no idea whether Service A or B caused it.

The mechanism works through context propagation. When a request enters the system, the first service generates a globally unique trace ID and a span ID for its own unit of work. Before calling a downstream service, it injects these IDs into the outgoing request headers — the W3C Trace Context standard (traceparent header) is the modern way to do this. The receiving service extracts those IDs, creates a child span linked to the parent span, and continues the chain.

Each span records: service name, operation name, start timestamp, duration, status, and any custom attributes (user ID, query string, etc.). The tracing backend — Jaeger, Zipkin, Tempo — collects all spans and stitches them into a tree called a trace. The waterfall view of that tree immediately shows which service added how much latency.

OpenTelemetry is now the de-facto standard for instrumentation: you add the OTel SDK to your service, configure an exporter (OTLP), and the spans flow to your backend of choice without vendor lock-in.

What HTTP header does the W3C Trace Context specification define for propagating trace context? X-Request-ID

✗ Try again — X-Request-ID is not the W3C standard.

traceparent

✓ Well done — the W3C Trace Context spec defines the traceparent header.

X-B3-TraceId

✗ Try again — X-B3 headers are the older Zipkin/B3 format, not the W3C standard.

What is the name of the collection of all spans for a single request in distributed tracing? A segment

✗ Try again — AWS X-Ray uses segments, but the general term is trace.

A log stream

✗ Try again.

A trace

✓ Well done — one trace = all spans for a single end-to-end request.

6. What is OpenTelemetry and why has it become the industry standard?

OpenTelemetry (OTel) is a vendor-neutral, open-source observability framework that provides APIs, SDKs, and a collector for generating, collecting, and exporting telemetry data — metrics, logs, and traces — from applications. It was formed in 2019 by merging OpenCensus (Google) and OpenTracing (CNCF), and is now a CNCF incubating project with the broadest vendor support of any observability standard.

The key reason for its adoption is vendor neutrality. Before OTel, switching from Jaeger to Datadog required re-instrumenting every service. With OTel, you instrument once using the OTel SDK, export over the OTLP (OpenTelemetry Protocol) wire format, and change only the exporter configuration when switching backends. Every major observability vendor — Datadog, Honeycomb, New Relic, Grafana, AWS — accepts OTLP today.

OTel's architecture has three layers: the API (interface your application code calls), the SDK (the implementation with sampling, batching, and export logic), and the Collector (an agent/gateway that receives, processes, and re-exports telemetry). The Collector enables pipeline transforms — tail-based sampling, attribute filtering, routing to multiple backends — without touching application code.

Auto-instrumentation agents (Java agent, Python auto-instrumentation) can add traces to many frameworks with zero code changes, making adoption practical even for large legacy codebases.

Which two projects merged to form OpenTelemetry in 2019? Prometheus and Jaeger

✗ Try again — those are separate projects that did not merge.

OpenCensus and OpenTracing

✓ Well done — the merger of OpenCensus and OpenTracing created OpenTelemetry.

Zipkin and StatsD

✗ Try again.

What OTel component allows pipeline-level transforms like tail sampling without changing application code? The OTel API

✗ Try again — the API is the interface your code calls.

The OTel Collector

✓ Well done — the Collector is the agent/gateway that processes and re-exports telemetry.

The OTel SDK

✗ Try again — the SDK lives inside the application process.

7. What is the RED method for monitoring microservices?

The RED method, introduced by Tom Wilkie, defines three golden signals specifically suited to request-driven microservices:

R — Rate: The number of requests per second the service is receiving. This tells you about load and traffic patterns. Sudden drops can indicate that upstream services stopped calling — often a sign of their own failure.

E — Errors: The number of failed requests per second (or as a percentage of total requests). This directly reflects user impact. Separate 4xx client errors from 5xx server errors because they have different root-cause implications.

D — Duration: The distribution of latency for requests — specifically, percentiles (p50, p95, p99). Averages hide tail latency; the 99th percentile often reflects the experience of your most valuable users or the slowest database queries.

The RED method maps naturally to HTTP services, gRPC endpoints, and message consumers. It complements the USE method (Utilization, Saturation, Errors), which is better for resource-level monitoring (CPU, disk, network). In a mature microservices setup, you apply RED at every service boundary and USE to every infrastructure resource they depend on.

In the RED method, why should you use latency percentiles (p99) rather than averages? Percentiles are easier to compute

✗ Try again — percentiles are actually harder to compute than averages.

Averages mask tail latency that affects the slowest — often most important — users

✓ Well done — a p99 spike is invisible in an average if most requests are fast.

Percentiles are required by the Prometheus data model

✗ Try again — Prometheus supports both, it does not require percentiles.

Which complementary method is better suited to monitoring CPU and disk resources than RED? The DORA method

✗ Try again — DORA measures software delivery performance, not infrastructure.

The USE method

✓ Well done — USE (Utilization, Saturation, Errors) targets resource-level signals.

The Four Golden Signals

✗ Try again — Four Golden Signals also covers request-driven services, not just resources.

8. What are the Four Golden Signals defined by Google SRE?

Google's SRE book defines four signals that, when monitored together, give a comprehensive picture of a user-facing service's health:

1. Latency — The time it takes to serve a request. Critically, you must distinguish latency of successful requests from latency of failed requests. A 500 error that returns in 1 ms will skew your latency distribution favorably but hides the real problem.

2. Traffic — A measure of demand placed on the system. For web services this is requests per second; for audio streaming it might be bits per second; for key-value stores it might be transactions per second.

3. Errors — The rate of requests that fail, either explicitly (HTTP 500), implicitly (HTTP 200 with wrong content), or by policy (any request over 1 second is considered an error).

4. Saturation — How full the service is. This is the resource most constrained — CPU, memory, I/O, or queue depth. Saturation often predicts impending failure before errors or latency degrade. At 100% saturation, the service is overloaded.

The Four Golden Signals are broader than RED: they include Saturation, which RED omits, making them better for evaluating whether a service has headroom or is approaching its limits.

Which of the Four Golden Signals does the RED method NOT include? Errors

✗ Try again — Errors is the E in RED.

Latency

✗ Try again — Duration in RED maps to Latency.

Saturation

✓ Well done — RED omits Saturation; that is covered by the USE method instead.

Why should failed request latency be tracked separately from successful request latency? Because failures are always slower

✗ Try again — fast failures (e.g., immediate 500s) can actually lower the overall average artificially.

Fast failures can skew overall latency metrics favorably, masking real problems

✓ Well done — mixing fast error responses into latency averages distorts the signal.

SLAs require separate tracking

✗ Try again — this is a technical accuracy concern, not an SLA requirement.

9. What is Prometheus and how does its pull-based scraping model work?

Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud and now a CNCF graduated project. It stores all data as time-series: streams of timestamped float64 values identified by a metric name and a set of key-value labels.

What makes Prometheus distinctive is its pull-based scraping model. Instead of applications pushing metrics to a central server, Prometheus periodically sends HTTP GET requests to a /metrics endpoint on each target. The response is in Prometheus exposition format — a plain-text, line-by-line format listing metric names, label sets, and values. Prometheus stores the scraped data in its local TSDB (time-series database).

Targets are discovered through service discovery mechanisms: static configuration, Kubernetes API, Consul, EC2 tags, and many others. This means Prometheus automatically starts scraping new pods as they come up and stops when they go down — no manual registration required.

The pull model has important operational implications: Prometheus controls the scrape rate, failed targets are immediately visible as scrape failures, and there is no need for agents to know the Prometheus address. The trade-off is that short-lived jobs (batch jobs) may finish before the scrape happens — solved by the Pushgateway, which acts as an intermediary for ephemeral workloads to push metrics to.

PromQL (Prometheus Query Language) is used to query and aggregate these time-series, feeding both Grafana dashboards and Alertmanager rules.

How does Prometheus collect metrics from applications by default? Applications push metrics to Prometheus at regular intervals

✗ Try again — push is the model used by StatsD and some other tools, not Prometheus by default.

Prometheus scrapes each target's /metrics HTTP endpoint at configured intervals

✓ Well done — Prometheus pulls metrics from targets via HTTP scraping.

Applications write metrics directly to Prometheus's disk storage

✗ Try again.

Why is the Pushgateway needed in a Prometheus setup? It replaces the pull model entirely

✗ Try again — the Pushgateway is an addition for edge cases, not a replacement.

Short-lived batch jobs finish before Prometheus can scrape them, so they push to the gateway instead

✓ Well done — the Pushgateway solves the ephemeral job problem.

It stores metrics in a relational database

✗ Try again.

10. What is Grafana and how does it integrate with Prometheus?

Grafana is an open-source analytics and visualization platform that lets you query, visualize, and alert on metrics from a wide variety of data sources — Prometheus, Loki, Tempo, InfluxDB, Elasticsearch, CloudWatch, and many more — all from a single UI.

The integration with Prometheus works through a data source plugin. You configure Prometheus as a data source in Grafana by providing its HTTP endpoint. From that point, Grafana panels can issue PromQL queries against Prometheus's API and render the results as time-series graphs, heatmaps, stat panels, or tables.

Grafana does not store your metrics — it is a read-only query and visualization layer on top of Prometheus (or any other backend). This separation of concerns means you can switch visualization tools without touching the data store, and you can query the same Prometheus instance from multiple Grafana instances.

Grafana also has its own alerting engine (Grafana Unified Alerting) that can evaluate PromQL expressions and route alerts through contact points — Slack, PagerDuty, email — similar to Prometheus Alertmanager. In many organizations both are used: Alertmanager for rule evaluation close to the data, Grafana alerting for cross-datasource rules.

The Grafana LGTM stack (Loki, Grafana, Tempo, Mimir) has become a popular open-source alternative to commercial platforms, with Mimir providing horizontally scalable long-term storage for Prometheus metrics.

Does Grafana store metrics data itself? No — Grafana queries external data sources and visualizes results without storing the raw metrics

✓ Well done — Grafana is a query and visualization layer, not a metrics store.

Yes — it maintains its own time-series database

✗ Try again — Grafana has a config/metadata database but not a metrics TSDB.

Only for the last 24 hours, then it purges

✗ Try again.

What does the M in the LGTM stack stand for? Monitoring

✗ Try again — Monitoring is a concept, not a specific tool in the stack.

Mimir

✓ Well done — Grafana Mimir provides scalable long-term Prometheus-compatible storage.

Metrics

✗ Try again — Metrics describes the data type, not the specific tool.

11. What is structured logging and why is it preferred over plain-text logs?

Structured logging is the practice of emitting log records as machine-parseable data — typically JSON — rather than free-form text strings. Each log entry is a document with well-defined fields: timestamp, level, message, service, trace_id, user_id, and any other contextual fields relevant to the operation.

Plain-text logs look like: 2024-01-15 10:32:01 ERROR Failed to connect to DB after 3 retries. To extract the retry count, you write a fragile regex. Structured logs look like:

{"timestamp": "2024-01-15T10:32:01Z", "level": "ERROR", "event": "db_connect_failed", "retries": 3, "db_host": "postgres-primary", "trace_id": "abc123"}

The advantages are significant. Log aggregation systems (Elasticsearch, Loki, Splunk, CloudWatch Logs Insights) can index every field automatically, enabling fast, precise queries like level=ERROR AND retries > 2 AND db_host=postgres-primary without regex parsing. You can correlate logs with traces using trace_id directly. You can build metrics from log field counts without parsing.

In Java, libraries like Logback with Logstash encoder, Log4j2 with JSON layout, or SLF4J with structured argument APIs make structured logging straightforward. In Python, structlog or the standard logging module with a JSON formatter achieve the same result.

What is the primary advantage of structured logs over plain-text logs for incident investigation? Structured logs are smaller in bytes

✗ Try again — JSON logs are often larger than plain text.

Fields can be queried and filtered precisely without regex parsing

✓ Well done — machine-parseable fields enable fast, exact queries across billions of log lines.

They are easier to read in a terminal

✗ Try again — raw JSON in a terminal is harder to read than plain text.

Which field in a structured log record enables direct correlation with distributed traces? The service name field

✗ Try again — service name narrows the source but does not link to a specific trace.

The trace_id field

✓ Well done — injecting the trace ID into log records allows jumping from log to trace in one click.

The timestamp field

✗ Try again — timestamps are too coarse to uniquely identify a single request.

12. What is log aggregation and what tools are commonly used for it?

Log aggregation is the process of collecting log data from many sources — application instances, containers, VMs, serverless functions — into a centralized system where it can be searched, analyzed, and retained. Without aggregation, debugging a failure across 50 pods means SSHing into each one individually, which is impractical and slow.

The classic stack is the ELK Stack: Elasticsearch stores and indexes logs, Logstash or Beats (Filebeat, Metricbeat) ship logs from hosts, and Kibana provides the visualization and search UI. Logstash parses and transforms logs before indexing; Filebeat is a lightweight shipper that tails log files and forwards them to Logstash or directly to Elasticsearch.

Grafana Loki takes a different approach: it indexes only metadata labels (pod name, namespace, app) rather than full-text indexing the log content. This makes it far cheaper to run at scale. Queries use LogQL, which filters by label first, then applies line filters on the matching log streams. The trade-off is that ad-hoc field searches are slower than Elasticsearch for unstructured text.

Splunk is the dominant enterprise option with powerful search (SPL), rich dashboards, and deep integrations, but at significantly higher cost.

Cloud-native options include AWS CloudWatch Logs, Google Cloud Logging, and Azure Monitor Logs, each tightly integrated with their respective compute platforms.

How does Grafana Loki differ from Elasticsearch in terms of log indexing? Loki indexes full log content; Elasticsearch indexes only metadata

✗ Try again — it is the opposite.

Loki indexes only metadata labels, making it cheaper; Elasticsearch full-text indexes log content

✓ Well done — Loki's label-only indexing dramatically reduces storage cost.

They use identical indexing strategies but different query languages

✗ Try again — they differ fundamentally in indexing approach.

What is the role of Filebeat in the ELK stack? It stores logs in Elasticsearch

✗ Try again — storage is Elasticsearch's job.

It tails log files on hosts and ships them to Logstash or Elasticsearch

✓ Well done — Filebeat is a lightweight log shipper agent.

It visualizes logs in Kibana dashboards

✗ Try again — visualization is Kibana's role.

13. What is alerting fatigue and how can you reduce it?

Alerting fatigue occurs when on-call engineers receive so many alerts — many of which are non-actionable, duplicate, or transient — that they begin ignoring or acknowledging them without investigation. It is one of the most damaging failure modes in an observability program because it means real incidents go undetected while engineers burn out.

The root causes are typically: alerting on symptoms rather than user impact, overly sensitive thresholds, missing deduplication, no alert routing (everything goes to one channel), and alerts that fire at 2 AM for issues that can safely wait until morning.

Practical remedies include:

Alert on SLO burn rate, not individual metrics. Instead of alerting when CPU > 80%, alert when the error budget is burning faster than a sustainable rate. This ties every alert to actual user impact.

Use multi-window, multi-burn-rate alerting (as described in the Google SRE Workbook). A fast burn rate fires immediately; a slower burn rate fires after accumulating over a longer window. This avoids noisy one-minute spikes while still catching slow, steady degradation.

Group and deduplicate using Alertmanager's grouping and inhibition rules. One database outage should produce one alert, not 500 alerts from every service that depends on that database.

Regularly prune alerts by reviewing which fired in the last 30 days. Alerts that consistently go unactioned should be removed or turned into tickets.

What is the core principle of SLO-based alerting that makes it less noisy than threshold-based alerting? It uses higher thresholds so fewer alerts fire

✗ Try again — raising thresholds is a blunt fix that can miss real incidents.

It ties every alert to measurable user impact via error budget burn rate

✓ Well done — if the budget is not burning, no alert fires, regardless of internal metric noise.

It groups all alerts into a single daily digest

✗ Try again — daily digests delay critical incident response.

In Alertmanager, what feature prevents 500 derivative alerts from firing when a single upstream database goes down? Scrape interval tuning

✗ Try again — scrape interval does not affect alert fan-out.

Grouping and inhibition rules

✓ Well done — inhibition rules suppress dependent alerts when a root-cause alert is already firing.

Increasing the alerting evaluation interval

✗ Try again — slower evaluation delays detection but does not reduce fan-out.

14. What is the USE method and when should you apply it?

The USE method was defined by Brendan Gregg as a systematic way to analyze performance problems in any system resource. USE stands for:

U — Utilization: The percentage of time the resource is busy. A CPU at 90% utilization is heavily loaded. Disk at 100% utilization (100% of I/O time spent servicing requests) is a bottleneck.

S — Saturation: The degree to which the resource has extra work it cannot service yet — the queue or backlog. A CPU can be at 80% utilization but have a run queue of 20 waiting threads — that is saturation. Saturation predicts whether more requests will be delayed.

E — Errors: The count of error events. These can be hard errors (disk read failures) or soft errors (corrected ECC memory errors). Errors at the resource level often precede visible application-level failures.

The USE method applies to every physical and virtual resource: CPUs, memory, disks, network interfaces, storage controllers, buses. You iterate through each resource and check U, S, E. The first resource where any of these is abnormal is likely your bottleneck.

Apply USE for infrastructure-level diagnosis — especially when investigating capacity issues, noisy-neighbor problems in shared cloud environments, or hardware degradation. For request-driven microservice diagnosis, RED is more appropriate. Together, USE + RED give you both the resource and the service-level view.

A CPU shows 70% utilization but the run queue length is 40. Which USE dimension signals the real problem here? Utilization — 70% is too high

✗ Try again — 70% utilization alone is not typically a problem.

Saturation — the run queue indicates work is backing up

✓ Well done — a long run queue means threads are waiting, causing latency even at moderate utilization.

Errors — 70% CPU causes computation errors

✗ Try again — utilization alone does not cause computation errors.

Soft ECC memory errors that are automatically corrected fall into which USE dimension? Saturation

✗ Try again — saturation is about queuing/backlog, not error events.

Errors

✓ Well done — even corrected errors are tracked under the E in USE as they predict impending hardware failure.

Utilization

✗ Try again.

15. What is cardinality in metrics and why does high cardinality cause problems?

Cardinality in metrics refers to the number of unique label value combinations that a metric can produce. A metric like http_requests_total{method, status_code, endpoint} with 5 methods, 20 status codes, and 1,000 endpoints generates up to 100,000 unique time-series. Each unique combination is called a label set or series.

High cardinality causes problems in time-series databases like Prometheus because each unique series requires its own storage, indexing, and memory overhead. The Prometheus TSDB keeps an inverted index of all label values in RAM. When you add a label like user_id or request_id — which can have millions of values — the number of series explodes. This is called a cardinality explosion, and it can OOM-kill a Prometheus server within minutes.

Common cardinality pitfalls include: adding user IDs, session tokens, IP addresses, or UUIDs as metric labels; using unbounded string values as labels; or creating per-endpoint metrics for every URL path in an API (especially with path parameters like /user/{id}).

Solutions include: using logs or traces for high-cardinality data instead of metrics; normalizing high-cardinality labels into fixed buckets; using recording rules to pre-aggregate before storage; or migrating to backends that handle high cardinality better than vanilla Prometheus, such as VictoriaMetrics or Thanos.

Why should you never use request_id or user_id as a Prometheus metric label? Prometheus does not support string label values

✗ Try again — Prometheus supports string labels, that is not the issue.

Each unique value creates a new time-series, causing a cardinality explosion that exhausts RAM

✓ Well done — millions of unique IDs would create millions of series, crashing Prometheus.

It violates GDPR to store user IDs in metrics

✗ Try again — while GDPR compliance is real, the technical problem is cardinality explosion.

Which Prometheus feature can reduce cardinality by pre-computing aggregated values before they are stored? Recording rules

✓ Well done — recording rules evaluate PromQL expressions and store the result as a new, lower-cardinality metric.

Alerting rules

✗ Try again — alerting rules fire alerts, they do not reduce stored cardinality.

The Pushgateway

✗ Try again — the Pushgateway is for short-lived jobs, not cardinality management.

16. What is tail-based sampling in distributed tracing and when should you use it?

Tail-based sampling is a tracing strategy where the decision about whether to keep or discard a trace is made after the entire trace is complete, not at the moment the root span starts. This contrasts with head-based sampling, where a random coin flip at the entry point determines whether the trace is recorded — before you know if anything interesting will happen.

The problem with head-based sampling is that it discards traces randomly, including most of the interesting ones. If 1% of requests produce errors, and you sample 10% of all traces, you will keep only ~0.1% of your error traces. The errors — the cases you most need to debug — are systematically underrepresented.

Tail-based sampling solves this by buffering spans in a collector (like the OpenTelemetry Collector's tail sampling processor) until the trace is complete. Then the sampling policy is evaluated: keep all traces that contain an error, keep all traces with p99 latency exceeded, keep 1% of the healthy fast traces. This ensures errors and slow traces are always captured at 100%, while routine traffic is sampled down.

The trade-off is infrastructure complexity: the collector must hold spans in memory long enough for late-arriving spans to complete the trace (typically 10–30 seconds), requiring significant RAM and careful timeout tuning. If the collector crashes mid-window, partial traces are lost.

Use tail-based sampling in production microservices where error rates are low (less than 5%) and capturing all error traces is a hard requirement for debugging.

What is the key advantage of tail-based sampling over head-based sampling for error traces? It uses less memory

✗ Try again — tail-based sampling actually uses more memory since it must buffer complete traces.

Sampling decisions are based on the complete trace outcome, so error traces can be captured at 100%

✓ Well done — tail sampling lets you always keep traces that contain errors or exceed latency thresholds.

It eliminates the need for a trace collector

✗ Try again — tail sampling requires a stateful collector to buffer and evaluate traces.

What infrastructure challenge does tail-based sampling introduce compared to head-based sampling? It requires more network bandwidth

✗ Try again — both approaches transport the same spans; the difference is when decisions are made.

The collector must buffer spans in memory for each in-flight trace, requiring significant RAM and timeout management

✓ Well done — spans must be held until the full trace arrives, which is memory-intensive.

Applications must be redeployed to change sampling rates

✗ Try again — tail sampling policies are typically configurable in the collector without application changes.

17. What is a health check endpoint and what should it return?

A health check endpoint is an HTTP endpoint — typically /health, /healthz, or /actuator/health — that exposes the current health status of a service. Load balancers, orchestrators like Kubernetes, and monitoring systems poll this endpoint to determine whether the service is ready to receive traffic.

There are two distinct types of health checks that should be implemented separately:

Liveness probe: Answers the question "Is the application alive or should it be restarted?" It should only check whether the process is responsive — not whether its dependencies are healthy. If the liveness probe checks the database and the database goes down, Kubernetes would restart every pod unnecessarily, causing a cascading failure.

Readiness probe: Answers the question "Is the application ready to serve traffic?" This is where dependency checks belong. If the application cannot connect to its database, it should return a non-200 response here, and the load balancer will stop routing requests to it until it recovers.

A good health check response includes: overall status (UP/DOWN/DEGRADED), individual component statuses (database, cache, downstream services), response time of each dependency check, and optionally version information. Spring Boot Actuator's /actuator/health endpoint follows this structure natively and aggregates individual health indicators.

Health checks should be fast (under 100 ms) and should not perform expensive operations — otherwise the health check itself becomes a bottleneck under load.

Why should a Kubernetes liveness probe NOT check database connectivity? Kubernetes does not support database health checks

✗ Try again — Kubernetes can check any HTTP endpoint.

A database outage would cause Kubernetes to restart all pods, creating a cascading failure instead of routing traffic away

✓ Well done — liveness failures trigger pod restarts; database dependency checks belong in the readiness probe.

Database checks are too slow to run on every liveness probe

✗ Try again — speed is a concern, but the cascading restart problem is the more fundamental reason.

What does a readiness probe failure cause in a Kubernetes deployment? The pod is restarted

✗ Try again — restarts are triggered by liveness failures, not readiness.

The pod is removed from the Service endpoints so no traffic is routed to it

✓ Well done — readiness failure pulls the pod from load balancing without restarting it.

The deployment is rolled back

✗ Try again — rollbacks require explicit deployment failure policies, not just a readiness failure.

18. What is synthetic monitoring and how does it differ from real user monitoring (RUM)?

Synthetic monitoring (also called active monitoring) involves simulating user interactions with your application using scripted probes that run on a schedule, independent of real user traffic. The probes check that key user journeys — login, checkout, search — work correctly and measure their performance. Tools like Datadog Synthetics, Pingdom, Grafana k6, and AWS CloudWatch Synthetics run these scripts from multiple geographic regions around the clock.

The advantage of synthetic monitoring is that it detects issues even when real user traffic is zero — overnight, during off-peak hours, or before a region is publicly available. It also provides a consistent, reproducible baseline since the same script runs every time, making performance regressions easy to spot.

Real User Monitoring (RUM) collects telemetry from actual user browsers or mobile apps as they interact with your application. JavaScript agents (Datadog RUM, New Relic Browser, Google Analytics) capture page load times, core web vitals (LCP, CLS, FID), JavaScript errors, and user session data. RUM reflects the actual diversity of user environments: different browsers, network conditions, geographies, and device capabilities.

The two approaches are complementary. Synthetic monitoring provides consistent baselines and catches regressions before users see them; RUM reveals how real users across the globe experience your application and surfaces issues that synthetic scripts cannot replicate (e.g., third-party script failures on specific browser versions).

What is a key advantage of synthetic monitoring over RUM during off-peak hours? It captures real user performance data while they sleep

✗ Try again — RUM requires actual users; at off-peak times, there may be none.

Synthetic probes run continuously regardless of real traffic, detecting issues before users are affected

✓ Well done — synthetic monitoring is traffic-independent, so it catches failures even at 3 AM.

It is cheaper to run than RUM at low traffic volumes

✗ Try again — cost comparison is not the main advantage; continuous coverage is.

Which type of monitoring would best reveal a third-party ad script crashing only on Safari 17 for users in Germany? Synthetic monitoring

✗ Try again — synthetic scripts run in controlled environments and may not reproduce specific browser or third-party conditions.

Real User Monitoring (RUM)

✓ Well done — RUM captures telemetry from actual user environments, including specific browser versions and geographies.

Infrastructure metrics

✗ Try again — infrastructure metrics do not capture browser-side JavaScript errors.

19. What are Core Web Vitals and why do they matter for observability?

Core Web Vitals are a set of user-experience metrics defined by Google that measure loading performance, interactivity, and visual stability. They are directly included in Google's search ranking algorithm, making them both an observability concern and a business one.

The three current Core Web Vitals are:

LCP — Largest Contentful Paint: Measures when the largest image or text block in the viewport is rendered. Good: under 2.5 seconds. Poor: over 4 seconds. LCP is affected by slow server response times, render-blocking resources, and slow image loading.

INP — Interaction to Next Paint (replaced FID in 2024): Measures the latency of all user interactions (clicks, key presses) and reports the worst-case one. Good: under 200 ms. It replaces FID (First Input Delay) because FID only measured the first interaction, missing long-running JavaScript tasks mid-session.

CLS — Cumulative Layout Shift: Measures unexpected layout shifts — content jumping around while the page loads. Good: under 0.1. CLS is caused by images without dimensions, dynamically injected content above existing content, and web fonts causing FOUT (Flash of Unstyled Text).

From an observability perspective, Core Web Vitals are RUM metrics — they must be collected from real user browsers using the Web Vitals JavaScript library or a RUM agent. They complement server-side latency metrics because a server can respond in 50 ms while LCP is still 5 seconds due to client-side rendering bottlenecks.

Which Core Web Vital replaced First Input Delay (FID) in 2024 and why is it considered an improvement? LCP — it measures painting performance more comprehensively

✗ Try again — LCP replaced a different metric and was already in the original set.

INP — it captures all interaction latencies throughout the session, not just the first one

✓ Well done — INP reports the worst-case interaction latency, giving a more complete interactivity picture.

CLS — layout shifts include input delays

✗ Try again — CLS measures visual stability, not input responsiveness.

Why can server-side latency metrics be misleading about a page's actual LCP score? Server metrics use UTC timestamps; LCP uses local time

✗ Try again — time zones are not the issue here.

Fast server response does not account for client-side rendering time, large image downloads, or blocking resources

✓ Well done — LCP is measured in the browser; the server can be fast while the browser still takes seconds to paint.

Server metrics measure all users; LCP only measures new users

✗ Try again.

20. What is application performance monitoring (APM) and how does it differ from infrastructure monitoring?

Application Performance Monitoring (APM) focuses on the behavior and performance of your application code — transaction tracing, method-level timing, database query performance, external API call latency, memory allocations, and error rates at the code level. APM tools like Datadog APM, New Relic APM, Dynatrace, and Elastic APM instrument your code (often via agents) to collect this data with minimal manual effort.

Infrastructure monitoring, in contrast, focuses on the resources that your application runs on: CPU utilization, memory, disk I/O, network throughput, and availability of the underlying VMs, containers, or bare-metal hosts. Tools like Prometheus + Node Exporter, Datadog Infrastructure, or CloudWatch cover this layer.

The distinction matters for diagnosis. If your service's p99 latency spikes:

Infrastructure monitoring tells you whether the host is CPU-throttled or network-saturated.
APM tells you which specific database query or downstream API call accounts for the added latency, and on which line of code it originates.

Modern APM platforms increasingly blur this distinction by correlating application traces with host metrics and logs in a single UI, but the conceptual separation remains useful: infrastructure monitoring is about the box; APM is about the code running on that box.

An APM tool identifies that a specific SQL query is responsible for 80% of a service's latency. Could infrastructure monitoring alone have identified the same root cause? Yes — high disk I/O would point to the slow query

✗ Try again — high disk I/O is too coarse; it cannot identify which query caused it.

No — infrastructure metrics show resource usage but cannot attribute latency to a specific query or code path

✓ Well done — APM provides code-level attribution that infrastructure metrics cannot.

Yes — database host CPU usage would uniquely identify the slow query

✗ Try again — CPU usage on the DB host does not identify which query is slow.

How do most APM agents instrument Java applications? By modifying application source code before compilation

✗ Try again — that would require changing every application's source, which is impractical.

By attaching a JVM agent at startup that uses bytecode instrumentation to intercept method calls

✓ Well done — Java APM agents use the -javaagent JVM flag and bytecode weaving to add instrumentation transparently.

By proxying all network calls through a sidecar

✗ Try again — sidecars capture network-level data but not internal code-level spans.

21. What is eBPF and how is it revolutionizing observability?

eBPF (extended Berkeley Packet Filter) is a Linux kernel technology that allows sandboxed programs to run inside the kernel without modifying kernel source code or loading kernel modules. Originally designed for network packet filtering, eBPF has been extended to support arbitrary kernel and user-space event tracing — making it a powerful foundation for low-overhead observability.

In observability, eBPF programs attach to kernel hooks (kprobes, uprobes, tracepoints, network sockets) and fire when specific events occur: a system call is made, a TCP connection is established, a function is entered, or a packet arrives. The program can read kernel data structures, compute statistics, and emit events to user space — all with near-zero overhead because it runs in the kernel itself, eliminating context-switch costs.

The observability revolution eBPF enables is zero-instrumentation tracing. Tools like Cilium (network observability), Pixie (Kubernetes observability), and Falco (security monitoring) use eBPF to capture HTTP requests, database queries, DNS lookups, and system calls across all pods in a Kubernetes cluster — without adding a single line of instrumentation to application code or restarting any process.

This is particularly valuable for legacy applications that cannot be easily re-instrumented, polyglot environments, or organizations that want full observability from day one of deployment without waiting for developer instrumentation work.

What makes eBPF-based observability tools like Pixie different from traditional APM agents? eBPF tools work only on Windows

✗ Try again — eBPF is a Linux kernel technology.

eBPF captures telemetry at the kernel level with zero application code changes or agent injection

✓ Well done — eBPF instruments the kernel itself, requiring no changes to application code.

eBPF tools only monitor network traffic, not application code

✗ Try again — eBPF can trace system calls, function calls, and application-level events, not just network packets.

Which eBPF-based tool focuses primarily on Kubernetes network observability and security policy enforcement? Jaeger

✗ Try again — Jaeger is a distributed tracing backend, not an eBPF tool.

Cilium

✓ Well done — Cilium uses eBPF for Kubernetes networking, load balancing, and network-level observability.

Prometheus

✗ Try again — Prometheus is a metrics store, not an eBPF-based tool.

22. What is Jaeger and how does it work as a distributed tracing backend?

Jaeger is an open-source distributed tracing platform originally developed by Uber and now a CNCF graduated project. It collects, stores, and visualizes distributed traces from microservices, making it possible to reconstruct the end-to-end journey of any request.

Jaeger's architecture consists of several components:

Jaeger Agent: A network daemon deployed alongside each application (typically as a sidecar or DaemonSet) that listens for spans via UDP (using the compact Thrift or compact binary protocol) and batches them to the Collector. The UDP protocol is chosen to be non-blocking — sending a span should never block the application thread.

Jaeger Collector: Receives spans from agents or directly from applications via gRPC/HTTP, validates them, processes them through a pipeline (sampling, indexing), and writes them to the storage backend.

Storage backends: Jaeger supports Elasticsearch (for full-text search on tags and logs), Cassandra (for high-write-throughput production deployments), and in-memory storage (for development/testing only). For production, Elasticsearch is the most common choice.

Jaeger Query and UI: Exposes an HTTP API and web UI for searching traces by service, operation, duration, and tags, and renders the waterfall view showing span hierarchy and timing.

Jaeger supports OpenTelemetry natively via OTLP, making it straightforward to migrate from proprietary Jaeger client libraries to the OTel SDK while keeping Jaeger as the backend.

Why does the Jaeger Agent use UDP rather than TCP to receive spans from applications? UDP provides encryption that TCP lacks

✗ Try again — UDP is not inherently more secure than TCP.

UDP is non-blocking — span submission never holds up the application thread even if the agent is slow

✓ Well done — fire-and-forget UDP prevents tracing from adding latency to the application.

The Thrift serialization format only supports UDP transport

✗ Try again — Thrift works over many transports; UDP is chosen for performance, not format compatibility.

Which Jaeger storage backend is most commonly used in production for its full-text tag search capabilities? In-memory storage

✗ Try again — in-memory storage is for development only; data is lost on restart.

MySQL

✗ Try again — Jaeger does not natively support MySQL as a storage backend.

Elasticsearch

✓ Well done — Elasticsearch is the most common production backend for Jaeger, enabling tag-level full-text search on spans.

23. What is MTTR and MTTD and why do they matter to SRE teams?

MTTR (Mean Time To Recover) and MTTD (Mean Time To Detect) are reliability engineering metrics that quantify two key phases of an incident lifecycle.

MTTD — Mean Time To Detect is the average time between when a failure actually begins and when the monitoring system (or a customer) first detects it. A low MTTD means your alerting and observability systems are working well — issues are caught quickly, before they impact many users or accumulate large error budget burns.

MTTR — Mean Time To Recover is the average time from detection to full service restoration. MTTR encompasses diagnosis time (finding the root cause), mitigation time (deploying a fix or rollback), and verification time (confirming recovery). A low MTTR reflects good runbooks, good observability (fast diagnosis), fast deployment pipelines, and practiced incident response processes.

These metrics directly reflect observability maturity. If MTTD is high, alerts are too slow or missing entirely. If MTTR is high despite fast detection, either the debugging experience is poor (missing traces or logs), deployments are slow, or on-call engineers lack the knowledge to diagnose the system. Observability improvements — better traces, correlated logs, runbooks linked from alerts — directly reduce MTTR.

DORA (DevOps Research and Assessment) research identifies MTTR as one of four key metrics for elite engineering organizations, alongside deployment frequency, lead time, and change failure rate.

If MTTD is high but MTTR is low, what does this suggest about the team's observability? Incident response is slow but detection is fast

✗ Try again — you have the interpretation reversed.

The team fixes issues quickly once detected, but alerting is too slow and issues go unnoticed for too long

✓ Well done — high MTTD = slow detection; low MTTR = fast recovery once the issue is known.

Deployment pipelines are the bottleneck

✗ Try again — slow pipelines would increase MTTR, not MTTD.

According to DORA research, MTTR is one of how many key software delivery metrics? Two

✗ Try again.

Four

✓ Well done — DORA tracks deployment frequency, lead time for changes, change failure rate, and MTTR.

Eight

✗ Try again.

24. What is anomaly detection in observability and what are its limitations?

Anomaly detection in observability is the automated identification of data points or patterns in metrics, logs, or traces that deviate significantly from historical baselines or expected behavior. Instead of manually setting static thresholds like "alert if CPU > 80%", anomaly detection learns seasonal patterns (traffic spikes every Monday at 9 AM), trends (gradual memory leak over days), and correlated multi-metric behaviors, then alerts only when observed values fall outside learned normal ranges.

Common approaches include:

Statistical methods: Z-score, moving averages, exponential smoothing (Holt-Winters) detect deviations from a rolling baseline. Simple and interpretable.

Machine learning models: Isolation Forest, LSTM neural networks, and Prophet (Facebook's time-series library) can capture complex seasonal patterns. Datadog, New Relic, and Dynatrace all offer ML-based anomaly detection for metrics.

Limitations are significant. First, anomaly detection requires training data — new services or after major refactors, there is no baseline. Second, it generates alert noise during legitimate events (product launches, holiday traffic spikes) which look like anomalies but are expected. Third, it operates on known metrics: it can only flag what it can see, not unknown-unknowns that were never instrumented. Fourth, understanding why a metric is anomalous still requires human investigation — anomaly detection replaces threshold-setting, not debugging. Fifth, precision-recall trade-offs mean reducing false positives often increases false negatives, and vice versa.

Why does anomaly detection often fire false positives during a planned product launch? The model does not have enough RAM to process launch traffic

✗ Try again — RAM is not the issue.

The traffic spike pattern looks statistically anomalous even though it is an expected, intentional event

✓ Well done — the model has no context about intentional events; a spike looks like an anomaly regardless of cause.

Launch traffic causes metric labels to change, breaking the model

✗ Try again — label changes would be a separate issue, not the primary reason for false positives during traffic spikes.

What fundamental observability gap does anomaly detection NOT address? It cannot detect CPU spikes

✗ Try again — anomaly detection works well on CPU metrics.

It only flags deviations in instrumented metrics; uninstrumented failure modes remain invisible

✓ Well done — anomaly detection cannot alert on dimensions that were never measured.

It is incompatible with Prometheus

✗ Try again — many anomaly detection tools integrate with Prometheus.

25. What is a runbook and how should it be linked to monitoring alerts?

A runbook (also called a playbook) is a documented set of procedures that an on-call engineer follows when a specific alert fires. A well-written runbook dramatically reduces MTTR by pre-answering the first questions an engineer asks: What is this alert? Why does it matter? What checks do I run first? What are the common causes? What do I do if cause A? What do I do if cause B? Who do I escalate to?

Runbooks should be linked directly from the alert definition so they are one click away at 3 AM when the engineer is half-awake. In Alertmanager, you can add an annotations.runbook_url label to each alert rule. Datadog, PagerDuty, and most alerting platforms support custom annotation fields for this purpose.

A good runbook contains:

Alert summary: What fired and what service it covers.
Severity and SLO impact: Is the error budget burning? How fast?
Diagnosis steps: Specific PromQL or log queries to run, with expected outputs for common failure modes.
Mitigation options: Rollback command, feature flag to disable, cache flush procedure.
Escalation path: Who to call if the runbook does not resolve the issue within N minutes.
Post-incident: Link to the postmortem template.

Runbooks become stale quickly. Pair them with a last-updated timestamp and require post-incident updates when the actual resolution differed from documented steps.

In a Prometheus Alertmanager alert rule, which annotation field is the conventional place to link a runbook URL? labels.runbook

✗ Try again — labels are used for routing and grouping, not freeform URLs.

annotations.runbook_url

✓ Well done — annotations.runbook_url is the conventional field for linking documentation from Alertmanager rules.

description.url

✗ Try again — there is no description.url field in Prometheus alert rules.

What is a key sign that a runbook has become stale and should be updated? It is longer than two pages

✗ Try again — length alone does not indicate staleness.

Engineers consistently resolve incidents with steps that differ from what the runbook documents

✓ Well done — if actual resolution diverges from documented steps, the runbook is out of date.

It has not been read in 7 days

✗ Try again — a runbook that is never needed is a good thing, not a sign it is stale.

26. What is a service mesh and how does it enhance observability?

A service mesh is an infrastructure layer — deployed alongside your application services — that manages service-to-service communication. It intercepts network traffic using sidecar proxies (Envoy is the most common) injected into every pod, handling load balancing, mutual TLS, retries, circuit breaking, and observability without any application code changes.

From an observability perspective, a service mesh provides L7 telemetry automatically for every service-to-service call in the mesh. Because Envoy intercepts all HTTP/gRPC traffic, it can emit:

Metrics: Request rate, error rate, and latency (p50/p95/p99) per source-destination service pair — exactly the RED method signals, automatically, for every microservice.
Traces: Envoy can propagate trace context headers and generate spans for every hop, contributing to distributed traces without application-level instrumentation.
Access logs: Structured per-request logs with HTTP method, path, status, upstream cluster, and duration.

Istio (using Envoy) and Linkerd are the two dominant service meshes. Istio integrates with Prometheus (via native Envoy metrics scraping), Jaeger/Zipkin (for tracing), and Kiali (a service mesh topology visualization tool). Linkerd has its own lightweight Rust-based proxy with built-in Prometheus metrics.

The trade-off is operational complexity: managing a service mesh's control plane (istiod, Linkerd control plane) adds significant overhead, and sidecar injection adds latency and resource consumption per pod.

What is the sidecar proxy used by Istio for traffic interception and telemetry? Nginx

✗ Try again — Nginx is a web server/proxy but not the sidecar used by Istio.

Envoy

✓ Well done — Istio uses Envoy proxy as its data-plane sidecar for all traffic interception and telemetry emission.

HAProxy

✗ Try again — HAProxy is a load balancer, not the Istio data-plane proxy.

What observability benefit does a service mesh provide that would otherwise require manual SDK instrumentation in every microservice? Database query tracing

✗ Try again — service meshes intercept service-to-service calls, not calls into databases (which use non-HTTP protocols in most cases).

Automatic RED metrics and trace context propagation for every service-to-service HTTP/gRPC call

✓ Well done — Envoy emits rate, error, and latency metrics for every intercepted call with zero application code changes.

JVM heap memory profiling

✗ Try again — heap profiling is an application-level concern that the network sidecar cannot see.

27. What is a postmortem and what makes one blameless?

A postmortem (also called an incident review or retrospective) is a structured document written after a significant incident. Its purpose is to understand what happened, why it happened, what impact it had, and how to prevent recurrence. In SRE culture, postmortems are treated as a learning opportunity, not a blame-assignment exercise.

A typical postmortem includes:

Incident summary: What broke, when, and for how long.
Impact: Number of affected users, revenue or SLO impact, error budget burned.
Timeline: Precise chronology of detection, escalation, diagnosis steps, mitigation, and resolution.
Root cause analysis: The chain of contributing factors (using 5 Whys, fishbone diagrams, or similar).
Action items: Specific, assigned, and time-bound follow-up tasks to prevent recurrence.

A blameless postmortem operates under the assumption that engineers make reasonable decisions given the information and tools available to them at the time. Rather than asking "Who caused the outage?", it asks "What conditions made this mistake possible?" and "How do we remove those conditions?" This approach, championed by John Allspaw at Etsy and codified in Google's SRE book, creates a psychologically safe environment where engineers honestly report their actions without fear of punishment.

Blameless postmortems produce higher-quality information because engineers do not hide or sanitize their actions. The result is better action items targeting systemic fixes (tooling, automation, process) rather than individual performance reviews.

What is the core question a blameless postmortem asks instead of who caused the outage? Which team deployed last

✗ Try again — identifying who deployed last is the starting point of a blame-focused review, not a blameless one.

What systemic conditions made the mistake possible, and how do we remove them

✓ Well done — blameless postmortems focus on systemic factors, not individual fault.

Which monitoring alert failed to fire

✗ Try again — while alerting gaps may be a finding, the core question is about systemic conditions.

Why do blameless postmortems produce more accurate incident timelines than blame-focused reviews? Engineers write more formally when they know they will not be punished

✗ Try again — formality is not the reason.

Engineers honestly report all their actions without fear of punishment, producing unfiltered information

✓ Well done — psychological safety is the mechanism; engineers do not hide or omit actions that might look bad.

Blameless postmortems use automated log replay to reconstruct the timeline

✗ Try again — automated replay is a tool, not what distinguishes blameless from blame-focused postmortems.

28. What is the difference between blackbox monitoring and whitebox monitoring?

Blackbox monitoring treats the system under observation as a black box — you probe it from the outside and measure what you can observe without any inside knowledge. You send HTTP requests to an endpoint and measure whether you get a 200 response and in what time. The Prometheus Blackbox Exporter is the canonical tool: it probes HTTP, HTTPS, TCP, DNS, and ICMP endpoints and exposes the results as metrics. Synthetic monitoring (Pingdom, CloudWatch Synthetics) is also blackbox monitoring.

Blackbox monitoring catches failures from the user's perspective — it tells you whether the service appears healthy to external consumers. It works even when you have no access to the application internals, making it ideal for third-party APIs, vendor services, and legacy systems you cannot instrument.

Whitebox monitoring collects telemetry from inside the application: instrumented metrics (counters, histograms), structured logs, and distributed traces. It reveals internal behavior — database query times, thread pool utilization, cache hit rates, garbage collection pauses — that external probes cannot see. APM tools are entirely whitebox.

The two approaches are complementary, not alternatives. Blackbox monitoring catches: endpoint unavailability, DNS failures, TLS certificate expiry, and user-visible latency. Whitebox monitoring is needed to diagnose why those things happen. A balanced observability program deploys both: blackbox checks as the outermost user-facing signal, whitebox telemetry for diagnosis.

Which monitoring approach would first detect a TLS certificate that expires tomorrow? Whitebox monitoring via application logs

✗ Try again — application logs do not typically expose certificate expiry; external TLS probing does.

Blackbox monitoring — probing the HTTPS endpoint and checking the certificate expiry field

✓ Well done — the Prometheus Blackbox Exporter exposes ssl_earliest_cert_expiry to alert on expiring certificates.

Either approach — certificate expiry is equally visible from inside and outside

✗ Try again — the application itself does not normally alert on its own certificate expiry; an external probe is needed.

What internal signal would only whitebox monitoring reveal that blackbox cannot? Whether the HTTP endpoint returns a 200 status

✗ Try again — HTTP status is an external, blackbox-observable signal.

The JVM garbage collection pause duration causing latency spikes

✓ Well done — GC pauses are an internal JVM metric only visible via instrumentation, not from an external probe.

End-to-end response time from a regional probe location

✗ Try again — regional probing is a blackbox synthetic monitoring technique.

29. What is Kubernetes monitoring and what are the key components to observe?

Kubernetes monitoring covers multiple layers, each requiring different tooling and instrumentation. A Kubernetes cluster has at minimum these observable layers:

Control plane components: The API server, etcd, scheduler, and controller manager each expose their own Prometheus metrics. API server latency and request rates, etcd database size and disk fsync latency, and scheduler binding latency are critical signals for cluster health. The kube-state-metrics exporter converts Kubernetes object state (pod phase, deployment replicas, node conditions) into Prometheus metrics.

Node-level resources: Node Exporter (or the Windows Exporter) runs as a DaemonSet and collects host-level metrics: CPU, memory, filesystem, and network. These feed the USE method analysis for each node.

Pod and container metrics: The kubelet exposes the cAdvisor metrics endpoint, which provides CPU, memory, and network usage per container. These are scraped by Prometheus and enable per-pod resource utilization dashboards.

Application metrics: Each application exposes its own /metrics endpoint. ServiceMonitor or PodMonitor custom resources (from the Prometheus Operator) tell Prometheus which services to scrape.

Events: Kubernetes events (OOMKilled, CrashLoopBackOff, ImagePullError) are critical for understanding pod failure patterns. They can be shipped to a log aggregator using tools like eventrouter or Kubernetes event exporter.

The kube-prometheus-stack Helm chart bundles Prometheus Operator, Alertmanager, Grafana, and a set of pre-built dashboards and alert rules, making it the fastest path to a complete Kubernetes monitoring setup.

Which tool converts Kubernetes object state (pod phase, deployment replicas) into Prometheus metrics? Node Exporter

✗ Try again — Node Exporter collects host-level OS metrics, not Kubernetes object state.

kube-state-metrics

✓ Well done — kube-state-metrics watches the Kubernetes API and exposes object state as Prometheus metrics.

cAdvisor

✗ Try again — cAdvisor provides container-level resource usage, not Kubernetes object state.

What Prometheus Operator custom resource tells Prometheus which Kubernetes services to scrape for metrics? AlertRule

✗ Try again — AlertRule defines alerting conditions, not scrape targets.

ServiceMonitor

✓ Well done — ServiceMonitor CRDs define which services Prometheus should scrape and on which port and path.

ScrapeJob

✗ Try again — ScrapeJob is not a Prometheus Operator CRD name.

30. What is a metric histogram and why is it used for latency measurement?

A histogram is a metric type that samples observations and counts them into configurable buckets, while also tracking a running count and sum. In Prometheus, a histogram metric creates multiple time-series: _bucket{le="0.1"} (count of observations ≤ 100 ms), _bucket{le="0.5"}, _bucket{le="1.0"}, etc., plus _count (total observations) and _sum (sum of all observed values).

For latency measurement, histograms are preferred over gauges or counters because they enable percentile calculations without storing every individual data point. The histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) PromQL expression computes an approximate p99 from the bucket counts — not from raw samples.

The approximation quality depends on bucket placement. Buckets should be placed where percentile boundaries are likely to fall. If your SLO threshold is 500 ms, you need a bucket at exactly 0.5 seconds; otherwise the quantile approximation at that threshold will be inaccurate.

Prometheus Native Histograms (introduced experimentally in Prometheus 2.40) eliminate the need for pre-configured buckets by using a sparse representation with exponentially-spaced buckets that adapt to the actual data distribution, providing accurate percentiles at any threshold without bucket configuration.

A summary is an alternative that computes quantiles client-side and exposes them directly. Summaries are accurate but cannot be aggregated across instances — avg(summary_quantile) across 10 pods is mathematically incorrect. Histograms aggregate correctly because bucket counts can be summed.

Why are Prometheus histograms preferred over summaries for fleet-wide latency percentile calculations? Histograms are more accurate than summaries

✗ Try again — summaries are actually more accurate per instance; the advantage of histograms is aggregation.

Histogram bucket counts can be summed across instances; summary quantiles cannot be meaningfully averaged

✓ Well done — aggregating summary quantiles across pods is mathematically incorrect; histograms aggregate correctly.

Histograms use less memory

✗ Try again — histograms actually use more memory because they create multiple bucket series per metric.

If your SLO threshold is 500 ms latency, what must be true about your Prometheus histogram bucket configuration? The smallest bucket must be less than 500 ms

✗ Try again — having a small bucket does not ensure accuracy at the 500 ms boundary.

A bucket must exist at exactly 0.5 seconds (the 500 ms threshold) for the quantile approximation to be accurate there

✓ Well done — histogram_quantile interpolates between buckets; a bucket at the SLO boundary ensures the approximation is accurate at that critical point.

You need at least 20 buckets in total

✗ Try again — count alone does not ensure accurate percentiles at your specific threshold.

31. What is chaos engineering and how does it relate to observability?

Chaos engineering is the practice of intentionally injecting failures into a system in a controlled way to discover weaknesses before they cause unplanned outages. The discipline was pioneered by Netflix with their Chaos Monkey tool (now part of the Simian Army), which randomly terminated EC2 instances in production to verify that services could survive node failures.

Chaos engineering tools include Chaos Monkey (instance termination), Gremlin (CPU, memory, network, and disk failure injection as a SaaS platform), Chaos Mesh (Kubernetes-native chaos experiments), and Litmus Chaos (another CNCF project for Kubernetes chaos).

The connection to observability is direct and bidirectional. Chaos experiments are only safe and useful if you have strong observability in place first:

You cannot run chaos safely without observability. If you cannot detect the blast radius of an experiment in real time, you risk turning a controlled test into a real incident. You need dashboards, alerts, and SLO tracking active before any experiment starts.
Chaos experiments validate your observability. Running a chaos experiment and checking whether your alerts fired, your runbooks worked, and your dashboards showed the failure is a direct test of whether your monitoring would catch the same failure if it happened unexpectedly.
Chaos reveals observability gaps. If an experiment causes a failure that goes undetected by your monitoring until you look at a dashboard manually, that is an observability gap to close.

What tool originally popularized chaos engineering by terminating random EC2 instances at Netflix? Gremlin

✗ Try again — Gremlin is a commercial chaos platform that came later.

Chaos Monkey

✓ Well done — Netflix's Chaos Monkey was the original tool that made chaos engineering famous.

Litmus Chaos

✗ Try again — Litmus is a CNCF Kubernetes chaos project that came years after Chaos Monkey.

What does it mean for your observability when a chaos experiment causes a failure that goes undetected by your alerts? The chaos experiment was too severe

✗ Try again — severity does not explain a detection failure; missing alerts do.

There is an observability gap — the same real failure would go undetected in production

✓ Well done — undetected chaos experiments reveal exactly which failure modes your monitoring cannot see.

The system is more resilient than expected

✗ Try again — a fault with no alert is a monitoring gap, not a sign of resilience.

32. What is log sampling and when should you apply it?

Log sampling is the practice of recording only a fraction of log entries that match a certain pattern, rather than every single one. It is a strategy for controlling log volume and cost when some log types are emitted at very high rates and provide diminishing marginal value per entry.

The most common scenario is high-frequency success logs. If your API handles 50,000 requests per second and every successful request logs an INFO entry, you are generating 4.3 billion log lines per day — most of which are identical in structure and say everything is fine. Sampling 1 in 100 success logs while keeping 100% of warnings, errors, and slow requests reduces volume by ~99% without meaningfully hurting your ability to investigate incidents.

There are two main approaches:

Head-based (random) sampling: Log a fixed percentage of all events matching a rule. Simple to implement but may drop rare important events if they happen to fall in the unlogged fraction.

Adaptive sampling: Adjust the sampling rate dynamically based on rate — when the rate is low, log everything; when the rate spikes, increase the sampling ratio. This ensures unusual patterns (sudden traffic surges) are captured at higher fidelity.

Sampling should never be applied to error-level logs, security audit logs, or any log that is only emitted once per rare event. The critical rule: sample on volume, not on importance. Always emit 100% of high-severity events regardless of sampling configuration.

Which log category should always be captured at 100% and never sampled? DEBUG logs

✗ Try again — DEBUG logs are often the best candidates for aggressive sampling or disabling in production.

Error and security audit logs

✓ Well done — errors and audit records must be complete; sampling them creates blind spots in incident investigation and compliance.

INFO logs for successful requests

✗ Try again — high-volume INFO success logs are the primary candidates for sampling.

What is the advantage of adaptive sampling over fixed-rate (random) log sampling? It logs all requests at 100% rate all the time

✗ Try again — that is not sampling at all.

It increases sampling fidelity during unusual events (like sudden traffic spikes) while reducing volume during normal operation

✓ Well done — adaptive sampling captures anomalies at higher resolution automatically.

It is cheaper to implement than fixed-rate sampling

✗ Try again — adaptive sampling is more complex to implement than a simple random percentage.

33. What is the difference between push-based and pull-based metrics collection?

In a pull-based system, the monitoring server (like Prometheus) periodically initiates HTTP requests to each target's metrics endpoint and fetches the current metric values. The monitoring server controls the scrape interval and decides which targets to scrape.

In a push-based system, applications and agents send metrics to a central aggregation point as they are generated. StatsD, the InfluxDB line protocol, and AWS CloudWatch all use push models. Applications call a client library that buffers metrics and periodically flushes them to the aggregation server.

Pull advantages: The monitoring server always knows if a target is down (a scrape failure is itself a signal). No need to configure agents with the server's address. Easier to scale by adding new scrapers. Health of monitoring is transparent — you can check the scrape job.

Push advantages: Works naturally for short-lived jobs (batch, serverless functions, CI pipelines) that may finish before a pull can happen. Works when the monitored target is behind a firewall and cannot be reached by the monitoring server. Lower latency — metrics appear at the server as soon as they are generated, not on the next scrape cycle.

Hybrid approaches exist: Prometheus uses the Pushgateway for short-lived jobs, and the OpenTelemetry Collector accepts both push (OTLP) and pull (scraping Prometheus endpoints) depending on configuration. Many organizations use push for application telemetry and pull for infrastructure metrics.

In a pull-based monitoring system, how does the monitoring server detect that a target has gone down? The target sends a goodbye message before shutting down

✗ Try again — processes rarely send goodbye messages, especially in crash scenarios.

The scrape attempt fails with a connection error or timeout, which itself is a metric (up=0)

✓ Well done — Prometheus records each scrape result; a failed scrape sets the up metric to 0, triggering TargetDown alerts.

The target deregisters from the service discovery endpoint

✗ Try again — service discovery removes the target from the scrape list, but a failed scrape is the detection mechanism for unplanned outages.

Why is push-based metrics collection preferred for AWS Lambda functions? Lambda has a built-in Prometheus metrics endpoint

✗ Try again — Lambda does not expose a /metrics endpoint by default.

Lambda functions are ephemeral and may complete in milliseconds — a pull scrape could never reach them in time

✓ Well done — short-lived functions must push metrics before they exit; there is no stable endpoint for a puller to scrape.

AWS does not allow inbound HTTP connections to Lambda

✗ Try again — Lambda function URLs and API Gateway can accept inbound HTTP, but the real problem is ephemeral lifetime.

34. What is distributed systems observability and what challenges does it introduce compared to monolith observability?

Distributed systems observability refers to the ability to understand the internal state and behavior of a system that consists of multiple independently deployed services communicating over a network. Unlike a monolith — where all code runs in one process and profiling, logging, and debugging are straightforward — distributed systems introduce fundamental challenges that require purpose-built tooling.

Challenge 1 — No single log stream: A request touches 10 services; their logs are in 10 different places. Log correlation requires shared request IDs injected into every service's logs. Without structured logging and log aggregation, tracing a request manually is infeasible.

Challenge 2 — Partial failures: In a monolith, a failure either crashes the process or it does not. In distributed systems, Service A can respond successfully while Service B, called internally by A, silently times out and returns a degraded result. These partial failures are invisible without distributed tracing and upstream error propagation.

Challenge 3 — Clock skew: Services run on different machines with clocks that drift. Log timestamps from different services cannot be naively sorted — a span ending at 10:00:00.001 on Service B might be recorded before a span starting at 10:00:00.000 on Service A due to clock drift. OpenTelemetry uses monotonic clocks within a single process and accepts some clock-skew inaccuracy across processes.

Challenge 4 — Attribution: When latency spikes, which of the 10 services caused it? Without traces linking spans causally, you are guessing. Distributed tracing was invented specifically to solve attribution in distributed systems.

In a distributed system, what mechanism allows logs from 10 different services to be correlated to a single request? Synchronized system clocks across all services

✗ Try again — clock skew makes timestamp-only correlation unreliable.

A shared request/trace ID propagated through all service calls and written to every log entry

✓ Well done — a consistent trace ID injected into structured logs enables cross-service log correlation.

Storing all logs in the same Elasticsearch index

✗ Try again — co-location helps with search, but correlation still requires a shared identifier.

What type of failure is uniquely problematic in distributed systems but rare in a monolith? Memory leaks

✗ Try again — memory leaks affect both monoliths and distributed services.

Partial failures — where one service succeeds while a dependency silently degrades or times out

✓ Well done — partial failures are a hallmark of distributed systems and require active tracing and health propagation to detect.

CPU throttling

✗ Try again — CPU throttling affects both architectures.

35. What is Datadog and what differentiates it from open-source observability stacks?

Datadog is a cloud-based monitoring and observability platform that provides infrastructure monitoring, APM, log management, real user monitoring, synthetic testing, security monitoring, and more — all integrated in a single SaaS product. It is one of the dominant commercial observability platforms alongside New Relic, Dynatrace, and Elastic.

The key differentiators from an open-source stack (Prometheus + Grafana + Loki + Jaeger) include:

Unified correlation: Datadog stores metrics, logs, traces, and RUM data in a single platform with a shared data model. Jumping from a latency spike on a dashboard to the traces and logs for that exact time window is a single click. Open-source stacks require separate products that are manually integrated, and correlation often requires copy-pasting trace IDs across tools.

Long-term storage: Prometheus is not designed for multi-year retention at scale. Datadog stores metrics at full resolution for 15 months. Open-source solutions require adding Thanos or Cortex for long-term storage.

Automatic instrumentation and integrations: Datadog's Agent auto-discovers running processes and containers and enables integrations with hundreds of technologies (MySQL, Kafka, Redis, Kubernetes) with minimal configuration. Open-source requires manually deploying and maintaining separate exporters for each technology.

Cost: Datadog is significantly more expensive than self-hosted open-source, especially at scale. Pricing by host, APM host, and log ingested gigabytes can result in very large bills. Open-source stacks shift cost from licensing to operational engineering effort.

What is the key observability workflow advantage Datadog has over a typical Prometheus + Grafana + Jaeger open-source stack? Datadog supports more metric types than Prometheus

✗ Try again — Prometheus's metric model is comprehensive; this is not the primary advantage.

Metrics, traces, and logs are natively correlated in one platform, enabling seamless pivoting between signals

✓ Well done — Datadog's unified data model eliminates the manual trace-ID copy-paste needed when tools are separate.

Datadog is open-source and free to self-host

✗ Try again — Datadog is a commercial SaaS product, not open-source.

What open-source component is typically added to a Prometheus setup to provide multi-year metrics retention? Grafana Tempo

✗ Try again — Tempo is a tracing backend, not a long-term metrics store.

Thanos or Cortex (now Grafana Mimir)

✓ Well done — Thanos and Mimir add horizontally scalable, long-term object-storage-backed retention to Prometheus.

Loki

✗ Try again — Loki is a log aggregation system, not a metrics store.

36. What is on-call rotation and what makes an on-call experience sustainable?

An on-call rotation is a scheduled arrangement where engineers take turns being the primary responder for production incidents outside normal business hours. When an alert fires, the on-call engineer receives a page (via PagerDuty, Opsgenie, or VictorOps) and is expected to acknowledge and begin investigating within a defined response time (typically 5–15 minutes).

On-call is sustainable when several conditions are met:

Low alert volume: If the on-call engineer is paged more than a few times per shift, something is wrong with the alerting system. Google's SRE book recommends that on-call engineers spend no more than 25% of their time on operational work (toil). Frequent pages beyond that must trigger toil-reduction efforts.

Meaningful alerts: Every page should require a human decision. If an alert resolves itself without any action, it is either too sensitive or should auto-remediate. Pages that wake engineers at 3 AM for events that do not require action destroy morale and trust in the system.

Compensation: On-call work should be compensated — either financially (on-call pay) or with compensatory time off after a heavy on-call shift.

Escalation paths: The on-call engineer should not be alone. A clear secondary on-call, escalation contacts, and runbooks ensure that no single engineer is expected to know everything.

Post-incident investment: Each incident that required manual intervention is a toil-reduction opportunity. Sustainable on-call requires a cultural commitment to fix root causes rather than repeatedly firefighting the same issues.

According to Google's SRE principles, what percentage of an on-call engineer's time on operational/toil work should trigger remediation efforts? 10%

✗ Try again — 10% is below the threshold; at 10% the system is considered healthy.

More than 25%

✓ Well done — Google SRE targets a maximum of 25% toil; exceeding it requires engineering investment in automation.

More than 75%

✗ Try again — by 75% the on-call engineer is drowning in toil; the threshold is much lower.

What does it indicate if an on-call alert consistently resolves itself before the engineer takes any action? The system is self-healing and the alert is working as designed

✗ Try again — self-healing is good, but an alert that never needs human action should auto-remediate silently, not page an engineer.

The alert is either too sensitive or the remediation should be automated — it should not page a human

✓ Well done — pages that require no human decision are noise that should be eliminated or automated.

The on-call engineer is not responding fast enough

✗ Try again — the issue is the alert design, not engineer response speed.

37. What is continuous profiling and how does it differ from traditional profiling?

Continuous profiling is the practice of running lightweight profilers in production continuously (24/7), sampling CPU usage, memory allocations, goroutine counts, or mutex contention at low frequency, and storing the results in a queryable database. The key word is continuously — unlike traditional profiling, you do not need to predict when a performance problem will occur and manually attach a profiler to catch it.

Traditional profiling (using tools like JProfiler, YourKit, or Java Flight Recorder in triggered mode) is done on demand: a developer identifies a performance issue, attaches a profiler to the suspect process, reproduces the problem, and analyzes the profile. This works well in development but has two problems in production: the profiler overhead can be too high for continuous use (JProfiler in full instrumentation mode can add 20-200% overhead), and you cannot retroactively profile an incident that already passed.

Continuous profiling tools like Pyroscope (open-source), Parca (CNCF), Google Cloud Profiler, and Datadog Continuous Profiler use sampling-based profilers (typically 100 Hz) that add less than 1-5% overhead, making them safe for production. Results are stored with timestamps and labels, enabling queries like: "Show me the flame graph for the payment-service during last Tuesday's latency spike" — and directly compare it to flame graphs from the same time the previous week.

Continuous profiling connects naturally to the other observability pillars: when traces show a method is slow, the continuous profiler shows exactly which code path within that method consumes the time.

What is the key operational advantage of continuous profiling over triggered profiling during a production incident? Continuous profiling uses more accurate instrumentation

✗ Try again — triggered profilers can be more accurate; the advantage is temporal coverage.

Profile data for the exact incident window already exists — you do not need to reproduce the problem to analyze it

✓ Well done — because the profiler runs continuously, the flame graph from during an incident is available retroactively.

Continuous profiling eliminates the need for distributed tracing

✗ Try again — profiling and tracing are complementary, not alternatives.

Why is sampling-based profiling preferred over full instrumentation profiling for continuous production use? Sampling profiles are more accurate for hot code paths

✗ Try again — full instrumentation is generally more accurate; sampling trades accuracy for low overhead.

Sampling adds less than 5% CPU overhead at typical rates, making it safe to run continuously without impacting production

✓ Well done — full instrumentation can add 20-200% overhead, which is not acceptable in production 24/7.

Sampling profiles are easier to visualize as flame graphs

✗ Try again — both sampling and instrumentation profiles can be visualized as flame graphs.

38. What is a flame graph and how do you read it?

A flame graph is a visualization of a stack trace profile that makes it easy to identify which functions consume the most CPU time, memory, or other resources. It was invented by Brendan Gregg while at Netflix to visualize perf(1) output on Linux systems.

Reading a flame graph:

Y-axis (vertical): Each row represents a stack frame. The bottom of the graph is the starting point (main, or the thread entry point). Moving upward, each row is the function called by the one below it. A tall column means a deep call stack — many nested function calls.

X-axis (horizontal): The width of each box represents the proportion of samples in which that function appeared in the call stack. A wider box means more time was spent in or below that function. The order within a row is sorted alphabetically, not temporally — left is not earlier.

Color: In most flame graph tools, color is used only for readability (to distinguish adjacent boxes). Red/orange flames suggest hotness in some tools (like speedscope), but this is cosmetic, not inherent to the format.

Finding the bottleneck: Look for wide boxes near the top of the graph — these are functions that appear frequently at the top of call stacks, meaning the CPU was executing them (not calling children). A wide box deep in the stack that has many narrow children indicates a dispatcher pattern, not necessarily a bottleneck.

Differential flame graphs compare two profiles (before and after a change) by coloring regressions red and improvements blue, making performance regressions visually obvious.

In a flame graph, what does the width of a function's box represent? The wall-clock time the function took in a single execution

✗ Try again — flame graphs show sampled frequency, not individual execution durations.

The proportion of profile samples in which that function appeared in the call stack

✓ Well done — wider boxes mean the function appeared more often across all samples, indicating more time was spent there.

The number of threads executing that function simultaneously

✗ Try again — thread count is not what the x-axis represents in a standard flame graph.

What does a differential flame graph use color coding to highlight? Thread priority levels

✗ Try again — differential flame graphs are not about thread priority.

Performance regressions (red) and improvements (blue) compared to a baseline profile

✓ Well done — differential flame graphs make before-and-after performance changes visually obvious.

Memory vs CPU hotspots using different colors

✗ Try again — a single differential flame graph compares one type of resource between two time periods.

39. What is the role of an observability platform in incident response?

An observability platform serves as the central nervous system of incident response. When an alert fires, the on-call engineer opens the platform and uses it through every phase of the incident lifecycle.

Detection phase: Alerts integrated with PagerDuty or Opsgenie fire when SLO burn rates exceed thresholds. The alert links directly to a dashboard showing the incident's scope: which services are affected, since when, and how much error budget has burned.

Triage phase: The engineer uses the platform to scope the blast radius. Dashboards show whether the issue is isolated to one region, one service version, or one dependency. Service maps (topology graphs) in Datadog, Dynatrace, or Grafana show real-time dependency health.

Diagnosis phase: The engineer pivots from the metric anomaly to distributed traces for that time window. Traces show which service added unexpected latency and where in the call chain. From a suspicious span, the engineer pivots to structured logs for that trace ID to see the exact error message and stack trace.

Mitigation phase: Feature flag systems (LaunchDarkly, Unleash) integrated with the observability platform let engineers disable a feature and immediately see the impact on error rate in the same dashboard. Deployment rollback triggers are linked from incident management tools.

Resolution verification: After mitigation, the platform provides the confirmation signal — SLO burn rate drops back to baseline, error rate returns to normal, traces show clean spans. The engineer can close the incident confidently based on data, not hope.

After deploying a hotfix during an incident, how should an engineer use the observability platform to confirm resolution? Wait 24 hours and check if any new incidents were opened

✗ Try again — waiting 24 hours leaves users impacted far longer than necessary.

Verify that error rate, SLO burn rate, and relevant metrics return to pre-incident baselines in real time

✓ Well done — data-driven resolution confirmation via live dashboards is the correct approach.

Ask a customer if the issue is resolved

✗ Try again — customer feedback is slow and anecdotal; observability data gives immediate, objective confirmation.

What type of observability visualization shows the real-time dependency topology between microservices during triage? A flame graph

✗ Try again — flame graphs show code-level CPU profiles, not service dependency topology.

A service map (service dependency graph)

✓ Well done — service maps show live dependency health, making it easy to spot which upstream or downstream service is the source of a problem.

A log stream

✗ Try again — a log stream shows individual events, not service topology.

40. What is OpenMetrics and how does it relate to Prometheus exposition format?

OpenMetrics is a specification for transmitting metrics at scale that evolved from the Prometheus text exposition format. It was accepted as a CNCF sandbox project and aims to be the standard for metrics exposition across the industry, not just within the Prometheus ecosystem.

The original Prometheus text format is simple: each line contains a metric name, label set, value, and optional timestamp. OpenMetrics extends this format with:

A required final EOF marker (# EOF) that allows parsers to detect incomplete responses.
Exemplars: Structured sample annotations that attach trace IDs to specific metric observations. For example, a histogram bucket observation can carry the trace ID of the request that fell into that bucket, enabling one-click navigation from a latency spike in a metric to the exact trace that caused it. This is the bridge between metrics and traces.
Mandatory type and unit metadata: Stronger requirements for # TYPE and # UNIT annotations make the format more self-describing.
Native support for created timestamps: Useful for staleness handling.

Prometheus 2.x supports both the original text format and OpenMetrics (content negotiation via the Accept header). Most modern Prometheus client libraries can expose either format. The key practical feature that OpenMetrics enables is exemplars, which Grafana and Datadog can display as clickable trace links directly on metric graphs.

What OpenMetrics feature enables direct navigation from a metric data point to the specific distributed trace that caused it? Mandatory UNIT metadata

✗ Try again — unit metadata improves self-description but does not link metrics to traces.

Exemplars — trace ID annotations attached to specific metric observations

✓ Well done — exemplars embed a trace ID in a metric data point, creating a direct link between the metric and the trace.

The # EOF marker

✗ Try again — the EOF marker helps parsers detect incomplete responses, not link metrics to traces.

How does Prometheus select between the original text exposition format and OpenMetrics when scraping an endpoint? A configuration flag in the prometheus.yml scrape job

✗ Try again — the selection is done via HTTP, not a static config flag.

HTTP content negotiation via the Accept header in the scrape request

✓ Well done — Prometheus sends an Accept header listing supported content types; the target responds with the format it supports.

The target always serves OpenMetrics if Prometheus version is above 2.0

✗ Try again — the target controls what format it can expose; Prometheus negotiates via Accept headers.

41. What is a dead man's switch alert and when should you use it?

A dead man's switch alert (also called a heartbeat alert or watchdog alert) is an alert that fires when it stops receiving a signal, rather than when it detects a problem. The pattern inverts the usual alerting logic: instead of "alert when metric X exceeds threshold Y", it says "alert if I have not heard from system X in the past N minutes."

The canonical use case is monitoring your monitoring system. If Prometheus crashes, it cannot emit metrics, so all your normal alerts go silent — and you would never know. A dead man's switch in an external system (Alertmanager's Watchdog alert, PagerDuty's dead man's switch feature, or a separate uptime monitor like Better Uptime or StatusCake) expects a regular "I'm alive" ping from your monitoring system every N minutes. If the ping stops, the external system fires an alert.

Other use cases:

Scheduled batch jobs: Alert if the nightly ETL pipeline does not emit a completion metric within 2 hours of its scheduled start time.
Queue consumers: Alert if a Kafka consumer stops consuming (no heartbeat emitted) — possibly indicating it is deadlocked or crashed without surfacing an error.
Certificate renewal jobs: Ensure the cert-renewal cron job emits a success metric within 24 hours of its expected run time.

In Prometheus, the Alertmanager configuration ships a built-in Watchdog alert that fires continuously when healthy. Routing this alert to a dead man's switch service (Alertmanager's own Watchdog route, or a service like DeadMansSnitch) closes the loop.

Why is a dead man's switch alert necessary for your monitoring infrastructure itself? To detect when monitoring dashboards load slowly

✗ Try again — dashboard performance is not what a dead man's switch addresses.

If Prometheus crashes, all normal alerts go silent — only an external watchdog expecting a regular heartbeat can detect this failure

✓ Well done — a dead system cannot alert about itself; an external heartbeat monitor is the only way to detect total monitoring failure.

To reduce alert noise from Prometheus

✗ Try again — dead man's switches add an alert, they do not reduce existing ones.

For a nightly ETL pipeline scheduled at midnight, what dead man's switch condition would be appropriate? Alert if pipeline CPU usage is zero

✗ Try again — CPU usage can legitimately be zero if the pipeline finished early; this would be a noisy check.

Alert if no job_success metric is emitted within 2-3 hours of the scheduled start

✓ Well done — the heartbeat is a completion signal; its absence within a reasonable window indicates the pipeline did not run or failed silently.

Alert if the pipeline runs for more than 1 second

✗ Try again — duration of 1 second is far too short; it would fire every time the pipeline runs normally.

42. What is Thanos and how does it extend Prometheus for large-scale deployments?

Thanos is an open-source, CNCF incubating project that extends Prometheus to provide highly available, long-term metrics storage at scale. It was created at Improbable (now Grafana Labs contributes heavily) and addresses two fundamental limitations of standalone Prometheus: single-node storage limits and multi-cluster query federation.

Thanos's architecture uses a sidecar pattern: a Thanos Sidecar runs alongside each Prometheus server. It uploads completed TSDB blocks to an object store (S3, GCS, Azure Blob) every 2 hours. This provides unlimited long-term retention without changing how Prometheus works internally — Prometheus still handles recent data (last 2 hours) locally.

The Thanos Store Gateway makes historical blocks in object storage queryable by implementing the same gRPC StoreAPI that Prometheus exposes. The Thanos Querier is a global query layer that fans out PromQL queries to multiple Prometheus instances and Thanos Store Gateways simultaneously, deduplicating results from replicated Prometheus servers (using the --deduplication.replica-label flag).

The Thanos Compactor runs in the background to downsample old blocks (5-minute and 1-hour resolution for data older than 40 days and 1 year respectively) and delete expired blocks according to retention policies, keeping object storage costs manageable.

The Thanos Ruler runs recording rules and alerting rules against the global Thanos view, enabling cross-cluster alerting rules that a single Prometheus instance cannot evaluate.

How does the Thanos Sidecar move historical data from Prometheus to object storage? It continuously streams every scrape result to S3 in real time

✗ Try again — real-time streaming would be expensive; Thanos uploads completed TSDB blocks periodically.

It uploads completed 2-hour TSDB blocks to object storage after Prometheus seals them

✓ Well done — Prometheus writes data in 2-hour blocks; Thanos Sidecar uploads each sealed block to object storage.

It reads the Prometheus WAL and copies raw samples to S3

✗ Try again — Thanos uses completed TSDB blocks, not the WAL directly.

What Thanos component handles downsampling of old metric blocks to reduce storage cost? Thanos Querier

✗ Try again — the Querier handles global fan-out queries, not compaction or downsampling.

Thanos Compactor

✓ Well done — the Compactor downsamples old blocks to 5-minute and 1-hour resolutions and applies retention policies.

Thanos Store Gateway

✗ Try again — the Store Gateway serves historical blocks for queries; compaction is the Compactor's job.

43. How does observability apply to event-driven and asynchronous architectures?

Observability in event-driven architectures (EDA) — systems built around message queues like Apache Kafka, RabbitMQ, or AWS SQS — presents distinct challenges because requests do not follow a synchronous request-response path. A single business transaction might produce events consumed by multiple services asynchronously, making traditional HTTP-trace-based observability incomplete.

Message tracing: The core technique is propagating trace context through message headers. Just as HTTP requests carry traceparent headers, Kafka messages carry trace context in their headers map. When a consumer reads a message and creates a child span, it extracts the producer's trace ID from the message headers. OpenTelemetry's Kafka instrumentation handles this automatically, enabling end-to-end traces that span the Kafka boundary.

Consumer lag monitoring: In Kafka, consumer lag (the difference between the latest offset and the consumer group's committed offset) is the primary signal of throughput problems. A growing lag means the consumer is falling behind producers. Kafka's JMX metrics and the kafka_consumer_group_lag metric (exported by the Kafka Exporter for Prometheus) are essential.

Message queue depth: In SQS or RabbitMQ, queue depth (number of messages waiting) and message age (oldest message waiting time) signal consumer health and backpressure.

Poison pill detection: Messages that consistently fail processing and end up in dead-letter queues (DLQs) must be monitored. A growing DLQ count with no alert is a silent data loss scenario.

How is trace context propagated across a Kafka message boundary in an event-driven architecture? By encoding the trace ID in the message payload JSON

✗ Try again — embedding trace context in the payload couples business data with infrastructure concerns; message headers are the correct location.

By injecting trace context into Kafka message headers, which the consumer extracts to create a child span

✓ Well done — Kafka message headers are the standard carrier for trace context in event-driven architectures.

Traces cannot cross Kafka boundaries and must restart at every consumer

✗ Try again — OTel Kafka instrumentation propagates context through headers, enabling end-to-end traces.

What does growing Kafka consumer lag indicate in terms of system health? The Kafka cluster is running out of disk space

✗ Try again — disk space affects retention, not consumer lag directly.

The consumer is processing messages more slowly than producers are writing them — throughput is insufficient

✓ Well done — consumer lag growth means the consumer group is falling behind, signaling a throughput or processing problem.

Too many consumer groups are subscribed to the same topic

✗ Try again — multiple consumer groups are independent and do not cause lag in each other.

44. What is the difference between an alert and a notification in observability?

The terms "alert" and "notification" are often used interchangeably but represent different stages in the incident response pipeline. Understanding the distinction helps design more effective on-call systems.

An alert is the detection event itself — the result of evaluating a rule against metric or log data and finding that a condition is satisfied. In Prometheus, an alerting rule defines a PromQL expression and a duration threshold. When the expression is continuously true for the specified duration (e.g., 5 minutes), Prometheus changes the alert state from inactive to pending to firing. The alert is an internal state within the monitoring system.

A notification is how the alert is communicated to a human or another system. Alertmanager receives firing alerts from Prometheus, applies grouping, inhibition, and silencing, and then routes them to receivers — Slack channel, PagerDuty incident, email, or webhook. The notification is the downstream artifact of the alert.

This two-stage architecture is important because it allows sophisticated routing: the same alert can send a low-severity Slack message during business hours and a PagerDuty page at night. Alerts can be silenced during maintenance windows (suppressing notifications) without disabling the alerting rule. Multiple alerts can be grouped into a single notification to reduce noise.

Alertmanager also handles deduplication: if Prometheus sends the same alert 100 times (once per evaluation cycle), Alertmanager fires only one notification and re-notifies only after a configured repeat interval or when the alert recovers.

What Alertmanager feature prevents the same alert from generating hundreds of PagerDuty incidents during a prolonged outage? Alert silencing

✗ Try again — silencing suppresses all notifications; deduplication is about preventing repeated pages for the same ongoing alert.

Deduplication — Alertmanager fires one notification per alert and re-notifies only after the repeat_interval or on resolution

✓ Well done — Alertmanager tracks alert identity and deduplicate repeated firings into a single notification thread.

Rate limiting in Prometheus's evaluation engine

✗ Try again — Prometheus evaluates rules at a fixed interval regardless; deduplication happens in Alertmanager.

How can you suppress Alertmanager notifications during a scheduled maintenance window without disabling the underlying alerting rule? Delete and re-create the Prometheus alerting rule before and after maintenance

✗ Try again — modifying rules for maintenance is error-prone and unnecessary.

Create an Alertmanager silence with a matchers filter and an end time that covers the maintenance window

✓ Well done — Alertmanager silences suppress matching notifications for a defined duration without touching the alert rule.

Set the alert's for duration to match the maintenance window length

✗ Try again — the for duration controls how long a condition must persist before firing, not when to suppress notifications.

45. What is observability-driven development (ODD) and how does it shift monitoring left?

Observability-driven development (ODD) is a practice where engineers write instrumentation — metrics, logs, and trace spans — as a first-class part of feature development, not as an afterthought added after a service is deployed. The principle is "if you cannot observe it, you cannot reason about it in production", so instrumentation ships with features.

The shift-left metaphor comes from moving activities earlier in the development lifecycle. Traditional monitoring is bolted on post-deployment: ops teams add dashboards after a service is already in production and fires an incident. ODD moves this to the code review stage: observability is a requirement for merging, just like unit tests.

In practice, ODD includes:

Instrumentation in definition of done: A feature is not "done" until it has metrics for rate, error, and duration; structured log statements at key decision points; and trace spans for every external call.
Dashboard-first design: Engineers sketch what they want to see in production before writing the feature code, then instrument to produce those signals.
Local observability testing: Developers run Grafana and Loki locally (via docker-compose) and verify their instrumentation works before pushing to CI. Tools like Tilt and Skaffold enable local Kubernetes observability environments.
SLO definition at design time: The SLI and SLO for a new feature are defined before implementation, guiding what to instrument and how to alert.

ODD reduces MTTD for new features because the monitoring is ready from day one, rather than being retrofitted after the first production incident reveals it was missing.

In observability-driven development, when must instrumentation be completed relative to feature deployment? Within the first month after a feature ships to production

✗ Try again — ODD requires instrumentation before shipping, not after.

Before the feature is merged — instrumentation is part of the definition of done

✓ Well done — ODD treats instrumentation like unit tests: required before the PR can be merged.

Only after the first production incident reveals what needs to be monitored

✗ Try again — that is the reactive approach ODD explicitly replaces.

What does the dashboard-first design principle in ODD mean for a developer writing a new feature? The developer creates a Grafana dashboard before writing any tests

✗ Try again — while instrumentation comes first, it does not replace tests; the principle is about designing observability before code.

The developer defines what production signals should look like before writing feature code, then instruments to produce those signals

✓ Well done — starting with the desired observable output guides what metrics, logs, and spans to add during implementation.

The developer is not allowed to deploy until a dashboard is approved by the ops team

✗ Try again — ODD is about developer ownership of observability, not a gate requiring ops approval.

	Interviews Questions Java Spring Hibernate Maven Testing API BigData Web DataStructures AI Database MuleESB Cloud Scala Tools	About Javapedia.net Javapedia.net is for Java and J2EE developers, technologist and college students who prepare of interview. Also this site includes many practical examples. This site is developed using J2EE technologies by Steve Antony, a senior Developer/lead at one of the logistics based company.
	contact: javatutorials2016[at]gmail[dot]com
Kindly consider donating for maintaining this website. Thanks.
	Copyright © 2026, javapedia.net, all rights reserved. privacy policy.

Tools / Monitoring and Observability Interview Questions

Comments & Discussions

Recently added...