BigData / Data Lake Interview questions
What is Kappa Architecture and when should it be used?
Kappa Architecture is a simplification of Lambda Architecture that eliminates the batch processing layer, using only stream processing for both real-time and historical data. Proposed by Jay Kreps (creator of Apache Kafka), Kappa Architecture argues that maintaining two separate code paths is unnecessarily complex when modern streaming systems can handle both use cases.
Core Principle: Everything is a stream. Historical data is just a stream you can replay.
Kappa Architecture Layers:
1. Streaming Layer: All data flows through a replayable log (like Apache Kafka). This log acts as both the real-time ingestion pipeline and the source of truth for reprocessing.
2. Serving Layer: Materialized views and aggregates computed from streams are stored in databases optimized for querying (Cassandra, DynamoDB, ElasticSearch, or data lake tables).
How Kappa Handles Batch Requirements:
When batch-like processing is needed (algorithm changes, bug fixes, or schema evolution), Kappa Architecture replays the entire event stream through the new version of the processing logic. The stream processing job creates new output views while the old views continue serving queries. Once the new views catch up, traffic switches to them, and old views are decommissioned.
Requirements for Kappa Architecture:
- Replayable Streams: Event logs must retain data long enough for full reprocessing (Kafka with appropriate retention policies)
- Deterministic Processing: Stream processing must produce consistent results when replaying data
- Mature Streaming Framework: Requires Apache Flink, Kafka Streams, or Spark Streaming with stateful processing capabilities
- Fast Reprocessing: Must be able to replay historical data quickly enough to meet business requirements
When to Use Kappa Architecture:
- Streaming-first use cases (IoT, real-time analytics, event-driven applications)
- When historical reprocessing is infrequent
- When development team expertise is in streaming technologies
- When simplicity and single code path are prioritized
- When data volumes allow reasonable replay times (minutes to hours, not days)
When Lambda Architecture May Be Better:
- Very large historical datasets where stream replay is impractical
- Complex batch algorithms that don't translate well to streaming
- When batch and streaming require fundamentally different processing logic
- Regulatory requirements for immutable batch archives
Modern Hybrid Approach:
Many organizations use a pragmatic hybrid: stream processing for real-time needs, with periodic batch jobs for complex analytics or ML model training. Data lakehouses enable this by providing unified storage that both streaming and batch engines can access, avoiding the either/or choice between Lambda and Kappa.
Invest now in Acorns!!! 🚀
Join Acorns and get your $5 bonus!
Acorns is a micro-investing app that automatically invests your "spare change" from daily purchases into diversified, expert-built portfolios of ETFs. It is designed for beginners, allowing you to start investing with as little as $5. The service automates saving and investing. Disclosure: I may receive a referral bonus.
Invest now!!! Get Free equity stock (US, UK only)!
Use Robinhood app to invest in stocks. It is safe and secure. Use the Referral link to claim your free stock when you sign up!.
The Robinhood app makes it easy to trade stocks, crypto and more.
Webull! Receive free stock by signing up using the link: Webull signup.
More Related questions...
