BigData / Data Lake Interview questions
What file formats are best suited for Data Lakes and why?
Choosing the right file format is crucial for data lake performance, storage efficiency, and query speed. The three dominant formats for analytics workloads are Parquet, ORC, and Avro, each optimized for different use cases.
| Format | Storage | Compression | Best For | Ecosystem |
|---|---|---|---|---|
| Parquet | Columnar | Excellent (Snappy, GZIP, LZ4) | Read-heavy analytics, BI queries | Spark, Hive, Presto, Athena |
| ORC | Columnar | Excellent (ZLIB, Snappy, LZO) | Hive-based workloads, complex types | Hive, Spark, Presto |
| Avro | Row-based | Good (Snappy, Deflate) | Streaming, schema evolution, write-heavy | Kafka, Flink, Spark Streaming |
Apache Parquet is the most widely adopted format for data lakes. As a columnar format, Parquet stores data by column rather than by row, enabling highly efficient compression (often 75% reduction) and query performance. When queries select specific columns, Parquet reads only those columns, dramatically reducing I/O. Parquet supports complex nested structures, predicate pushdown, and column pruning, making it ideal for analytical workloads. It integrates seamlessly with Spark, Athena, BigQuery, and Redshift Spectrum.
Apache ORC (Optimized Row Columnar) is similar to Parquet but originated in the Hive ecosystem. ORC provides slightly better compression than Parquet and includes built-in indexes for fast lookups. ORC excels at handling complex nested types and supports ACID transactions natively in Hive. While less portable than Parquet, ORC remains popular in Hadoop-centric environments.
Apache Avro is a row-based format optimized for write-heavy and streaming workloads. Unlike columnar formats, Avro stores complete rows together, making it efficient for full-row reads and writes. Avro's killer feature is schema evolution—schemas are embedded in files, allowing readers and writers with different schema versions to communicate. This makes Avro ideal for Kafka streaming, CDC pipelines, and scenarios where schemas change frequently.
When to Use Each Format:
- Parquet: Default choice for analytics, BI, and data warehousing workloads
- ORC: Hive-based systems, ACID requirements, complex nested data
- Avro: Streaming ingestion, schema evolution, Kafka integration, archival storage
Modern data lakes often use a hybrid approach: ingest data in Avro for flexibility, then convert to Parquet for analytics in Silver/Gold layers.
Invest now in Acorns!!! 🚀
Join Acorns and get your $5 bonus!
Acorns is a micro-investing app that automatically invests your "spare change" from daily purchases into diversified, expert-built portfolios of ETFs. It is designed for beginners, allowing you to start investing with as little as $5. The service automates saving and investing. Disclosure: I may receive a referral bonus.
Invest now!!! Get Free equity stock (US, UK only)!
Use Robinhood app to invest in stocks. It is safe and secure. Use the Referral link to claim your free stock when you sign up!.
The Robinhood app makes it easy to trade stocks, crypto and more.
Webull! Receive free stock by signing up using the link: Webull signup.
More Related questions...
