BigData / Apache Parquet Interview Questions
What is small file problem in Parquet-based data lakes and how do you solve it?
The small file problem occurs when a Parquet dataset accumulates thousands of tiny files (often from streaming writes or over-partitioning). Each file requires a separate HDFS/S3 metadata operation and a separate footer read, causing significant overhead.
Effects:
- Slow query planning — the driver/namenode must enumerate and open many files.
- Inefficient Parquet statistics — small row groups give poor skipping benefits.
- High object-store API costs (S3 LIST + GET per file).
Solutions:
- Compaction job — periodically coalesce small files into larger ones. In Spark:
spark.read.parquet(path).coalesce(N).write.mode("overwrite").parquet(path) - Delta Lake / Iceberg OPTIMIZE —
OPTIMIZE my_table; - Streaming micro-batch tuning — increase trigger interval or use
maxFilesPerTriggerto reduce write frequency. - Hudi MOR → COW compaction — compact merge-on-read log files into base Parquet files.
Invest now in Acorns!!! 🚀
Join Acorns and get your $5 bonus!
Acorns is a micro-investing app that automatically invests your "spare change" from daily purchases into diversified, expert-built portfolios of ETFs. It is designed for beginners, allowing you to start investing with as little as $5. The service automates saving and investing. Disclosure: I may receive a referral bonus.
Invest now!!! Get Free equity stock (US, UK only)!
Use Robinhood app to invest in stocks. It is safe and secure. Use the Referral link to claim your free stock when you sign up!.
The Robinhood app makes it easy to trade stocks, crypto and more.
Webull! Receive free stock by signing up using the link: Webull signup.
More Related questions...
