BigData / Data Lake Interview questions
What is data compaction and why is it important in Data Lakes?
Data compaction merges many small files into fewer large files, addressing the 'small files problem' that plagues data lakes and degrades performance. Distributed processing systems like Spark and Hive struggle with millions of small files due to metadata overhead and inefficient parallelization.
The Small Files Problem:
Frequent writes (especially streaming) create numerous small files. HDFS NameNode must track metadata for every file. Query engines create one task per file, overwhelming schedulers. File open/close operations dominate processing time. Network connections exhaust limits.
Impact: Slower queries (100x degradation possible), metadata memory exhaustion, scheduler overload, wasted S3 LIST API costs.
Compaction Strategies:
1. Scheduled Compaction: Periodically merge small files. Delta Lake OPTIMIZE command, Hudi compaction jobs, Iceberg rewrite_data_files operation.
2. Automatic Compaction: Hudi auto-compaction, Delta Auto Optimize on write.
3. Z-Ordering/Data Clustering: During compaction, co-locate related data to improve query performance. Delta ZORDER BY, BigQuery clustering.
4. Bin-Packing: Rewrite files to target optimal size (128MB-1GB for Parquet).
When to Compact: After streaming ingestion sessions, when average file size drops below threshold, before heavy query workloads, during maintenance windows.
Best Practices: Target 128MB-1GB file sizes, compact during low-usage periods, monitor file counts, use Delta/Iceberg/Hudi for built-in compaction, automate compaction scheduling.
Invest now in Acorns!!! 🚀
Join Acorns and get your $5 bonus!
Acorns is a micro-investing app that automatically invests your "spare change" from daily purchases into diversified, expert-built portfolios of ETFs. It is designed for beginners, allowing you to start investing with as little as $5. The service automates saving and investing. Disclosure: I may receive a referral bonus.
Invest now!!! Get Free equity stock (US, UK only)!
Use Robinhood app to invest in stocks. It is safe and secure. Use the Referral link to claim your free stock when you sign up!.
The Robinhood app makes it easy to trade stocks, crypto and more.
Webull! Receive free stock by signing up using the link: Webull signup.
More Related questions...
