BigData / Apache Parquet Interview Questions
What are best practices for writing Parquet files in production?
Producing high-quality Parquet files that perform well at query time requires attention at write time:
- Target 128 MB–512 MB row groups — too small wastes footer reads; too large makes predicate skipping coarse.
- Sort data before writing on filter columns — tight min/max ranges per row group dramatically improve skipping.
- Choose the right compression codec — ZSTD for I/O-bound workloads; Snappy for CPU-bound (fast decompression).
- Enable dictionary encoding on low-cardinality columns — automatic in most frameworks but verify it is not being disabled.
- Partition on moderate-cardinality columns (e.g., date, country) — never on user IDs or UUIDs.
- Avoid tiny files — compact regularly if streaming or incremental writes produce many small files.
- Enable Bloom filters on high-cardinality equality columns (UUIDs, hashed IDs).
- Embed correct schema types — use TIMESTAMP_MICROS not INT96 (deprecated); use DECIMAL not DOUBLE for monetary values.
- Test with query benchmarks after schema changes to confirm no regression in pushdown effectiveness.
Invest now in Acorns!!! 🚀
Join Acorns and get your $5 bonus!
Acorns is a micro-investing app that automatically invests your "spare change" from daily purchases into diversified, expert-built portfolios of ETFs. It is designed for beginners, allowing you to start investing with as little as $5. The service automates saving and investing. Disclosure: I may receive a referral bonus.
Invest now!!! Get Free equity stock (US, UK only)!
Use Robinhood app to invest in stocks. It is safe and secure. Use the Referral link to claim your free stock when you sign up!.
The Robinhood app makes it easy to trade stocks, crypto and more.
Webull! Receive free stock by signing up using the link: Webull signup.
More Related questions...
