BigData / Apache Parquet Interview Questions
How do you handle schema mismatches when merging multiple Parquet files?
When a dataset is composed of Parquet files written at different times (possibly with different schemas), you have several options:
1. Spark mergeSchema — the simplest approach; Spark unions all schemas and fills missing columns with null:
df = spark.read.option("mergeSchema", "true").parquet("s3://datalake/events/")
2. AWS Glue Schema Registry — version schemas centrally; consumers register against a schema version and handle evolution rules explicitly.
3. Table formats (Iceberg / Delta / Hudi) — track schema history at the table level; ALTER TABLE ADD COLUMN is applied transactionally without rewriting files.
4. Manual reconciliation — use df.schema to inspect each file's schema, build a unified schema, then apply df.select(unified_cols) with lit(None).cast() for missing columns before union.
Always validate after merging: check for unexpected nulls or type coercions that widen types silently.
Invest now in Acorns!!! 🚀
Join Acorns and get your $5 bonus!
Acorns is a micro-investing app that automatically invests your "spare change" from daily purchases into diversified, expert-built portfolios of ETFs. It is designed for beginners, allowing you to start investing with as little as $5. The service automates saving and investing. Disclosure: I may receive a referral bonus.
Invest now!!! Get Free equity stock (US, UK only)!
Use Robinhood app to invest in stocks. It is safe and secure. Use the Referral link to claim your free stock when you sign up!.
The Robinhood app makes it easy to trade stocks, crypto and more.
Webull! Receive free stock by signing up using the link: Webull signup.
More Related questions...
