BigData / Apache Parquet Interview Questions

1. What is Apache Parquet and why is it used? 2. What are the advantages of Parquet over CSV? 3. How are Parquet files structured? (Row Groups, Column Chunks, Pages)? 4. What is Schema Evolution in Parquet? 5. What is Column Pruning and Projection Pushdown in Parquet? 6. When would you choose Avro over Parquet? 7. How does Parquet handle compression and encoding? 8. What is the Vectorized Reader in Spark and how does it improve Parquet performance? 9. How do you handle schema mismatches when merging multiple Parquet files? 10. If a Spark query on Parquet is slow, what optimisation steps would you take? 11. How do you load Parquet files into Snowflake? 12. What are the supported data types in Parquet? 13. How do you read and write Parquet files in PySpark? 14. How do you read and write Parquet files in Python with PyArrow? 15. What is partitioning in Parquet and how does it improve query performance? 16. What are Bloom Filters in Parquet and when should you use them? 17. What is the difference between Parquet, ORC, and Avro? 18. What is Z-ordering (Z-order clustering) and how does it help Parquet queries? 19. What is Apache Iceberg and how does it use Parquet? 20. How does DuckDB query Parquet files and what makes it fast? 21. What is the Parquet file footer and why does the reader fetch it first? 22. How does Parquet support nested data (structs, lists, maps)? 23. What is small file problem in Parquet-based data lakes and how do you solve it? 24. What is the difference between repartition and coalesce when writing Parquet files? 25. How does AWS Athena query Parquet files in S3? 26. What is predicate pushdown in Parquet and how does it work end-to-end? 27. What are best practices for writing Parquet files in production? 28. How does Google BigQuery use Parquet-style columnar storage internally? 29. What is Delta Lake and how does it extend Parquet for ACID transactions? 30. How do you perform upserts (MERGE INTO) on Parquet-based tables in Delta Lake?

Could not find what you were looking for? send us the question and we would be happy to answer your question.

1. What is Apache Parquet and why is it used?

Apache Parquet is an open-source, columnar storage file format designed for the Hadoop ecosystem and modern big-data platforms. Unlike row-based formats (CSV, JSON), Parquet stores each column's data contiguously on disk. This layout allows query engines to read only the columns they need, dramatically reducing I/O.

Key reasons Parquet is widely used:

Columnar layout — queries touch only the relevant columns, not full rows.
Efficient compression — columns hold homogeneous types, so codecs like Snappy and GZIP achieve higher ratios.
Predicate pushdown — embedded min/max statistics let engines skip entire row groups that cannot satisfy a WHERE clause.
Schema embedding — the schema is stored inside the file, eliminating external schema management.
Broad ecosystem support — Spark, Hive, Presto, Flink, DuckDB, Pandas, AWS Athena, and Snowflake all read/write Parquet natively.

Take quiz

Which storage layout does Apache Parquet use?Columnar — data for each column is stored contiguously

✓ Correct! Well done.

Row-based — full rows are stored together

✗ Try again if not.

Hybrid key-value with no fixed schema

✗ Try again if not.

What enables Parquet to skip irrelevant row groups during a query?Row-level locks

✗ Try again if not.

Embedded min/max statistics per column chunk

✓ Correct! Well done.

External catalog metadata only

✗ Try again if not.

2. What are the advantages of Parquet over CSV?

CSV is simple but untyped and inefficient at scale. Parquet improves on it in every dimension that matters for analytics:

Aspect	CSV	Parquet
Storage layout	Row-based	Columnar
Compression	Low (mixed types per row)	High (homogeneous per column)
Schema	None (inferred)	Embedded in file
Partial reads	Must scan full row	Read only needed columns
Predicate pushdown	No	Yes (min/max statistics)
Schema evolution	Manual	Built-in (add columns)
Ecosystem	Universal	Hadoop, Spark, cloud lakes

For OLAP-style queries that aggregate a few columns across millions of rows, Parquet is typically 10–100× faster to query and 3–10× smaller than an equivalent CSV.

Take quiz

Why does Parquet compress data more efficiently than CSV?It uses a proprietary binary cipher

✗ Try again if not.

Column data is homogeneous in type, enabling higher codec ratios

✓ Correct! Well done.

It removes all null values before writing

✗ Try again if not.

Which CSV limitation does Parquet's embedded schema solve?CSV has no schema — types must be inferred or provided externally

✓ Correct! Well done.

CSV cannot store strings

✗ Try again if not.

CSV files cannot be opened in Excel

✗ Try again if not.

3. How are Parquet files structured? (Row Groups, Column Chunks, Pages)?

Parquet organises data in a three-level hierarchy:

Row Group — a horizontal slice of the dataset, typically 128 MB–1 GB of data. Each row group contains one column chunk per column.
Column Chunk — all values for a single column within a row group. This is the unit of compression and encoding.
Page — the smallest addressable unit inside a column chunk (default 1 MB). Pages can be data pages, dictionary pages, or index pages.

At the end of the file, a footer stores the schema and per-column statistics (min, max, null count, distinct count). Readers fetch the footer first to plan which row groups and pages to skip.

Parquet File\nâââ Row Group 1 (128 MB)\nâ   âââ Column Chunk: id\nâ   âââ Column Chunk: name\nâ   âââ Column Chunk: amount\nâââ Row Group 2\nâ   âââ ...\nâââ Footer (schema + statistics)

Take quiz

What is the smallest addressable unit of storage inside a Parquet column chunk?Row Group

✗ Try again if not.

Column Chunk

✗ Try again if not.

Page

✓ Correct! Well done.

Where is the schema and per-column statistics stored inside a Parquet file?In the file footer

✓ Correct! Well done.

In the first row group

✗ Try again if not.

In an external catalog always

✗ Try again if not.

4. What is Schema Evolution in Parquet?

Schema evolution is the ability to change a Parquet dataset's schema over time without rewriting existing files. Parquet natively supports:

Adding columns — new columns appear as null in older files when read together with newer files.
Renaming columns — supported via field IDs (used in formats like Iceberg on top of Parquet).
Widening types — e.g., INT32 → INT64 is safe; narrowing is not.

In Apache Spark, set mergeSchema = true to merge schemas across multiple Parquet files automatically:

df = spark.read.option("mergeSchema", "true").parquet("s3://bucket/data/")

Schema evolution is critical for long-lived data lakes where upstream producers add new fields without co-ordinating with all downstream consumers.

Take quiz

What value do older Parquet files return for a newly added column?null

✓ Correct! Well done.

0 or empty string depending on type

✗ Try again if not.

An exception is thrown

✗ Try again if not.

Which Spark read option enables automatic schema merging across Parquet files?unifySchema

✗ Try again if not.

inferSchema

✗ Try again if not.

mergeSchema

✓ Correct! Well done.

5. What is Column Pruning and Projection Pushdown in Parquet?

Column pruning (also called projection pushdown) is an optimisation where the query engine reads only the columns referenced in the query, ignoring all other column chunks on disk. Because Parquet stores each column separately, skipping unrequested columns costs nothing beyond reading the footer.

Example — a table has 50 columns but the query only needs 3:

SELECT user_id, revenue, country FROM events WHERE date = '2026-01-01';

The engine reads only the user_id, revenue, country, and date column chunks — the other 46 are never loaded from disk. Combined with predicate pushdown (row group skipping via statistics), this makes Parquet queries orders of magnitude faster than scanning full rows.

Take quiz

What does column pruning (projection pushdown) allow the query engine to do?Compress all columns before writing

✗ Try again if not.

Read only the columns needed by the query, skipping the rest on disk

✓ Correct! Well done.

Merge multiple Parquet files into one

✗ Try again if not.

Which Parquet feature allows row groups to be skipped when a WHERE clause cannot be satisfied?Column pruning

✗ Try again if not.

Predicate pushdown using min/max statistics

✓ Correct! Well done.

Dictionary encoding

✗ Try again if not.

6. When would you choose Avro over Parquet?

Parquet and Avro serve different access patterns. The table below summarises when to prefer each:

Criterion	Choose Parquet	Choose Avro
Query pattern	OLAP — aggregate few columns over many rows	Row-based access — read/write entire records
Streaming	Batch / micro-batch	Streaming (Kafka, Flink event records)
Write frequency	Write-once, read-many	Frequent writes / appends
Schema evolution	Column adds; limited	Full backward/forward compatibility
Compression	Higher for analytics	Moderate; better for write throughput

A common pattern: ingest events as Avro in Kafka, then compact and convert to Parquet in a data lake for batch analytics.

Take quiz

For a Kafka streaming pipeline where every field of each event is consumed, which format is the better fit?Avro — optimised for row-based writes and streaming

✓ Correct! Well done.

Parquet — always the best choice for any format

✗ Try again if not.

CSV — simplest to produce from Kafka consumers

✗ Try again if not.

Which workload is Parquet best suited for?High-frequency transactional inserts (OLTP)

✗ Try again if not.

OLAP aggregations reading a few columns over millions of rows

✓ Correct! Well done.

Real-time event streaming from IoT sensors

✗ Try again if not.

7. How does Parquet handle compression and encoding?

Parquet uses two complementary techniques to minimise storage:

Encoding — applied first, at the column level, to exploit data patterns:

Dictionary Encoding — replaces repeated values with integer codes. Ideal for low-cardinality columns (e.g., status, country).
Run-Length Encoding (RLE) — collapses consecutive identical values into (value, count) pairs.
Delta Encoding — stores differences between successive integers; great for timestamps and monotonic IDs.
Bit-packing — packs small integers into fewer bits.

Compression codec — applied after encoding to the byte stream:

Snappy (default in Spark) — fast, moderate ratio (~2–3×).
GZIP/Zlib — higher ratio (~4–6×) but slower CPU.
ZSTD — best balance; recommended for most new deployments.
LZO, LZ4, Brotli — specialised use cases.

Encoding and codec are configured independently per column, so hot columns can use fast codecs while archive columns use ZSTD for maximum compression.

Take quiz

Which Parquet encoding is best suited for a low-cardinality column like country or status?Dictionary Encoding — maps repeated values to integer codes

✓ Correct! Well done.

Delta Encoding — stores differences between successive values

✗ Try again if not.

Bit-packing — reduces bit-width for small integers

✗ Try again if not.

Which compression codec offers the best balance of ratio and speed for modern Parquet deployments?Snappy

✗ Try again if not.

GZIP

✗ Try again if not.

ZSTD

✓ Correct! Well done.

8. What is the Vectorized Reader in Spark and how does it improve Parquet performance?

The Vectorized Parquet Reader (introduced in Spark 2.0) reads a batch of rows at once directly into an in-memory columnar format (ColumnarBatch) rather than converting each row individually to a JVM object. This avoids object creation overhead and allows the JVM's JIT compiler to apply SIMD-style optimisations.

Key benefits:

Reduces per-row CPU overhead significantly.
Enables whole-stage code generation to operate on batches.
Speeds up filters and projections applied directly to column vectors.

Enable or verify via Spark config:

spark.conf.set("spark.sql.parquet.enableVectorizedReader", "true")  # default true

For complex nested schemas (e.g., deeply nested structs/maps), the vectorized reader may fall back to row-by-row mode automatically.

Take quiz

What memory format does the Spark Vectorized Parquet Reader output rows into?Individual Java Row objects

✗ Try again if not.

ColumnarBatch — a batch of rows in columnar in-memory layout

✓ Correct! Well done.

Byte arrays serialised with Kryo

✗ Try again if not.

When does Spark's vectorized Parquet reader fall back to row-by-row mode?When reading complex nested schemas such as deeply nested structs or maps

✓ Correct! Well done.

When the file is compressed with GZIP

✗ Try again if not.

When the file has more than 10 row groups

✗ Try again if not.

9. How do you handle schema mismatches when merging multiple Parquet files?

When a dataset is composed of Parquet files written at different times (possibly with different schemas), you have several options:

1. Spark mergeSchema — the simplest approach; Spark unions all schemas and fills missing columns with null:

df = spark.read.option("mergeSchema", "true").parquet("s3://datalake/events/")

2. AWS Glue Schema Registry — version schemas centrally; consumers register against a schema version and handle evolution rules explicitly.

3. Table formats (Iceberg / Delta / Hudi) — track schema history at the table level; ALTER TABLE ADD COLUMN is applied transactionally without rewriting files.

4. Manual reconciliation — use df.schema to inspect each file's schema, build a unified schema, then apply df.select(unified_cols) with lit(None).cast() for missing columns before union.

Always validate after merging: check for unexpected nulls or type coercions that widen types silently.

Take quiz

Which Spark read option automatically merges different schemas across Parquet files?inferSchema

✗ Try again if not.

mergeSchema

✓ Correct! Well done.

enforceSchema

✗ Try again if not.

What does Spark fill in for columns that exist in newer files but are absent in older Parquet files when mergeSchema is used?Default column type values (0, empty string)

✗ Try again if not.

null

✓ Correct! Well done.

An exception is raised

✗ Try again if not.

10. If a Spark query on Parquet is slow, what optimisation steps would you take?

Diagnosing and tuning slow Parquet queries in Spark follows a layered approach:

Check partitioning — ensure the table is partitioned on high-cardinality filter columns (e.g., date, region). Without partitions, Spark scans all files.
```
df.write.partitionBy("date", "region").parquet("s3://path/")
```
Verify column pruning — use SELECT col1, col2 rather than SELECT * so only required column chunks are read.
Align filters with statistics — filter on columns that have good min/max spread so row group skipping is effective. Sort data by the filter column before writing (Z-ordering or file-level sorting).
Right-size row groups — too-small row groups → many footer reads; too-large → coarse skipping. Target 128 MB–512 MB per row group.
Check compression codec — Snappy (fast decompression) suits CPU-bound queries; ZSTD suits I/O-bound scenarios.
Enable Adaptive Query Execution (AQE) — spark.sql.adaptive.enabled=true reoptimises joins and coalesces small partitions at runtime.
Inspect the Spark UI — look at the scan stage: bytes read vs. bytes skipped tells you how effective pushdown is.

Take quiz

Which write-time practice most directly improves row-group skipping for date-range queries?Partitioning and/or sorting data by the date column before writing

✓ Correct! Well done.

Using SELECT * to load all columns

✗ Try again if not.

Increasing the Spark driver memory

✗ Try again if not.

What Spark setting enables runtime re-optimisation of joins and partition coalescing?spark.sql.parquet.enableVectorizedReader

✗ Try again if not.

spark.sql.adaptive.enabled

✓ Correct! Well done.

spark.sql.shuffle.partitions

✗ Try again if not.

11. How do you load Parquet files into Snowflake?

Snowflake can query and load Parquet files staged in cloud storage using a two-step approach:

Step 1 — Stage the files (S3, Azure Blob, or GCS external stage):

CREATE OR REPLACE STAGE my_stage
  URL = 's3://my-bucket/parquet-data/'
  CREDENTIALS = (AWS_KEY_ID='...' AWS_SECRET_KEY='...');

Step 2a — Infer schema and query directly (no table creation needed):

SELECT $1:user_id::INT, $1:revenue::FLOAT
FROM @my_stage (FILE_FORMAT => 'parquet_fmt');

Step 2b — Use INFER_SCHEMA to auto-create table:

CREATE TABLE events USING TEMPLATE (
  SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*))
  FROM TABLE(INFER_SCHEMA(
    LOCATION => '@my_stage',
    FILE_FORMAT => 'parquet_fmt'
  ))
);
COPY INTO events FROM @my_stage FILE_FORMAT = (TYPE = 'PARQUET');

INFER_SCHEMA reads the Parquet footer metadata to derive column names and types automatically, eliminating manual DDL.

Take quiz

What Snowflake function reads a Parquet file's embedded schema to auto-generate a CREATE TABLE statement?AUTO_INGEST

✗ Try again if not.

INFER_SCHEMA

✓ Correct! Well done.

PARSE_PARQUET

✗ Try again if not.

Which SQL command actually moves data from a Snowflake stage into a table?INSERT INTO ... SELECT FROM stage

✗ Try again if not.

COPY INTO

✓ Correct! Well done.

LOAD DATA INFILE

✗ Try again if not.

12. What are the supported data types in Parquet?

Parquet defines primitive types (physical storage) and logical types (semantic meaning layered on top).

Primitive types: BOOLEAN, INT32, INT64, INT96 (legacy timestamps), FLOAT, DOUBLE, BYTE_ARRAY, FIXED_LEN_BYTE_ARRAY.

Common logical types (annotations on primitives):

Logical Type	Physical Mapping	Example
STRING	BYTE_ARRAY (UTF-8)	"London"
DATE	INT32 (days since epoch)	2026-01-01
TIMESTAMP_MILLIS	INT64	1704067200000
DECIMAL	INT32/INT64/BYTE_ARRAY	99.99
LIST	Repeated group	[1, 2, 3]
MAP	Repeated key-value group	{k: v}
ENUM	BYTE_ARRAY	"ACTIVE"
UUID	FIXED_LEN_BYTE_ARRAY(16)	RFC-4122

Understanding the primitive/logical split matters when debugging type mismatch errors between Spark, Hive, and other readers.

Take quiz

How does Parquet store a DATE column at the physical level?As INT32 representing days since the Unix epoch

✓ Correct! Well done.

As a UTF-8 string in ISO 8601 format

✗ Try again if not.

As FLOAT representing fractional Julian days

✗ Try again if not.

What physical Parquet type is used to represent a STRING logical type?INT64

✗ Try again if not.

BYTE_ARRAY with UTF-8 annotation

✓ Correct! Well done.

FIXED_LEN_BYTE_ARRAY

✗ Try again if not.

13. How do you read and write Parquet files in PySpark?

Spark provides first-class Parquet support via the DataFrameReader and DataFrameWriter APIs.

Read:

# Read a single file or directory of Parquet files
df = spark.read.parquet("s3://my-bucket/events/")

# With options
df = (spark.read
      .option("mergeSchema", "true")
      .parquet("hdfs:///datalake/transactions/"))

Write:

# Overwrite with Snappy compression (default)
df.write.mode("overwrite").parquet("s3://my-bucket/output/")

# Partition by date and region, use ZSTD
(df.write
   .partitionBy("date", "region")
   .option("compression", "zstd")
   .mode("append")
   .parquet("s3://my-bucket/partitioned/"))

Register as temp view for SQL:

df.createOrReplaceTempView("events")
spark.sql("SELECT date, SUM(revenue) FROM events GROUP BY date").show()

Take quiz

Which PySpark write mode appends new data to an existing Parquet directory without overwriting existing files?overwrite

✗ Try again if not.

append

✓ Correct! Well done.

ignore

✗ Try again if not.

What method creates a SQL-queryable alias for a Spark DataFrame?df.registerTable()

✗ Try again if not.

df.createOrReplaceTempView()

✓ Correct! Well done.

df.toTable()

✗ Try again if not.

14. How do you read and write Parquet files in Python with PyArrow?

PyArrow provides a low-level Parquet library that is fast, pure Python-friendly, and interoperates with Pandas:

Write:

import pyarrow as pa
import pyarrow.parquet as pq

table = pa.Table.from_pandas(df)
pq.write_table(table, "output.parquet", compression="zstd")

Read:

table = pq.read_table("output.parquet")
df = table.to_pandas()

Read specific columns only (column pruning):

table = pq.read_table("output.parquet", columns=["user_id", "revenue"])

Read with filter pushdown:

import pyarrow.dataset as ds
dataset = ds.dataset("s3://bucket/data/", format="parquet")
table = dataset.to_table(
    columns=["user_id", "revenue"],
    filter=ds.field("date") == "2026-01-01"
)

PyArrow's dataset API supports partitioned directories and applies filter pushdown automatically using Parquet statistics.

Take quiz

Which PyArrow function writes a PyArrow Table to a Parquet file on disk?pq.to_parquet()

✗ Try again if not.

pq.write_table()

✓ Correct! Well done.

pa.save()

✗ Try again if not.

How do you push a column filter down to Parquet statistics when using the PyArrow dataset API?Pass a filter expression using ds.field() to dataset.to_table(filter=...)

✓ Correct! Well done.

Set a WHERE clause on the PyArrow Table object

✗ Try again if not.

Use pq.read_table() with a SQL string

✗ Try again if not.

15. What is partitioning in Parquet and how does it improve query performance?

Partitioning organises files into a directory hierarchy based on column values, following the Hive partition layout:

s3://bucket/events/
  date=2026-01-01/
    part-0000.parquet
  date=2026-01-02/
    part-0001.parquet
  region=US/
    ...

When a query filters on a partition column, the engine lists only matching subdirectories and skips all others — this is called partition pruning. No Parquet files in the skipped partitions are opened at all.

Example savings: a table with 3 years of daily data (1,095 partitions) filtering on a single date skips 99.9% of files before any Parquet-level statistics are consulted.

Trade-offs:

Partition on columns with moderate cardinality (dates, regions, categories) — not user IDs (too many small files).
Avoid over-partitioning: thousands of tiny files hurt throughput more than they help with pruning.

Take quiz

What is the performance optimisation called when a query skips entire directory partitions based on a filter column value?Predicate pushdown

✗ Try again if not.

Partition pruning

✓ Correct! Well done.

Column pruning

✗ Try again if not.

Why should you avoid partitioning a Parquet dataset by a high-cardinality column like user_id?It creates millions of tiny files, which hurts throughput and metadata overhead

✓ Correct! Well done.

Parquet does not support integer partition keys

✗ Try again if not.

It disables row-group statistics

✗ Try again if not.

16. What are Bloom Filters in Parquet and when should you use them?

A Bloom Filter is a probabilistic data structure that answers "is this value possibly in this row group?" with zero false negatives. Parquet 1.12+ supports per-column Bloom filters stored in the file footer.

They complement min/max statistics for high-cardinality columns where the min–max range spans the full value space (e.g., UUIDs, email addresses). Without a Bloom filter, the engine must open every row group; with one, it can skip groups that definitely do not contain the queried value.

Enable in PySpark/Spark:

spark.conf.set("parquet.bloom.filter.enabled#user_id", "true")
spark.conf.set("parquet.bloom.filter.expected.ndv#user_id", "1000000")

When to use:

Equality filters (WHERE uuid = '...') on high-cardinality string/UUID columns.
Not useful for range queries (use sorted data + min/max for that).

Trade-off: adds size to the footer; tune expected.ndv (number of distinct values) to control false-positive rate vs. filter size.

Take quiz

For which query pattern are Parquet Bloom filters most beneficial?Range queries such as revenue BETWEEN 100 AND 500

✗ Try again if not.

Equality lookups on high-cardinality columns like UUID or email

✓ Correct! Well done.

ORDER BY aggregations on numeric columns

✗ Try again if not.

What does a Parquet Bloom filter guarantee about its results?Zero false positives and zero false negatives

✗ Try again if not.

Zero false negatives — if the filter says absent, the value is definitely not in the row group

✓ Correct! Well done.

Exact membership testing with no probability component

✗ Try again if not.

17. What is the difference between Parquet, ORC, and Avro?

All three are Apache-ecosystem formats but optimised for different workloads:

Feature	Parquet	ORC	Avro
Layout	Columnar	Columnar	Row-based
Ecosystem fit	Spark, Presto, Hive, cloud lakes	Hive-native; Spark	Kafka, Flink, Hadoop MR
Compression	Excellent (ZSTD, Snappy)	Excellent (ZLIB, Snappy, ZSTD)	Good (Deflate, Snappy)
Schema evolution	Column adds; mergeSchema	Column adds; type promotions	Full backward/forward compatibility
Nested data	Strong (Dremel-style)	Good	Native (JSON-like nested)
Best for	Analytics, data lake	Hive OLAP, HBase integration	Streaming, messaging, OLTP-like writes

In practice: most modern data lake stacks (Databricks, AWS lake formation, Google BigQuery) default to Parquet. ORC remains dominant in Hive-heavy shops.

Take quiz

Which format has the strongest support for full backward and forward schema evolution?Parquet

✗ Try again if not.

ORC

✗ Try again if not.

Avro

✓ Correct! Well done.

Which format is native to Hive and historically dominant in Hive-centric data warehouses?Parquet

✗ Try again if not.

ORC

✓ Correct! Well done.

Avro

✗ Try again if not.

18. What is Z-ordering (Z-order clustering) and how does it help Parquet queries?

Z-ordering is a multi-dimensional data-skipping technique that physically co-locates related rows across multiple filter columns within Parquet files. It maps multiple column values to a single Z-order curve value and sorts data along that curve.

Problem it solves: Standard partitioning and sorting works well for one column. When queries filter on two or more independent columns (e.g., WHERE region='EU' AND product_category='Electronics'), traditional single-column sorting cannot co-locate all relevant rows.

Z-ordering interleaves the bits of the filter column values so that rows with similar values in multiple columns are physically adjacent.

In Delta Lake (Databricks):

OPTIMIZE events ZORDER BY (region, product_category);

After Z-ordering, the same query skips far more row groups because the min/max ranges for both columns are tight within each file.

Trade-off: Z-order OPTIMIZE is a full rewrite of affected files and should be scheduled periodically, not on every write.

Take quiz

What problem does Z-ordering solve that single-column sorting cannot address?Efficient data skipping for queries filtering on multiple independent columns simultaneously

✓ Correct! Well done.

Deduplicating rows with the same primary key

✗ Try again if not.

Splitting a Parquet file into smaller row groups

✗ Try again if not.

In Delta Lake, which SQL command triggers Z-order clustering on a table?CLUSTER BY

✗ Try again if not.

OPTIMIZE ... ZORDER BY

✓ Correct! Well done.

SORT TABLE ... MULTI

✗ Try again if not.

19. What is Apache Iceberg and how does it use Parquet?

Apache Iceberg is a high-performance open table format for huge analytic datasets, designed to replace traditional Hive table management. Iceberg stores actual data in Parquet (or ORC/Avro) files and adds a metadata layer on top.

Iceberg adds to raw Parquet:

ACID transactions — snapshot isolation for concurrent reads and writes.
Hidden partitioning — partition transforms (bucket, truncate, year/month/day) applied transparently without user-specified partition paths.
Full schema evolution — rename, reorder, drop columns tracked via field IDs.
Time travel — query any historical snapshot: SELECT * FROM events VERSION AS OF 12345;
Metadata statistics — manifest files track per-file min/max/null counts for fast planning.

Parquet is the default data format in Iceberg deployments (Spark, Flink, Trino, Hive). Iceberg handles all the metadata overhead so Parquet files can remain immutable and easily compactable.

Take quiz

What does Apache Iceberg add on top of raw Parquet files?A different columnar encoding specific to Iceberg

✗ Try again if not.

A metadata layer providing ACID transactions, schema evolution, and time travel

✓ Correct! Well done.

A replacement compression algorithm incompatible with plain Parquet

✗ Try again if not.

Which Iceberg SQL feature allows querying a table as it existed at a specific past snapshot?Time travel — VERSION AS OF or TIMESTAMP AS OF

✓ Correct! Well done.

Schema evolution rollback

✗ Try again if not.

Partition pruning

✗ Try again if not.

20. How does DuckDB query Parquet files and what makes it fast?

DuckDB can query Parquet files directly without loading them into a database, using SQL:

-- Query a local Parquet file
SELECT region, SUM(revenue)
FROM read_parquet('events.parquet')
GROUP BY region;

-- Glob pattern for multiple files / partitioned directory
SELECT * FROM read_parquet('s3://bucket/data/**/*.parquet')
WHERE date = '2026-01-01';

DuckDB is fast on Parquet for several reasons:

Columnar execution engine — DuckDB processes data in column vectors natively, aligning perfectly with Parquet's columnar layout.
Push-down filters — WHERE conditions are pushed into the Parquet reader, exploiting row-group statistics and Bloom filters.
Projection pushdown — only referenced columns are read from disk.
Parallel I/O — multiple row groups are read in parallel across CPU cores.
Zero-copy reads — avoids deserialising data into intermediate objects.

This makes DuckDB an excellent tool for ad-hoc analytics on Parquet data lakes without a cluster.

Take quiz

Which DuckDB function reads Parquet files (including S3 paths and glob patterns)?LOAD PARQUET

✗ Try again if not.

read_parquet()

✓ Correct! Well done.

FROM PARQUET FILE

✗ Try again if not.

Why is DuckDB's columnar execution engine particularly well-suited for Parquet?DuckDB processes data in column vectors, which directly maps to Parquet's columnar storage layout

✓ Correct! Well done.

DuckDB converts Parquet to CSV internally before processing

✗ Try again if not.

DuckDB uses row-by-row JDBC driver access to Parquet

✗ Try again if not.

21. What is the Parquet file footer and why does the reader fetch it first?

The Parquet file footer is a serialised Thrift structure at the end of every Parquet file. It contains:

The full file schema (field names, types, nesting).
Per-row-group metadata: byte offsets, compressed/uncompressed sizes, row counts.
Per-column-chunk statistics: min value, max value, null count, distinct count.
Encoding and compression codec per column chunk.
Bloom filter offsets (if present).

Readers always fetch the footer first because it is small (typically kilobytes) and provides the complete map needed to plan which row groups and column chunks to read. Without the footer, the reader would have to scan the entire file sequentially.

The last 4 bytes of a Parquet file are the magic bytes PAR1; the 4 bytes before that are a 32-bit integer giving the footer length, so readers seek to the end of the file first.

Take quiz

What are the last 4 bytes of a valid Parquet file?EOF0

✗ Try again if not.

PAR1

✓ Correct! Well done.

PRQT

✗ Try again if not.

Why does a Parquet reader seek to the end of the file before reading any data?To read the footer, which contains the schema, row group offsets, and column statistics needed for query planning

✓ Correct! Well done.

To verify the file checksum before processing

✗ Try again if not.

To locate the dictionary pages stored at the end

✗ Try again if not.

22. How does Parquet support nested data (structs, lists, maps)?

Parquet uses the Dremel encoding (Google's paper, 2010) to represent arbitrarily nested data in a flat columnar layout. Two extra per-value integers are stored alongside each column's data:

Definition level — how many optional fields in the path are actually defined (non-null). Encodes null positions without storing nulls explicitly.
Repetition level — which repeated field in the path started a new list element. Encodes list boundaries.

Example — a column orders.items.price where items is a repeated field:

Row 1: orders.items = [{price: 10}, {price: 20}]
Row 2: orders.items = []  (empty list)

Stored as:
  price column: [10, 20]
  repetition: [0, 1]   (0=new top-level record, 1=repeated value)
  definition: [2, 2, 1] (includes row 2's empty-list marker)

This lets engines reconstruct nested structures perfectly while retaining columnar access for analytics on leaf fields.

Take quiz

Which Google research paper's encoding technique does Parquet use for nested data?MapReduce encoding

✗ Try again if not.

Dremel encoding — using definition and repetition levels

✓ Correct! Well done.

BigTable sparse row encoding

✗ Try again if not.

What does the definition level encode in Parquet's nested column representation?How many times a repeated field repeats within one row

✗ Try again if not.

How many optional fields in the path are defined (non-null) for a given value

✓ Correct! Well done.

The column's compression codec

✗ Try again if not.

23. What is small file problem in Parquet-based data lakes and how do you solve it?

The small file problem occurs when a Parquet dataset accumulates thousands of tiny files (often from streaming writes or over-partitioning). Each file requires a separate HDFS/S3 metadata operation and a separate footer read, causing significant overhead.

Effects:

Slow query planning — the driver/namenode must enumerate and open many files.
Inefficient Parquet statistics — small row groups give poor skipping benefits.
High object-store API costs (S3 LIST + GET per file).

Solutions:

Compaction job — periodically coalesce small files into larger ones. In Spark:
```
spark.read.parquet(path).coalesce(N).write.mode("overwrite").parquet(path)
```
Delta Lake / Iceberg OPTIMIZE —
```
OPTIMIZE my_table;
```
Streaming micro-batch tuning — increase trigger interval or use maxFilesPerTrigger to reduce write frequency.
Hudi MOR → COW compaction — compact merge-on-read log files into base Parquet files.

Take quiz

What is the primary overhead caused by too many small Parquet files in a data lake?Excessive metadata operations and footer reads per file, slowing query planning

✓ Correct! Well done.

Parquet cannot read more than 1000 files at once

✗ Try again if not.

Small files use a different compression format

✗ Try again if not.

Which Spark transformation consolidates many small partitions into fewer output files?repartition(N) — always shuffles data

✗ Try again if not.

coalesce(N) — reduces partitions without a full shuffle

✓ Correct! Well done.

cache() — materialises the DataFrame to disk

✗ Try again if not.

24. What is the difference between repartition and coalesce when writing Parquet files?

Both control the number of output Parquet files, but they work differently:

Aspect	repartition(N)	coalesce(N)
Shuffle	Full shuffle — all data redistributed across N partitions	No full shuffle — merges existing partitions locally
Output	Exactly N evenly-sized partitions	Up to N partitions; may be uneven
Direction	Can increase or decrease partitions	Can only decrease partitions
Cost	Higher — network + disk I/O	Lower — local merge
When to use	Need even file sizes; before a wide join	Reducing many small files cheaply after a filter

For writing right-sized Parquet files, a common pattern is:

# After a heavy filter that leaves few rows, coalesce is cheaper
df.filter(df.date == "2026-01-01").coalesce(4).write.parquet(output_path)

Take quiz

Which Spark operation can both increase and decrease the number of partitions by performing a full shuffle?repartition(N)

✓ Correct! Well done.

coalesce(N)

✗ Try again if not.

sortWithinPartitions()

✗ Try again if not.

After a heavy filter that leaves data very skewed across partitions, which is the cheaper way to reduce the number of output Parquet files?repartition() — ensures even file sizes

✗ Try again if not.

coalesce() — merges locally without a full shuffle

✓ Correct! Well done.

cache() then write

✗ Try again if not.

25. How does AWS Athena query Parquet files in S3?

AWS Athena is a serverless interactive query service that uses Presto/Trino under the hood. It reads Parquet files directly from S3 using the Parquet columnar reader.

Steps to query Parquet in Athena:

Define a Glue Data Catalog table pointing to the S3 prefix:

CREATE EXTERNAL TABLE events (
  user_id BIGINT,
  event_type STRING,
  revenue DOUBLE,
  event_date DATE
)
STORED AS PARQUET
LOCATION 's3://my-bucket/events/'
TBLPROPERTIES ("parquet.compress"="SNAPPY");

Query as normal SQL — Athena applies column pruning and predicate pushdown automatically.

Cost optimisation: Athena charges per byte scanned. Parquet compression + columnar reads can reduce costs by 90%+ compared to querying CSV for the same logical data.

Partition projection or partition-filtered queries (WHERE event_date BETWEEN ...) reduce both scan size and latency.

Take quiz

What query engine powers AWS Athena under the hood?Apache Spark

✗ Try again if not.

Presto / Trino

✓ Correct! Well done.

Apache Hive

✗ Try again if not.

Why does querying Parquet in Athena cost significantly less than querying CSV for the same data?Athena has a discount pricing tier for Parquet only

✗ Try again if not.

Parquet compression + column pruning dramatically reduces bytes scanned, and Athena charges per byte scanned

✓ Correct! Well done.

Athena caches all Parquet files in memory automatically

✗ Try again if not.

26. What is predicate pushdown in Parquet and how does it work end-to-end?

Predicate pushdown is the process of evaluating filter conditions as early as possible — before data reaches the compute layer — using metadata stored inside Parquet files.

End-to-end flow for WHERE amount > 1000:

Reader fetches the Parquet footer and reads per-row-group statistics: min_amount, max_amount.
For each row group, if max_amount ≤ 1000, the entire row group is skipped — no decompression, no I/O.
For surviving row groups, the amount column chunk is decompressed and values are evaluated against > 1000 at the page level (with page-level indexes in Parquet 2.x).
Only matching rows are returned to the engine.

Parquet 2.0 introduced column indexes (min/max per page, not just per row group), enabling finer-grained skipping within a row group.

Combined with partition pruning (directory skipping) and Bloom filters (equality skipping), predicate pushdown is the single most important performance feature of the Parquet format.

Take quiz

At what granularity does Parquet 1.x predicate pushdown (row group skipping) operate?Individual row level

✗ Try again if not.

Page level

✗ Try again if not.

Row group level — using min/max statistics in the footer

✓ Correct! Well done.

What feature introduced in Parquet 2.0 enables finer-grained skipping within a row group?Column indexes — page-level min/max statistics

✓ Correct! Well done.

Bloom filters

✗ Try again if not.

Dictionary encoding

✗ Try again if not.

27. What are best practices for writing Parquet files in production?

Producing high-quality Parquet files that perform well at query time requires attention at write time:

Target 128 MB–512 MB row groups — too small wastes footer reads; too large makes predicate skipping coarse.
Sort data before writing on filter columns — tight min/max ranges per row group dramatically improve skipping.
Choose the right compression codec — ZSTD for I/O-bound workloads; Snappy for CPU-bound (fast decompression).
Enable dictionary encoding on low-cardinality columns — automatic in most frameworks but verify it is not being disabled.
Partition on moderate-cardinality columns (e.g., date, country) — never on user IDs or UUIDs.
Avoid tiny files — compact regularly if streaming or incremental writes produce many small files.
Enable Bloom filters on high-cardinality equality columns (UUIDs, hashed IDs).
Embed correct schema types — use TIMESTAMP_MICROS not INT96 (deprecated); use DECIMAL not DOUBLE for monetary values.
Test with query benchmarks after schema changes to confirm no regression in pushdown effectiveness.

Take quiz

What is the recommended row group size range for production Parquet files?1 KB – 10 MB

✗ Try again if not.

128 MB – 512 MB

✓ Correct! Well done.

2 GB – 5 GB

✗ Try again if not.

Why is the deprecated INT96 timestamp type being replaced by TIMESTAMP_MICROS in Parquet?INT96 uses more disk space

✗ Try again if not.

INT96 is deprecated and lacks timezone semantics; TIMESTAMP_MICROS with timezone annotation is more portable

✓ Correct! Well done.

INT96 is not supported by ZSTD compression

✗ Try again if not.

28. How does Google BigQuery use Parquet-style columnar storage internally?

BigQuery uses Capacitor, its proprietary columnar format, which shares the same fundamental principles as Parquet: columnar layout, aggressive encoding, and embedded statistics. When you export from BigQuery or import into it, Parquet is the preferred external format.

Exporting BigQuery tables to Parquet:

bq extract \
  --destination_format PARQUET \
  --compression SNAPPY \
  myproject:mydataset.mytable \
  gs://my-bucket/export/*.parquet

Loading Parquet into BigQuery:

bq load \
  --source_format=PARQUET \
  myproject:mydataset.newtable \
  gs://my-bucket/data/*.parquet

BigQuery can also query external Parquet tables via BigLake without loading — applying its own predicate pushdown against Parquet statistics in GCS.

Take quiz

What is the name of BigQuery's internal columnar storage format?Capacitor

✓ Correct! Well done.

Parquet

✗ Try again if not.

ORC

✗ Try again if not.

Which bq CLI flag specifies the output format when extracting a BigQuery table to GCS?--format

✗ Try again if not.

--destination_format

✓ Correct! Well done.

--output_format

✗ Try again if not.

29. What is Delta Lake and how does it extend Parquet for ACID transactions?

Delta Lake is an open-source storage layer built by Databricks that adds transactional guarantees on top of Parquet files stored in object storage. The core idea: all changes (inserts, updates, deletes) are written as immutable Parquet files and tracked via a JSON transaction log (the _delta_log directory).

ACID properties in Delta Lake:

Atomicity — a transaction either fully commits (log entry added) or is rolled back (log entry absent).
Consistency — schema enforcement prevents corrupt writes.
Isolation — Optimistic Concurrency Control (OCC) detects conflicts between concurrent writers.
Durability — log + data files in object storage are highly durable.

Data files remain Parquet; Delta adds a _delta_log/ with JSON commit files that record which Parquet files are added or removed in each transaction. Periodic checkpoint files (Parquet snapshots of the log) speed up log replay.

# Spark Delta write with schema enforcement
df.write.format("delta").mode("append").save("/delta/events")

# Time travel
spark.read.format("delta").option("versionAsOf", 5).load("/delta/events")

Take quiz

Where does Delta Lake store its transaction log alongside Parquet data files?In a _delta_log/ directory at the table root containing JSON commit files

✓ Correct! Well done.

In the Parquet file footer alongside column statistics

✗ Try again if not.

In a Hive Metastore database table

✗ Try again if not.

Which concurrency control mechanism does Delta Lake use to resolve conflicts between concurrent writers?Pessimistic locking — writers acquire locks before writing

✗ Try again if not.

Optimistic Concurrency Control — conflicts detected at commit time

✓ Correct! Well done.

Two-phase commit protocol with a coordinator

✗ Try again if not.

30. How do you perform upserts (MERGE INTO) on Parquet-based tables in Delta Lake?

Raw Parquet files are immutable — you cannot update individual rows. Delta Lake adds MERGE INTO support, which implements upserts by reading affected Parquet files, rewriting them with changes, and recording the transaction:

from delta.tables import DeltaTable

delta_table = DeltaTable.forPath(spark, "/delta/customers")

delta_table.alias("target").merge(
    updates_df.alias("source"),
    "target.customer_id = source.customer_id"
).whenMatchedUpdateAll(
).whenNotMatchedInsertAll(
).execute()

Or in SQL:

MERGE INTO customers AS target
USING updates AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

Under the hood: Delta identifies which Parquet files contain matching rows, rewrites only those files with updated values, and records the old files as removed and new files as added in the transaction log. Non-matching files are untouched.

Take quiz

Why can't you update individual rows directly in raw Parquet files?Parquet files are immutable — in-place updates are not supported

✓ Correct! Well done.

Parquet uses checksums that prevent any modification

✗ Try again if not.

Parquet stores rows across multiple files with no row identifiers

✗ Try again if not.

In a Delta Lake MERGE INTO operation, what happens to the original Parquet files that contained matched rows?They are modified in place and the footer is rewritten

✗ Try again if not.

They are rewritten as new files; the originals are marked as removed in the transaction log

✓ Correct! Well done.

They are deleted immediately from object storage

✗ Try again if not.

Data Lake Interview questions

	Interviews Questions Java Spring Hibernate Maven Testing API BigData Web DataStructures AI Database Integration Cloud Scala Python Tools Golang	About Javapedia.net Javapedia.net is for Java and J2EE developers, technologist and college students who prepare of interview. Also this site includes many practical examples. This site is developed using J2EE technologies by Steve Antony, a senior Developer/lead at one of the logistics based company.
	contact: javatutorials2016[at]gmail[dot]com
Kindly consider donating for maintaining this website. Thanks.
	Copyright © 2026, javapedia.net, all rights reserved. privacy policy.

BigData / Apache Parquet Interview Questions

1. What is Apache Parquet and why is it used?

2. What are the advantages of Parquet over CSV?

3. How are Parquet files structured? (Row Groups, Column Chunks, Pages)?

4. What is Schema Evolution in Parquet?

5. What is Column Pruning and Projection Pushdown in Parquet?

6. When would you choose Avro over Parquet?

7. How does Parquet handle compression and encoding?

8. What is the Vectorized Reader in Spark and how does it improve Parquet performance?

9. How do you handle schema mismatches when merging multiple Parquet files?

10. If a Spark query on Parquet is slow, what optimisation steps would you take?

11. How do you load Parquet files into Snowflake?

12. What are the supported data types in Parquet?

13. How do you read and write Parquet files in PySpark?

14. How do you read and write Parquet files in Python with PyArrow?

15. What is partitioning in Parquet and how does it improve query performance?

16. What are Bloom Filters in Parquet and when should you use them?

17. What is the difference between Parquet, ORC, and Avro?

18. What is Z-ordering (Z-order clustering) and how does it help Parquet queries?

19. What is Apache Iceberg and how does it use Parquet?

20. How does DuckDB query Parquet files and what makes it fast?

21. What is the Parquet file footer and why does the reader fetch it first?

22. How does Parquet support nested data (structs, lists, maps)?

23. What is small file problem in Parquet-based data lakes and how do you solve it?

24. What is the difference between repartition and coalesce when writing Parquet files?

25. How does AWS Athena query Parquet files in S3?

26. What is predicate pushdown in Parquet and how does it work end-to-end?

27. What are best practices for writing Parquet files in production?

28. How does Google BigQuery use Parquet-style columnar storage internally?

29. What is Delta Lake and how does it extend Parquet for ACID transactions?

30. How do you perform upserts (MERGE INTO) on Parquet-based tables in Delta Lake?

Comments & Discussions

Recently added...