Prev Next

BigData / Apache Parquet Interview Questions

1. What is Apache Parquet and why is it used? 2. What are the advantages of Parquet over CSV? 3. How are Parquet files structured? (Row Groups, Column Chunks, Pages)? 4. What is Schema Evolution in Parquet? 5. What is Column Pruning and Projection Pushdown in Parquet? 6. When would you choose Avro over Parquet? 7. How does Parquet handle compression and encoding? 8. What is the Vectorized Reader in Spark and how does it improve Parquet performance? 9. How do you handle schema mismatches when merging multiple Parquet files? 10. If a Spark query on Parquet is slow, what optimisation steps would you take? 11. How do you load Parquet files into Snowflake? 12. What are the supported data types in Parquet? 13. How do you read and write Parquet files in PySpark? 14. How do you read and write Parquet files in Python with PyArrow? 15. What is partitioning in Parquet and how does it improve query performance? 16. What are Bloom Filters in Parquet and when should you use them? 17. What is the difference between Parquet, ORC, and Avro? 18. What is Z-ordering (Z-order clustering) and how does it help Parquet queries? 19. What is Apache Iceberg and how does it use Parquet? 20. How does DuckDB query Parquet files and what makes it fast? 21. What is the Parquet file footer and why does the reader fetch it first? 22. How does Parquet support nested data (structs, lists, maps)? 23. What is small file problem in Parquet-based data lakes and how do you solve it? 24. What is the difference between repartition and coalesce when writing Parquet files? 25. How does AWS Athena query Parquet files in S3? 26. What is predicate pushdown in Parquet and how does it work end-to-end? 27. What are best practices for writing Parquet files in production? 28. How does Google BigQuery use Parquet-style columnar storage internally? 29. What is Delta Lake and how does it extend Parquet for ACID transactions? 30. How do you perform upserts (MERGE INTO) on Parquet-based tables in Delta Lake?
Could not find what you were looking for? send us the question and we would be happy to answer your question.

1. What is Apache Parquet and why is it used?

Apache Parquet is an open-source, columnar storage file format designed for the Hadoop ecosystem and modern big-data platforms. Unlike row-based formats (CSV, JSON), Parquet stores each column's data contiguously on disk. This layout allows query engines to read only the columns they need, dramatically reducing I/O.

Key reasons Parquet is widely used:

  • Columnar layout — queries touch only the relevant columns, not full rows.
  • Efficient compression — columns hold homogeneous types, so codecs like Snappy and GZIP achieve higher ratios.
  • Predicate pushdown — embedded min/max statistics let engines skip entire row groups that cannot satisfy a WHERE clause.
  • Schema embedding — the schema is stored inside the file, eliminating external schema management.
  • Broad ecosystem support — Spark, Hive, Presto, Flink, DuckDB, Pandas, AWS Athena, and Snowflake all read/write Parquet natively.
Which storage layout does Apache Parquet use?
What enables Parquet to skip irrelevant row groups during a query?
2. What are the advantages of Parquet over CSV?

CSV is simple but untyped and inefficient at scale. Parquet improves on it in every dimension that matters for analytics:

AspectCSVParquet
Storage layoutRow-basedColumnar
CompressionLow (mixed types per row)High (homogeneous per column)
SchemaNone (inferred)Embedded in file
Partial readsMust scan full rowRead only needed columns
Predicate pushdownNoYes (min/max statistics)
Schema evolutionManualBuilt-in (add columns)
EcosystemUniversalHadoop, Spark, cloud lakes

For OLAP-style queries that aggregate a few columns across millions of rows, Parquet is typically 10–100× faster to query and 3–10× smaller than an equivalent CSV.

Why does Parquet compress data more efficiently than CSV?
Which CSV limitation does Parquet's embedded schema solve?
3. How are Parquet files structured? (Row Groups, Column Chunks, Pages)?

Parquet organises data in a three-level hierarchy:

  • Row Group — a horizontal slice of the dataset, typically 128 MB–1 GB of data. Each row group contains one column chunk per column.
  • Column Chunk — all values for a single column within a row group. This is the unit of compression and encoding.
  • Page — the smallest addressable unit inside a column chunk (default 1 MB). Pages can be data pages, dictionary pages, or index pages.

At the end of the file, a footer stores the schema and per-column statistics (min, max, null count, distinct count). Readers fetch the footer first to plan which row groups and pages to skip.

Parquet File\n├── Row Group 1 (128 MB)\n│   ├── Column Chunk: id\n│   ├── Column Chunk: name\n│   └── Column Chunk: amount\n├── Row Group 2\n│   └── ...\n└── Footer (schema + statistics)
What is the smallest addressable unit of storage inside a Parquet column chunk?
Where is the schema and per-column statistics stored inside a Parquet file?
4. What is Schema Evolution in Parquet?

Schema evolution is the ability to change a Parquet dataset's schema over time without rewriting existing files. Parquet natively supports:

  • Adding columns — new columns appear as null in older files when read together with newer files.
  • Renaming columns — supported via field IDs (used in formats like Iceberg on top of Parquet).
  • Widening types — e.g., INT32 → INT64 is safe; narrowing is not.

In Apache Spark, set mergeSchema = true to merge schemas across multiple Parquet files automatically:

df = spark.read.option("mergeSchema", "true").parquet("s3://bucket/data/")

Schema evolution is critical for long-lived data lakes where upstream producers add new fields without co-ordinating with all downstream consumers.

What value do older Parquet files return for a newly added column?
Which Spark read option enables automatic schema merging across Parquet files?
5. What is Column Pruning and Projection Pushdown in Parquet?

Column pruning (also called projection pushdown) is an optimisation where the query engine reads only the columns referenced in the query, ignoring all other column chunks on disk. Because Parquet stores each column separately, skipping unrequested columns costs nothing beyond reading the footer.

Example — a table has 50 columns but the query only needs 3:

SELECT user_id, revenue, country FROM events WHERE date = '2026-01-01';

The engine reads only the user_id, revenue, country, and date column chunks — the other 46 are never loaded from disk. Combined with predicate pushdown (row group skipping via statistics), this makes Parquet queries orders of magnitude faster than scanning full rows.

What does column pruning (projection pushdown) allow the query engine to do?
Which Parquet feature allows row groups to be skipped when a WHERE clause cannot be satisfied?
6. When would you choose Avro over Parquet?

Parquet and Avro serve different access patterns. The table below summarises when to prefer each:

CriterionChoose ParquetChoose Avro
Query patternOLAP — aggregate few columns over many rowsRow-based access — read/write entire records
StreamingBatch / micro-batchStreaming (Kafka, Flink event records)
Write frequencyWrite-once, read-manyFrequent writes / appends
Schema evolutionColumn adds; limitedFull backward/forward compatibility
CompressionHigher for analyticsModerate; better for write throughput

A common pattern: ingest events as Avro in Kafka, then compact and convert to Parquet in a data lake for batch analytics.

For a Kafka streaming pipeline where every field of each event is consumed, which format is the better fit?
Which workload is Parquet best suited for?
7. How does Parquet handle compression and encoding?

Parquet uses two complementary techniques to minimise storage:

Encoding — applied first, at the column level, to exploit data patterns:

  • Dictionary Encoding — replaces repeated values with integer codes. Ideal for low-cardinality columns (e.g., status, country).
  • Run-Length Encoding (RLE) — collapses consecutive identical values into (value, count) pairs.
  • Delta Encoding — stores differences between successive integers; great for timestamps and monotonic IDs.
  • Bit-packing — packs small integers into fewer bits.

Compression codec — applied after encoding to the byte stream:

  • Snappy (default in Spark) — fast, moderate ratio (~2–3×).
  • GZIP/Zlib — higher ratio (~4–6×) but slower CPU.
  • ZSTD — best balance; recommended for most new deployments.
  • LZO, LZ4, Brotli — specialised use cases.

Encoding and codec are configured independently per column, so hot columns can use fast codecs while archive columns use ZSTD for maximum compression.

Which Parquet encoding is best suited for a low-cardinality column like country or status?
Which compression codec offers the best balance of ratio and speed for modern Parquet deployments?

8. What is the Vectorized Reader in Spark and how does it improve Parquet performance?

The Vectorized Parquet Reader (introduced in Spark 2.0) reads a batch of rows at once directly into an in-memory columnar format (ColumnarBatch) rather than converting each row individually to a JVM object. This avoids object creation overhead and allows the JVM's JIT compiler to apply SIMD-style optimisations.

Key benefits:

  • Reduces per-row CPU overhead significantly.
  • Enables whole-stage code generation to operate on batches.
  • Speeds up filters and projections applied directly to column vectors.

Enable or verify via Spark config:

spark.conf.set("spark.sql.parquet.enableVectorizedReader", "true")  # default true

For complex nested schemas (e.g., deeply nested structs/maps), the vectorized reader may fall back to row-by-row mode automatically.

What memory format does the Spark Vectorized Parquet Reader output rows into?
When does Spark's vectorized Parquet reader fall back to row-by-row mode?
9. How do you handle schema mismatches when merging multiple Parquet files?

When a dataset is composed of Parquet files written at different times (possibly with different schemas), you have several options:

1. Spark mergeSchema — the simplest approach; Spark unions all schemas and fills missing columns with null:

df = spark.read.option("mergeSchema", "true").parquet("s3://datalake/events/")

2. AWS Glue Schema Registry — version schemas centrally; consumers register against a schema version and handle evolution rules explicitly.

3. Table formats (Iceberg / Delta / Hudi) — track schema history at the table level; ALTER TABLE ADD COLUMN is applied transactionally without rewriting files.

4. Manual reconciliation — use df.schema to inspect each file's schema, build a unified schema, then apply df.select(unified_cols) with lit(None).cast() for missing columns before union.

Always validate after merging: check for unexpected nulls or type coercions that widen types silently.

Which Spark read option automatically merges different schemas across Parquet files?
What does Spark fill in for columns that exist in newer files but are absent in older Parquet files when mergeSchema is used?
10. If a Spark query on Parquet is slow, what optimisation steps would you take?

Diagnosing and tuning slow Parquet queries in Spark follows a layered approach:

  1. Check partitioning — ensure the table is partitioned on high-cardinality filter columns (e.g., date, region). Without partitions, Spark scans all files.
    df.write.partitionBy("date", "region").parquet("s3://path/")
  2. Verify column pruning — use SELECT col1, col2 rather than SELECT * so only required column chunks are read.
  3. Align filters with statistics — filter on columns that have good min/max spread so row group skipping is effective. Sort data by the filter column before writing (Z-ordering or file-level sorting).
  4. Right-size row groups — too-small row groups → many footer reads; too-large → coarse skipping. Target 128 MB–512 MB per row group.
  5. Check compression codec — Snappy (fast decompression) suits CPU-bound queries; ZSTD suits I/O-bound scenarios.
  6. Enable Adaptive Query Execution (AQE)spark.sql.adaptive.enabled=true reoptimises joins and coalesces small partitions at runtime.
  7. Inspect the Spark UI — look at the scan stage: bytes read vs. bytes skipped tells you how effective pushdown is.
Which write-time practice most directly improves row-group skipping for date-range queries?
What Spark setting enables runtime re-optimisation of joins and partition coalescing?
11. How do you load Parquet files into Snowflake?

Snowflake can query and load Parquet files staged in cloud storage using a two-step approach:

Step 1 — Stage the files (S3, Azure Blob, or GCS external stage):

CREATE OR REPLACE STAGE my_stage
  URL = 's3://my-bucket/parquet-data/'
  CREDENTIALS = (AWS_KEY_ID='...' AWS_SECRET_KEY='...');

Step 2a — Infer schema and query directly (no table creation needed):

SELECT $1:user_id::INT, $1:revenue::FLOAT
FROM @my_stage (FILE_FORMAT => 'parquet_fmt');

Step 2b — Use INFER_SCHEMA to auto-create table:

CREATE TABLE events USING TEMPLATE (
  SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*))
  FROM TABLE(INFER_SCHEMA(
    LOCATION => '@my_stage',
    FILE_FORMAT => 'parquet_fmt'
  ))
);
COPY INTO events FROM @my_stage FILE_FORMAT = (TYPE = 'PARQUET');

INFER_SCHEMA reads the Parquet footer metadata to derive column names and types automatically, eliminating manual DDL.

What Snowflake function reads a Parquet file's embedded schema to auto-generate a CREATE TABLE statement?
Which SQL command actually moves data from a Snowflake stage into a table?
12. What are the supported data types in Parquet?

Parquet defines primitive types (physical storage) and logical types (semantic meaning layered on top).

Primitive types: BOOLEAN, INT32, INT64, INT96 (legacy timestamps), FLOAT, DOUBLE, BYTE_ARRAY, FIXED_LEN_BYTE_ARRAY.

Common logical types (annotations on primitives):

Logical TypePhysical MappingExample
STRINGBYTE_ARRAY (UTF-8)"London"
DATEINT32 (days since epoch)2026-01-01
TIMESTAMP_MILLISINT641704067200000
DECIMALINT32/INT64/BYTE_ARRAY99.99
LISTRepeated group[1, 2, 3]
MAPRepeated key-value group{k: v}
ENUMBYTE_ARRAY"ACTIVE"
UUIDFIXED_LEN_BYTE_ARRAY(16)RFC-4122

Understanding the primitive/logical split matters when debugging type mismatch errors between Spark, Hive, and other readers.

How does Parquet store a DATE column at the physical level?
What physical Parquet type is used to represent a STRING logical type?
13. How do you read and write Parquet files in PySpark?

Spark provides first-class Parquet support via the DataFrameReader and DataFrameWriter APIs.

Read:

# Read a single file or directory of Parquet files
df = spark.read.parquet("s3://my-bucket/events/")

# With options
df = (spark.read
      .option("mergeSchema", "true")
      .parquet("hdfs:///datalake/transactions/"))

Write:

# Overwrite with Snappy compression (default)
df.write.mode("overwrite").parquet("s3://my-bucket/output/")

# Partition by date and region, use ZSTD
(df.write
   .partitionBy("date", "region")
   .option("compression", "zstd")
   .mode("append")
   .parquet("s3://my-bucket/partitioned/"))

Register as temp view for SQL:

df.createOrReplaceTempView("events")
spark.sql("SELECT date, SUM(revenue) FROM events GROUP BY date").show()
Which PySpark write mode appends new data to an existing Parquet directory without overwriting existing files?
What method creates a SQL-queryable alias for a Spark DataFrame?
14. How do you read and write Parquet files in Python with PyArrow?

PyArrow provides a low-level Parquet library that is fast, pure Python-friendly, and interoperates with Pandas:

Write:

import pyarrow as pa
import pyarrow.parquet as pq

table = pa.Table.from_pandas(df)
pq.write_table(table, "output.parquet", compression="zstd")

Read:

table = pq.read_table("output.parquet")
df = table.to_pandas()

Read specific columns only (column pruning):

table = pq.read_table("output.parquet", columns=["user_id", "revenue"])

Read with filter pushdown:

import pyarrow.dataset as ds
dataset = ds.dataset("s3://bucket/data/", format="parquet")
table = dataset.to_table(
    columns=["user_id", "revenue"],
    filter=ds.field("date") == "2026-01-01"
)

PyArrow's dataset API supports partitioned directories and applies filter pushdown automatically using Parquet statistics.

Which PyArrow function writes a PyArrow Table to a Parquet file on disk?
How do you push a column filter down to Parquet statistics when using the PyArrow dataset API?
15. What is partitioning in Parquet and how does it improve query performance?

Partitioning organises files into a directory hierarchy based on column values, following the Hive partition layout:

s3://bucket/events/
  date=2026-01-01/
    part-0000.parquet
  date=2026-01-02/
    part-0001.parquet
  region=US/
    ...

When a query filters on a partition column, the engine lists only matching subdirectories and skips all others — this is called partition pruning. No Parquet files in the skipped partitions are opened at all.

Example savings: a table with 3 years of daily data (1,095 partitions) filtering on a single date skips 99.9% of files before any Parquet-level statistics are consulted.

Trade-offs:

  • Partition on columns with moderate cardinality (dates, regions, categories) — not user IDs (too many small files).
  • Avoid over-partitioning: thousands of tiny files hurt throughput more than they help with pruning.
What is the performance optimisation called when a query skips entire directory partitions based on a filter column value?
Why should you avoid partitioning a Parquet dataset by a high-cardinality column like user_id?
16. What are Bloom Filters in Parquet and when should you use them?

A Bloom Filter is a probabilistic data structure that answers "is this value possibly in this row group?" with zero false negatives. Parquet 1.12+ supports per-column Bloom filters stored in the file footer.

They complement min/max statistics for high-cardinality columns where the min–max range spans the full value space (e.g., UUIDs, email addresses). Without a Bloom filter, the engine must open every row group; with one, it can skip groups that definitely do not contain the queried value.

Enable in PySpark/Spark:

spark.conf.set("parquet.bloom.filter.enabled#user_id", "true")
spark.conf.set("parquet.bloom.filter.expected.ndv#user_id", "1000000")

When to use:

  • Equality filters (WHERE uuid = '...') on high-cardinality string/UUID columns.
  • Not useful for range queries (use sorted data + min/max for that).

Trade-off: adds size to the footer; tune expected.ndv (number of distinct values) to control false-positive rate vs. filter size.

For which query pattern are Parquet Bloom filters most beneficial?
What does a Parquet Bloom filter guarantee about its results?
17. What is the difference between Parquet, ORC, and Avro?

All three are Apache-ecosystem formats but optimised for different workloads:

FeatureParquetORCAvro
LayoutColumnarColumnarRow-based
Ecosystem fitSpark, Presto, Hive, cloud lakesHive-native; SparkKafka, Flink, Hadoop MR
CompressionExcellent (ZSTD, Snappy)Excellent (ZLIB, Snappy, ZSTD)Good (Deflate, Snappy)
Schema evolutionColumn adds; mergeSchemaColumn adds; type promotionsFull backward/forward compatibility
Nested dataStrong (Dremel-style)GoodNative (JSON-like nested)
Best forAnalytics, data lakeHive OLAP, HBase integrationStreaming, messaging, OLTP-like writes

In practice: most modern data lake stacks (Databricks, AWS lake formation, Google BigQuery) default to Parquet. ORC remains dominant in Hive-heavy shops.

Which format has the strongest support for full backward and forward schema evolution?
Which format is native to Hive and historically dominant in Hive-centric data warehouses?
18. What is Z-ordering (Z-order clustering) and how does it help Parquet queries?

Z-ordering is a multi-dimensional data-skipping technique that physically co-locates related rows across multiple filter columns within Parquet files. It maps multiple column values to a single Z-order curve value and sorts data along that curve.

Problem it solves: Standard partitioning and sorting works well for one column. When queries filter on two or more independent columns (e.g., WHERE region='EU' AND product_category='Electronics'), traditional single-column sorting cannot co-locate all relevant rows.

Z-ordering interleaves the bits of the filter column values so that rows with similar values in multiple columns are physically adjacent.

In Delta Lake (Databricks):

OPTIMIZE events ZORDER BY (region, product_category);

After Z-ordering, the same query skips far more row groups because the min/max ranges for both columns are tight within each file.

Trade-off: Z-order OPTIMIZE is a full rewrite of affected files and should be scheduled periodically, not on every write.

What problem does Z-ordering solve that single-column sorting cannot address?
In Delta Lake, which SQL command triggers Z-order clustering on a table?
19. What is Apache Iceberg and how does it use Parquet?

Apache Iceberg is a high-performance open table format for huge analytic datasets, designed to replace traditional Hive table management. Iceberg stores actual data in Parquet (or ORC/Avro) files and adds a metadata layer on top.

Iceberg adds to raw Parquet:

  • ACID transactions — snapshot isolation for concurrent reads and writes.
  • Hidden partitioning — partition transforms (bucket, truncate, year/month/day) applied transparently without user-specified partition paths.
  • Full schema evolution — rename, reorder, drop columns tracked via field IDs.
  • Time travel — query any historical snapshot: SELECT * FROM events VERSION AS OF 12345;
  • Metadata statistics — manifest files track per-file min/max/null counts for fast planning.

Parquet is the default data format in Iceberg deployments (Spark, Flink, Trino, Hive). Iceberg handles all the metadata overhead so Parquet files can remain immutable and easily compactable.

What does Apache Iceberg add on top of raw Parquet files?
Which Iceberg SQL feature allows querying a table as it existed at a specific past snapshot?
20. How does DuckDB query Parquet files and what makes it fast?

DuckDB can query Parquet files directly without loading them into a database, using SQL:

-- Query a local Parquet file
SELECT region, SUM(revenue)
FROM read_parquet('events.parquet')
GROUP BY region;

-- Glob pattern for multiple files / partitioned directory
SELECT * FROM read_parquet('s3://bucket/data/**/*.parquet')
WHERE date = '2026-01-01';

DuckDB is fast on Parquet for several reasons:

  • Columnar execution engine — DuckDB processes data in column vectors natively, aligning perfectly with Parquet's columnar layout.
  • Push-down filters — WHERE conditions are pushed into the Parquet reader, exploiting row-group statistics and Bloom filters.
  • Projection pushdown — only referenced columns are read from disk.
  • Parallel I/O — multiple row groups are read in parallel across CPU cores.
  • Zero-copy reads — avoids deserialising data into intermediate objects.

This makes DuckDB an excellent tool for ad-hoc analytics on Parquet data lakes without a cluster.

Which DuckDB function reads Parquet files (including S3 paths and glob patterns)?
Why is DuckDB's columnar execution engine particularly well-suited for Parquet?
21. What is the Parquet file footer and why does the reader fetch it first?

The Parquet file footer is a serialised Thrift structure at the end of every Parquet file. It contains:

  • The full file schema (field names, types, nesting).
  • Per-row-group metadata: byte offsets, compressed/uncompressed sizes, row counts.
  • Per-column-chunk statistics: min value, max value, null count, distinct count.
  • Encoding and compression codec per column chunk.
  • Bloom filter offsets (if present).

Readers always fetch the footer first because it is small (typically kilobytes) and provides the complete map needed to plan which row groups and column chunks to read. Without the footer, the reader would have to scan the entire file sequentially.

The last 4 bytes of a Parquet file are the magic bytes PAR1; the 4 bytes before that are a 32-bit integer giving the footer length, so readers seek to the end of the file first.

What are the last 4 bytes of a valid Parquet file?
Why does a Parquet reader seek to the end of the file before reading any data?
22. How does Parquet support nested data (structs, lists, maps)?

Parquet uses the Dremel encoding (Google's paper, 2010) to represent arbitrarily nested data in a flat columnar layout. Two extra per-value integers are stored alongside each column's data:

  • Definition level — how many optional fields in the path are actually defined (non-null). Encodes null positions without storing nulls explicitly.
  • Repetition level — which repeated field in the path started a new list element. Encodes list boundaries.

Example — a column orders.items.price where items is a repeated field:

Row 1: orders.items = [{price: 10}, {price: 20}]
Row 2: orders.items = []  (empty list)

Stored as:
  price column: [10, 20]
  repetition: [0, 1]   (0=new top-level record, 1=repeated value)
  definition: [2, 2, 1] (includes row 2's empty-list marker)

This lets engines reconstruct nested structures perfectly while retaining columnar access for analytics on leaf fields.

Which Google research paper's encoding technique does Parquet use for nested data?
What does the definition level encode in Parquet's nested column representation?
23. What is small file problem in Parquet-based data lakes and how do you solve it?

The small file problem occurs when a Parquet dataset accumulates thousands of tiny files (often from streaming writes or over-partitioning). Each file requires a separate HDFS/S3 metadata operation and a separate footer read, causing significant overhead.

Effects:

  • Slow query planning — the driver/namenode must enumerate and open many files.
  • Inefficient Parquet statistics — small row groups give poor skipping benefits.
  • High object-store API costs (S3 LIST + GET per file).

Solutions:

  1. Compaction job — periodically coalesce small files into larger ones. In Spark:
    spark.read.parquet(path).coalesce(N).write.mode("overwrite").parquet(path)
  2. Delta Lake / Iceberg OPTIMIZE
    OPTIMIZE my_table;
  3. Streaming micro-batch tuning — increase trigger interval or use maxFilesPerTrigger to reduce write frequency.
  4. Hudi MOR → COW compaction — compact merge-on-read log files into base Parquet files.
What is the primary overhead caused by too many small Parquet files in a data lake?
Which Spark transformation consolidates many small partitions into fewer output files?
24. What is the difference between repartition and coalesce when writing Parquet files?

Both control the number of output Parquet files, but they work differently:

Aspectrepartition(N)coalesce(N)
ShuffleFull shuffle — all data redistributed across N partitionsNo full shuffle — merges existing partitions locally
OutputExactly N evenly-sized partitionsUp to N partitions; may be uneven
DirectionCan increase or decrease partitionsCan only decrease partitions
CostHigher — network + disk I/OLower — local merge
When to useNeed even file sizes; before a wide joinReducing many small files cheaply after a filter

For writing right-sized Parquet files, a common pattern is:

# After a heavy filter that leaves few rows, coalesce is cheaper
df.filter(df.date == "2026-01-01").coalesce(4).write.parquet(output_path)
Which Spark operation can both increase and decrease the number of partitions by performing a full shuffle?
After a heavy filter that leaves data very skewed across partitions, which is the cheaper way to reduce the number of output Parquet files?
25. How does AWS Athena query Parquet files in S3?

AWS Athena is a serverless interactive query service that uses Presto/Trino under the hood. It reads Parquet files directly from S3 using the Parquet columnar reader.

Steps to query Parquet in Athena:

  1. Define a Glue Data Catalog table pointing to the S3 prefix:
CREATE EXTERNAL TABLE events (
  user_id BIGINT,
  event_type STRING,
  revenue DOUBLE,
  event_date DATE
)
STORED AS PARQUET
LOCATION 's3://my-bucket/events/'
TBLPROPERTIES ("parquet.compress"="SNAPPY");
  1. Query as normal SQL — Athena applies column pruning and predicate pushdown automatically.

Cost optimisation: Athena charges per byte scanned. Parquet compression + columnar reads can reduce costs by 90%+ compared to querying CSV for the same logical data.

Partition projection or partition-filtered queries (WHERE event_date BETWEEN ...) reduce both scan size and latency.

What query engine powers AWS Athena under the hood?
Why does querying Parquet in Athena cost significantly less than querying CSV for the same data?
26. What is predicate pushdown in Parquet and how does it work end-to-end?

Predicate pushdown is the process of evaluating filter conditions as early as possible — before data reaches the compute layer — using metadata stored inside Parquet files.

End-to-end flow for WHERE amount > 1000:

  1. Reader fetches the Parquet footer and reads per-row-group statistics: min_amount, max_amount.
  2. For each row group, if max_amount ≤ 1000, the entire row group is skipped — no decompression, no I/O.
  3. For surviving row groups, the amount column chunk is decompressed and values are evaluated against > 1000 at the page level (with page-level indexes in Parquet 2.x).
  4. Only matching rows are returned to the engine.

Parquet 2.0 introduced column indexes (min/max per page, not just per row group), enabling finer-grained skipping within a row group.

Combined with partition pruning (directory skipping) and Bloom filters (equality skipping), predicate pushdown is the single most important performance feature of the Parquet format.

At what granularity does Parquet 1.x predicate pushdown (row group skipping) operate?
What feature introduced in Parquet 2.0 enables finer-grained skipping within a row group?
27. What are best practices for writing Parquet files in production?

Producing high-quality Parquet files that perform well at query time requires attention at write time:

  1. Target 128 MB–512 MB row groups — too small wastes footer reads; too large makes predicate skipping coarse.
  2. Sort data before writing on filter columns — tight min/max ranges per row group dramatically improve skipping.
  3. Choose the right compression codec — ZSTD for I/O-bound workloads; Snappy for CPU-bound (fast decompression).
  4. Enable dictionary encoding on low-cardinality columns — automatic in most frameworks but verify it is not being disabled.
  5. Partition on moderate-cardinality columns (e.g., date, country) — never on user IDs or UUIDs.
  6. Avoid tiny files — compact regularly if streaming or incremental writes produce many small files.
  7. Enable Bloom filters on high-cardinality equality columns (UUIDs, hashed IDs).
  8. Embed correct schema types — use TIMESTAMP_MICROS not INT96 (deprecated); use DECIMAL not DOUBLE for monetary values.
  9. Test with query benchmarks after schema changes to confirm no regression in pushdown effectiveness.
What is the recommended row group size range for production Parquet files?
Why is the deprecated INT96 timestamp type being replaced by TIMESTAMP_MICROS in Parquet?
28. How does Google BigQuery use Parquet-style columnar storage internally?

BigQuery uses Capacitor, its proprietary columnar format, which shares the same fundamental principles as Parquet: columnar layout, aggressive encoding, and embedded statistics. When you export from BigQuery or import into it, Parquet is the preferred external format.

Exporting BigQuery tables to Parquet:

bq extract \
  --destination_format PARQUET \
  --compression SNAPPY \
  myproject:mydataset.mytable \
  gs://my-bucket/export/*.parquet

Loading Parquet into BigQuery:

bq load \
  --source_format=PARQUET \
  myproject:mydataset.newtable \
  gs://my-bucket/data/*.parquet

BigQuery can also query external Parquet tables via BigLake without loading — applying its own predicate pushdown against Parquet statistics in GCS.

What is the name of BigQuery's internal columnar storage format?
Which bq CLI flag specifies the output format when extracting a BigQuery table to GCS?
29. What is Delta Lake and how does it extend Parquet for ACID transactions?

Delta Lake is an open-source storage layer built by Databricks that adds transactional guarantees on top of Parquet files stored in object storage. The core idea: all changes (inserts, updates, deletes) are written as immutable Parquet files and tracked via a JSON transaction log (the _delta_log directory).

ACID properties in Delta Lake:

  • Atomicity — a transaction either fully commits (log entry added) or is rolled back (log entry absent).
  • Consistency — schema enforcement prevents corrupt writes.
  • Isolation — Optimistic Concurrency Control (OCC) detects conflicts between concurrent writers.
  • Durability — log + data files in object storage are highly durable.

Data files remain Parquet; Delta adds a _delta_log/ with JSON commit files that record which Parquet files are added or removed in each transaction. Periodic checkpoint files (Parquet snapshots of the log) speed up log replay.

# Spark Delta write with schema enforcement
df.write.format("delta").mode("append").save("/delta/events")

# Time travel
spark.read.format("delta").option("versionAsOf", 5).load("/delta/events")
Where does Delta Lake store its transaction log alongside Parquet data files?
Which concurrency control mechanism does Delta Lake use to resolve conflicts between concurrent writers?
30. How do you perform upserts (MERGE INTO) on Parquet-based tables in Delta Lake?

Raw Parquet files are immutable — you cannot update individual rows. Delta Lake adds MERGE INTO support, which implements upserts by reading affected Parquet files, rewriting them with changes, and recording the transaction:

from delta.tables import DeltaTable

delta_table = DeltaTable.forPath(spark, "/delta/customers")

delta_table.alias("target").merge(
    updates_df.alias("source"),
    "target.customer_id = source.customer_id"
).whenMatchedUpdateAll(
).whenNotMatchedInsertAll(
).execute()

Or in SQL:

MERGE INTO customers AS target
USING updates AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

Under the hood: Delta identifies which Parquet files contain matching rows, rewrites only those files with updated values, and records the old files as removed and new files as added in the transaction log. Non-matching files are untouched.

Why can't you update individual rows directly in raw Parquet files?
In a Delta Lake MERGE INTO operation, what happens to the original Parquet files that contained matched rows?
«
»
Web

Comments & Discussions