Prev Next

Database / Snowflake Interview Questions

1. What is Snowflake and how does its multi-cluster, shared-data architecture differ from traditional data warehouses? 2. What are the three layers of Snowflake's architecture (Storage, Compute, Cloud Services) and what does each do? 3. What is a Virtual Warehouse in Snowflake and how does it scale independently of storage? 4. What is the difference between Snowflake's columnar storage and row-based storage in traditional RDBMS? 5. What is the Snowflake Cloud Services layer and what components does it manage? 6. What are micro-partitions in Snowflake and how do they enable automatic data clustering? 7. What is data clustering in Snowflake and when should you define a clustering key? 8. What is the difference between Snowflake Standard, Enterprise, Business Critical, and Virtual Private Snowflake edition? 9. How does Snowflake handle multi-cloud deployment and cross-cloud replication? 10. What is Snowflake's separation of compute from storage and what billing advantages does it provide? 11. What are the stages in Snowflake (internal vs external) and how do you use them for data loading? 12. How does the COPY INTO command work and what file formats does it support? 13. What is Snowpipe and how does it enable continuous / serverless data ingestion? 14. What is the difference between bulk loading with COPY INTO and micro-batch loading with Snowpipe? 15. How does Snowflake handle semi-structured data (JSON, Avro, Parquet, ORC) with the VARIANT type? 16. What are Snowflake Dynamic Tables and how do they differ from regular tables and views? 17. What is Time Travel in Snowflake and how does it work (retention period, UNDROP, AT/BEFORE)? 18. What is Fail-safe in Snowflake and how does it differ from Time Travel? 19. What is the Snowflake Query Profile and how do you use it to diagnose slow queries? 20. What is result caching in Snowflake and under what conditions does it apply? 21. What is the metadata cache (Cloud Services layer cache) and how does it speed up queries? 22. What is a clustering key and how does it reduce partition pruning cost for large tables? 23. What are Snowflake Materialized Views and when should you use them over regular views? 24. What is the difference between a Snowflake View, Materialized View, and Dynamic Table? 25. How do you optimize query performance in Snowflake (warehouse sizing, clustering, pruning, result cache)? 26. What are Snowflake Streams and how do they implement Change Data Capture (CDC)? 27. What are Snowflake Tasks and how do you schedule SQL transformations with them? 28. How does Snowflake implement Role-Based Access Control (RBAC) and what are the system-defined roles? 29. What is column-level security in Snowflake (Dynamic Data Masking and Column-level Security policies)? 30. What is Row Access Policy in Snowflake and how does it implement row-level security? 31. How does Snowflake encrypt data at rest and in transit? 32. What is Snowflake's Tri-Secret Secure model and when is it used? 33. What are Snowflake Object Tags and Data Classification and how do they support governance? 34. What is the Snowflake Access History feature and how does it support audit and compliance? 35. What is Snowflake Secure Data Sharing and how does it work without copying data? 36. What is the Snowflake Data Marketplace and what types of data products are available? 37. What are Snowflake Data Clean Rooms and what privacy problems do they solve? 38. What is Snowpark and how does it allow Python/Java/Scala code to run inside Snowflake? 39. What are Snowflake Native Apps and how does the Native App Framework work? 40. What are Snowflake External Tables and when would you use them over internal tables? 41. What is Snowflake's multi-cluster warehouse and how does it handle concurrency auto-scaling? 42. What is Resource Monitor in Snowflake and how do you use it to control credit consumption? 43. How does Snowflake support ELT patterns and how does it compare to ETL? 44. What are common Snowflake anti-patterns and performance pitfalls to avoid? 45. How does Snowflake compare to BigQuery and Redshift in architecture and pricing model?
Could not find what you were looking for? send us the question and we would be happy to answer your question.

1. What is Snowflake and how does its multi-cluster, shared-data architecture differ from traditional data warehouses?

Snowflake is a cloud-native data platform delivered as a fully managed SaaS, built on cloud object storage (S3, Azure Blob, GCS). Its defining architectural trait is complete decoupling of storage from compute, setting it apart from every traditional warehouse model.

Traditional warehouses use either shared-nothing (Teradata, Redshift) where each node owns local disks — adding compute forces proportional storage growth — or shared-disk (Oracle RAC) where a central storage bus creates a bottleneck. Snowflake's multi-cluster shared-data model stores all data in cloud object storage and lets any number of independent Virtual Warehouses read from it simultaneously with no contention. An ETL warehouse and a BI warehouse can query the same petabyte at the same time, each billed only for their own compute seconds.

Architecture Comparison
AspectShared-Nothing (Redshift)Snowflake Multi-Cluster
Storage/compute couplingTightly coupled per nodeFully decoupled
Multiple concurrent workloadsOne cluster, resource contentionIndependent warehouses, no contention
Scale compute aloneNo — nodes carry their own storageYes — resize warehouse without touching data
Idle costPay for cluster 24/7Suspend warehouse = zero compute billing
Which architectural model does Snowflake use where independent compute clusters read from a single shared cloud storage layer?
In a traditional shared-nothing data warehouse, adding more compute nodes results in what happening to storage?
2. What are the three layers of Snowflake's architecture (Storage, Compute, Cloud Services) and what does each do?

Snowflake is built on three independently operating layers, each with a clearly defined responsibility.

Storage Layer holds all data as columnar, compressed micro-partition files in cloud object storage (S3, Azure Blob, or GCS). Snowflake manages all storage operations automatically — no manual partitioning, indexing, or vacuuming required. The data is always available to any compute cluster that needs it.

Compute Layer consists of Virtual Warehouses — MPP clusters from X-Small to 6X-Large. Each warehouse is a temporary set of cloud compute instances that reads from storage, executes queries, and writes results. Warehouses start, resize, and suspend in seconds, completely independently of the data layer.

Cloud Services Layer is Snowflake's always-on brain. It handles query parsing and optimization, micro-partition metadata management (min/max values per column per partition), authentication and authorization, ACID transaction management, and infrastructure coordination. It runs continuously and is only billed when daily consumption exceeds 10% of compute credits.

-- Create a Virtual Warehouse in the Compute layer
CREATE WAREHOUSE reporting_wh
  WAREHOUSE_SIZE = 'LARGE'
  AUTO_SUSPEND = 300
  AUTO_RESUME = TRUE;

-- COUNT(*) is answered from Cloud Services metadata -- zero compute credits
SELECT COUNT(*) FROM orders;
Which Snowflake layer is responsible for query optimization, metadata management, and authentication?
Where is Snowflake table data physically stored in the Storage layer?
3. What is a Virtual Warehouse in Snowflake and how does it scale independently of storage?

A Virtual Warehouse is a named MPP compute cluster in Snowflake that executes SQL queries, loads data, and performs DML operations. Unlike traditional databases where compute and storage share the same physical nodes, a Virtual Warehouse borrows cloud compute on demand and reads data directly from the shared cloud object storage layer — no data is moved when a warehouse starts, resizes, or suspends.

Warehouses scale in two dimensions. Scale up (resize to a larger T-shirt size) gives more memory and CPU per query — useful for complex aggregations, large sorts, or multi-table joins. Sizes double credits per hour: X-Small=1 credit, Small=2, Medium=4, Large=8, X-Large=16, 2X-Large=32, 3X-Large=64, 4X-Large=128. Scale out (multi-cluster warehouses, Enterprise+) adds parallel clusters to handle concurrent user workloads. Resizing is near-instant and affects only compute — the storage layer is completely untouched.

Auto-suspend stops billing when idle (configurable from 60s to 30m). Auto-resume brings the warehouse back on the next query in 1-2 seconds. Together they make pay-per-second compute practical for real workloads.

CREATE WAREHOUSE etl_wh
  WAREHOUSE_SIZE = 'X-LARGE'
  AUTO_SUSPEND = 60
  AUTO_RESUME = TRUE;

-- Resize without any data movement
ALTER WAREHOUSE etl_wh SET WAREHOUSE_SIZE = '2X-LARGE';

-- Suspend manually; billing stops immediately
ALTER WAREHOUSE etl_wh SUSPEND;
What happens to Snowflake data when a Virtual Warehouse is suspended?
How many Snowflake credits per hour does an X-Large warehouse consume?
4. What is the difference between Snowflake's columnar storage and row-based storage in traditional RDBMS?

In row-based storage (MySQL, PostgreSQL, Oracle), a complete row is written as a contiguous block on disk. This is optimal for OLTP: a single-row read or update is one I/O. But analytical queries that select 3 columns out of 100 must scan the full row for every record — wasting I/O on irrelevant data.

In columnar storage (Snowflake, BigQuery, Redshift), all values from the same column are stored contiguously. An analytical SELECT of 3 columns skips the other 97 entirely — no wasted I/O. Values of the same data type also compress far better: run-length encoding, delta encoding, and dictionary encoding on homogeneous column data yield 3–10x compression ratios versus mixed-type row blocks. Snowflake additionally stores per-column min/max metadata per micro-partition, enabling partition pruning that is impossible with row-oriented storage.

Row-Based vs Columnar Storage
AspectRow-BasedColumnar (Snowflake)
Best forOLTP (point reads/writes)OLAP (scans, aggregations)
CompressionModerate (mixed types per block)High (same type per column)
Partial column queryMust read full rowOnly reads selected columns
Single-row accessFast — 1 I/OSlower — reconstructs across column blocks
Why does columnar storage achieve higher compression ratios for analytical data?
For a system with frequent single-row reads and updates (OLTP), which storage format is more efficient?
5. What is the Snowflake Cloud Services layer and what components does it manage?

The Cloud Services layer is Snowflake's always-on intelligence layer. It runs within Snowflake's managed infrastructure — not on customer-provisioned resources — and coordinates everything before and after a Virtual Warehouse executes a query. It is billed only when its daily consumption exceeds 10% of compute credits; in most workloads this threshold is never crossed.

The six primary components it manages:

  • Query Parser and Optimizer — parses SQL, resolves names, builds the execution plan, and selects which micro-partitions to scan via partition pruning.
  • Metadata Service — stores min/max column values, null counts, row counts, and distinct value estimates per micro-partition for every table. Enables partition pruning and zero-credit COUNT(*) answers.
  • Authentication and Authorization — handles logins (password, key-pair, OAuth, SAML/SSO) and all RBAC privilege checks before a query touches any data.
  • Transaction Manager — enforces ACID semantics, manages multi-statement transactions, and maintains snapshot isolation for concurrent readers and writers.
  • Infrastructure Manager — provisions and monitors Virtual Warehouse compute nodes; drives auto-suspend and auto-resume events.
  • Security Service — manages the hierarchical AES-256 key model, handles Tri-Secret Secure key requests, and enforces column and row security policies at query runtime.
Which Cloud Services component uses per-partition min/max metadata to determine which micro-partitions a query must scan?
Snowflake's ACID transaction management and snapshot isolation for concurrent queries are managed by which layer?
6. What are micro-partitions in Snowflake and how do they enable automatic data clustering?

Micro-partitions are the fundamental storage units in Snowflake. When data is loaded, Snowflake automatically divides it into contiguous blocks of 50–500 MB of uncompressed data (typically ~16 MB compressed on disk). Each micro-partition is stored as a columnar compressed file in cloud object storage. The key optimization property comes from the rich metadata Snowflake stores for each partition in the Cloud Services layer — not the data itself.

Per micro-partition, Snowflake records: minimum and maximum column values for every column, null count, distinct value count, and row count. When the query optimizer evaluates a predicate like WHERE order_date = '2024-03-15', it reads this metadata and skips any micro-partition whose max(order_date) is earlier or whose min(order_date) is later. This is partition pruning — no data bytes are read from skipped partitions.

Automatic clustering refers to the fact that data loaded chronologically (e.g., daily appended events) naturally groups similar values into the same micro-partitions. For tables where load order does not match query patterns, explicit clustering keys can trigger reorganization. The better the clustering, the fewer micro-partitions must be scanned per predicate, directly reducing query cost.

-- Check clustering quality of the sales table
SELECT SYSTEM$CLUSTERING_INFORMATION('sales', '(order_date)');

-- Micro-partition statistics via Information Schema
SELECT table_name, clustering_key, row_count, bytes
FROM information_schema.tables
WHERE table_schema = 'PUBLIC';
What is the typical uncompressed size range of a single Snowflake micro-partition?
Which metadata stored in the Cloud Services layer allows Snowflake to skip entire micro-partitions without reading data?
7. What is data clustering in Snowflake and when should you define a clustering key?

Data clustering describes how closely the physical order of rows across micro-partitions aligns with the columns used in query predicates. When data is well-clustered on a column, rows with similar values land in the same micro-partitions. A range predicate on that column then prunes away most partitions because their min/max ranges do not overlap the filter value.

A clustering key is an explicit declaration telling Snowflake to sort and co-locate data based on one or more columns or expressions. The Automatic Clustering background service (Enterprise+) periodically re-clusters micro-partitions to maintain this sort order as new data arrives. Snowflake charges for the background compute used.

The key metric is average_overlap_depth from SYSTEM$CLUSTERING_INFORMATION(): a value of 1.0 means perfect clustering (no partition overlap); higher values mean more overlap and worse pruning efficiency.

Define a clustering key when: the table is larger than ~1 TB; queries frequently filter on a predictable column set (date, region, category); and data does not arrive pre-sorted. Skip it when: the table is small, accessed randomly, or frequently truncated and reloaded.

-- Define clustering key on sales table
ALTER TABLE sales CLUSTER BY (order_date, region);

-- Check current clustering quality
SELECT PARSE_JSON(
  SYSTEM$CLUSTERING_INFORMATION('sales', '(order_date, region)')
):average_overlap_depth::FLOAT AS overlap_depth;

-- Query that benefits from clustering-based pruning
SELECT product_id, SUM(amount)
FROM sales
WHERE order_date BETWEEN '2024-01-01' AND '2024-03-31'
  AND region = 'EMEA'
GROUP BY product_id;
Which Snowflake function returns the clustering depth metric to assess partitioning quality on a given column set?
The Automatic Clustering background service that maintains sort order on clustered tables is available starting from which edition?

8. What is the difference between Snowflake Standard, Enterprise, Business Critical, and Virtual Private Snowflake edition?

Snowflake offers four service editions, each a strict superset of the previous. Edition choice is driven primarily by compliance requirements, data sensitivity, and advanced concurrency needs.

Snowflake Edition Feature Matrix
FeatureStandardEnterpriseBusiness CriticalVPS
Multi-cluster warehousesNoYesYesYes
Time Travel retentionUp to 1 dayUp to 90 daysUp to 90 daysUp to 90 days
Materialized ViewsNoYesYesYes
Column/row security policiesNoYesYesYes
HIPAA / PCI DSS complianceNoNoYesYes
Tri-Secret Secure (CMK)NoNoYes (add-on)Yes
AWS PrivateLink / Azure PLNoNoYesYes
Dedicated hardware isolationNoNoNoYes
Account failover/failbackNoNoYesYes

Virtual Private Snowflake (VPS) runs on dedicated hardware fully isolated from all other Snowflake customers. It targets financial institutions and government agencies requiring hardware-level isolation and customer-managed encryption throughout the entire environment.

Which is the minimum Snowflake edition that enables multi-cluster warehouses?
A healthcare organization needing HIPAA compliance and customer-managed encryption keys requires at minimum which edition?
9. How does Snowflake handle multi-cloud deployment and cross-cloud replication?

Snowflake accounts are created on a specific cloud platform (AWS, Azure, or GCP) and in a specific region. While the primary account is tied to that choice, Snowflake supports replicating data and account objects across regions and even across cloud providers — enabling disaster recovery, regional data residency compliance, and low-latency access for global users.

Database Replication copies individual databases from a primary account to one or more secondary (read-only replica) accounts on any supported cloud/region combination. Replicas are kept in sync on a configurable refresh schedule. Replication Groups extend this to bundle multiple databases and other objects into a single consistent snapshot unit.

Failover Groups add the ability to promote a secondary account to primary (failover) and demote the original (failback), delivering true disaster recovery with a configurable RPO. Account-level replication (Business Critical+) also replicates users, roles, resource monitors, and warehouses.

  • Cross-cloud replication (e.g., AWS → Azure) works the same as cross-region but incurs cloud egress charges.
  • Within the same cloud provider, cross-region replication is cheaper.
  • Secondary accounts serve read-only queries and regional data sharing to local consumers.
  • Replication is billed: compute credits for the refresh run plus data transfer fees.
Which Snowflake object type, when promoted, allows a secondary account to become the new primary with read-write capability?
Which cloud platforms does Snowflake currently support for native account deployment?
10. What is Snowflake's separation of compute from storage and what billing advantages does it provide?

In Snowflake, storage and compute are billed as completely independent resources. Storage is charged at a flat rate per terabyte per month of compressed data in cloud object storage, regardless of query volume. Compute is charged per credit-second, and only when a Virtual Warehouse is actively in the STARTED state.

This separation creates three concrete billing advantages. First, an idle warehouse can be suspended in seconds, immediately stopping all compute billing while data remains intact and instantly accessible. Second, multiple warehouses reading the same data cost no extra storage — a marketing analytics warehouse and a data science warehouse can both query the same 50 TB of data, each paying only for their own compute time. Third, you can spin up a powerful warehouse for a one-hour intensive job and pay for only those 3,600 seconds of compute, rather than running that power 24/7.

Compare this to Amazon Redshift provisioned clusters: the cluster runs continuously whether queries execute or not — you pay for every hour it exists. Redshift Serverless partially addresses this, but Snowflake's model is fundamentally more granular and flexible.

  • Storage: billed per TB/month compressed (typically 40–60% of uncompressed size).
  • Compute: billed per credit-second; 1 credit costs roughly $2–4 depending on edition and cloud region.
  • Cloud Services: free up to 10% of daily compute consumption; excess billed separately.
  • Data transfer: egress charges apply for cross-region or cross-cloud replication.
When a Snowflake Virtual Warehouse is suspended, what immediately changes about billing?
Two Virtual Warehouses querying the same 100 TB Snowflake table results in how much additional storage cost?
11. What are the stages in Snowflake (internal vs external) and how do you use them for data loading?

A stage in Snowflake is a named location holding data files before they are loaded into a table (or after being unloaded from one). Stages abstract the physical storage location so the COPY INTO command works identically whether files sit in Snowflake-managed storage or in your own cloud bucket.

Internal stages are managed by Snowflake within its own cloud storage. Three kinds exist: the user stage (@~, one per user, auto-created), the table stage (@%table_name, one per table, auto-created), and named internal stages (@my_stage, created explicitly, shareable across roles and sessions).

External stages point to customer-owned cloud storage (S3 bucket, Azure Blob container, GCS bucket). Defined with a URL and either inline credentials or a Storage Integration object (recommended — avoids embedding credentials in DDL). Supports all the same file formats as COPY INTO.

-- Create a named internal stage with a named file format
CREATE STAGE my_internal_stage
  FILE_FORMAT = (TYPE = 'CSV' FIELD_DELIMITER = ',' SKIP_HEADER = 1);

-- Upload a local file to the internal stage (run in SnowSQL)
PUT file:///local/path/orders.csv @my_internal_stage;

-- List files in a stage
LIST @my_internal_stage;

-- Create an external stage pointing to S3 using a Storage Integration
CREATE STAGE my_s3_stage
  URL = 's3://my-bucket/data/'
  STORAGE_INTEGRATION = my_s3_integration
  FILE_FORMAT = (TYPE = 'PARQUET');
Which type of Snowflake internal stage is automatically available for every user without needing to create it?
Which command uploads a local file from a client machine to a Snowflake internal stage?
12. How does the COPY INTO command work and what file formats does it support?

COPY INTO is Snowflake's bulk data loading command. It reads files from a stage (internal or external) and loads them into a target table in a single transactional operation. The command tracks which files have already been loaded using a 64-day load metadata store — by default, re-running COPY INTO on the same files is a no-op, preventing accidental duplicates.

Supported file formats: CSV, JSON, Avro, ORC, Parquet, and XML. Format options are specified inline or via a named FILE FORMAT object. Key ON_ERROR options control how errors are handled:

  • ON_ERROR = ABORT_STATEMENT (default) — rolls back the entire load on any error.
  • ON_ERROR = CONTINUE — skips bad rows and loads the rest.
  • ON_ERROR = SKIP_FILE — skips an entire file that contains any error.
  • PURGE = TRUE — deletes source files from the stage after a successful load.
  • FORCE = TRUE — reloads files even if already loaded (ignores load metadata).
  • VALIDATION_MODE = RETURN_ERRORS — dry-run; validates files without loading.
COPY INTO orders
FROM @my_s3_stage/orders/
FILE_FORMAT = (
  TYPE = 'CSV'
  FIELD_OPTIONALLY_ENCLOSED_BY = '"'
  SKIP_HEADER = 1
  NULL_IF = ('NULL', 'null', ''))
ON_ERROR = 'CONTINUE'
PURGE = TRUE;

-- Validate without loading
COPY INTO orders FROM @my_s3_stage
VALIDATION_MODE = 'RETURN_ERRORS';
What does ON_ERROR = 'CONTINUE' do during a COPY INTO operation?
Which COPY INTO option forces Snowflake to reload files already successfully loaded in a previous run?
13. What is Snowpipe and how does it enable continuous / serverless data ingestion?

Snowpipe is Snowflake's serverless, event-driven data ingestion service. Unlike COPY INTO which requires a running Virtual Warehouse and an explicit trigger, Snowpipe uses Snowflake-managed serverless compute and fires automatically when new files arrive in a stage. Typical latency is under one minute from file arrival to data being queryable.

Snowpipe supports two trigger modes. Auto-ingest (preferred): you configure the cloud storage bucket to emit event notifications — Amazon SQS for S3, Azure Event Grid for Blob Storage, Google Pub/Sub for GCS — to Snowflake. When files land, the notification triggers the pipe automatically with no polling needed. REST API mode: call Snowpipe's REST endpoint programmatically with a list of file paths; useful when you control the producer and need deterministic triggering.

Billing is based on compute seconds consumed for loading, not warehouse credits. This makes Snowpipe cost-effective for small, frequent file arrivals that would otherwise require a warehouse to run continuously.

-- Create a pipe that auto-triggers on new S3 files
CREATE PIPE orders_pipe
  AUTO_INGEST = TRUE
AS
  COPY INTO orders
  FROM @my_s3_stage/orders/
  FILE_FORMAT = (TYPE = 'JSON');

-- After creation, retrieve the SQS ARN and configure S3 event notification
SHOW PIPES LIKE 'orders_pipe';
For AWS S3 auto-ingest, which AWS service delivers file-arrival event notifications to Snowpipe?
What compute resource executes Snowpipe data loads?
14. What is the difference between bulk loading with COPY INTO and micro-batch loading with Snowpipe?

COPY INTO and Snowpipe both load files from stages into Snowflake tables, but they differ fundamentally in trigger mechanism, compute model, transaction semantics, and cost structure.

COPY INTO vs Snowpipe
DimensionCOPY INTO (Bulk)Snowpipe (Micro-batch)
TriggerExplicit SQL command (manual, Task, or orchestrator)Auto-ingest event notification or REST API
ComputeRuns on user Virtual WarehouseSnowflake serverless compute
BillingCredits per second of warehouse timeCompute-seconds for loading
LatencyMinutes to hours (batch cadence)~1 minute after file arrives
TransactionsFull ACID; can be rolled backAt-least-once; no explicit rollback
Error handlingON_ERROR options; full rollback availableFailed files retried automatically
Optimal file size100–250 MB compressedWorks with smaller files; batches them
Best forNightly ETL, large batch loadsNear-real-time streaming ingestion
Which loading method is best for near-real-time ingestion of small, frequent files arriving continuously?
Snowpipe billing is based on which unit, distinct from regular warehouse credit billing?
15. How does Snowflake handle semi-structured data (JSON, Avro, Parquet, ORC) with the VARIANT type?

Snowflake stores semi-structured data in a VARIANT column — a flexible container that holds any valid JSON, Avro, Parquet, ORC, or XML value up to 16 MB per cell. Unlike traditional columns, VARIANT does not require a predefined schema. You can load raw JSON with an evolving structure and query the fields you need at read time via dot and bracket notation.

Internally, Snowflake parses and columnarizes VARIANT data at load time, extracting paths and types into an efficient encoding. This allows the optimizer to apply column-level statistics on frequently accessed paths, enabling partial partition pruning even for semi-structured data.

  • Dot notation: src:customer.address.city::STRING — navigates nested objects.
  • Bracket notation: src['customer']['address'] — equivalent to dot notation.
  • Array indexing: src:items[0]:product_id::NUMBER.
  • FLATTEN: lateral table function that explodes arrays or objects into rows for relational processing.
  • Type casting: ::STRING, ::NUMBER, ::TIMESTAMP_NTZ cast extracted values to typed columns for better statistics and comparison performance.
-- Table with a VARIANT column for raw JSON events
CREATE TABLE events (
  event_id  NUMBER AUTOINCREMENT,
  raw       VARIANT,
  loaded_at TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP()
);

-- Query nested JSON using dot notation with type cast
SELECT
  raw:user_id::STRING     AS user_id,
  raw:event_type::STRING  AS event_type,
  raw:properties:page::STRING AS page
FROM events
WHERE raw:event_type::STRING = 'pageview';

-- FLATTEN an array of order items into rows
SELECT
  o.raw:order_id::STRING AS order_id,
  item.value:sku::STRING AS sku,
  item.value:qty::NUMBER AS qty
FROM orders o,
LATERAL FLATTEN(INPUT => o.raw:items) item;
What is the maximum size of a single VARIANT column value in Snowflake?
Which Snowflake table function expands a VARIANT array or object into multiple relational rows?
16. What are Snowflake Dynamic Tables and how do they differ from regular tables and views?

A Dynamic Table is a Snowflake object that automatically maintains the result of a defining SQL query and refreshes it to stay within a user-specified staleness limit. You declare what you want the table to contain and how fresh it must be — Snowflake handles all the scheduling, incremental or full refreshes, and dependency tracking.

The TARGET_LAG parameter is the core control: it sets the maximum acceptable staleness (e.g., '5 minutes', '1 hour'). Snowflake refreshes the Dynamic Table on a schedule that keeps actual lag within that target. Refreshes run on a user-specified warehouse or on serverless compute.

Dynamic Tables can be chained: if DT_B depends on DT_A, Snowflake refreshes them in the correct order automatically — similar to a DAG without explicit orchestration. This makes them ideal for multi-step ELT pipelines.

Regular Table vs View vs Dynamic Table
AspectRegular TableViewDynamic Table
Data storedYesNoYes
Auto-refreshedNoN/AYes (TARGET_LAG)
Supports multi-table joinsYesYesYes
Best forManual ETL loadsQuery abstractionDeclarative pipeline results
CREATE DYNAMIC TABLE customer_summary
  TARGET_LAG = '10 minutes'
  WAREHOUSE = transform_wh
AS
  SELECT
    c.customer_id,
    c.name,
    COUNT(o.order_id)  AS total_orders,
    SUM(o.amount)      AS lifetime_value
  FROM customers c
  LEFT JOIN orders o ON c.customer_id = o.customer_id
  GROUP BY c.customer_id, c.name;
Which Dynamic Table parameter specifies the maximum allowable staleness of the table's data?
What advantage does a Snowflake Dynamic Table pipeline have over a regular Materialized View for multi-step transformations?
17. What is Time Travel in Snowflake and how does it work (retention period, UNDROP, AT/BEFORE)?

Time Travel is Snowflake's ability to query, clone, or restore data as it existed at any point within a configurable retention window. When a row is deleted or updated, Snowflake does not immediately remove the old micro-partitions — it retains them for the duration of the Time Travel period. The original data is accessible via special SQL syntax.

The AT clause queries data as it existed at a specific moment using a timestamp, a negative offset in seconds from now, or a statement ID. The BEFORE clause returns the state just before a named statement executed — useful for recovering from a mistaken DML. After the retention window expires, data moves to the Fail-safe tier.

Retention period: Standard edition = 1 day. Enterprise+ = configurable per object up to 90 days (set via DATA_RETENTION_TIME_IN_DAYS). Longer retention increases storage costs since old micro-partitions are retained. Transient and temporary tables have 0-day Fail-safe and max 1-day Time Travel.

-- Query orders as they existed one hour ago
SELECT * FROM orders AT (OFFSET => -3600);

-- Restore data before an accidental DELETE (use the query_id of the DELETE statement)
SELECT * FROM orders BEFORE (STATEMENT => '01ab23cd-...');

-- Clone a table to its state from yesterday as a point-in-time backup
CREATE TABLE orders_backup
  CLONE orders AT (TIMESTAMP => '2024-06-01 09:00:00'::TIMESTAMP_NTZ);

-- Recover a dropped table (must be within retention window)
UNDROP TABLE orders;
Which Time Travel SQL clause queries a table exactly 3 hours before the current time?
What SQL command recovers a Snowflake table that was accidentally dropped within the retention window?
18. What is Fail-safe in Snowflake and how does it differ from Time Travel?

Fail-safe is an additional 7-day data protection period that begins after a table's Time Travel retention window expires. It is Snowflake's last-resort disaster recovery safety net — but unlike Time Travel, it is not accessible to customers directly. Only Snowflake Support personnel can initiate a Fail-safe recovery, and only for genuine disaster scenarios.

Time Travel vs Fail-safe
DimensionTime TravelFail-safe
Who can accessThe account owner / any authorized userSnowflake Support only
Retention duration0–90 days (configurable per object)Fixed 7 days (not configurable)
TriggerAT/BEFORE syntax; UNDROP commandSupport ticket to Snowflake
Use caseAccidental DML recovery, cloning, historical analysisData centre-level disaster recovery
Applies toPermanent tables, schemas, databasesPermanent tables only
Transient/temp tables0 or 1 day0 days (no Fail-safe)
Storage costYes — old micro-partitions billedYes — same rate as active storage

A practical implication: after you drop a table and the Time Travel window closes, you cannot recover it yourself. If the data is critical, open a Snowflake Support case within the 7-day Fail-safe window before it expires permanently.

Which statement correctly describes who can access Snowflake Fail-safe data?
What is the fixed duration of Snowflake's Fail-safe period for permanent tables?
19. What is the Snowflake Query Profile and how do you use it to diagnose slow queries?

Query Profile is Snowflake's built-in visual execution plan viewer, available in Snowsight under Query History for any completed or actively running query. It shows the full operator tree for a query execution — every node representing a distinct processing step — along with timing, row counts, and resource metrics at each step. It is your first stop when a query is slower than expected.

Key operator node types: TableScan (reads micro-partitions), Filter, Aggregate, Join, Sort, Exchange (shuffle data between nodes), Projection.

The most important diagnostic signals:

  • Partitions Scanned vs Partitions Total — if scanned is close to total, the query is not pruning well. Consider adding a clustering key on the filter columns.
  • Bytes Spilled to Local Storage — the warehouse ran out of in-memory space for intermediate results; consider scaling up the warehouse size.
  • Bytes Spilled to Remote Storage — even worse spilling to cloud storage; a significantly larger warehouse or a query rewrite is needed.
  • Bytes Sent over Network in Join nodes — large data shuffle between nodes; pushing down more filters earlier can reduce this.
  • Compilation Time vs Execution Time — very high compilation time relative to execution indicates a complex query plan; simplify CTEs or break into steps.
  • % of execution per node — a single node consuming 90%+ of time is the bottleneck; focus optimization there.
In Snowflake Query Profile, what does a large value for Bytes Spilled to Remote Storage indicate?
A Query Profile shows that 980 out of 1000 total partitions were scanned. What optimization would most likely reduce this?
20. What is result caching in Snowflake and under what conditions does it apply?

Result caching stores the complete output of a query in the Cloud Services layer for 24 hours. When an identical query runs again within that window, Snowflake returns the cached result instantly — no warehouse compute is consumed, no storage is read, and no credits are charged. This is Snowflake's most impactful zero-cost optimization for repeated analytical queries and dashboards.

For the result cache to apply, all five conditions must hold:

  • The SQL text is character-for-character identical (whitespace counts).
  • The same virtual warehouse is used (or the session's warehouse context matches).
  • The same role context is active (ensures the user still has access to the underlying data).
  • No DML (INSERT, UPDATE, DELETE, MERGE, COPY INTO) has touched any underlying table since the result was cached.
  • The query contains no non-deterministic functions: CURRENT_TIMESTAMP(), CURRENT_DATE(), RANDOM(), UUID_STRING(), SEQ4(), etc.

Common cache-busting mistakes: adding LIMIT 10, appending a comment, changing column aliases, or calling GETDATE() in a WHERE clause. Verify cache usage by checking QUERY_TYPE = 'RESULT_REUSE' in SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY.

-- First execution: runs on warehouse, result cached
SELECT region, SUM(revenue) AS total
FROM sales
GROUP BY region;

-- Second execution (identical SQL): served from cache, 0 credits
SELECT region, SUM(revenue) AS total
FROM sales
GROUP BY region;

-- Disabling result cache for benchmarking
ALTER SESSION SET USE_CACHED_RESULT = FALSE;
How long does Snowflake's result cache retain a query result?
Which function call in a query always prevents the Snowflake result cache from being used?
21. What is the metadata cache (Cloud Services layer cache) and how does it speed up queries?

The metadata cache lives entirely within the Cloud Services layer and is separate from both the result cache and the local SSD cache on Virtual Warehouse nodes. It stores structural and statistical information about every micro-partition for every table: per-column minimum and maximum values, null counts, row counts, and approximate distinct value counts. This metadata is updated in real time as data is loaded, inserted, or deleted.

The metadata cache serves two critical performance functions. First, the query optimizer reads it before any data scan begins, using column min/max ranges to decide which micro-partitions are irrelevant to the predicate (partition pruning) — this happens entirely within Cloud Services and costs no warehouse credits. Second, certain queries can be answered completely from metadata without reading a single data file, consuming zero compute credits:

  • SELECT COUNT(*) FROM orders — answered from stored row count.
  • SELECT MIN(order_date), MAX(order_date) FROM orders — answered from stored min/max.
  • SELECT COUNT(DISTINCT region) FROM orders (approximate) — can use HLL sketches.
  • Schema queries: DESCRIBE TABLE, SHOW TABLES.
Which query can Snowflake answer entirely from the metadata cache without consuming any warehouse compute credits?
What metadata information does the Cloud Services layer store per micro-partition to enable partition pruning?
22. What is a clustering key and how does it reduce partition pruning cost for large tables?

A clustering key is a column (or expression) declared on a Snowflake table that tells Snowflake to physically sort and co-locate rows with similar values into the same micro-partitions. When the key is a date column, all rows for a given month land in the same small set of partitions. A query filtering on that month then only reads those partitions — the rest are skipped by partition pruning.

Without a clustering key, data arrives in load order. On a table loaded from hundreds of daily files over several years, any given month's rows may be scattered across thousands of micro-partitions. A query for one month must scan all of them. With an effective clustering key, average_overlap_depth drops toward 1.0 — meaning each predicate matches only ~1 partition-layer deep.

The key metric is from SYSTEM$CLUSTERING_INFORMATION(): average_overlap_depth. A value near 1.0 = excellent clustering; a value near the total partition count = no clustering benefit. Automatic Clustering (Enterprise+) runs a background service that continuously re-clusters data as new rows arrive.

-- Define clustering key on a multi-billion-row fact table
ALTER TABLE sales CLUSTER BY (order_date, region);

-- Inspect clustering quality
SELECT
  PARSE_JSON(SYSTEM$CLUSTERING_INFORMATION('sales','(order_date, region)'))
    :average_overlap_depth::FLOAT AS overlap_depth;

-- This query now prunes away non-January partitions
SELECT product_id, SUM(revenue)
FROM sales
WHERE order_date BETWEEN '2024-01-01' AND '2024-01-31'
  AND region = 'APAC'
GROUP BY product_id;
What metadata does the query optimizer use to skip micro-partitions not relevant to a WHERE clause predicate?
An average_overlap_depth of 1.0 from SYSTEM$CLUSTERING_INFORMATION indicates what about the table?
23. What are Snowflake Materialized Views and when should you use them over regular views?

A Materialized View (MV) in Snowflake is a pre-computed, physically stored snapshot of a SELECT query's result. Unlike a regular view that re-executes its SQL every time it is queried, an MV stores the result as a table that Snowflake automatically keeps fresh in the background as the base table changes. Reads from an MV are as fast as reads from a regular table.

Snowflake refreshes MVs incrementally — only the rows affected by changes to the base table are reprocessed, not the entire result set. This background refresh uses Snowflake-managed serverless compute and is billed separately from warehouse usage. The cost model is: storage for the materialized result + serverless compute credits for incremental maintenance.

Use an MV when: a complex aggregation is queried frequently, the base table changes predictably (not constantly), and query latency matters more than the added storage and maintenance cost. Skip MVs when: the base table is updated at very high frequency (refresh cost exceeds the query savings), the query joins multiple tables (MVs only support a single base table in Snowflake), or the query uses non-deterministic functions. Requires Enterprise edition or higher.

-- Create a Materialized View of daily sales aggregates
CREATE MATERIALIZED VIEW daily_sales_mv AS
  SELECT
    DATE_TRUNC('day', order_ts) AS sale_date,
    region,
    SUM(amount)                AS daily_revenue,
    COUNT(*)                   AS order_count
  FROM orders
  GROUP BY 1, 2;

-- Query the MV — reads pre-computed data, not the full orders table
SELECT * FROM daily_sales_mv
WHERE sale_date >= '2024-01-01';
Snowflake Materialized Views are available starting from which edition?
What is the primary limitation of Snowflake Materialized Views that Dynamic Tables do not share?
24. What is the difference between a Snowflake View, Materialized View, and Dynamic Table?

These three object types all provide an abstraction layer over raw tables, but they differ fundamentally in whether they store data physically, how they refresh, and what query patterns they support.

View vs Materialized View vs Dynamic Table
DimensionRegular ViewMaterialized ViewDynamic Table
Physical data storedNo — virtualYesYes
Auto-refreshedN/A — always currentYes, incrementally by SnowflakeYes, based on TARGET_LAG
Refresh triggerOn query (inline rewrite)Base table DML (incremental)Snowflake scheduler (lag-based)
Multi-table joinsYesNo — single base table onlyYes
Pipeline chainingNoNoYes — chain DTs together
Extra storage costNoneYesYes
Background compute costNoneServerless (incremental)Warehouse or serverless
Best forSimple aliases, access controlRepeated aggregate on one tableMulti-step ELT pipelines

Choosing between them: use a View when you only need a query abstraction or access restriction with no performance concern. Use a Materialized View when you need fast repeated reads of a single-table aggregate with predictable change frequency. Use a Dynamic Table when you need declarative, refreshed pipeline results across multiple joined tables or multi-step transformation chains.

Which of the three object types — View, Materialized View, or Dynamic Table — stores no physical data and executes SQL inline at query time?
You need a transformation that joins five tables and must stay fresh within 5 minutes. Which object type is best suited?
25. How do you optimize query performance in Snowflake (warehouse sizing, clustering, pruning, result cache)?

Snowflake query optimization covers four levers, each addressing a different root cause. Using the wrong lever wastes money without helping performance.

Warehouse sizing (scale up) helps when a single query is slow due to complexity — large sorts, multi-way joins, or spilling to disk. Check Query Profile for Bytes Spilled to Local/Remote Storage. Doubling the warehouse size doubles available memory and parallelism.

Multi-cluster scale-out helps when many users are queuing, not when individual queries are slow. Check Queue Time in Query History. Scaling out adds parallel clusters, each serving a different concurrent query.

Clustering keys help when a large table (1 TB+) is scanned with selective WHERE predicates and partition scanned / partition total is high in Query Profile. Defining a clustering key on the filter column co-locates relevant micro-partitions, reducing scan volume dramatically.

Result cache: make your dashboard SQL deterministic and idempotent. Standardize query text across users (no variable timestamps, no per-user LIMIT injections). Each reuse costs zero credits.

Additional optimizations:

  • Select only needed columns — never SELECT * on wide tables.
  • Push selective filters early in CTEs; avoid late filtering after expensive joins.
  • Use UNION ALL instead of UNION when duplicates are already excluded.
  • Cast VARIANT paths to typed columns to enable statistics-based pruning.
  • Use SHOW TABLES / INFORMATION_SCHEMA metadata queries instead of scanning data for schema information.
A query runs slowly because it sorts 8 TB of intermediate data and the Query Profile shows heavy spilling. The correct first action is what?
Adding CURRENT_TIMESTAMP() to an otherwise-identical repeated query causes which specific performance regression?
26. What are Snowflake Streams and how do they implement Change Data Capture (CDC)?

A Snowflake Stream is a named object that records row-level DML changes (INSERT, UPDATE, DELETE) made to a source table since the stream was last consumed. It works like a bookmark: each time you read the stream inside a DML transaction, the bookmark advances, and those changes are removed from the stream. Streams implement Change Data Capture without any external tooling or log-tailing.

Every row returned by a stream includes three metadata columns:

  • METADATA$ACTION: either 'INSERT' or 'DELETE'. An UPDATE to a row appears as a DELETE of the old version plus an INSERT of the new version — a before/after pair.
  • METADATA$ISUPDATE: TRUE if this row is the INSERT half of an UPDATE pair (not a net-new insert).
  • METADATA$ROW_ID: a unique, stable identifier for the underlying row, allowing matching of DELETE/INSERT UPDATE pairs.

Stream types: Standard captures all DML (INSERT + UPDATE + DELETE). Append-only captures only INSERTs, ignoring updates and deletes — more efficient for append-heavy event tables. Insert-only is for External Tables.

-- Create a standard stream on the orders table
CREATE STREAM orders_stream ON TABLE orders;

-- Inspect changes since last consumption
SELECT * FROM orders_stream;

-- Consume the stream: copy inserts into a staging table
INSERT INTO orders_staging
SELECT order_id, customer_id, amount, order_ts
FROM orders_stream
WHERE METADATA$ACTION = 'INSERT'
  AND METADATA$ISUPDATE = FALSE;

-- After the INSERT above commits, the stream advances its offset
SELECT SYSTEM$STREAM_HAS_DATA('orders_stream');
Which metadata column in a Snowflake Stream indicates whether a changed row is the new-value half of an UPDATE operation?
Which Snowflake Stream type should you use when your pipeline only needs to react to new INSERT operations and not to UPDATEs or DELETEs?
27. What are Snowflake Tasks and how do you schedule SQL transformations with them?

A Task is a named Snowflake object that executes a single SQL statement (or a stored procedure call) on a defined schedule. Tasks are the native scheduling mechanism for data transformations inside Snowflake — no external orchestration tool is required for pipelines that can be expressed as SQL.

Schedule options: SCHEDULE = '5 MINUTE' runs the task every 5 minutes. SCHEDULE = 'USING CRON 0 6 * * MON-FRI America/New_York' uses standard cron syntax with a timezone. Tasks default to a user-specified warehouse, but serverless tasks (no warehouse needed) are also supported and billed by compute seconds.

Task DAGs: a root task holds the schedule; child tasks use AFTER <parent_task> to declare dependencies. When the root fires, Snowflake runs the tree in dependency order, fan-out and fan-in patterns included. Tasks are created in SUSPENDED state and must be activated with ALTER TASK ... RESUME.

Combining Streams and Tasks is the idiomatic CDC pipeline: the task body checks SYSTEM$STREAM_HAS_DATA('stream_name') and only proceeds when new changes exist, avoiding empty executions.

-- Root task: runs every 10 minutes on the transform warehouse
CREATE TASK load_staging
  WAREHOUSE = transform_wh
  SCHEDULE  = '10 MINUTE'
AS
  CALL load_staging_proc();

-- Child task: runs after load_staging completes
CREATE TASK build_aggregates
  WAREHOUSE = transform_wh
  AFTER load_staging
AS
  INSERT INTO daily_summary SELECT ...;

-- Both tasks must be resumed before they will fire
ALTER TASK build_aggregates RESUME;
ALTER TASK load_staging     RESUME;
After creating a Snowflake Task, what must you do before it will start executing on schedule?
In a Snowflake Task DAG, which task holds the SCHEDULE parameter that drives the entire pipeline?
28. How does Snowflake implement Role-Based Access Control (RBAC) and what are the system-defined roles?

In Snowflake's RBAC model, privileges are granted to roles, and roles are granted to users or other roles (creating a role hierarchy). A user acquires the union of all privileges from every role in their active role tree. At any moment, a session runs under one active role; the user can switch with USE ROLE.

Five system-defined roles exist in every account and cannot be dropped:

  • ACCOUNTADMIN — highest privilege; manages billing, account parameters, all objects. Should be used sparingly and with MFA.
  • SYSADMIN — manages all databases, schemas, warehouses, and objects. The recommended role for day-to-day data engineering work.
  • SECURITYADMIN — creates and manages users and roles; grants and revokes privileges. Cannot read data.
  • USERADMIN — creates users and roles (a subset of SECURITYADMIN's scope).
  • PUBLIC — automatically granted to every user; assign here only objects you want universally accessible.

Best practice: create custom roles and grant them up to SYSADMIN so ACCOUNTADMIN retains full visibility through inheritance. Never do daily work as ACCOUNTADMIN.

-- Grant SELECT on a table to a custom role
GRANT SELECT ON TABLE sales.orders TO ROLE analyst_role;

-- Grant the warehouse to the role
GRANT USAGE ON WAREHOUSE reporting_wh TO ROLE analyst_role;

-- Assign the custom role to a user
GRANT ROLE analyst_role TO USER alice;

-- Grant the custom role UP to SYSADMIN for visibility
GRANT ROLE analyst_role TO ROLE SYSADMIN;

-- Inspect what privileges a role holds
SHOW GRANTS TO ROLE analyst_role;
Which system-defined Snowflake role is the recommended choice for day-to-day warehouse and database management?
When you create a custom role in Snowflake, to which system role should it be granted so ACCOUNTADMIN retains visibility through role inheritance?
29. What is column-level security in Snowflake (Dynamic Data Masking and Column-level Security policies)?

Column-level security in Snowflake is implemented through Dynamic Data Masking (DDM) policies. A masking policy defines a SQL expression — typically a CASE statement on CURRENT_ROLE() or IS_ROLE_IN_SESSION() — that returns either the real column value or a masked substitute depending on who is querying. The masking is applied at query runtime; the original data is always stored unmasked in storage.

Key properties: the same policy can be applied to many columns; the policy can be changed or replaced without touching the table DDL; a privileged role always sees the real value; less-privileged roles see a masked version (e.g., ****, NULL, a truncated value, or a hash). Policy application is transparent to the user — their SQL is unchanged, they simply receive different output.

A Conditional Masking Policy can additionally inspect other column values in the same row when deciding how to mask (e.g., only mask the SSN column if the is_pii flag is true). This makes the policies context-sensitive.

-- Create a masking policy that hides emails from non-privileged roles
CREATE MASKING POLICY email_mask AS (val STRING) RETURNS STRING ->
  CASE
    WHEN IS_ROLE_IN_SESSION('PII_VIEWER') THEN val
    ELSE REGEXP_REPLACE(val, '.+@', '****@')
  END;

-- Apply the policy to the email column
ALTER TABLE customers
  MODIFY COLUMN email
  SET MASKING POLICY email_mask;

-- Analyst (no PII_VIEWER role) sees: ****@example.com
-- PII_VIEWER role sees: alice@example.com
At what point does Snowflake's Dynamic Data Masking apply the masking expression to column data?
Which SQL command attaches an existing masking policy to a specific table column?
30. What is Row Access Policy in Snowflake and how does it implement row-level security?

A Row Access Policy (RAP) is a Snowflake security object that transparently injects an additional filter predicate into every query that touches the protected table. From the querying user's perspective, their SQL is unchanged — they simply receive fewer rows. The policy's logic determines which rows are visible based on the current user's identity or role.

The policy body is a SQL Boolean expression referencing CURRENT_USER(), CURRENT_ROLE(), or IS_ROLE_IN_SESSION(). A common pattern uses a mapping table: the policy joins the mapping table to determine which regions or tenants the current role may see, then returns only those rows.

Multiple policies can be applied to a single table; all must return TRUE for a row to be visible (AND logic). Policies can also be applied to views. Row Access Policies are evaluated after column-level masking policies.

-- Mapping table: role -> allowed region
CREATE TABLE region_access_map (
  role_name VARCHAR,
  region    VARCHAR
);
INSERT INTO region_access_map VALUES
  ('EMEA_ANALYST', 'EMEA'),
  ('APAC_ANALYST', 'APAC'),
  ('GLOBAL_ANALYST', 'GLOBAL');

-- Row Access Policy: show all rows to GLOBAL_ANALYST,
-- otherwise filter by the mapped region
CREATE ROW ACCESS POLICY region_rap AS (row_region VARCHAR) RETURNS BOOLEAN ->
  EXISTS (
    SELECT 1 FROM region_access_map
    WHERE role_name = CURRENT_ROLE()
      AND (region = 'GLOBAL' OR region = row_region)
  );

-- Attach the policy to the orders table on the region column
ALTER TABLE orders
  ADD ROW ACCESS POLICY region_rap ON (region);
How does a Snowflake Row Access Policy restrict which rows a user sees from the user's SQL perspective?
Which Snowflake context function is commonly used inside a Row Access Policy to identify the current session's active role?
31. How does Snowflake encrypt data at rest and in transit?

Snowflake enforces encryption universally — it cannot be disabled. Every byte of customer data is always encrypted, both when stored in cloud object storage and when moving across networks.

Encryption at rest uses AES-256-GCM in a four-tier hierarchical key model:

  • File keys encrypt individual micro-partition files.
  • Table keys encrypt the file keys for a given table.
  • Account master keys encrypt the table keys for an account.
  • Root keys (Snowflake-held) encrypt the account master keys.

Key rotation happens automatically every 12 months per Snowflake's key management policy. Customers can also request on-demand rekeying via ALTER TABLE ... REKEYING = TRUE (Enterprise+) to force immediate rotation of table-level keys. The Tri-Secret Secure model (Business Critical+) adds a customer-managed key on top of this hierarchy.

Encryption in transit uses TLS 1.2 or higher for all connections: between client drivers (JDBC, ODBC, Python) and the Snowflake service endpoint, and between Snowflake's internal services and cloud object storage. Certificate pinning and OCSP validation are used to prevent man-in-the-middle attacks.

Customers in regulated industries should note that Snowflake's encryption is not customer-managed by default; for that, they need Business Critical edition and Tri-Secret Secure.

What encryption algorithm does Snowflake use to protect data at rest in cloud object storage?
How often does Snowflake automatically rotate encryption keys by default?
32. What is Snowflake's Tri-Secret Secure model and when is it used?

Tri-Secret Secure is Snowflake's highest tier of data protection. It ensures that data can only be decrypted when two independent keys are simultaneously present: Snowflake's own master key and a Customer-Managed Key (CMK) held in the customer's own cloud KMS (AWS KMS or Azure Key Vault). Neither party can decrypt the data alone.

In the standard key model, Snowflake holds all key hierarchy tiers. In Tri-Secret Secure, the account master key is wrapped not just by Snowflake's root key but also by the customer's CMK. Snowflake must call the customer's KMS every time it needs to access encrypted data. If the customer revokes the CMK, all data in that account becomes immediately inaccessible — even to Snowflake engineers. This gives customers provable data sovereignty.

Use cases: financial institutions with regulatory mandates, government agencies, healthcare organizations requiring proof they can revoke Snowflake's access at any time. Requires Business Critical or Virtual Private Snowflake edition plus a Tri-Secret Secure subscription add-on.

Operational risk: if the customer loses or deletes the CMK without a backup, all data is permanently lost. The customer bears full responsibility for CMK lifecycle management.

-- Conceptual setup: link an Azure Key Vault key to the Snowflake account
-- (actual steps involve Snowflake Support + Azure Key Vault configuration)
ALTER ACCOUNT SET HSM_MASTER_KEY_ID = 'https://myvault.vault.azure.net/keys/mykey/version';

-- Verify Tri-Secret Secure is active
SHOW PARAMETERS LIKE 'TRI_SECRET_SECURE' IN ACCOUNT;
In Tri-Secret Secure, if a customer revokes their Customer-Managed Key, what happens to their Snowflake data?
Which minimum Snowflake edition is required to enable Tri-Secret Secure?
33. What are Snowflake Object Tags and Data Classification and how do they support governance?

Object Tags are key-value string metadata labels that can be attached to any Snowflake object — accounts, databases, schemas, tables, columns, warehouses, or other named resources. Tags give governance teams a structured way to label assets by sensitivity level, data domain, cost center, or regulatory category without changing the underlying schema.

Tag lineage is the automatic propagation behavior: a tag set on a schema propagates to all tables in that schema; a tag set on a table propagates to all its columns. Tags can be overridden at a lower level. This means you can mark an entire schema as sensitivity=high and every column inherits that classification automatically, saving thousands of individual tag assignments.

Data Classification uses Snowflake's SYSTEM$CLASSIFY() function (or the automated classification in Snowsight) to automatically inspect column names and sample values, then suggest PII categories: personal name, email address, phone number, SSN, date of birth, credit card number, IP address, and others. You can accept the suggestions and apply them as tags.

Governance views: SNOWFLAKE.ACCOUNT_USAGE.TAG_REFERENCES shows every tag assignment across the account. Combined with ACCESS_HISTORY, you can produce reports like 'who queried tagged PII columns in the last 30 days'.

-- Create a tag
CREATE TAG sensitivity ALLOWED_VALUES 'high', 'medium', 'low';

-- Apply a tag to a table (propagates to all columns)
ALTER TABLE customers SET TAG sensitivity = 'high';

-- Apply a different tag directly to one column
ALTER TABLE customers MODIFY COLUMN ssn SET TAG sensitivity = 'high';

-- Run automated classification on a table
SELECT SYSTEM$CLASSIFY('mydb.myschema.customers', {'auto_tag': true});
What is tag lineage in Snowflake's Object Tagging system?
Which Snowflake system function automatically detects PII categories by inspecting column names and sample data?
34. What is the Snowflake Access History feature and how does it support audit and compliance?

Access History records every query that accessed specific database objects — tracking exactly which tables and columns were read or written, by which user, running which role, at what time. It is the foundation for data lineage auditing, regulatory compliance (GDPR, CCPA, HIPAA), and detecting over-privileged accounts.

The primary view is SNOWFLAKE.ACCOUNT_USAGE.ACCESS_HISTORY. Key columns include:

  • QUERY_ID and QUERY_START_TIME — the query being audited.
  • USER_NAME and ROLE_NAME — who ran the query.
  • DIRECT_OBJECTS_ACCESSED — tables/views the SQL explicitly named.
  • BASE_OBJECTS_ACCESSED — the underlying base tables behind any views, including column-level granularity.
  • OBJECTS_MODIFIED — tables written to via DML/COPY.

Access History has a latency of up to 3 hours before records appear in the ACCOUNT_USAGE views. For compliance-sensitive queries, this means near-real-time auditing is not possible through this view alone — pair it with query history and login history for a fuller picture.

Practical use: run a weekly query across ACCESS_HISTORY and TAG_REFERENCES to produce a report of which users accessed columns tagged as sensitivity=high, satisfying GDPR Article 30 record-keeping requirements.

Which Snowflake view provides the Access History records showing exactly which tables and columns were accessed per query?
What is the approximate latency for new query records to become visible in SNOWFLAKE.ACCOUNT_USAGE.ACCESS_HISTORY?
35. What is Snowflake Secure Data Sharing and how does it work without copying data?

Secure Data Sharing lets a Snowflake account (the data provider) grant another Snowflake account (the data consumer) live read access to specific objects — without ever copying, moving, or exporting the data. The consumer's Virtual Warehouse reads directly from the provider's cloud object storage using a metadata pointer. From the consumer's perspective, the shared data appears as a regular read-only database.

The Share object is the key primitive: the provider adds objects (tables, external tables, secure views, secure materialized views, secure UDFs) to a Share, then grants the Share to one or more consumer accounts. The Share is a metadata reference — it contains zero data bytes.

Two consumer types exist:

  • Full Snowflake account — an existing Snowflake customer; uses their own Virtual Warehouse to query the share. Zero extra cost to the provider for compute.
  • Reader Account — a free Snowflake account created and managed by the provider for consumers who do not have their own Snowflake subscription. The provider pays for the reader account's compute.

Limitation: cross-region or cross-cloud sharing requires the provider to first replicate data to a local account in the consumer's region before sharing. Shared data cannot be further re-shared by consumers by default.

When a consumer accesses a Snowflake Data Share, where is the actual data physically read from?
What type of consumer account allows sharing data with an organization that does not have its own Snowflake subscription?
36. What is the Snowflake Data Marketplace and what types of data products are available?

The Snowflake Data Marketplace is a hub within Snowflake where data providers publish and monetize data products, and data consumers discover and access them — all powered by Secure Data Sharing. When a consumer subscribes to a listing, they receive a live Share pointing to the provider's data, not a static file download. The data is always current, and no ETL pipeline is needed.

Three listing types are offered:

  • Free listings — immediately accessible to any Snowflake customer in the same region as the provider. Common for sample datasets, open government data, and marketing datasets.
  • Paid listings — require a commercial agreement brokered through Snowflake's billing system. The provider sets a subscription price and Snowflake handles invoicing.
  • Personalized listings — the consumer submits a request describing their data need; the provider manually reviews it, customizes the share, and approves. Used for proprietary or compliance-sensitive data.

Data categories available: financial markets (equities, FX, options), weather and climate, demographics and consumer behavior, ESG and sustainability data, geospatial and location data, healthcare and clinical trials, and firmographic business data. Major providers include Bloomberg, Refinitiv, AccuWeather, Foursquare, SafeGraph, and dozens of government agencies publishing open data.

How does a Snowflake Marketplace consumer receive access to a listed dataset?
Which Marketplace listing type requires the provider to manually review and approve the consumer's data request before granting access?
37. What are Snowflake Data Clean Rooms and what privacy problems do they solve?

Data Clean Rooms (DCRs) are privacy-preserving environments where two or more parties can run collaborative analytics on a union of their datasets without either party being able to see the other's raw records. Snowflake implements Clean Rooms on top of Secure Data Sharing and the Native App Framework.

The privacy problem they solve: an advertiser wants to measure how many of their customers saw a publisher's ad and subsequently made a purchase. This requires matching two customer lists. But the advertiser cannot share its customer list with the publisher and vice versa — both contain sensitive PII. A DCR lets them run the overlap-and-attribution query without either side ever getting a plain-text view of the other's underlying customer records.

Snowflake's Clean Room architecture enforces privacy through:

  • Pre-approved query templates — only specific, vetted SQL queries may be executed; no ad-hoc access.
  • Aggregation thresholds — results are suppressed if the matching cohort is smaller than a minimum size (e.g., <500 rows), preventing re-identification.
  • No raw data export — results are returned as aggregates; the underlying rows are never exposed.
  • Differential privacy options — statistical noise can be added to outputs to provide provable privacy guarantees.

Use cases span advertising (media measurement, audience activation), financial services (transaction reconciliation between banks), and healthcare (de-identified outcomes research across payer and provider datasets).

What fundamental privacy problem do Snowflake Data Clean Rooms address?
Which two Snowflake platform capabilities form the technical foundation of Snowflake Data Clean Rooms?
38. What is Snowpark and how does it allow Python/Java/Scala code to run inside Snowflake?

Snowpark is Snowflake's developer framework that lets you write Python, Java, or Scala code that executes inside the Snowflake engine rather than pulling data to an external machine. The core abstraction is the Snowpark DataFrame, which works like Apache Spark or pandas DataFrames: you compose transformations lazily, and Snowflake compiles them into SQL and executes them on the Virtual Warehouse.

The primary benefit is that data never leaves Snowflake. Instead of pulling 100 GB to a Python server, applying transformations, and pushing results back, the transformation runs where the data lives — in Snowflake compute. This eliminates egress costs, reduces latency, and scales with warehouse size.

Deployment targets in Snowpark:

  • Snowpark Stored Procedures — Python/Java/Scala logic callable via CALL proc_name(), running on the caller's warehouse.
  • User-Defined Functions (UDFs) — scalar or tabular functions callable from any SQL query.
  • User-Defined Aggregate Functions (UDAFs) — custom aggregations in Python/Java.
  • Snowpark ML — train and deploy scikit-learn, XGBoost, LightGBM models entirely inside Snowflake.
  • Snowpark Container Services — run arbitrary Docker containers within Snowflake's network for full-custom workloads.
# Python Snowpark: run a transformation inside Snowflake
from snowflake.snowpark import Session

session = Session.builder.configs(connection_params).create()

# Build a lazy DataFrame — nothing runs yet
df = session.table('raw_events')

df_clean = (
    df.filter(df['event_type'] == 'purchase')
      .select('user_id', 'amount', 'event_ts')
      .group_by('user_id')
      .agg({'amount': 'sum'})
)

# Action: push result into Snowflake -- runs as SQL on the warehouse
df_clean.write.save_as_table('user_purchase_summary', mode='overwrite')
What is the primary advantage of transforming data with Snowpark rather than pulling it to an external Python environment?
What does the Snowpark DataFrame API use to represent transformations before actually executing them against Snowflake?
39. What are Snowflake Native Apps and how does the Native App Framework work?

Snowflake Native Apps are full applications that a provider packages once and consumers install directly into their own Snowflake account. The Native App Framework is the platform layer that enables this — combining Snowpark code, data shares, Streamlit UI components, stored procedures, and UDFs into a versioned, deployable bundle called an Application Package.

The key architectural difference from traditional SaaS is where code and data meet: in conventional SaaS the customer's data is sent to the vendor's environment (requiring data egress and trust). In a Native App, the application is brought to the customer's data — the app code runs in the consumer's Snowflake account, using the consumer's compute, with no data leaving the consumer's environment.

Lifecycle: the provider develops the app locally, creates an Application Package, uploads it to the Marketplace or shares a private URL. Consumers find the app in their Snowflake Marketplace, click install, grant the requested privileges, and the app runs inside their account. The provider receives telemetry (if the consumer consents) but never sees the consumer's data.

  • Use cases: data enrichment APIs (address validation, geocoding), compliance audit tools that run inside a regulated environment, ML inference apps that score your own data, analytics dashboards with managed SQL logic.
Where does a Snowflake Native App's code execute after the consumer installs it?
What is the fundamental architectural difference between a Snowflake Native App and a traditional SaaS product?
40. What are Snowflake External Tables and when would you use them over internal tables?

An External Table is a read-only Snowflake object that maps a schema onto files stored in an external stage (S3, Azure Blob, GCS) without loading the data into Snowflake-managed storage. The files stay in your own cloud bucket; Snowflake queries them in place by scanning the raw files through the Virtual Warehouse.

External Tables support partitioning based on file path expressions (e.g., year/month/day directory structure) which allows partition pruning similar to Hive-style partitioning. The partition metadata is maintained in a Snowflake-managed catalog and must be kept current with ALTER EXTERNAL TABLE ... REFRESH or automated via cloud event notifications.

Key limitations compared to internal tables: read-only (no DML), no Time Travel, no Fail-safe, no automatic micro-partition optimization, slower query performance since data is not stored in Snowflake's columnar compressed format, and schema must be explicitly defined via the VALUE:field::TYPE VARIANT path syntax.

When to use: querying a data lake without ingesting it (cost-sensitive archival data), sharing data already in S3/Azure Blob with Snowflake analysts, compliance scenarios where data must remain in a specific bucket, or bridging a data lake with Snowflake analytics without full migration.

-- Create an external table on an S3 stage
CREATE EXTERNAL TABLE ext_clickstream (
  event_ts   TIMESTAMP_NTZ AS (VALUE:c1::TIMESTAMP_NTZ),
  user_id    VARCHAR       AS (VALUE:c2::VARCHAR),
  event_type VARCHAR       AS (VALUE:c3::VARCHAR)
)
PARTITION BY (TO_DATE(event_ts))
LOCATION = @my_s3_stage/clickstream/
AUTO_REFRESH = TRUE
FILE_FORMAT = (TYPE = 'PARQUET');

-- Refresh metadata after new files arrive (if not using AUTO_REFRESH)
ALTER EXTERNAL TABLE ext_clickstream REFRESH;
What is the main performance disadvantage of querying Snowflake External Tables compared to internal tables?
Which SQL command updates an External Table's partition metadata to reflect newly arrived files in the external stage?
41. What is Snowflake's multi-cluster warehouse and how does it handle concurrency auto-scaling?

A multi-cluster warehouse is a Virtual Warehouse configured with a minimum and maximum cluster count. Instead of a single MPP cluster, Snowflake can dynamically spin up additional identical clusters to serve queued queries in parallel, eliminating the most common cause of poor concurrency performance: the query queue.

Two operating modes:

  • Auto-scale mode (default) — Snowflake starts a new cluster only when the existing clusters have a queue forming. Once the queue clears and the added cluster has been idle for the Auto-Suspend period, it shuts down. This is credit-efficient for variable concurrency patterns.
  • Maximized mode — all clusters in the range are always running, regardless of load. Use this when you have consistently high, predictable concurrency and cannot afford any queue latency.

Two scaling policies control how aggressively new clusters start:

  • STANDARD (default) — starts a new cluster as soon as a single query is queued.
  • ECONOMY — waits until at least 6 minutes of queue time accumulates before starting a new cluster. Saves credits at the cost of higher queue latency for occasional spikes.

Requires Enterprise edition. Each additional cluster counts as a full warehouse credit spend at the configured T-shirt size.

CREATE WAREHOUSE concurrent_wh
  WAREHOUSE_SIZE         = 'LARGE'
  MIN_CLUSTER_COUNT      = 1
  MAX_CLUSTER_COUNT      = 5
  SCALING_POLICY         = 'STANDARD'
  AUTO_SUSPEND           = 120
  AUTO_RESUME            = TRUE;
What is the key difference between Auto-scale and Maximized mode in a multi-cluster warehouse?
Multi-cluster warehouses are available from which minimum Snowflake edition?
42. What is Resource Monitor in Snowflake and how do you use it to control credit consumption?

A Resource Monitor is a named Snowflake object that watches credit consumption for one or more Virtual Warehouses (or the entire account) and automatically takes action when configurable thresholds are crossed. It is the primary guardrail against runaway warehouse spend — for example, preventing a poorly-written query or a forgotten warehouse from consuming an entire month's budget.

A Resource Monitor defines a credit quota (the total credits allowed within the reset period) and one or more threshold triggers (expressed as percentages of the quota). Each trigger has an action:

  • NOTIFY — sends an email alert to the monitor's notification list but takes no action against the warehouse.
  • SUSPEND — suspends the warehouse after all currently running queries complete.
  • SUSPEND_IMMEDIATE — immediately kills all running queries and suspends the warehouse without waiting.

Reset schedule: DAILY, WEEKLY, MONTHLY, or YEARLY — the quota resets on this cadence. Monitors can be set at account level (applies to all warehouses) or assigned to individual warehouses.

-- Create a Resource Monitor: 500 credits/month, suspend at 90%, alert at 75%
CREATE RESOURCE MONITOR monthly_budget
  CREDIT_QUOTA = 500
  FREQUENCY    = MONTHLY
  START_TIMESTAMP = IMMEDIATELY
  TRIGGERS
    ON 75 PERCENT DO NOTIFY
    ON 90 PERCENT DO SUSPEND
    ON 100 PERCENT DO SUSPEND_IMMEDIATE;

-- Attach the monitor to a specific warehouse
ALTER WAREHOUSE reporting_wh SET RESOURCE_MONITOR = monthly_budget;
What does the SUSPEND_IMMEDIATE action on a Resource Monitor threshold do to the warehouse?
At which two levels can a Snowflake Resource Monitor be applied?
43. How does Snowflake support ELT patterns and how does it compare to ETL?

ELT (Extract, Load, Transform) loads raw data into Snowflake first, then transforms it inside using SQL or Snowpark. ETL (Extract, Transform, Load) transforms data in an external tool before loading it. Snowflake's MPP engine and separation-of-storage-from-compute make it strongly suited for ELT — the transformation step is where Snowflake's power is applied, not a place to avoid.

ELT advantages with Snowflake: no external transformation infrastructure to manage; transformations can access the full history of raw data at any time (Time Travel); warehouse can be sized specifically for heavy transform runs; SQL models are easy to version-control with dbt; Snowpark handles transformations that require Python or ML libraries.

Typical Snowflake ELT stack:

  • Ingestion: Fivetran, Airbyte, or Snowpipe pulls data into a raw schema.
  • Staging: dbt or Snowpark cleans and casts the raw data into typed staging tables.
  • Transformation: dbt models join, aggregate, and enrich staging data into semantic layer tables.
  • Serving: BI tools (Tableau, Looker, Power BI) query the semantic layer through a dedicated reporting warehouse.
-- ELT pattern: transform raw JSON into a typed table inside Snowflake
CREATE OR REPLACE TABLE orders_clean AS
SELECT
  raw:order_id::NUMBER        AS order_id,
  raw:customer_id::VARCHAR    AS customer_id,
  raw:amount::DECIMAL(10,2)   AS amount,
  raw:order_ts::TIMESTAMP_NTZ AS order_ts
FROM raw.orders_raw
WHERE raw:status::VARCHAR = 'completed';
Why does Snowflake's architecture particularly favor ELT over traditional ETL?
Which open-source transformation framework, built around SQL models, is most widely used for ELT pipelines in Snowflake?
44. What are common Snowflake anti-patterns and performance pitfalls to avoid?

Anti-patterns in Snowflake fall into four categories: security misuse, cost mismanagement, query performance degradation, and data loading inefficiency.

Common Snowflake Anti-Patterns
Anti-PatternRoot CauseFix
Using ACCOUNTADMIN for daily workViolates least-privilege; accidental privilege grantsCreate custom roles; use SYSADMIN for object management
Never suspending idle warehousesCompute bills tick even with zero queriesSet AUTO_SUSPEND = 60 seconds on all non-critical warehouses
High partition scan ratio on large tablesNo clustering; data not co-located with query predicatesDefine a clustering key on frequently filtered columns
Loading thousands of tiny files with COPY INTOSmall-file overhead: metadata cost > data costMerge files to 100–250 MB before loading; use Snowpipe for micro-batches
Breaking the result cacheNon-deterministic functions (CURRENT_TIMESTAMP, RANDOM) in repeated queriesStandardize query SQL; move timestamp filters to parameters
Storing all text as VARCHARNo column statistics for numerics or dates; worse compressionUse typed columns (NUMBER, DATE, TIMESTAMP_NTZ)
UNION instead of UNION ALLImplicit deduplication scan on large result setsUse UNION ALL when input rows are already distinct
SELECT * on wide tablesReads all columns; wastes I/O and memory on irrelevant dataAlways specify required columns in SELECT list

The single highest-impact change for most Snowflake accounts: set AUTO_SUSPEND = 60 on every warehouse that is not required to be always-on, and audit warehouses weekly in SNOWFLAKE.ACCOUNT_USAGE.WAREHOUSE_METERING_HISTORY for unexpected runtime.

How does adding LIMIT 10 to an otherwise-identical repeated query affect Snowflake's result cache?
What is the 'small file problem' in Snowflake COPY INTO and which ingestion mechanism mitigates it?
45. How does Snowflake compare to BigQuery and Redshift in architecture and pricing model?

These three platforms dominate cloud data warehousing and each represents a distinct architectural philosophy.

Snowflake uses multi-cluster shared-data architecture: explicit Virtual Warehouses (MPP clusters) read from cloud object storage. Available on AWS, Azure, and GCP. Compute billed per credit-second while running; storage billed per TB/month compressed. Customers manage warehouse sizing and concurrency configuration.

BigQuery (Google Cloud only) is fully serverless — there are no clusters or warehouses to provision. The Dremel engine allocates compute slots automatically per query. Default pricing: on-demand (per TB of data scanned), or capacity pricing (flat-rate slot reservations). BigQuery abstracts all infrastructure; you just run SQL.

Amazon Redshift uses shared-nothing node clusters. RA3 nodes (modern Redshift) separate storage to S3 and compute to nodes, partially approaching Snowflake's model. Billed per node-hour (provisioned) or per RPU-second (Redshift Serverless). Deeper AWS service integration. Requires cluster sizing decisions (node type, count) and maintenance (vacuuming, sort/dist key tuning).

Snowflake vs BigQuery vs Redshift
DimensionSnowflakeBigQueryRedshift
ArchitectureMulti-cluster shared-dataServerless (Dremel)Shared-nothing cluster (RA3)
Cloud platformsAWS, Azure, GCPGCP onlyAWS only
Compute modelVirtual Warehouses (user-managed)Fully serverless, auto-slot allocationProvisioned nodes or Serverless RPU
Compute billingCredits/second while warehouse runsPer TB scanned (on-demand) or slotsNode-hours or RPU-seconds
Storage billingPer TB/month compressedPer TB/month in GCSPer TB/month in S3 (RA3)
Idle compute costZero when warehouse suspendedZero (serverless)Full node-hour cost if always-on
Cross-cloud supportYes (replicate across clouds)NoNo
Semi-structured dataNative VARIANT typeNative JSON/ARRAY typesSUPER type (limited)
Which cloud data warehouse platform charges users based on bytes of data scanned per query in its default on-demand pricing mode?
Which platform requires you to choose a node type and cluster node count rather than a warehouse T-shirt size?
«
»
CouchDB Interview Questions

Comments & Discussions