Prev Next

BigData / Apache Airflow Interview Questions

1. What is Apache Airflow? 2. What is a DAG in Apache Airflow? 3. What are Operators in Apache Airflow? 4. What is the Airflow Scheduler? 5. What are the main components of Apache Airflow? 6. What is an Executor in Airflow and what types are available? 7. What is an Airflow Connection? 8. What is an Airflow Variable? 9. What is XCom in Airflow and how is it used? 10. What are Hooks in Apache Airflow? 11. What is the difference between a DAG Run and a Task Instance in Airflow? 12. What are Sensors in Apache Airflow? 13. What is catchup in Airflow and how does it work? 14. What is backfilling in Apache Airflow? 15. What is the TaskFlow API in Airflow? 16. What is the difference between schedule_interval and timetable in Airflow? 17. What is a SubDAG and why is it generally discouraged? 18. What is a TaskGroup in Airflow? 19. What is branching in Airflow and how is BranchPythonOperator used? 20. What are trigger rules in Airflow? 21. What is the Airflow metadata database and what does it store? 22. How does the CeleryExecutor work in Airflow? 23. What is the KubernetesExecutor and what are its benefits? 24. What are Pools in Apache Airflow? 25. What are Airflow Providers? 26. What is dynamic task mapping in Airflow? 27. What is the difference between depends_on_past and wait_for_downstream in Airflow? 28. What is the Airflow Web UI and what can you do with it? 29. What are Airflow task states and what do they mean? 30. What are retries and retry_delay in Airflow tasks? 31. What is a Deferrable Operator (async operator) in Airflow? 32. What are Airflow Plugins? 33. How does Airflow handle templating and macros? 34. What is idempotency in the context of Airflow tasks? 35. What are best practices for writing efficient Airflow DAGs? 36. What is the ExternalTaskSensor in Airflow? 37. What is the KubernetesPodOperator in Airflow? 38. What are SLAs in Apache Airflow and how are they configured? 39. How does Airflow handle task concurrency and parallelism? 40. What is an Airflow Dataset and how does data-driven scheduling work? 41. What is the difference between Airflow and Apache Spark? 42. How do you deploy Apache Airflow using Docker Compose? 43. What is Airflow on Kubernetes (KEDA) autoscaling? 44. What is the SparkSubmitOperator in Airflow? 45. What is Managed Airflow (MWAA) on AWS? 46. How does Airflow handle secrets management? 47. What is the difference between PythonOperator and PythonVirtualenvOperator? 48. What is the Grid view in Airflow 2.x? 49. What are common Airflow anti-patterns to avoid? 50. What is Airflow 2 vs Airflow 1 — key differences?
Could not find what you were looking for? send us the question and we would be happy to answer your question.

1. What is Apache Airflow?

Apache Airflow is an open-source platform for programmatically authoring, scheduling, and monitoring workflows. Originally created at Airbnb in 2014 and donated to the Apache Software Foundation, Airflow lets you define workflows as Directed Acyclic Graphs (DAGs) written in Python. The scheduler executes tasks on an array of workers while following the specified dependencies. The rich UI makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues.

Key capabilities include dynamic pipeline generation, extensible operators, and a web-based UI for visibility into pipeline state.

What language are Apache Airflow workflows defined in?
Which organization originally created Apache Airflow?
2. What is a DAG in Apache Airflow?

A DAG (Directed Acyclic Graph) is the core concept in Airflow. It is a collection of tasks organized with dependencies and relationships that define how they should run. The "directed" part means each edge has a direction (from one task to another). "Acyclic" means there are no loops - you cannot create a cycle where task A depends on task B which depends on task A.

A simple DAG example:

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

with DAG('my_dag', start_date=datetime(2024, 1, 1), schedule='@daily') as dag:
    t1 = PythonOperator(task_id='task_1', python_callable=lambda: print('Hello'))
    t2 = PythonOperator(task_id='task_2', python_callable=lambda: print('World'))
    t1 » t2 # t2 runs after t1

What does the 'Acyclic' property in a DAG guarantee?
3. What are Operators in Apache Airflow?

Operators are the building blocks of a DAG. Each operator represents a single, idempotent unit of work — a task. Airflow provides a wide variety of built-in operators:

  • PythonOperator — executes a Python callable.
  • BashOperator — runs a Bash command.
  • EmailOperator — sends an email notification.
  • HttpOperator — makes an HTTP request.
  • SqlOperator / PostgresOperator — executes SQL against a database.

Operators are instantiated as Task objects inside a DAG context. They define what to do; the Airflow scheduler decides when to do it.

Which Airflow operator would you use to run a shell script?
4. What is the Airflow Scheduler?

The Airflow Scheduler is a core component that monitors DAGs and triggers task instances when their dependencies are met and the schedule interval fires. It continuously parses DAG files, checks the DAG schedule and any data interval conditions, and submits tasks to the executor for running.

Key points about the scheduler:

  • Reads DAG files from the configured dags_folder.
  • Creates DagRuns and TaskInstances in the metadata database.
  • Does not execute tasks itself — it delegates that to the Executor.
  • Runs continuously as a background process.
Does the Airflow Scheduler directly execute tasks?
5. What are the main components of Apache Airflow?

Apache Airflow consists of five main components:

ComponentRole
WebserverProvides the UI to monitor, trigger, and debug DAGs
SchedulerTriggers task instances per schedule and dependency rules
ExecutorDetermines how and where tasks run (locally, on Celery workers, on Kubernetes, etc.)
Metadata DBStores state of DAGs, runs, task instances, connections, and variables
WorkersProcesses that actually execute tasks (used with CeleryExecutor/KubernetesExecutor)
Which Airflow component stores DAG run state?
6. What is an Executor in Airflow and what types are available?

The Executor defines how tasks are run. Airflow supports several executor types suited for different scale and infrastructure needs:

  • SequentialExecutor — runs one task at a time; only for development/testing.
  • LocalExecutor — runs tasks in parallel on the same machine using subprocesses.
  • CeleryExecutor — distributes tasks across a pool of Celery workers; suitable for production scale-out.
  • KubernetesExecutor — spins up a new Kubernetes pod per task; great for isolation and dynamic resource allocation.
  • CeleryKubernetesExecutor — hybrid of Celery and Kubernetes executors.
Which executor is recommended only for development and testing?
7. What is an Airflow Connection?

A Connection stores credentials and endpoint information for external systems such as databases, cloud providers, and APIs. Rather than hardcoding credentials in DAG code, you store them in Airflow's metadata database (or a secrets backend) and reference them by a conn_id.

Example — reading a Postgres connection:

from airflow.hooks.postgres_hook import PostgresHook

hook = PostgresHook(postgres_conn_id='my_postgres')
records = hook.get_records('SELECT * FROM users LIMIT 10')

Connections can be managed via the UI, CLI, or environment variables prefixed with AIRFLOW_CONN_.

How can Airflow Connections be stored securely without using the metadata DB?

8. What is an Airflow Variable?

Variables are key-value pairs stored in Airflow's metadata database. They provide a way to pass configuration or runtime values into DAGs without hardcoding them. Variables can be set via the UI, CLI, environment variables, or the Python API.

from airflow.models import Variable

# Read a variable (with default fallback)
env = Variable.get('environment', default_var='production')

# Serialize JSON
config = Variable.get('pipeline_config', deserialize_json=True)

For sensitive values, prefer Connections or a secrets backend over plain Variables, since Variable values are stored unencrypted by default.

Are Airflow Variables encrypted by default?
9. What is XCom in Airflow and how is it used?

XCom (cross-communication) is a mechanism that lets tasks exchange small amounts of data. A task can push a value into XCom and a downstream task can pull it.

def push_func(**context):
    context['ti'].xcom_push(key='result', value=42)

def pull_func(**context):
    val = context['ti'].xcom_pull(task_ids='push_task', key='result')
    print(f'Received: {val}')

XCom values are stored in the metadata database, so they should be used for small payloads (strings, numbers, short dicts). Passing large dataframes through XCom is an anti-pattern — use a shared storage layer such as S3 or GCS instead.

What is the recommended maximum size of data to pass via XCom?
10. What are Hooks in Apache Airflow?

Hooks are interfaces to external platforms and databases. They abstract connection handling, authentication, and API calls so operators and tasks don't need to manage those details directly. Hooks use Connections (stored in the metadata DB) to retrieve credentials.

Example using the S3Hook:

from airflow.providers.amazon.aws.hooks.s3 import S3Hook

hook = S3Hook(aws_conn_id='my_aws')
hook.load_file(filename='/tmp/data.csv', key='uploads/data.csv', bucket_name='my-bucket')

Common built-in hooks include PostgresHook, MySqlHook, HttpHook, S3Hook, BigQueryHook, and many provider-contributed hooks.

What is the primary purpose of a Hook in Airflow?
11. What is the difference between a DAG Run and a Task Instance in Airflow?

A DAG Run is an instantiation of the whole DAG for a specific data interval (logical date). It tracks the overall execution state (running, success, failed) for that interval.

A Task Instance is a specific execution of one task within a DAG Run. Each task instance belongs to exactly one DAG Run and has its own state (queued, running, success, failed, skipped, etc.).

Relationship: one DAG Run contains many Task Instances — one per task defined in the DAG.

What is the relationship between a DAG Run and Task Instances?
12. What are Sensors in Apache Airflow?

Sensors are a special type of operator that wait for a certain condition to become true before proceeding. They poke the condition at a configurable interval (poke_interval) and either block (poke mode) or release the worker slot between polls (reschedule mode).

Common sensors:

  • FileSensor — waits until a file appears on the filesystem.
  • S3KeySensor — waits for an object key to exist in S3.
  • HttpSensor — polls an HTTP endpoint until a condition is met.
  • ExternalTaskSensor — waits for a task in another DAG to succeed.

Using mode='reschedule' is best practice for long waits to avoid holding worker slots idle.

Which sensor mode releases the worker slot while waiting?
13. What is catchup in Airflow and how does it work?

Catchup is a DAG parameter that controls whether Airflow should backfill all DAG Runs from the start_date up to the current date when a DAG is first enabled or its schedule is changed. When catchup=True (default), Airflow will create a DAG Run for every missed schedule interval. When catchup=False, only the most recent interval is scheduled.

with DAG(
    'my_dag',
    start_date=datetime(2024, 1, 1),
    schedule='@daily',
    catchup=False   # do not backfill past runs
) as dag:
    ...

For most production pipelines, setting catchup=False is safer to avoid unexpected mass backfill on re-activation.

What happens when catchup=True and a DAG has been paused for 30 days?
14. What is backfilling in Apache Airflow?

Backfilling is the process of running a DAG for historical date ranges that were not previously executed. You can trigger a backfill from the CLI:

airflow dags backfill --start-date 2024-01-01 --end-date 2024-03-31 my_dag

This creates DAG Runs for every schedule interval between the start and end dates. Backfilling is useful when you add a new DAG with start_date in the past and want to process historical data, or when a pipeline was down and you need to reprocess missed intervals.

How do you trigger a backfill for a DAG via the Airflow CLI?
15. What is the TaskFlow API in Airflow?

Introduced in Airflow 2.0, the TaskFlow API is a decorator-based approach to writing DAGs and tasks that reduces boilerplate. Instead of instantiating operator objects explicitly, you decorate plain Python functions with @task and define the DAG with @dag.

from airflow.decorators import dag, task
from datetime import datetime

@dag(schedule='@daily', start_date=datetime(2024, 1, 1), catchup=False)
def my_pipeline():
    @task
    def extract():
        return {'rows': 100}

    @task
    def transform(data: dict):
        return data['rows'] * 2

    transform(extract())

my_pipeline()

XCom passing is handled automatically when the return value of one @task is passed as an argument to another.

Which Airflow version introduced the TaskFlow API?
16. What is the difference between schedule_interval and timetable in Airflow?

schedule_interval (deprecated from Airflow 2.4) accepted cron strings or timedelta objects to define how frequently a DAG should run. schedule (the replacement) accepts the same values but also supports Timetables.

A Timetable is a plugin interface that gives full control over how data intervals and run times are calculated. This supports use cases like business-day schedules or irregular intervals that can't be expressed with cron alone.

from airflow.timetables.interval import CronDataIntervalTimetable

with DAG(
    'example',
    timetable=CronDataIntervalTimetable('0 9 * * MON-FRI', timezone='UTC'),
    ...
) as dag:
    ...
What does a Timetable enable that a cron expression alone cannot?
17. What is a SubDAG and why is it generally discouraged?

A SubDAG is a pattern where one DAG embeds another DAG as a single task using the SubDagOperator. It was used to group related tasks and visualize them as a unit in the UI.

SubDAGs are discouraged because:

  • They have their own scheduler state, which creates deadlocks in some executor configurations.
  • They cannot be parallelized easily.
  • They add complexity without clear benefit over TaskGroups.

The recommended replacement is TaskGroups, introduced in Airflow 2.0, which groups tasks visually without the performance and deadlock issues of SubDAGs.

What is the recommended replacement for SubDAGs in Airflow 2.x?
18. What is a TaskGroup in Airflow?

TaskGroups are a UI grouping mechanism for tasks introduced in Airflow 2.0. They collapse a set of related tasks into a single expandable node in the Graph view, making complex DAGs easier to read. Unlike SubDAGs, TaskGroups don't create a separate DAG object — they are purely visual.

from airflow.utils.task_group import TaskGroup

with DAG('etl_dag', ...) as dag:
    with TaskGroup('extract_group') as extract:
        t1 = BashOperator(task_id='extract_a', bash_command='...')
        t2 = BashOperator(task_id='extract_b', bash_command='...')
    load = BashOperator(task_id='load', bash_command='...')
    extract >> load
Do TaskGroups create separate DAG objects like SubDAGs?
19. What is branching in Airflow and how is BranchPythonOperator used?

Branching lets a DAG conditionally execute one or more downstream paths based on runtime logic. The BranchPythonOperator runs a Python callable that returns the task_id (or list of task_ids) of the branch(es) to follow. All other branches are skipped.

from airflow.operators.python import BranchPythonOperator

def choose_branch():
    import random
    return 'branch_a' if random.random() > 0.5 else 'branch_b'

branch = BranchPythonOperator(
    task_id='choose',
    python_callable=choose_branch
)
branch >> [task_a, task_b]

Tasks not selected by the branch get a skipped state, so downstream join tasks often need trigger_rule='none_failed_min_one_success'.

What state do unselected branches receive in a BranchPythonOperator flow?
20. What are trigger rules in Airflow?

Trigger rules define when a task should be triggered relative to its upstream tasks. The default is all_success, but many alternatives exist:

RuleDescription
all_successAll upstream tasks succeeded (default)
all_failedAll upstream tasks failed
all_doneAll upstream tasks are in a terminal state
one_successAt least one upstream task succeeded
one_failedAt least one upstream task failed
none_failedNo upstream tasks failed or were skipped
none_failed_min_one_successNo failures and at least one success (useful after branching)
Which trigger rule is most appropriate for a join task after branching?
21. What is the Airflow metadata database and what does it store?

The metadata database (metastore) is a relational database (PostgreSQL or MySQL recommended for production; SQLite for local testing) that is the central state store for Airflow. It persists:

  • DAG definitions and schedules
  • DAG Runs and their states
  • Task Instances and their states, logs, and XCom values
  • Connections and Variables
  • User accounts and RBAC roles (when auth is enabled)
  • Pool and slot usage

The database schema is managed through Alembic migrations and upgraded via airflow db upgrade.

Which database is recommended for the Airflow metadata store in production?
22. How does the CeleryExecutor work in Airflow?

The CeleryExecutor uses Celery (a distributed task queue) to distribute Airflow task execution across multiple worker nodes. The workflow is:

  1. The Scheduler submits tasks to a message broker (Redis or RabbitMQ).
  2. Celery workers pick up tasks from the queue and execute them.
  3. Results and state are written back to the metadata database.

This allows horizontal scaling of workers. You can start as many workers as needed with airflow celery worker. A Flower dashboard (airflow celery flower) provides monitoring of the Celery cluster.

What component acts as the message broker for CeleryExecutor?
23. What is the KubernetesExecutor and what are its benefits?

The KubernetesExecutor launches a dedicated Kubernetes pod for every task instance. When a task is scheduled, the executor calls the Kubernetes API to create a pod; the pod runs the task and then terminates. Benefits include:

  • Isolation — each task runs in its own container with its own dependencies.
  • Resource efficiency — no idle workers; pods start on demand.
  • Custom images per task — different Docker images can be specified per task.
  • Scalability — Kubernetes handles scheduling and resource allocation.

Configuration is managed through the pod_override parameter or KubernetesPodOperator for full control.

What is the main advantage of KubernetesExecutor over CeleryExecutor for task isolation?
24. What are Pools in Apache Airflow?

Pools are a mechanism to limit the number of concurrently running tasks that use shared resources (e.g., database connections, API rate limits). You define a pool with a maximum number of slots, and assign tasks to it. If all slots are occupied, queued tasks wait.

# Assign a task to a pool
t = PythonOperator(
    task_id='my_task',
    python_callable=my_func,
    pool='limited_db_pool',    # uses one slot from this pool
    pool_slots=1
)

Pools are created in the UI (Admin > Pools) or via the CLI: airflow pools set limited_db_pool 5 'DB connections pool'.

What problem do Airflow Pools solve?
25. What are Airflow Providers?

Providers (formerly contrib) are installable packages that extend Airflow with operators, hooks, sensors, and connections for third-party services. They are published separately from the core Airflow package, so you install only what you need.

Example providers:

  • apache-airflow-providers-amazon — AWS (S3, Redshift, EMR, etc.)
  • apache-airflow-providers-google — GCP (BigQuery, GCS, Dataflow, etc.)
  • apache-airflow-providers-databricks — Databricks jobs
  • apache-airflow-providers-apache-spark — Spark submit and Livy operators

Install via pip: pip install apache-airflow-providers-amazon.

How do you install Airflow provider packages?
26. What is dynamic task mapping in Airflow?

Dynamic task mapping, introduced in Airflow 2.3, allows you to create a variable number of task instances at runtime based on data rather than defining a fixed set of tasks at DAG parse time.

from airflow.decorators import dag, task

@dag(schedule='@daily', start_date=datetime(2024, 1, 1), catchup=False)
def dynamic_example():
    @task
    def process(item: str):
        print(f'Processing {item}')

    # Creates one task instance per item in the list
    process.expand(item=['a', 'b', 'c', 'd'])

dynamic_example()

You can also map over the output of an upstream task with .expand(item=upstream_task()), which makes the parallelism truly data-driven.

Which Airflow version introduced dynamic task mapping?
27. What is the difference between depends_on_past and wait_for_downstream in Airflow?

Both parameters control inter-run dependencies for a task:

  • depends_on_past=True: The task will only run if the same task in the previous DAG Run succeeded. Useful for tasks that process sequential data.
  • wait_for_downstream=True: The task will wait until all immediately downstream tasks from the previous DAG Run have completed. Stricter than depends_on_past.

These are set as operator-level parameters and are applied per task, not per DAG.

What does depends_on_past=True enforce for a task?
28. What is the Airflow Web UI and what can you do with it?

The Airflow Web UI (powered by Flask and the FAB — Flask-AppBuilder security framework) is a browser-based dashboard for managing and monitoring pipelines. Key features include:

  • DAG list view — see all DAGs, toggle pause/unpause, check recent run states.
  • Graph view — visualize task dependencies and task-level states for a specific run.
  • Gantt view — examine task duration and parallelism within a DAG Run.
  • Task logs — read stdout/stderr output for any task instance.
  • Trigger DAGs — manually trigger a DAG Run with optional config JSON.
  • Clear tasks — reset failed or successful tasks for re-execution.
  • Admin — manage Connections, Variables, Pools, and user accounts.
Which Airflow UI view helps you analyze task duration and parallelism?
29. What are Airflow task states and what do they mean?

Each Task Instance in Airflow goes through a lifecycle represented by states:

StateMeaning
noneNot yet scheduled
scheduledDependency met, waiting for executor
queuedSent to executor, waiting for a worker slot
runningCurrently executing
successCompleted without error
failedTerminated with an error
skippedNot run due to branching or trigger rule
up_for_retryFailed but has retries remaining
up_for_rescheduleSensor released its slot and is polling again
deferredWaiting for an async trigger to fire
What state does a task enter when it fails but retries are still available?
30. What are retries and retry_delay in Airflow tasks?

Task-level retry parameters control behavior after a task failure:

  • retries — number of times Airflow will retry the task before marking it as failed. Default is 0.
  • retry_delay — how long to wait between retries (a timedelta object).
  • retry_exponential_backoff — if True, doubles the wait time after each retry.
  • max_retry_delay — caps the maximum wait time when exponential backoff is enabled.
from datetime import timedelta

t = PythonOperator(
    task_id='resilient_task',
    python_callable=my_func,
    retries=3,
    retry_delay=timedelta(minutes=5),
    retry_exponential_backoff=True
)
Which parameter causes retry wait times to increase after each failure?
31. What is a Deferrable Operator (async operator) in Airflow?

Deferrable Operators (introduced in Airflow 2.2) allow tasks to suspend themselves, release the worker slot, and wait for an external event via a lightweight Trigger component. This is more efficient than sensors in poke mode because no worker thread is held while waiting.

The flow is:

  1. Task starts, then defers itself by calling self.defer(...).
  2. A Triggerer process monitors the event (file arrival, HTTP response, etc.).
  3. When the event fires, the Triggerer resumes the task on a worker.

Use deferrable operators for long-polling scenarios (e.g., waiting hours for a cloud job to finish) to avoid tying up workers.

What Airflow process monitors events for deferrable operators?
32. What are Airflow Plugins?

Plugins are a way to extend Airflow with custom operators, hooks, sensors, macros, UI views, and Flask blueprints without forking the core codebase. You place a Python module in the plugins folder (configured as plugins_folder) and Airflow automatically discovers it at startup.

# plugins/my_plugin.py
from airflow.plugins_manager import AirflowPlugin
from my_operators import MyCustomOperator

class MyPlugin(AirflowPlugin):
    name = 'my_plugin'
    operators = [MyCustomOperator]

In modern Airflow 2.x, many extensions are better distributed as provider packages rather than plugins for easier versioning and reuse.

Where should plugin Python modules be placed for Airflow to auto-discover them?
33. How does Airflow handle templating and macros?

Airflow supports Jinja2 templating inside operator parameters that are listed in template_fields. You can inject runtime values such as execution date, data interval, and custom macros.

Common built-in macros:

  • {{ ds }} — execution date as YYYY-MM-DD.
  • {{ ds_nodash }} — execution date without dashes.
  • {{ data_interval_start }} / {{ data_interval_end }} — start/end of the data interval.
  • {{ var.value.my_var }} — an Airflow Variable value.
  • {{ conn.my_conn_id.host }} — a Connection attribute.
BashOperator(
    task_id='print_date',
    bash_command='echo Processing date: {{ ds }}'
)
Which Jinja2 macro gives you the execution date in YYYY-MM-DD format?
34. What is idempotency in the context of Airflow tasks?

An idempotent task produces the same result regardless of how many times it is executed for the same input. In Airflow, tasks are often retried or re-run (via clear), so designing them to be idempotent prevents duplicate data or inconsistent state.

Techniques to ensure idempotency:

  • Use INSERT ... ON CONFLICT DO NOTHING or upserts instead of plain inserts.
  • Partition output by {{ ds }} and overwrite the partition on each run.
  • Write to intermediate staging tables and swap atomically.
  • Delete existing data for the date range before inserting.
Why is idempotency important for Airflow tasks?
35. What are best practices for writing efficient Airflow DAGs?

Key best practices for production-quality DAGs:

  • Keep DAG files lightweight — avoid heavy imports or database calls at parse time; the scheduler parses DAG files continuously.
  • Use top-level constants only — don't call APIs or read files at module level; do it inside operators/callables.
  • Set catchup=False unless backfilling is intentional.
  • Prefer TaskFlow API for clarity and automatic XCom passing.
  • Use sensors in reschedule mode for long waits.
  • Keep tasks idempotent and atomic.
  • Use Pools to protect downstream systems from overload.
  • Set email_on_failure and SLAs for alerting.
  • Avoid using Variables at top-level — each call hits the DB at parse time.
Why should you avoid calling Airflow Variables at the top level of a DAG file?
36. What is the ExternalTaskSensor in Airflow?

The ExternalTaskSensor waits for a task in another DAG (or the DAG itself) to reach a target state (default: success) before proceeding. This enables cross-DAG dependencies without tight coupling.

from airflow.sensors.external_task import ExternalTaskSensor

wait = ExternalTaskSensor(
    task_id='wait_for_upstream',
    external_dag_id='upstream_dag',
    external_task_id='final_task',
    mode='reschedule',
    timeout=3600
)

The sensor matches runs by execution_date by default. Use execution_date_fn or execution_delta when the upstream DAG runs on a different schedule.

How does ExternalTaskSensor match runs in the external DAG by default?
37. What is the KubernetesPodOperator in Airflow?

The KubernetesPodOperator runs any command inside a Docker container launched as a Kubernetes pod. It is independent of the executor type — even with CeleryExecutor you can use KubernetesPodOperator to run specific tasks in isolated pods.

from airflow.providers.cncf.kubernetes.operators.pod import KubernetesPodOperator

run_etl = KubernetesPodOperator(
    task_id='run_etl',
    name='etl-pod',
    namespace='airflow',
    image='myrepo/etl:latest',
    cmds=['python', 'etl.py'],
    arguments=['--date', '{{ ds }}'],
    get_logs=True,
    is_delete_operator_pod=True
)

This is useful for tasks that need specific dependencies, GPU access, or custom environments not available in the standard worker image.

Can KubernetesPodOperator be used with a CeleryExecutor deployment?
38. What are SLAs in Apache Airflow and how are they configured?

SLAs (Service Level Agreements) in Airflow allow you to define the maximum time by which a task or DAG should complete after the scheduled execution date. If the deadline is missed, Airflow sends an email alert and logs an SLA miss event.

from datetime import timedelta

with DAG(
    'sla_example',
    sla_miss_callback=my_sla_callback,
    default_args={'sla': timedelta(hours=2)},
    ...
) as dag:
    t = PythonOperator(task_id='task', python_callable=my_func)

SLA misses appear in the Airflow UI under Browse > SLA Misses. Note: SLAs are measured from the scheduled execution date, not the actual start time.

From when is an Airflow SLA miss measured?
39. How does Airflow handle task concurrency and parallelism?

Airflow provides several levels of concurrency control:

  • parallelism (airflow.cfg) — global maximum number of tasks running across all DAGs.
  • dag_concurrency / max_active_tasks (per DAG) — max tasks running within a single DAG at once.
  • max_active_runs (per DAG) — max simultaneous DAG Runs for a single DAG.
  • task_concurrency / max_active_tis_per_dag (per task) — max instances of a specific task running at once across all DAG Runs.
  • Pools — cross-DAG slot limits for shared resources.
Which setting limits how many simultaneous DAG Runs can exist for a single DAG?
40. What is an Airflow Dataset and how does data-driven scheduling work?

Datasets (introduced in Airflow 2.4) are logical references to data assets identified by a URI. A DAG can produce a Dataset via an outlet, and another DAG can be scheduled to run automatically when that Dataset is updated.

from airflow.datasets import Dataset

my_dataset = Dataset('s3://my-bucket/output/daily.csv')

# Producer DAG
@dag(schedule='@daily', ...)
def producer():
    @task(outlets=[my_dataset])
    def write_data():
        ...  # write to S3

# Consumer DAG — triggered whenever my_dataset is updated
@dag(schedule=[my_dataset], ...)
def consumer():
    ...

This replaces fragile time-based scheduling with event-driven, data-dependency-aware scheduling.

How is a consumer DAG triggered in data-driven scheduling?
41. What is the difference between Airflow and Apache Spark?

Airflow and Spark serve different purposes and are frequently used together:

AspectApache AirflowApache Spark
PurposeWorkflow orchestration — define, schedule, and monitor pipelinesDistributed data processing — transform and analyze large datasets in memory
DataDoes not process data itself; delegates to operators/hooksProcesses terabytes of data in parallel across a cluster
LanguagePython (DAGs)Scala, Python (PySpark), Java, R
ExecutionTask scheduling on workersIn-memory RDD/DataFrame transformations on executors

A common pattern: Airflow submits a Spark job via SparkSubmitOperator or LivyOperator, then monitors its completion.

What is the primary role of Apache Airflow compared to Apache Spark?
42. How do you deploy Apache Airflow using Docker Compose?

The official Airflow Docker Compose setup (docker-compose.yaml) is the quickest way to run a production-like local environment with Postgres, Redis, and CeleryExecutor. Steps:

  1. Download the official compose file: curl -LfO 'https://airflow.apache.org/docs/apache-airflow/stable/docker-compose.yaml'
  2. Create required directories: mkdir -p ./dags ./logs ./plugins ./config
  3. Set the Airflow UID: echo -e "AIRFLOW_UID=$(id -u)" > .env
  4. Initialise the database: docker compose up airflow-init
  5. Start all services: docker compose up -d

The UI is available at http://localhost:8080 with user airflow / password airflow.

Which executor does the official Airflow Docker Compose setup use?
43. What is Airflow on Kubernetes (KEDA) autoscaling?

When running CeleryExecutor on Kubernetes, KEDA (Kubernetes-based Event Driven Autoscaler) can automatically scale Celery worker pods based on the number of tasks in the queue. This removes the need for static worker counts and makes the cluster cost-efficient.

How it works:

  1. KEDA is deployed alongside Airflow.
  2. A ScaledObject is configured to watch the Celery broker queue length.
  3. When tasks accumulate, KEDA scales up worker Deployment replicas.
  4. When the queue drains, workers scale back down to zero.

Helm chart support: set workers.keda.enabled=true in the official Airflow Helm chart.

What does KEDA monitor to scale Airflow workers?
44. What is the SparkSubmitOperator in Airflow?

The SparkSubmitOperator submits a Spark application to a Spark cluster (Standalone, YARN, or Kubernetes). It wraps the spark-submit command and reads Spark connection details from an Airflow Connection (conn_id='spark_default').

from airflow.providers.apache.spark.operators.spark_submit import SparkSubmitOperator

submit_job = SparkSubmitOperator(
    task_id='run_spark_etl',
    application='/opt/spark/jobs/etl.py',
    conn_id='spark_default',
    executor_cores=2,
    executor_memory='4g',
    driver_memory='2g',
    verbose=True
)

Install the provider: pip install apache-airflow-providers-apache-spark.

Which provider package contains the SparkSubmitOperator?
45. What is Managed Airflow (MWAA) on AWS?

Amazon Managed Workflows for Apache Airflow (MWAA) is a fully managed Airflow service on AWS. It handles infrastructure provisioning, scaling, patching, and HA configuration. Key features:

  • DAGs are stored in an S3 bucket and auto-synced to the environment.
  • Supports CeleryExecutor with auto-scaling workers.
  • Integrates natively with IAM for secrets/connections via AWS Secrets Manager.
  • Logs are shipped to CloudWatch.
  • Runs inside a VPC for network isolation.

It's useful for teams that want to run Airflow without managing the underlying infrastructure.

Where are DAG files stored in an AWS MWAA environment?
46. How does Airflow handle secrets management?

Airflow supports a pluggable Secrets Backend to retrieve Connections and Variables from external secret stores rather than the metadata DB. Supported backends include AWS Secrets Manager, HashiCorp Vault, GCP Secret Manager, and Azure Key Vault. Configure via secrets.backend in airflow.cfg. Example:

[secrets]
backend = airflow.providers.amazon.aws.secrets.secrets_manager.SecretsManagerBackend
backend_kwargs = {"connections_prefix": "airflow/connections", "variables_prefix": "airflow/variables"}
Which configuration section controls the Secrets Backend in Airflow?
47. What is the difference between PythonOperator and PythonVirtualenvOperator?

Both operators run Python callables, but differ in execution environment:

  • PythonOperator — runs the callable in the same Python environment as the Airflow worker. All packages available to the worker are accessible.
  • PythonVirtualenvOperator — creates a temporary virtual environment with specified requirements, runs the callable inside it, then destroys the venv. Useful when a task needs package versions that conflict with the worker environment.
from airflow.operators.python import PythonVirtualenvOperator

run = PythonVirtualenvOperator(
    task_id='isolated_task',
    python_callable=my_func,
    requirements=['pandas==1.5.0', 'numpy'],
    system_site_packages=False
)
When should you prefer PythonVirtualenvOperator over PythonOperator?
48. What is the Grid view in Airflow 2.x?

The Grid view (introduced in Airflow 2.3 as a replacement for the Tree view) shows a matrix of DAG Runs on the X-axis and tasks on the Y-axis. Each cell represents a task instance colored by state (green=success, red=failed, yellow=running, etc.).

Key interactions:

  • Click a cell to inspect that task instance's logs, duration, and XCom output.
  • Filter by run type (scheduled, manual, backfill).
  • Group tasks by TaskGroup using the expand/collapse controls.

The Grid view is the primary view for diagnosing patterns of failures across multiple DAG Runs.

What replaced the Tree view in Airflow 2.3?
49. What are common Airflow anti-patterns to avoid?

Common pitfalls that hurt reliability and performance:

  • Top-level DB calls — calling Variable.get() or Connection.get_connection_from_secrets() at parse time stresses the scheduler.
  • Non-idempotent tasks — retries create duplicate data.
  • Giant XCom payloads — overloads the metadata DB.
  • Too many small tasks — scheduling overhead grows linearly; batch micro-tasks where possible.
  • Dynamic DAG generation at parse time — if the generation is slow (API calls, DB queries), the scheduler lags.
  • Using SubDAGs — causes deadlocks; use TaskGroups instead.
  • Sensors in poke mode for long waits — blocks worker slots; use reschedule mode.
  • Hardcoded credentials — always use Connections or secrets backend.
Why should you avoid using SubDAGs in Airflow?
50. What is Airflow 2 vs Airflow 1 — key differences?

Airflow 2.0 (released December 2020) was a major overhaul. Key changes:

FeatureAirflow 1.xAirflow 2.x
Scheduler HASingle scheduler onlyMultiple schedulers supported (HA)
TaskFlow APINot available@dag / @task decorators
Provider packagesAll in monolithic airflow packageSeparate installable provider packages
REST APIExperimentalStable, versioned REST API
UIFlask-AdminFlask-AppBuilder with RBAC
DAG serializationOptionalEnabled by default (faster scheduler)
What major scheduler improvement was introduced in Airflow 2.0?
«
»
Apache Parquet Interview Questions

Comments & Discussions