Question 1

What is Apache Airflow?

Accepted Answer

Apache Airflow is an open-source platform for programmatically authoring, scheduling, and monitoring workflows. Originally created at Airbnb in 2014 and donated to the Apache Software Foundation, Airflow lets you define workflows as Directed Acyclic Graphs (DAGs) written in Python. The scheduler ...

Question 2

What is a DAG in Apache Airflow?

Accepted Answer

A DAG (Directed Acyclic Graph) is the core concept in Airflow. It is a collection of tasks organized with dependencies and relationships that define how they should run. The "directed" part means each edge has a direction (from one task to another). "Acyclic" means there are no loops - you cannot...

Question 3

What are Operators in Apache Airflow?

Accepted Answer

Operators are the building blocks of a DAG. Each operator represents a single, idempotent unit of work — a task. Airflow provides a wide variety of built-in operators: PythonOperator — executes a Python callable. BashOperator — runs a Bash command. EmailOperator — sends an email notification. Htt...

Question 4

What is the Airflow Scheduler?

Accepted Answer

The Airflow Scheduler is a core component that monitors DAGs and triggers task instances when their dependencies are met and the schedule interval fires. It continuously parses DAG files, checks the DAG schedule and any data interval conditions, and submits tasks to the executor for running. Key ...

Question 5

What are the main components of Apache Airflow?

Accepted Answer

Apache Airflow consists of five main components: Component Role Webserver Provides the UI to monitor, trigger, and debug DAGs Scheduler Triggers task instances per schedule and dependency rules Executor Determines how and where tasks run (locally, on Celery workers, on Kubernetes, etc.) Metadata ...

Question 6

What is an Executor in Airflow and what types are available?

Accepted Answer

The Executor defines how tasks are run. Airflow supports several executor types suited for different scale and infrastructure needs: SequentialExecutor — runs one task at a time; only for development/testing. LocalExecutor — runs tasks in parallel on the same machine using subprocesses. CeleryExe...

Question 7

What is an Airflow Connection?

Accepted Answer

A Connection stores credentials and endpoint information for external systems such as databases, cloud providers, and APIs. Rather than hardcoding credentials in DAG code, you store them in Airflow's metadata database (or a secrets backend) and reference them by a conn_id . Example — reading a Po...

Question 8

What is an Airflow Variable?

Accepted Answer

Variables are key-value pairs stored in Airflow's metadata database. They provide a way to pass configuration or runtime values into DAGs without hardcoding them. Variables can be set via the UI, CLI, environment variables, or the Python API. from airflow.models import Variable # Read a variable ...

Question 9

What is XCom in Airflow and how is it used?

Accepted Answer

XCom (cross-communication) is a mechanism that lets tasks exchange small amounts of data. A task can push a value into XCom and a downstream task can pull it. def push_func ( ** context): context[ 'ti' ] . xcom_push(key = 'result' , value = 42 ) def pull_func ( ** context): val = context[ 'ti' ] ...

Question 10

What are Hooks in Apache Airflow?

Accepted Answer

Hooks are interfaces to external platforms and databases. They abstract connection handling, authentication, and API calls so operators and tasks don't need to manage those details directly. Hooks use Connections (stored in the metadata DB) to retrieve credentials. Example using the S3Hook: from ...

Question 11

What is the difference between a DAG Run and a Task Instance in Airflow?

Accepted Answer

A DAG Run is an instantiation of the whole DAG for a specific data interval (logical date). It tracks the overall execution state (running, success, failed) for that interval. A Task Instance is a specific execution of one task within a DAG Run. Each task instance belongs to exactly one DAG Run a...

Question 12

What are Sensors in Apache Airflow?

Accepted Answer

Sensors are a special type of operator that wait for a certain condition to become true before proceeding. They poke the condition at a configurable interval ( poke_interval ) and either block (poke mode) or release the worker slot between polls (reschedule mode). Common sensors: FileSensor — wai...

Question 13

What is catchup in Airflow and how does it work?

Accepted Answer

Catchup is a DAG parameter that controls whether Airflow should backfill all DAG Runs from the start_date up to the current date when a DAG is first enabled or its schedule is changed. When catchup=True (default), Airflow will create a DAG Run for every missed schedule interval. When catchup=Fals...

Question 14

What is backfilling in Apache Airflow?

Accepted Answer

Backfilling is the process of running a DAG for historical date ranges that were not previously executed. You can trigger a backfill from the CLI: airflow dags backfill --start-date 2024-01-01 --end-date 2024-03-31 my_dag This creates DAG Runs for every schedule interval between the start and end...

Question 15

What is the TaskFlow API in Airflow?

Accepted Answer

Introduced in Airflow 2.0, the TaskFlow API is a decorator-based approach to writing DAGs and tasks that reduces boilerplate. Instead of instantiating operator objects explicitly, you decorate plain Python functions with @task and define the DAG with @dag . from airflow.decorators import dag, tas...

Question 16

What is the difference between schedule_interval and timetable in Airflow?

Accepted Answer

schedule_interval (deprecated from Airflow 2.4) accepted cron strings or timedelta objects to define how frequently a DAG should run. schedule (the replacement) accepts the same values but also supports Timetables . A Timetable is a plugin interface that gives full control over how data intervals...

Question 17

What is a SubDAG and why is it generally discouraged?

Accepted Answer

A SubDAG is a pattern where one DAG embeds another DAG as a single task using the SubDagOperator . It was used to group related tasks and visualize them as a unit in the UI. SubDAGs are discouraged because: They have their own scheduler state, which creates deadlocks in some executor configuratio...

Question 18

What is a TaskGroup in Airflow?

Accepted Answer

TaskGroups are a UI grouping mechanism for tasks introduced in Airflow 2.0. They collapse a set of related tasks into a single expandable node in the Graph view, making complex DAGs easier to read. Unlike SubDAGs, TaskGroups don't create a separate DAG object — they are purely visual. from airflo...

Question 19

What is branching in Airflow and how is BranchPythonOperator used?

Accepted Answer

Branching lets a DAG conditionally execute one or more downstream paths based on runtime logic. The BranchPythonOperator runs a Python callable that returns the task_id (or list of task_id s) of the branch(es) to follow. All other branches are skipped. from airflow.operators.python import BranchP...

Question 20

What are trigger rules in Airflow?

Accepted Answer

Trigger rules define when a task should be triggered relative to its upstream tasks. The default is all_success , but many alternatives exist: Rule Description all_success All upstream tasks succeeded (default) all_failed All upstream tasks failed all_done All upstream tasks are in a terminal sta...

Question 21

What is the Airflow metadata database and what does it store?

Accepted Answer

The metadata database (metastore) is a relational database (PostgreSQL or MySQL recommended for production; SQLite for local testing) that is the central state store for Airflow. It persists: DAG definitions and schedules DAG Runs and their states Task Instances and their states, logs, and XCom v...

Question 22

How does the CeleryExecutor work in Airflow?

Accepted Answer

The CeleryExecutor uses Celery (a distributed task queue) to distribute Airflow task execution across multiple worker nodes. The workflow is: The Scheduler submits tasks to a message broker (Redis or RabbitMQ). Celery workers pick up tasks from the queue and execute them. Results and state are wr...

Question 23

What is the KubernetesExecutor and what are its benefits?

Accepted Answer

The KubernetesExecutor launches a dedicated Kubernetes pod for every task instance. When a task is scheduled, the executor calls the Kubernetes API to create a pod; the pod runs the task and then terminates. Benefits include: Isolation — each task runs in its own container with its own dependenci...

Question 24

What are Pools in Apache Airflow?

Accepted Answer

Pools are a mechanism to limit the number of concurrently running tasks that use shared resources (e.g., database connections, API rate limits). You define a pool with a maximum number of slots, and assign tasks to it. If all slots are occupied, queued tasks wait. # Assign a task to a pool t = Py...

Question 25

What are Airflow Providers?

Accepted Answer

Providers (formerly contrib) are installable packages that extend Airflow with operators, hooks, sensors, and connections for third-party services. They are published separately from the core Airflow package, so you install only what you need. Example providers: apache-airflow-providers-amazon — ...

Question 26

What is dynamic task mapping in Airflow?

Accepted Answer

Dynamic task mapping, introduced in Airflow 2.3, allows you to create a variable number of task instances at runtime based on data rather than defining a fixed set of tasks at DAG parse time. from airflow.decorators import dag, task @dag (schedule = '@daily' , start_date = datetime( 2024 , 1 , 1 ...

Question 27

What is the difference between depends_on_past and wait_for_downstream in Airflow?

Accepted Answer

Both parameters control inter-run dependencies for a task:

depends_on_past=True: The task will only run if the same task in the previous DAG Run succeeded. Useful for tasks that process sequential data.
wait_for_downstream=True: The task will wait until all immediately downstream tasks from the previous DAG Run have completed. Stricter than depends_on_past.

These are set as operator-level parameters and are applied per task, not per DAG.

Question 28

What is the Airflow Web UI and what can you do with it?

Accepted Answer

The Airflow Web UI (powered by Flask and the FAB — Flask-AppBuilder security framework) is a browser-based dashboard for managing and monitoring pipelines. Key features include: DAG list view — see all DAGs, toggle pause/unpause, check recent run states. Graph view — visualize task dependencies a...

Question 29

What are Airflow task states and what do they mean?

Accepted Answer

Each Task Instance in Airflow goes through a lifecycle represented by states: State Meaning none Not yet scheduled scheduled Dependency met, waiting for executor queued Sent to executor, waiting for a worker slot running Currently executing success Completed without error failed Terminated with a...

Question 30

What are retries and retry_delay in Airflow tasks?

Accepted Answer

Task-level retry parameters control behavior after a task failure: retries — number of times Airflow will retry the task before marking it as failed . Default is 0. retry_delay — how long to wait between retries (a timedelta object). retry_exponential_backoff — if True , doubles the wait time aft...

Question 31

What is a Deferrable Operator (async operator) in Airflow?

Accepted Answer

Deferrable Operators (introduced in Airflow 2.2) allow tasks to suspend themselves, release the worker slot, and wait for an external event via a lightweight Trigger component. This is more efficient than sensors in poke mode because no worker thread is held while waiting. The flow is: Task start...

Question 32

What are Airflow Plugins?

Accepted Answer

Plugins are a way to extend Airflow with custom operators, hooks, sensors, macros, UI views, and Flask blueprints without forking the core codebase. You place a Python module in the plugins folder (configured as plugins_folder ) and Airflow automatically discovers it at startup. # plugins/my_plug...

Question 33

How does Airflow handle templating and macros?

Accepted Answer

Airflow supports Jinja2 templating inside operator parameters that are listed in template_fields . You can inject runtime values such as execution date, data interval, and custom macros. Common built-in macros: {{ ds }} — execution date as YYYY-MM-DD . {{ ds_nodash }} — execution date without das...

Question 34

What is idempotency in the context of Airflow tasks?

Accepted Answer

An idempotent task produces the same result regardless of how many times it is executed for the same input. In Airflow, tasks are often retried or re-run (via clear), so designing them to be idempotent prevents duplicate data or inconsistent state. Techniques to ensure idempotency: Use INSERT ......

Question 35

What are best practices for writing efficient Airflow DAGs?

Accepted Answer

Key best practices for production-quality DAGs: Keep DAG files lightweight — avoid heavy imports or database calls at parse time; the scheduler parses DAG files continuously. Use top-level constants only — don't call APIs or read files at module level; do it inside operators/callables. Set catchu...

Question 36

What is the ExternalTaskSensor in Airflow?

Accepted Answer

The ExternalTaskSensor waits for a task in another DAG (or the DAG itself) to reach a target state (default: success) before proceeding. This enables cross-DAG dependencies without tight coupling. from airflow.sensors.external_task import ExternalTaskSensor wait = ExternalTaskSensor( task_id = 'w...

Question 37

What is the KubernetesPodOperator in Airflow?

Accepted Answer

The KubernetesPodOperator runs any command inside a Docker container launched as a Kubernetes pod. It is independent of the executor type — even with CeleryExecutor you can use KubernetesPodOperator to run specific tasks in isolated pods. from airflow.providers.cncf.kubernetes.operators.pod impor...

Question 38

What are SLAs in Apache Airflow and how are they configured?

Accepted Answer

SLAs (Service Level Agreements) in Airflow allow you to define the maximum time by which a task or DAG should complete after the scheduled execution date. If the deadline is missed, Airflow sends an email alert and logs an SLA miss event. from datetime import timedelta with DAG( 'sla_example' , s...

Question 39

How does Airflow handle task concurrency and parallelism?

Accepted Answer

Airflow provides several levels of concurrency control: parallelism (airflow.cfg) — global maximum number of tasks running across all DAGs. dag_concurrency / max_active_tasks (per DAG) — max tasks running within a single DAG at once. max_active_runs (per DAG) — max simultaneous DAG Runs for a sin...

Question 40

What is an Airflow Dataset and how does data-driven scheduling work?

Accepted Answer

Datasets (introduced in Airflow 2.4) are logical references to data assets identified by a URI. A DAG can produce a Dataset via an outlet, and another DAG can be scheduled to run automatically when that Dataset is updated. from airflow.datasets import Dataset my_dataset = Dataset( 's3://my-bucket...

Question 41

What is the difference between Airflow and Apache Spark?

Accepted Answer

Airflow and Spark serve different purposes and are frequently used together: Aspect Apache Airflow Apache Spark Purpose Workflow orchestration — define, schedule, and monitor pipelines Distributed data processing — transform and analyze large datasets in memory Data Does not process data itself; ...

Question 42

How do you deploy Apache Airflow using Docker Compose?

Accepted Answer

The official Airflow Docker Compose setup ( docker-compose.yaml ) is the quickest way to run a production-like local environment with Postgres, Redis, and CeleryExecutor. Steps: Download the official compose file: curl -LfO 'https://airflow.apache.org/docs/apache-airflow/stable/docker-compose.yam...

Question 43

What is Airflow on Kubernetes (KEDA) autoscaling?

Accepted Answer

When running CeleryExecutor on Kubernetes, KEDA (Kubernetes-based Event Driven Autoscaler) can automatically scale Celery worker pods based on the number of tasks in the queue. This removes the need for static worker counts and makes the cluster cost-efficient. How it works: KEDA is deployed alon...

Question 44

What is the SparkSubmitOperator in Airflow?

Accepted Answer

The SparkSubmitOperator submits a Spark application to a Spark cluster (Standalone, YARN, or Kubernetes). It wraps the spark-submit command and reads Spark connection details from an Airflow Connection ( conn_id='spark_default' ). from airflow.providers.apache.spark.operators.spark_submit import ...

Question 45

What is Managed Airflow (MWAA) on AWS?

Accepted Answer

Amazon Managed Workflows for Apache Airflow (MWAA) is a fully managed Airflow service on AWS. It handles infrastructure provisioning, scaling, patching, and HA configuration. Key features: DAGs are stored in an S3 bucket and auto-synced to the environment. Supports CeleryExecutor with auto-scalin...

Question 46

How does Airflow handle secrets management?

Accepted Answer

Airflow supports a pluggable Secrets Backend to retrieve Connections and Variables from external secret stores rather than the metadata DB. Supported backends include AWS Secrets Manager, HashiCorp Vault, GCP Secret Manager, and Azure Key Vault. Configure via secrets.backend in airflow.cfg . Exam...

Question 47

What is the difference between PythonOperator and PythonVirtualenvOperator?

Accepted Answer

Both operators run Python callables, but differ in execution environment: PythonOperator — runs the callable in the same Python environment as the Airflow worker. All packages available to the worker are accessible. PythonVirtualenvOperator — creates a temporary virtual environment with specified...

Question 48

What is the Grid view in Airflow 2.x?

Accepted Answer

The Grid view (introduced in Airflow 2.3 as a replacement for the Tree view) shows a matrix of DAG Runs on the X-axis and tasks on the Y-axis. Each cell represents a task instance colored by state (green=success, red=failed, yellow=running, etc.). Key interactions: Click a cell to inspect that ta...

Question 49

What are common Airflow anti-patterns to avoid?

Accepted Answer

Common pitfalls that hurt reliability and performance: Top-level DB calls — calling Variable.get() or Connection.get_connection_from_secrets() at parse time stresses the scheduler. Non-idempotent tasks — retries create duplicate data. Giant XCom payloads — overloads the metadata DB. Too many smal...

Question 50

What is Airflow 2 vs Airflow 1 — key differences?

Accepted Answer

Airflow 2.0 (released December 2020) was a major overhaul. Key changes: Feature Airflow 1.x Airflow 2.x Scheduler HA Single scheduler only Multiple schedulers supported (HA) TaskFlow API Not available @dag / @task decorators Provider packages All in monolithic airflow package Separate installable...

Rule	Description
all_success	All upstream tasks succeeded (default)
all_failed	All upstream tasks failed
all_done	All upstream tasks are in a terminal state
one_success	At least one upstream task succeeded
one_failed	At least one upstream task failed
none_failed	No upstream tasks failed or were skipped
none_failed_min_one_success	No failures and at least one success (useful after branching)

Aspect	Apache Airflow	Apache Spark
Purpose	Workflow orchestration — define, schedule, and monitor pipelines	Distributed data processing — transform and analyze large datasets in memory
Data	Does not process data itself; delegates to operators/hooks	Processes terabytes of data in parallel across a cluster
Language	Python (DAGs)	Scala, Python (PySpark), Java, R
Execution	Task scheduling on workers	In-memory RDD/DataFrame transformations on executors

	Interviews Questions Java Spring Hibernate Maven Testing API BigData Web DataStructures AI Database Integration Cloud Scala Python Tools Golang	About Javapedia.net Javapedia.net is for Java and J2EE developers, technologist and college students who prepare of interview. Also this site includes many practical examples. This site is developed using J2EE technologies by Steve Antony, a senior Developer/lead at one of the logistics based company.
	contact: javatutorials2016[at]gmail[dot]com
Kindly consider donating for maintaining this website. Thanks.
	Copyright © 2026, javapedia.net, all rights reserved. privacy policy.

Component	Role
Webserver	Provides the UI to monitor, trigger, and debug DAGs
Scheduler	Triggers task instances per schedule and dependency rules
Executor	Determines how and where tasks run (locally, on Celery workers, on Kubernetes, etc.)
Metadata DB	Stores state of DAGs, runs, task instances, connections, and variables
Workers	Processes that actually execute tasks (used with CeleryExecutor/KubernetesExecutor)

State	Meaning
none	Not yet scheduled
scheduled	Dependency met, waiting for executor
queued	Sent to executor, waiting for a worker slot
running	Currently executing
success	Completed without error
failed	Terminated with an error
skipped	Not run due to branching or trigger rule
up_for_retry	Failed but has retries remaining
up_for_reschedule	Sensor released its slot and is polling again
deferred	Waiting for an async trigger to fire

Feature	Airflow 1.x	Airflow 2.x
Scheduler HA	Single scheduler only	Multiple schedulers supported (HA)
TaskFlow API	Not available	@dag / @task decorators
Provider packages	All in monolithic airflow package	Separate installable provider packages
REST API	Experimental	Stable, versioned REST API
UI	Flask-Admin	Flask-AppBuilder with RBAC
DAG serialization	Optional	Enabled by default (faster scheduler)

BigData / Apache Airflow Interview Questions

1. What is Apache Airflow?

2. What is a DAG in Apache Airflow?

3. What are Operators in Apache Airflow?

4. What is the Airflow Scheduler?

5. What are the main components of Apache Airflow?

6. What is an Executor in Airflow and what types are available?

7. What is an Airflow Connection?

8. What is an Airflow Variable?

9. What is XCom in Airflow and how is it used?

10. What are Hooks in Apache Airflow?

11. What is the difference between a DAG Run and a Task Instance in Airflow?

12. What are Sensors in Apache Airflow?

13. What is catchup in Airflow and how does it work?

14. What is backfilling in Apache Airflow?

15. What is the TaskFlow API in Airflow?

16. What is the difference between schedule_interval and timetable in Airflow?

17. What is a SubDAG and why is it generally discouraged?

18. What is a TaskGroup in Airflow?

19. What is branching in Airflow and how is BranchPythonOperator used?

20. What are trigger rules in Airflow?

21. What is the Airflow metadata database and what does it store?

22. How does the CeleryExecutor work in Airflow?

23. What is the KubernetesExecutor and what are its benefits?

24. What are Pools in Apache Airflow?

25. What are Airflow Providers?

26. What is dynamic task mapping in Airflow?

27. What is the difference between depends_on_past and wait_for_downstream in Airflow?

28. What is the Airflow Web UI and what can you do with it?

29. What are Airflow task states and what do they mean?

30. What are retries and retry_delay in Airflow tasks?

31. What is a Deferrable Operator (async operator) in Airflow?

32. What are Airflow Plugins?

33. How does Airflow handle templating and macros?

34. What is idempotency in the context of Airflow tasks?

35. What are best practices for writing efficient Airflow DAGs?

36. What is the ExternalTaskSensor in Airflow?

37. What is the KubernetesPodOperator in Airflow?

38. What are SLAs in Apache Airflow and how are they configured?

39. How does Airflow handle task concurrency and parallelism?

40. What is an Airflow Dataset and how does data-driven scheduling work?

41. What is the difference between Airflow and Apache Spark?

42. How do you deploy Apache Airflow using Docker Compose?

43. What is Airflow on Kubernetes (KEDA) autoscaling?

44. What is the SparkSubmitOperator in Airflow?

45. What is Managed Airflow (MWAA) on AWS?

46. How does Airflow handle secrets management?

47. What is the difference between PythonOperator and PythonVirtualenvOperator?

48. What is the Grid view in Airflow 2.x?

49. What are common Airflow anti-patterns to avoid?

50. What is Airflow 2 vs Airflow 1 — key differences?

Comments & Discussions

Recently added...