Data Engineering – Airflow, DBT, Spark and PySpark Training

Course Overview

About Course

This 40-hour instructor-led program empowers data engineers and analytics professionals to master modern data pipeline tools. Participants learn to orchestrate workflows with Apache Airflow, an open-source platform for scheduling and monitoring complex data pipelines. They will also use dbt (Data Build Tool) to simplify and accelerate data transformations in SQL, and build Spark and PySpark applications for large-scale distributed processing. Training combines conceptual lectures with extensive hands-on practice so learners can apply skills in real-world scenarios.

Learners should have basic Python and SQL knowledge. The course is ideal for data engineers, analytics engineers, ETL developers, and data analysts working in industries such as finance, healthcare, and retail. (For example, in healthcare data engineering is increasingly critical for improving patient outcomes and optimizing operations.) By the end of the training, participants will be able to design end-to-end data pipelines, automate workflows, and transform and analyze big data sets.

Skills and Outcomes: Upon completion, participants will:

Orchestrate Data Pipelines: Design, schedule, and monitor workflows in Apache Airflow (DAG creation, task dependencies, triggers).
Model and Transform Data: Build dbt models to transform and test data in a SQL-based warehouse, using best practices for sources, staging, and documentation.
Process Big Data: Develop Apache Spark and PySpark applications (using RDDs, DataFrames, Spark SQL, MLlib) to process large-scale batch and streaming data.
Integrate Tools: Orchestrate Spark jobs and dbt workflows within Airflow pipelines, enabling automated ELT workflows.
Hands-On Experience: Build a portfolio of real-world projects (e.g. ETL pipelines and streaming applications) to demonstrate skills.
Certification Readiness: Gain practical knowledge aligned with industry certifications (Microsoft, AWS, Snowflake, etc.) to prepare for professional data engineering exams

Course Syllabus

Module 1: Data Engineering Foundations (2 hours)

This introductory module covers the core concepts of data engineering and pipeline architecture. Learners explore ETL/ELT workflows, modern data warehouse vs data lake designs, and the role of data engineers. The session includes a simple hands-on project (e.g. ingesting public CSV data into a SQL database) to reinforce extract-transform-load concepts. This project-based approach helps build proficiency with Python/SQL for data processing.

Module 2: Apache Airflow Basics (4 hours)

Participants dive into Apache Airflow fundamentals. Topics include DAGs (Directed Acyclic Graphs), tasks/operators, scheduling vs. triggering, and the Airflow UI. The instructor will demonstrate how to set up a local Airflow environment and create a first DAG. Learners get hands-on experience by building an example workflow (e.g. a data ingestion pipeline) to practice scheduling and monitoring tasks.
- Topics: Airflow architecture, scheduler, metadata database, basic operators and sensors.
- Hands-on: Build and run a simple Airflow DAG that orchestrates Python tasks or data transfers.
Module 3: Advanced Airflow and Pipeline Management (3 hours)

This module covers advanced Airflow features for production pipelines. Topics include DAG lifecycle (runs, retries), various executors (Celery, Kubernetes), XCom for data exchange between tasks, variables/connections, and the TaskFlow API for Pythonic DAG authoring. Security, logging, and monitoring best practices are introduced.
- Topics: Executors and parallelism; Operators/Sensors/Hooks (built-in and custom); TaskFlow functional APIs.
- Hands-on: Extend the earlier DAG with dynamic task mapping or parameterized tasks, and configure Airflow connections for external databases.
Module 4: dbt Fundamentals (5 hours)

Learners explore dbt (Data Build Tool) fundamentals, focusing on transforming data within a warehouse. The module covers setting up dbt projects, writing SQL models, and version control integration. Key concepts include sources, models, tests, and documentation. Students will run dbt commands to build a layered analytics data model.
- Topics: dbt project structure; Jinja templating; incremental models; schema tests and documentation.
- Hands-on: Create dbt models for a simple sales or logs dataset: connect dbt to a database, write transformations, execute tests, and generate documentation.
Module 5: Advanced dbt – Testing & Deployment (3 hours)

This advanced dbt session teaches best practices for production pipelines. Topics include advanced Jinja macros, snapshots, and continuous integration/deployment workflows. Learners will practice creating source freshness checks and deploying dbt jobs in a CI pipeline. Integration with Airflow is demonstrated (e.g. triggering dbt runs from a DAG).
- Topics: dbt tests and hooks; modular project design; using dbt Cloud or GitHub Actions for CI/CD.
- Hands-on: Configure a dbt project to run in development and production modes, including automated tests on data quality.
Module 6: Apache Spark Fundamentals (4 hours)

Introduction to Apache Spark for large-scale data processing. Participants learn Spark’s architecture (driver/executor, cluster mode) and work with Spark RDDs and DataFrames. Spark SQL and the Dataset API are covered for structured data processing. Emphasis is on writing efficient transformations and actions in Spark.
- Topics: Spark core concepts; RDD operations; Spark SQL/DataFrames; key/value RDDs; data ingestion formats (CSV, Parquet); Spark’s lazy evaluation.
- Hands-on: Run Spark jobs that load a large dataset (e.g. logs or retail transactions), perform transformations (filtering, aggregations), and inspect results.
Module 7: PySpark Programming (4 hours)

This module focuses on using PySpark (the Python interface to Spark). Learners write Spark applications in Python, working with DataFrames and Spark SQL from PySpark. Advanced topics include broadcast/join optimizations and UDFs. (As one course notes, PySpark training “covers Spark RDD, Spark SQL, Spark MLlib, and Spark Streaming”.)
- Topics: SparkSession setup; DataFrame vs RDD in Python; filtering, grouping; joining data; basic MLlib usage (e.g. linear regression).
- Hands-on: Develop a PySpark pipeline to process a dataset (e.g. compute summary statistics on a large CSV or JSON file) and optionally train a simple ML model using Spark MLlib.
Module 8: Spark Streaming & Machine Learning (3 hours)

Building on Spark fundamentals, this module covers streaming data and ML in Spark. Topics include Spark Structured Streaming for real-time data pipelines, and an overview of Spark MLlib. Learners will see how to set up streaming sources (e.g. Kafka or file streams) and write continuous processing queries.
- Topics: Spark Structured Streaming API; event time vs processing time; window operations; MLlib pipeline components (pipelines, transformers, estimators).
- Hands-on: Create a Spark streaming job to process micro-batches of data (e.g. reading new logs every few seconds) and perform a simple streaming aggregation or prediction.
Module 9: Orchestrating Pipelines with Airflow (3 hours)

This capstone integration module shows how to orchestrate Spark and dbt tasks in Airflow. Learners will create Airflow DAGs that submit Spark jobs (via SparkSubmitOperator or Livy) and run dbt transformations. The concept of data lineage is introduced (e.g. using OpenLineage with Airflow, dbt, and Spark).
- Topics: Airflow operators for Spark (YARN, Kubernetes) and Bash; triggering external jobs; managing dependencies between DBT and Spark tasks.
- Hands-on: Develop an end-to-end Airflow workflow that ingests raw data, transforms it with dbt, and processes it with Spark, demonstrating a complete ETL pipeline.
Module 10: Real-World Data Engineering Project (5 hours)

In this hands-on capstone, participants apply all learned tools to a comprehensive data pipeline use case. Possible scenarios include processing financial transactions for fraud detection, analyzing healthcare records for insights, or aggregating retail sales data for analytics. Each team designs and implements an ETL/ELT pipeline using Airflow, dbt, and Spark/PySpark, deploying it end-to-end. Instructors provide guidance, and participants present their solutions.
- Skills Applied: Workflow scheduling with Airflow, transformations with dbt, batch/stream processing with Spark, and data storage (SQL/NoSQL).
- Outcome: A working pipeline demo, plus discussion on best practices and optimizations in real data engineering environments.

Key Features

 Live Online Classes: Interactive instructor-led sessions ensure real-time Q&A and discussions. Core modules are taught by industry experts

 Hands-On Labs & Projects: Each module includes practical labs. In total, learners complete 10+ hands-on exercises and capstone projects, building real data pipelines

 Industry Use Cases: Examples and exercises drawn from finance, healthcare, and retail illustrate how data pipelines solve real business problems.

 Certification Prep: Content aligns with major certifications; students receive guidance on exam topics and best practices

 Ongoing Support: Access to lab environments, reference materials, and recorded sessions for post-training review.

Our Upcoming Batches

At Topskill.ai, we understand that today’s professionals navigate demanding schedules.
To support your continuous learning, we offer fully flexible session timings across all our trainings.

Below is the schedule for our Training. If these time slots don’t align with your availability, simply let us know—we’ll be happy to design a customized timetable that works for you.

Training Timetable

Batches Online/Offline	Batch Start Date	Session Days	Time Slot (IST)	Fees
Week Days (Virtual Online)	Aug 28, 2025 Sept 4th, 2025 Sept 11th, 2025	Mon-Fri	7:00 AM (Class 1-1.30 Hrs)	View Fees
Week Days (Virtual Online)	Aug 28, 2025 Sept 4th, 2025 Sept 11th, 2025	Mon-Fri	11:00 AM (Class 1-1.30 Hrs)	View Fees
Week Days (Virtual Online)	Aug 28, 2025 Sept 4th, 2025 Sept 11th, 2025	Mon-Fri	5:00 PM (Class 1-1.30 Hrs)	View Fees
Week Days (Virtual Online)	Aug 28, 2025 Sept 4th, 2025 Sept 11th, 2025	Mon-Fri	7:00 PM (Class 1-1.30 Hrs)	View Fees
Weekends (Virtual Online)	Aug 28, 2025 Sept 4th, 2025 Sept 11th, 2025	Sat-Sun	7:00 AM (Class 3 Hrs)	View Fees
Weekends (Virtual Online)	Aug 28, 2025 Sept 4th, 2025 Sept 11th, 2025	Sat-Sun	10:00 AM (Class 3 Hrs)	View Fees
Weekends (Virtual Online)	Aug 28, 2025 Sept 4th, 2025 Sept 11th, 2025	Sat-Sun	11:00 AM (Class 3 Hrs)	View Fees

For any adjustments or bespoke scheduling requests, reach out to our admissions team at
support@topskill.ai or call +91-8431222743.
We’re committed to ensuring your training fits seamlessly into your professional life.

Note: Clicking “View Fees” will direct you to detailed fee structures, instalment options, and available discounts.

Don’t see a batch that fits your schedule? Click here to Request a Batch to design a bespoke training timetable.

Can’t find a batch you were looking for?

Corporate Training

“Looking to give your employees the experience of the latest trending technologies? We’re here to make it happen!”

Feedback

0.0

0 rating

Be the first to review “Data Engineering – Airflow, DBT, Spark and PySpark Training” Cancel reply

₹29,999.00₹24,999.00

₹29,999.00

Duration 10 week
Lessons 0
Quizzes 0
Language
Skill level all
Certificate no