The Databricks Certified Data Engineer Associate is not a theory exam you can read your way through. Its own exam guide describes it as assessing whether you can “complete introductory data engineering tasks” on the Databricks Data Intelligence Platform, and the way it tests that is by putting code in front of you and asking which version actually works. Data-manipulation code is shown in SQL where possible and in Python (PySpark) otherwise, so you must be able to read both. The single most effective thing you can do is keep a free workspace open and perform every task this guide describes, because the questions reward people who have built the thing, not people who have only seen it described. This guide is a full self-study course built around the five sections of the official exam outline. It is original teaching material with no real or simulated exam questions, and you should always confirm the current outline on the Databricks certification page before you book, since Databricks revises both the exam and its product names over time.
Chapter 1: Exam overview and how to use this guide
What the exam measures and how it is shaped
The exam is 45 scored multiple-choice questions in 90 minutes, delivered online or at a test centre with a proctor, and it carries a few additional unscored items that do not affect your result. Databricks does not publish a passing score, a per-section percentage weighting, or an official exam code, so treat any third-party site quoting a fixed pass mark or a code like “DEA-C01” with suspicion. The practical consequence of unpublished weights is simple: you cannot safely skip a section, so plan to be competent across all five. The credential is valid for two years and you renew it by retaking the current exam.
The questions are drawn from five sections, and this course follows them in order: the Databricks Intelligence Platform, Development and Ingestion, Data Processing and Transformations, Productionizing Data Pipelines, and Data Governance and Quality. Databricks recommends roughly six months of hands-on experience before sitting it, and although there is no formal prerequisite, that recommendation is honest. The exam phrasing assumes you have used the workspace.
A note on the platform’s vocabulary
Databricks renames things, and the current exam outline reflects the Lakeflow family: Lakeflow Connect for ingestion, Lakeflow Jobs for orchestration, and Lakeflow Declarative Pipelines for declarative ETL (this last one was previously called Delta Live Tables, and you will still see “DLT” everywhere online). Throughout this guide the current names are used, with the old name noted once so you recognise both. Getting the vocabulary straight is not pedantry: a question may hinge on knowing that a Job orchestrates tasks while a Declarative Pipeline maintains tables.
How to use this course
Read the chapters in order at least once, because the platform chapter builds the mental model that the ingestion, transformation, pipeline, and governance chapters then fill in. Treat the bold terms as a checklist you should be able to explain in a sentence and, more importantly, perform in a notebook. The final chapters turn the content into a schedule, a final-week routine, and a description of exam day. Where a concept is easy to misread, a short worked illustration appears, but none of these are exam questions: they exist to make the idea concrete.
Chapter 2: The Databricks Intelligence Platform
This first section is the conceptual map the rest of the exam hangs on. It covers what the Data Intelligence Platform is, how compute works, and the features that simplify data layout and speed up queries. Get it straight early and everything later becomes easier to reason about.
The lakehouse idea and why it matters
The platform is a lakehouse: it puts the reliability and structure of a data warehouse directly on top of cheap, open files in cloud storage, rather than copying data into a separate proprietary warehouse. The mechanism that makes this possible is Delta Lake, the default open table format, which adds ACID transactions, schema enforcement, and time travel to files in object storage. The reason this matters for the exam is that almost every later topic, from ingestion to governance, assumes your tables are Delta tables, so the guarantees Delta provides (a write either fully commits or does not, the schema is enforced, and you can query an earlier version) are the foundation, not a detail.
Choosing the right compute for a task
A recurring exam objective is to identify the applicable compute to use for a specific use case, so you need a clear picture of the compute options. An all-purpose cluster is interactive compute you attach a notebook to for exploration and development; a job cluster is created for a scheduled job and torn down when it finishes, which is cheaper for production runs; and serverless compute is a hands-off, automatically optimised option that Databricks manages for you, removing cluster sizing decisions. The judgement the exam wants is matching these to a situation: interactive development points to all-purpose, a scheduled production run points to a job cluster or serverless, and a request to avoid managing infrastructure points to serverless. Databricks SQL warehouses are the compute behind SQL queries and dashboards, distinct from the clusters that run notebooks.
Features that optimise data layout
The exam asks you to enable features that simplify data layout decisions and optimize query performance, which is Databricks’ way of pointing at the automatic optimisations Delta tables can use. The instinct to build is that you do not always hand-tune file sizes and partitioning yourself: features like automatic file compaction and data-layout optimisation reduce the small-file problems that slow queries, and predictive optimisation can manage this maintenance for you. You do not need to be an expert tuner at the associate level, but you should recognise that the platform offers managed ways to keep tables fast rather than requiring manual intervention for everything.
The pitfall to avoid here
The trap is treating this section as throwaway “marketing” content and rushing to the code. Do not. If you cannot explain in one sentence why a lakehouse table is reliable (Delta’s ACID guarantees) or which compute you would pick for a nightly job (a job cluster or serverless, not a long-running all-purpose cluster), the later sections will feel like disconnected facts instead of a coherent system.
Chapter 3: Development and Ingestion
This section is about getting data into the platform and doing the early development work, and it is where a large share of hands-on questions live. The official objectives here cover using Databricks Connect, the capabilities of notebooks, valid Auto Loader sources, Auto Loader syntax, and using the built-in debugging tools.
Notebooks and Databricks Connect
Notebooks are the primary development surface: cells of SQL or Python, attached to compute, with results inline. You should know their everyday capabilities, including running cells in order, mixing languages with magic commands (a %sql or %python cell), and parameterising work. Databricks Connect is the complementary idea for developers who want to write code in their own IDE and run it against Databricks compute, rather than in the browser notebook. The exam wants you to recognise where each fits in a data-engineering workflow: notebooks for interactive, in-platform development, and Databricks Connect for local IDE development executed remotely.
Ingesting files incrementally with Auto Loader
Auto Loader is the headline ingestion tool, and the exam tests it specifically: which sources are valid and the basics of its syntax. Auto Loader incrementally and efficiently processes new files as they arrive in cloud storage, using the cloudFiles source. Its defining property, and the reason it is preferred over re-reading a whole directory, is that it tracks which files it has already ingested, so each run picks up only new data without you writing bookkeeping logic. To build the right intuition, set up an Auto Loader stream that reads a folder, then drop a new file in and watch only that file get processed. You should be comfortable reading a spark.readStream.format("cloudFiles") configuration and knowing that the format option declares the incoming file type (for example CSV or JSON) and that a schema location lets it track and evolve the schema.
Reading, writing, and the built-in debugger
Beyond Auto Loader you should handle basic reads and writes: reading a source into a DataFrame or table, and writing out as a Delta table. The exam also lists using Databricks’ built-in debugging tools to troubleshoot a given issue, so practise reading an error in a notebook and using the Spark UI and run logs to find the cause, rather than guessing. The skill being tested is diagnosis: when a cell fails, you should know where the platform surfaces the reason.
How to study this section
Do not read about ingestion, perform it. Take a public dataset, land it in storage, and ingest it with Auto Loader into a first table. Deliberately break something (a wrong path, a type mismatch) and use the error output and Spark UI to fix it. The questions in this section reward muscle memory with cloudFiles and with basic read/write code, which you only get by typing it.
Chapter 4: Data Processing and Transformations
This section is the analytical core: transforming data through the medallion layers, choosing cluster configuration, using Lakeflow Declarative Pipelines for ETL, knowing DDL and DML features, and computing aggregations with PySpark DataFrames.
The medallion architecture
The medallion architecture is the organising pattern for the whole section, and the exam asks you to describe its three layers and the purpose of each. Bronze holds raw ingested data, close to the source and largely untransformed, so you always have the original. Silver holds cleaned, conformed, deduplicated data, joined and validated into a usable shape. Gold holds business-level aggregates and the tables that feed reports and dashboards. The purpose of the layering is that each stage has a single, clear job, which makes pipelines easier to reason about and to debug. When a scenario describes “raw events landing from a source,” that is bronze; “a curated, deduplicated customer table” is silver; “daily revenue by region for a dashboard” is gold.
Choosing cluster type and configuration
A listed objective is to classify the type of cluster and configuration for optimal performance based on the scenario. This connects back to the compute discussion in Chapter 2 but applies it to processing work: a heavy one-off transformation, a scheduled production transformation, and an interactive exploration each suggest different compute. The exam wants you to reason from the scenario to a sensible choice rather than reciting cluster specs.
Lakeflow Declarative Pipelines for ETL
Lakeflow Declarative Pipelines (LDP), formerly Delta Live Tables, is the declarative framework for building ETL, and the exam explicitly asks you to emphasise its advantages and to implement pipelines with it. The defining idea is declarative: you define the tables you want and the transformations that produce them, and the framework works out the dependency order, manages execution, and can apply data-quality rules, instead of you hand-coding the orchestration. The advantages worth being able to state are automatic dependency management, built-in data-quality enforcement through expectations, and simplified maintenance of the resulting tables. To learn it properly, build a small pipeline that defines a bronze, a silver, and a gold table and let the framework resolve the order; then add a data-quality expectation and see how it handles records that violate it.
DDL, DML, and PySpark aggregations
You should identify DDL and DML features: DDL (Data Definition Language) defines and changes table structure, for example CREATE TABLE or ALTER TABLE; DML (Data Manipulation Language) changes the data, for example INSERT, UPDATE, DELETE, and the very useful MERGE INTO for upserts. Finally, the exam asks you to compute complex aggregations and metrics with PySpark DataFrames, so be fluent with groupBy(...).agg(...) and the difference between aggregate functions. A worth-knowing distinction the official sample illustrates: to count unique invoices you use a distinct count of the identifier, not a sum of it and not a plain count that includes duplicates. The teaching point is that choosing the correct aggregate function for the question asked (a total versus a count versus a distinct count) is exactly the kind of judgement these items test.
The pitfall to avoid here
The common mistake is conflating Lakeflow Declarative Pipelines with the orchestration tool covered in the next chapter. Hold the line: LDP declares and maintains tables and their dependencies; Jobs schedule and orchestrate tasks. If you can say which one you reach for and why, you have neutralised one of the section’s reliable traps.
Chapter 5: Productionizing Data Pipelines
Once data flows through the medallion layers, the exam cares about putting that work into production reliably. This section centres on Databricks Asset Bundles, deploying and repairing workloads, serverless compute, and reading the Spark UI.
Databricks Asset Bundles and deployment
A current objective is to identify the difference between DAB and traditional deployment methods and to identify the structure of Asset Bundles. A Databricks Asset Bundle (DAB) packages your project (notebooks, pipeline and job definitions, and configuration) as code in a structured project, so you can version it and deploy it consistently across environments, rather than wiring jobs up by hand in the UI. The contrast the exam draws is between this code-defined, repeatable deployment and the manual, click-through approach. You should recognise the shape of a bundle: a configuration file that declares the resources (such as jobs and pipelines) and the targets (such as development and production) it deploys to.
Deploying, repairing, and rerunning
The exam lists deploy a workflow, repair, and rerun a task in case of failure, which is the operational heart of the section. A workflow (a Job) runs a sequence of tasks; when one task fails partway through, you do not necessarily rerun everything. The skill being tested is knowing that you can repair a run to retry the failed task and the ones after it, reusing the successful work already done. Practise this: build a multi-task job, make one task fail, then repair and rerun it so you have felt how a partial failure is recovered.
Serverless and the Spark UI
Two further objectives round out the section. Use serverless for a hands-off, auto-optimized compute managed by Databricks restates the value of serverless from a production angle: it removes cluster sizing and management, which is often the right answer when a scenario emphasises operational simplicity. Analyzing the Spark UI to optimize the query is the diagnostic skill: when a job is slow, the Spark UI shows stages, tasks, and where time and shuffling are spent, so you can find the bottleneck. You do not need deep tuning expertise at this level, but you should know that the Spark UI is where you look to understand a query’s behaviour.
How to study this section
Production concepts are abstract until you operate something. Define a job, schedule it, force a failure, and repair the run. Open the Spark UI for a real query and locate the most expensive stage. These are small exercises, but they convert “I read about repairing a run” into “I have repaired a run,” which is the difference the exam rewards.
Chapter 6: Data Governance and Quality
The final section covers governing data with Unity Catalog and protecting and sharing it. The objectives span managed versus external tables, granting permissions, key Unity Catalog roles, audit logs, lineage, Delta Sharing, and Lakehouse Federation.
Managed versus external tables
The exam asks you to explain the difference between managed and external tables, and this is a classic point of confusion worth nailing. With a managed table, Databricks (through Unity Catalog) manages both the metadata and the underlying data files, so dropping the table deletes the data. With an external table, you point at data in a location you control, and Databricks manages only the metadata, so dropping the table removes the table definition but leaves the files in place. The exam-relevant consequence is the behaviour of DROP: know who owns the files and what survives a drop for each type.
Unity Catalog: grants, roles, and lineage
Unity Catalog (UC) is the governance layer, organising data in a three-level namespace of catalog.schema.table with centralised access control across workspaces. The exam asks you to identify the grant of permissions to users and groups within UC, so be comfortable with GRANT statements: granting SELECT on a schema gives read access, and (as the official sample highlights) you grant on the object at the right level assuming the prerequisite USE CATALOG and USE SCHEMA privileges are already in place. You should also identify key roles in UC (such as the metastore admin and object owners), know how audit logs are stored for governance and review, and use lineage features that show how data flows from source to downstream tables. Lineage matters because it lets you trace the impact of a change or the origin of a value.
Delta Sharing and Lakehouse Federation
The section closes with collaboration features. Delta Sharing is the open protocol for sharing live data with other Unity Catalog metastores or with external systems, with no copying of the data; the exam asks you to use it, identify its advantages and limitations, distinguish Databricks-to-Databricks sharing from sharing with an external system, and reason about the cost considerations of sharing data across clouds. The clean mental model is that you grant read access to selected objects and the recipient queries live data rather than receiving a copy. Lakehouse Federation is the complementary capability for querying data that lives in external sources (other databases and warehouses) from within Databricks without first ingesting it, so you should recognise its use cases when a scenario describes reaching into an external system rather than loading from it.
The pitfall to avoid here
The marks in this section are very gettable, which is exactly why people lose them by skimming. Set up a catalog, grant a privilege to a group, drop a managed and an external table to see the different outcomes, and look at lineage for a table you built. An hour of doing this turns the whole governance section from memorisation into recognition.
Chapter 7: Study plan and hands-on practice
With the five sections understood, the remaining work is pacing them so the hands-on practice that the exam rewards does not get squeezed out. The plan below assumes you treat a live workspace, not a textbook, as your main study tool.
Set up your environment first
Before studying in earnest, get a workspace you can practise in. The free Community Edition or a trial workspace is enough to learn the concepts, and Databricks Academy offers a free self-paced learning path that maps to the exam. Do this on day one so that every later session can include doing, not just reading.
Choose a timeline by experience
If you already use Databricks regularly, a focused plan runs about three to five weeks alongside your daily work: spend the first weeks performing the tasks in each of the five sections in a live workspace, then move to timed review. If you are newer to the platform but know SQL and Python, plan for roughly six to ten weeks, starting with the platform basics and Delta Lake, then building up through ingestion, transformations, pipelines, and governance, with hands-on practice across all five sections before any heavy question drilling. To turn a chosen length into dated weeks for your own start date, use the free study-plan generator.
Build the full loop once
The most valuable single exercise is to build the whole workflow end to end. Ingest a public dataset incrementally with Auto Loader into a bronze Delta table, clean it into silver with Spark SQL and PySpark, aggregate it into gold, wrap the lot in a Lakeflow Declarative Pipeline with a data-quality expectation, schedule it as a Job and practise repairing a failed run, then govern it with Unity Catalog grants and look at its lineage. Doing this loop once exercises every section of the exam and is worth far more than re-reading documentation.
Chapter 8: Final preparation, exam day, and format
Final preparation
In the closing week, consolidate rather than learn new material. Revisit the things people most often blur: Delta managed-versus-external table behaviour, the Lakeflow Declarative Pipelines versus Jobs distinction, Auto Loader’s cloudFiles source and incremental behaviour, PySpark aggregation choices (total versus count versus distinct count), and Unity Catalog grants. Work through practice questions across all five sections to find weak spots, then go back to the workspace and actually perform whatever you got wrong instead of memorising the right letter. Treat each miss as a diagnosis. Avoid “exam dump” sites that recycle copied questions, because they breach Databricks certification policy and teach the wrong habits.
Exam day and format
On the day, the exam is 45 scored multiple-choice questions in 90 minutes (plus a small number of unscored items that add a little time), taken online or at a test centre with a proctor, and no test aids are allowed. Remember that Databricks does not publish a passing score, so do not chase a target percentage; aim instead for steady competence across every section. Read each code-based question carefully, because the difference between two options is often a single function or clause (a distinct count versus a sum, a SELECT grant versus ALL PRIVILEGES, a repair versus a full rerun). Confirm the current exam guide, fee, and policy on the Databricks certification page before you book, since Databricks updates both the exam and its product names over time, and having practised the tasks in a real workspace is exactly the advantage that makes the format feel familiar rather than abstract.