Practice questions · Data & Analytics
Databricks Certified Data Engineer Associate: Practice Questions
Original practice questions for the Databricks Certified Data Engineer Associate. Each answer is explained, including why each other option is wrong. Filter by topic area or difficulty. These are concept checks - not questions from the certification.
Answered 0 · Correct 0
-
Which open table format gives Databricks tables ACID transactions, schema enforcement and time travel on top of files in cloud storage?
Correct answer: B. Delta Lake is the open table format that adds ACID transactions, schema enforcement and time travel over files, which is why Databricks tables are reliable. Plain CSV files have none of those guarantees; a Databricks SQL dashboard is a way to visualise query results, not a storage format; and Unity Catalog governs access and metadata but is not the table format itself. -
In the medallion architecture, which layer holds raw data ingested as-is from the source?
Correct answer: A. Bronze is the raw landing layer where data is ingested as-is before any cleaning. Silver holds cleaned and conformed data; gold holds business-level aggregates ready for reporting; and 'platinum' is not a standard medallion layer at all. -
A team needs an interactive cluster to develop and test notebook code together during the day. Which compute is the best fit?
Correct answer: D. An all-purpose (interactive) cluster is designed for interactive development and shared, ad-hoc work in notebooks. A job cluster is spun up for a single automated run and then terminated, which does not suit interactive collaboration; a metastore stores governance metadata, not compute; and a Delta table is data storage, not compute. -
Why is the lakehouse described as combining a data lake with a data warehouse?
Correct answer: C. The lakehouse keeps the open, inexpensive file storage of a data lake while layering on the transactions, performance and governance you expect from a warehouse. It does not lock data into a proprietary-only format (it uses open formats like Delta/Parquet); it does not replace SQL with Python (both are first-class); and it certainly does not remove the need for modelling such as the medallion layers. -
Databricks SQL is primarily used to:
Correct answer: A. Databricks SQL provides SQL warehouses, a query editor and dashboards for analysing lakehouse data. Scheduling multi-task jobs is the role of Jobs/Workflows; automatically resolving table dependencies is what a declarative pipeline does; and cluster auto-termination is a compute setting, not a function of Databricks SQL. -
You query a Delta table 'AS OF' an earlier version to recover data that was overwritten this morning. Which Delta Lake capability is this?
Correct answer: B. Time travel lets you query a previous version of a Delta table by version number or timestamp, which is exactly how you recover overwritten data. Auto Loader is for incrementally ingesting new files; Structured Streaming processes data continuously as it arrives; and partition pruning is a query optimisation that skips irrelevant files, not a way to read history. -
You drop a managed table in Databricks. What happens to the underlying data files?
Correct answer: D. For a managed table, Databricks owns both the metadata and the data, so dropping it deletes the underlying files as well. Files surviving a drop describes an external (unmanaged) table, not a managed one; Delta does not selectively keep only old versions on a drop; and dropping a table does not move files to another catalog. -
You need to continuously load only newly arrived files from a cloud storage folder, without reprocessing old ones. Which Databricks feature is built for this?
Correct answer: A. Auto Loader (the cloudFiles source) incrementally detects and ingests only new files as they land, tracking what it has already processed. A one-time CREATE TABLE reads the folder once and does not pick up later arrivals; Unity Catalog lineage tracks data flow but does not ingest files; and a dashboard visualises data rather than loading it. -
What is the key difference between a managed table and an external (unmanaged) table?
Correct answer: B. With an external table you specify and own the storage location, so dropping the table removes only the metadata and leaves the files. Managed tables are fully queryable with SQL; external Delta tables still keep version history; and managed Delta tables do support time travel - so the other options are all false. -
Which SQL command would you use to load query results into a new managed Delta table from existing data?
Correct answer: C. CREATE TABLE ... AS SELECT (CTAS) creates a new table and populates it from a query in one statement. GRANT SELECT manages permissions; REFRESH on a dashboard re-runs its queries rather than creating a table; and DESCRIBE HISTORY shows a Delta table's past versions, none of which create and load a new table. -
When ingesting raw source files into the bronze layer, the usual goal is to:
Correct answer: A. Bronze is the raw landing zone, so you ingest the data essentially as-is to preserve everything for later reprocessing. Business aggregations belong in the gold layer; final report formatting is a presentation concern, not an ingestion one; and dropping columns early throws away data you may later need, which defeats the purpose of bronze. -
A CREATE TABLE statement fails because incoming data has a column type that does not match the table definition. Which Delta Lake behaviour caused this?
Correct answer: D. Schema enforcement rejects writes whose structure or types do not match the table's schema, which protects the table from bad data. Time travel reads past versions and does not block writes; Auto Loader backfill is about picking up earlier files, not type checking; and cross-filter direction is a Power BI modelling term with no role in Databricks writes. -
Which language is the Python API for working with Spark DataFrames in Databricks?
Correct answer: B. PySpark is the Python API for Spark, used to read, transform and write DataFrames. M is the Power Query language in Power BI; DAX is Power BI's measure/calculation language; and HTML is a markup language - none of which is Spark's Python API. -
How does Structured Streaming differ from a standard batch query?
Correct answer: C. Structured Streaming processes data incrementally and continuously as new data arrives, whereas a batch query runs once over a fixed dataset. It is not limited to CSV; it runs distributed across a cluster like other Spark work; and it can write to Delta tables (a very common pattern) - so the other statements are wrong. -
You want to combine two DataFrames by matching rows on a shared key column, adding columns from both. In PySpark you would use a:
Correct answer: A. A join matches rows from two DataFrames on a key and combines their columns, exactly as in SQL. GRANT manages permissions in Unity Catalog; VACUUM removes old, unreferenced Delta files; and DESCRIBE returns metadata about an object - none of which combine two DataFrames on a key. -
When transforming bronze data into a silver table, a typical step is to:
Correct answer: B. The silver layer is where raw bronze data is cleaned and conformed - fixing types, deduplicating and removing bad records. Granting admin rights is a governance action unrelated to transformation; shutting down the workspace is an operations task; and a table is not 'converted into' a dashboard - dashboards are built on top of data. -
A streaming query that reads from a source and writes to a Delta sink uses a checkpoint location mainly to:
Correct answer: D. A checkpoint location records the stream's progress and state so it can restart without reprocessing or losing data. It does not store billing information (a platform/account concern), it does not control read access (that is Unity Catalog), and it does not format output for dashboards (a presentation step). -
In Spark, a transformation such as filter() is described as 'lazy'. This means it:
Correct answer: A. Lazy evaluation means transformations build up a plan and execute only when an action (like count or write) is called, which lets Spark optimise the whole chain. They do not run immediately (that is an action's job); they do not delete source rows (a filter just narrows the result); and they can be written in either SQL or Python. -
Which Databricks feature lets you define target tables and transformations declaratively while the platform manages dependencies and execution?
Correct answer: C. Lakeflow Declarative Pipelines (formerly Delta Live Tables) let you declare the tables and transformations, and the platform works out dependencies, ordering and execution. An all-purpose cluster is just compute; Databricks Repos is Git-based code version control; and a SQL warehouse runs interactive SQL - none provide declarative pipeline management. -
In a Lakeflow Declarative Pipeline (DLT), what do 'expectations' do?
Correct answer: B. Expectations are declarative data-quality constraints that check rows and can keep, drop or fail records that violate them, with metrics reported. They do not size the cluster (a compute setting), they do not grant permissions (Unity Catalog does that), and they do not schedule runs (that is configured on the pipeline or a job). -
Which tool do you use to schedule and orchestrate several dependent tasks - notebooks, a pipeline and a script - to run in order?
Correct answer: D. Databricks Jobs/Workflows orchestrate multiple tasks with dependencies and schedules, running them in the right order. A Delta table is data storage; Unity Catalog lineage visualises how data flows but does not run tasks; and time travel reads historical table versions - none of which orchestrate work. -
How does a Lakeflow Declarative Pipeline differ from a Databricks Job?
Correct answer: A. A declarative pipeline focuses on defining and continuously maintaining target tables and their dependencies, while a Job schedules and orchestrates arbitrary tasks such as notebooks and scripts. They are not interchangeable; Jobs can run Python as well as SQL; and pipelines explicitly support data-quality checks through expectations - so only the first statement is correct. -
You want a production job to run on fresh, automatically terminated compute rather than a shared interactive cluster. You should configure it to use:
Correct answer: C. A job cluster is created for a specific run and terminated afterwards, which is the recommended, cost-efficient compute for scheduled production jobs. Running on the driver alone removes Spark's parallelism; a dashboard is for visualising results, not running jobs; and a schema is a governance grouping of tables, not compute. -
If one task in a multi-task Databricks Job fails, a sensible production practice is to:
Correct answer: B. Setting task dependencies plus retries and alerts means downstream tasks do not consume failed output and the team is notified to act. Ignoring failures and forcing success hides data problems; deleting the pipeline destroys working logic over a single run; and moving the table to another catalog does nothing to address the failure. -
Which Databricks component provides centralised governance - permissions, lineage and a unified namespace - across workspaces?
Correct answer: D. Unity Catalog is the central governance layer, providing access control, lineage and a single namespace shared across workspaces. Auto Loader ingests files, Structured Streaming processes data incrementally, and Delta time travel reads past table versions - none of which govern access across workspaces. -
What is the correct three-level namespace used to reference an object in Unity Catalog?
Correct answer: A. Unity Catalog uses a three-level namespace of catalog.schema.table to identify a securable object. 'workspace.cluster.notebook' mixes unrelated platform concepts; 'bronze.silver.gold' are medallion layers, not a namespace; and 'driver.executor.task' describes Spark execution components, not object naming. -
To let an analyst read a specific table but not modify it, which approach fits Unity Catalog?
Correct answer: B. Granting SELECT on the specific table (ideally to a group) gives read access without write permission, following least privilege. Workspace admin rights grant far more than read access; dropping and recreating the table per user is unworkable and destructive; and disabling their cluster removes access to everything, not just write. -
Within a Lakeflow Declarative Pipeline, you add a rule that fails the update if a primary-key column contains nulls. This is best described as:
Correct answer: C. A rule that validates rows (such as rejecting nulls in a key) and can fail or drop bad records is a data-quality expectation. An autoscaling policy governs how compute scales; a lineage diagram visualises data flow rather than enforcing rules; and a time-travel query reads historical versions - none of which validate incoming data. -
A regulator asks how a particular gold table was built and which source tables fed it. Which Unity Catalog feature answers this fastest?
Correct answer: C. Unity Catalog data lineage tracks how data flows from source tables through transformations to a target, so you can trace exactly what fed the gold table. Auto Loader checkpoints track ingestion progress, not table-to-table lineage; cluster event logs record compute activity, not data flow; and Delta partitioning is a storage optimisation that says nothing about origin. -
Your organisation wants the same table permissions and names to apply consistently when users work from several different Databricks workspaces. Unity Catalog supports this because it provides:
Correct answer: D. Unity Catalog uses a shared metastore, so the same catalogs, object names and access rules apply consistently across the workspaces attached to it. A separate metastore per notebook would fragment governance, not unify it; permissions scoped to a single cluster would not span workspaces; and duplicating every table into each workspace creates copies rather than one consistently governed source.
Practice questions FAQ
- Are these real Databricks DE Associate exam questions?
- No. These are original study questions written to test understanding. They are not real exam questions, exam dumps, or copied from any provider.
- How should I use these practice questions?
- Answer each one, read the explanation (including why the wrong options are wrong), and use the per-domain score below to focus your revision on weak areas. Revisit before exam day.
- How many questions should I do before the exam?
- Enough to score consistently across every domain, alongside full-length practice from official or reputable providers. Understanding why each answer is right matters more than raw volume.
- What score means I am ready?
- A good signal is consistently scoring around 80% or higher across all domains on questions you have not seen before, and being able to explain why the wrong options are wrong.
- Should I use exam dumps?
- No. Dumps (real or leaked questions) breach provider policy, can void your certification, and do not build the understanding the exam actually tests.