A BigQuery table of events is queried mostly by date ranges, and costs are high. What is the most effective first step to reduce bytes scanned?

Partition the table by the event date

Convert the table to a Bigtable instance

Export the table to Cloud Storage and query the files

After partitioning a BigQuery table by date, queries also filter heavily on customer_id. What further optimisation reduces bytes scanned within each partition?

Cluster the table by customer_id

Add a secondary index on customer_id

Create a separate table per customer

Switch the table to on-demand pricing

In an Apache Beam streaming pipeline on Dataflow, what is the purpose of windowing?

To group unbounded streaming data into finite time intervals for aggregation

To encrypt records before they are written to storage

To assign IAM roles to the pipeline's service account

To partition a BigQuery table by date

A pipeline's Dataflow job authenticates to BigQuery and Cloud Storage. Following least privilege, how should it be granted access?

Use a dedicated service account with only the roles it needs

Run it with a user's personal credentials

Grant the project Owner role to the pipeline

Make the BigQuery datasets and Cloud Storage buckets public

A pipeline loads a daily file from an external system, transforms it, and writes to BigQuery once per day. Which processing model is most appropriate?

Batch processing on a daily schedule

Streaming with Pub/Sub and Dataflow

An always-on Spark Structured Streaming job

Cloud Storage holds raw landing data that is rarely accessed after 90 days but must be retained for years. What reduces storage cost automatically?

A lifecycle rule that transitions objects to a colder storage class

Switching the bucket to a multi-region location

Enabling clustering on the bucket

A streaming pipeline must not lose messages if a downstream consumer is briefly unavailable. Which Pub/Sub behaviour supports this?

Pub/Sub retains and redelivers unacknowledged messages until they are acknowledged

Messages are deleted immediately on publish

Messages are stored only in the publisher's memory

Pub/Sub guarantees exactly-once delivery with no acknowledgements

You must choose between Cloud SQL and Spanner for a relational workload that needs strong consistency and horizontal scaling across regions. Which fits, and why?

Spanner, because it offers strong consistency with horizontal, global scale

Cloud SQL, because it scales horizontally across regions by default

Bigtable, because it is relational and globally consistent

BigQuery, because it is a transactional relational database

A Dataflow streaming pipeline is falling behind and system lag is growing. Where do you first look to diagnose throughput and lag?

Cloud Monitoring and Logging for the Dataflow job's metrics

The Cloud Storage bucket's access logs

An analytics team should be able to query a specific BigQuery dataset but must not modify or delete it. Which approach grants the right access?

Grant the dataset's data viewer role to the team

Grant the team the BigQuery Admin role

Share the underlying Cloud Storage bucket with the team

Add the team as project Editors

A pipeline must repeat the same multi-step ETL every night and rerun failed steps automatically. Which design best automates this?

A Cloud Composer (Airflow) DAG scheduled nightly with task retries

Manually trigger each step from the console each night

A single Dataflow job with no scheduling

A Bigtable table that stores the steps

Which statement best captures when to choose Dataflow over Dataproc?

Choose Dataflow when you want serverless, autoscaling Beam pipelines without managing clusters

Choose Dataflow only when you must run existing Hive scripts

Choose Dataflow because Dataproc cannot process streaming data at all

Choose Dataflow because it is the only service that can write to BigQuery

A dataset contains personal data subject to regional regulations requiring it stay in the EU. Which design choice helps meet this requirement?

Store and process the data in EU regions/locations for the relevant services

Make the dataset public to simplify access

Store the data only in a US multi-region for lower cost

Disable all logging to avoid recording personal data

A BigQuery dashboard reruns the same expensive aggregation many times an hour with little change in underlying data. What reduces repeated query cost?

Create a materialised view so the result is precomputed and incrementally refreshed

Re-import the data into Bigtable before each query

Remove partitioning from the source table

Increase the dashboard refresh frequency

An IoT system sends millions of events per second that must be buffered before processing. Which architecture handles ingestion at this scale?

Publish events to Pub/Sub, then process with Dataflow into BigQuery

Write each event directly into Cloud SQL

Have devices write straight to a single BigQuery table row by row

Store each event as a separate Cloud Storage object and query them individually

Which design supports data portability so pipelines could move between environments with less rework?

Build pipelines on the open Apache Beam model and parameterise environment-specific values

Hard-code one project's resource names throughout the pipeline

Store all logic in manual console clicks

Use a proprietary format with no documented schema

A nightly batch load into BigQuery occasionally fails midway, leaving partial data. What is the most robust way to keep loads correct?

Load into a staging table and only swap into production after the load succeeds

Ignore failures and let the next night overwrite the data

Switch the destination to Bigtable

Disable retries so failures are visible

Which option best describes the difference between ETL and ELT in a BigQuery context?

ETL transforms data before loading; ELT loads raw data then transforms it inside BigQuery

ETL and ELT are identical and interchangeable

ELT transforms data before loading; ETL never transforms data

ETL only works with streaming and ELT only with batch

A BigQuery cost report shows runaway on-demand spend from frequent SELECT * queries on a very wide table. Which change reduces cost most directly?

Query only the needed columns instead of SELECT *

Add more slots to the reservation

Convert the table to a view without changing the queries

Increase the table's expiration time

A real-time fraud system needs sub-10-millisecond reads of a user's recent activity by user ID, at very high request rates. Which store and key design fit?

Bigtable with a well-designed row key on user ID

BigQuery with a clustered table

Cloud Storage with one object per user

Cloud SQL with a single large table

An orchestration needs to run a Dataflow job, wait for it to finish, then run a BigQuery transformation, and alert on failure. Which tool models these dependencies best?

Cloud Composer (Airflow), defining tasks and dependencies in a DAG

A single cron entry that fires both at the same time

Manual checking and a follow-up email

Bigtable, storing the steps as rows

Different teams must be able to find, understand and trust shared datasets across the organisation. Which capability most directly enables this?

A data catalogue and governance layer such as Dataplex with metadata and tags

Turning off audit logging to reduce noise

Giving everyone the Owner role for convenience

Storing each dataset in a private spreadsheet

Are these real PDE exam questions?

No. These are original study questions written to test understanding. They are not real exam questions, exam dumps, or copied from any provider.

How should I use these practice questions?

Answer each one, read the explanation (including why the wrong options are wrong), and use the per-domain score below to focus your revision on weak areas. Revisit before exam day.

How many questions should I do before the exam?

Enough to score consistently across every domain, alongside full-length practice from official or reputable providers. Understanding why each answer is right matters more than raw volume.

What score means I am ready?

A good signal is consistently scoring around 80% or higher across all domains on questions you have not seen before, and being able to explain why the wrong options are wrong.

Should I use exam dumps?

No. Dumps (real or leaked questions) breach provider policy, can void your certification, and do not build the understanding the exam actually tests.

Practice questions · Data & Analytics

Google Cloud Professional Data Engineer: Practice Questions

expert 30 questions

Original practice questions for the Google Cloud Professional Data Engineer (PDE). Each answer is explained, including why each other option is wrong. Filter by domain or difficulty. These are concept checks - not questions from the certification, and not exam dumps.

By The Exam Atlas Editorial Team · Verified 2026-06-06 · ~38 min

Domain Difficulty

Storing the data easy

A team needs a serverless data warehouse to run SQL analytics over terabytes of data without managing infrastructure. Which Google Cloud service fits best?
Storing the data medium

An application needs high-throughput, low-latency reads and writes by a single key for time-series sensor data at massive scale. Which store is the best fit?
Ingesting and processing the data easy

You need to build a streaming pipeline that transforms events continuously as they arrive, with no clusters to manage. Which service is the default choice?
Ingesting and processing the data easy

Which service is the standard way to ingest and decouple high-volume event streams before they are processed?
Storing the data medium

A BigQuery table of events is queried mostly by date ranges, and costs are high. What is the most effective first step to reduce bytes scanned?
Storing the data hard

After partitioning a BigQuery table by date, queries also filter heavily on customer_id. What further optimisation reduces bytes scanned within each partition?
Ingesting and processing the data medium

A workload must run existing Apache Spark and Hadoop jobs with minimal code changes on Google Cloud. Which service is the most appropriate?
Maintaining and automating data workloads medium

Which service provides managed Apache Airflow for orchestrating and scheduling multi-step data pipelines on Google Cloud?
Ingesting and processing the data hard

In an Apache Beam streaming pipeline on Dataflow, what is the purpose of windowing?
Designing data processing systems medium

A pipeline's Dataflow job authenticates to BigQuery and Cloud Storage. Following least privilege, how should it be granted access?
Preparing and using data for analysis medium

An organisation wants to discover, catalogue and govern data quality across multiple BigQuery datasets and Cloud Storage data lakes from one place. Which service fits?
Preparing and using data for analysis medium

A team manages many interdependent SQL transformations in BigQuery and wants version control, testing and dependency management for them (ELT). Which tool is designed for this?
Ingesting and processing the data medium

A pipeline loads a daily file from an external system, transforms it, and writes to BigQuery once per day. Which processing model is most appropriate?
Storing the data medium

Cloud Storage holds raw landing data that is rarely accessed after 90 days but must be retained for years. What reduces storage cost automatically?
Ingesting and processing the data hard

A streaming pipeline must not lose messages if a downstream consumer is briefly unavailable. Which Pub/Sub behaviour supports this?
Designing data processing systems hard

You must choose between Cloud SQL and Spanner for a relational workload that needs strong consistency and horizontal scaling across regions. Which fits, and why?
Maintaining and automating data workloads medium

A Dataflow streaming pipeline is falling behind and system lag is growing. Where do you first look to diagnose throughput and lag?
Preparing and using data for analysis medium

An analytics team should be able to query a specific BigQuery dataset but must not modify or delete it. Which approach grants the right access?
Maintaining and automating data workloads medium

A pipeline must repeat the same multi-step ETL every night and rerun failed steps automatically. Which design best automates this?
Designing data processing systems hard

Which statement best captures when to choose Dataflow over Dataproc?
Designing data processing systems medium

A dataset contains personal data subject to regional regulations requiring it stay in the EU. Which design choice helps meet this requirement?
Storing the data hard

A BigQuery dashboard reruns the same expensive aggregation many times an hour with little change in underlying data. What reduces repeated query cost?
Ingesting and processing the data hard

An IoT system sends millions of events per second that must be buffered before processing. Which architecture handles ingestion at this scale?
Designing data processing systems hard

Which design supports data portability so pipelines could move between environments with less rework?
Maintaining and automating data workloads hard

A nightly batch load into BigQuery occasionally fails midway, leaving partial data. What is the most robust way to keep loads correct?
Preparing and using data for analysis medium

Which option best describes the difference between ETL and ELT in a BigQuery context?
Storing the data medium

A BigQuery cost report shows runaway on-demand spend from frequent SELECT * queries on a very wide table. Which change reduces cost most directly?
Designing data processing systems hard

A real-time fraud system needs sub-10-millisecond reads of a user's recent activity by user ID, at very high request rates. Which store and key design fit?
Maintaining and automating data workloads medium

An orchestration needs to run a Dataflow job, wait for it to finish, then run a BigQuery transformation, and alert on failure. Which tool models these dependencies best?
Preparing and using data for analysis medium

Different teams must be able to find, understand and trust shared datasets across the organisation. Which capability most directly enables this?

Practice questions FAQ

Are these real PDE exam questions?: No. These are original study questions written to test understanding. They are not real exam questions, exam dumps, or copied from any provider.
How should I use these practice questions?: Answer each one, read the explanation (including why the wrong options are wrong), and use the per-domain score below to focus your revision on weak areas. Revisit before exam day.
How many questions should I do before the exam?: Enough to score consistently across every domain, alongside full-length practice from official or reputable providers. Understanding why each answer is right matters more than raw volume.
What score means I am ready?: A good signal is consistently scoring around 80% or higher across all domains on questions you have not seen before, and being able to explain why the wrong options are wrong.
Should I use exam dumps?: No. Dumps (real or leaked questions) breach provider policy, can void your certification, and do not build the understanding the exam actually tests.

Sources

Google Cloud - Professional Data Engineer ↗