Practice questions · Data & Analytics
Google Cloud Professional Data Engineer: Practice Questions
Original practice questions for the Google Cloud Professional Data Engineer (PDE). Each answer is explained, including why each other option is wrong. Filter by domain or difficulty. These are concept checks - not questions from the certification, and not exam dumps.
Answered 0 · Correct 0
-
A team needs a serverless data warehouse to run SQL analytics over terabytes of data without managing infrastructure. Which Google Cloud service fits best?
Correct answer: A. BigQuery is the serverless, columnar data warehouse built for SQL analytics over large datasets with no infrastructure to manage. Bigtable is wide-column NoSQL for key-based access, not ad-hoc SQL analytics; Cloud SQL is a managed transactional relational database that does not scale to interactive analytics over terabytes; and Dataproc is managed Hadoop/Spark, a processing cluster rather than a serverless warehouse. -
An application needs high-throughput, low-latency reads and writes by a single key for time-series sensor data at massive scale. Which store is the best fit?
Correct answer: B. Bigtable is a wide-column NoSQL store designed for high-throughput, low-latency key-based access, which suits time-series and IoT data at scale. BigQuery is for analytical SQL, not high-rate single-key lookups; Cloud Storage is object storage, not a low-latency key-value store; and Spanner is a relational database optimised for global transactions, heavier and costlier than needed for simple key-based time-series access. -
You need to build a streaming pipeline that transforms events continuously as they arrive, with no clusters to manage. Which service is the default choice?
Correct answer: C. Dataflow is the serverless, fully managed service for running Apache Beam batch and streaming pipelines, so there are no clusters to manage. Dataproc runs Hadoop/Spark on clusters you size and manage; Cloud Composer orchestrates and schedules pipelines but does not itself process the data stream; and Cloud SQL is a transactional database, not a stream-processing engine. -
Which service is the standard way to ingest and decouple high-volume event streams before they are processed?
Correct answer: D. Pub/Sub is Google Cloud's global, at-least-once messaging service used to ingest event streams and decouple producers from consumers. Cloud Storage is for objects and batch landing, not real-time message delivery; Bigtable is a NoSQL store, not a messaging buffer; and Dataform manages SQL-based ELT transformations in BigQuery, not stream ingestion. -
A BigQuery table of events is queried mostly by date ranges, and costs are high. What is the most effective first step to reduce bytes scanned?
Correct answer: D. Partitioning the table by event date lets BigQuery prune partitions so date-range queries scan only the relevant data, directly cutting bytes scanned and on-demand cost. Moving to Bigtable abandons SQL analytics and does not address BigQuery cost; exporting to Cloud Storage adds complexity without reducing scanned data; and adding slots changes the pricing model but does not reduce how much data each query reads. -
After partitioning a BigQuery table by date, queries also filter heavily on customer_id. What further optimisation reduces bytes scanned within each partition?
Correct answer: B. Clustering sorts data within each partition by the chosen columns, so filters on customer_id read fewer blocks and scan fewer bytes. BigQuery does not use traditional secondary indexes like an OLTP database; a separate table per customer creates unmanageable sprawl and breaks cross-customer analytics; and changing the pricing model does not reduce the data each query scans. -
A workload must run existing Apache Spark and Hadoop jobs with minimal code changes on Google Cloud. Which service is the most appropriate?
Correct answer: C. Dataproc is managed Hadoop and Spark, so existing open-source jobs run with minimal changes. Dataflow uses the Apache Beam model and would require rewriting Spark jobs; BigQuery is a SQL warehouse, not a Spark/Hadoop runtime; and Pub/Sub is a messaging service, not a batch-processing engine. -
Which service provides managed Apache Airflow for orchestrating and scheduling multi-step data pipelines on Google Cloud?
Correct answer: A. Cloud Composer is managed Apache Airflow, used to author, schedule and monitor pipeline DAGs that coordinate multiple services. Dataflow runs individual data-processing pipelines but is not a general orchestrator; Dataprep is a visual data-cleaning tool; and Cloud Scheduler triggers single jobs on a cron schedule but does not manage complex task dependencies like Airflow. -
In an Apache Beam streaming pipeline on Dataflow, what is the purpose of windowing?
Correct answer: B. Windowing divides an unbounded stream into finite intervals (for example, fixed or sliding time windows) so that aggregations like counts and sums can be computed over each interval. It is not an encryption mechanism; it has nothing to do with IAM role assignment; and although it sounds similar to BigQuery partitioning, windowing is a Beam streaming concept, not table partitioning. -
A pipeline's Dataflow job authenticates to BigQuery and Cloud Storage. Following least privilege, how should it be granted access?
Correct answer: C. A dedicated service account scoped to only the roles the pipeline needs follows least privilege and keeps the identity independent of any person. Personal credentials break when the user leaves and over-grant access; the Owner role is far broader than required and dangerous; and making data public exposes it to everyone, the opposite of least privilege. -
An organisation wants to discover, catalogue and govern data quality across multiple BigQuery datasets and Cloud Storage data lakes from one place. Which service fits?
Correct answer: A. Dataplex is the data fabric for organising, governing, discovering and managing the quality of data across lakes and warehouses, including a unified catalogue. Dataproc is a Spark/Hadoop processing service; Cloud Composer orchestrates pipelines; and Pub/Sub is messaging - none of these provide cross-source data discovery and governance. -
A team manages many interdependent SQL transformations in BigQuery and wants version control, testing and dependency management for them (ELT). Which tool is designed for this?
Correct answer: B. Dataform manages SQL-based ELT workflows in BigQuery with dependencies, testing and version control, so transformations are maintainable as code. Dataprep is a visual, no-code cleaning tool without software-style dependency management; Bigtable is a NoSQL store, not a transformation framework; and Cloud SQL is a transactional database, not a BigQuery ELT tool. -
A pipeline loads a daily file from an external system, transforms it, and writes to BigQuery once per day. Which processing model is most appropriate?
Correct answer: C. A predictable, once-a-day file is a classic batch workload, so a scheduled batch job is the simplest and most cost-effective choice. Streaming infrastructure is unnecessary overhead for a single daily file; continuous Bigtable writes do not match a batch file load into a warehouse; and an always-on streaming job wastes resources for data that arrives once per day. -
Cloud Storage holds raw landing data that is rarely accessed after 90 days but must be retained for years. What reduces storage cost automatically?
Correct answer: D. Object lifecycle rules can automatically move objects to colder, cheaper storage classes (such as Nearline or Coldline) after a set age, cutting cost for rarely accessed data. Copying into BigQuery adds warehouse storage cost rather than reducing object-storage cost; clustering is a BigQuery feature, not a Cloud Storage one; and a multi-region bucket generally increases cost, not decreases it. -
A streaming pipeline must not lose messages if a downstream consumer is briefly unavailable. Which Pub/Sub behaviour supports this?
Correct answer: D. Pub/Sub retains messages and redelivers any that are not acknowledged within the deadline, so a brief consumer outage does not lose data. Messages are not deleted on publish - they wait for acknowledgement; standard Pub/Sub provides at-least-once delivery and does require acknowledgements; and messages are durably stored by the service, not held only in publisher memory. -
You must choose between Cloud SQL and Spanner for a relational workload that needs strong consistency and horizontal scaling across regions. Which fits, and why?
Correct answer: C. Spanner is the relational database that combines strong consistency with horizontal scaling across regions, which is exactly the stated need. Cloud SQL is a managed single-instance relational database that does not scale horizontally across regions like Spanner; Bigtable is NoSQL, not relational; and BigQuery is an analytical warehouse, not a transactional relational database. -
A Dataflow streaming pipeline is falling behind and system lag is growing. Where do you first look to diagnose throughput and lag?
Correct answer: B. Cloud Monitoring and Logging expose Dataflow job metrics such as system lag, throughput and worker utilisation, which are the first place to diagnose a backlog. BigQuery query history concerns warehouse queries, not the streaming job; Cloud Storage access logs track object access, not pipeline lag; and IAM bindings govern permissions, not throughput. -
An analytics team should be able to query a specific BigQuery dataset but must not modify or delete it. Which approach grants the right access?
Correct answer: A. Granting a read-only data viewer role on the dataset lets the team query it without rights to modify or delete, matching least privilege. The BigQuery Admin role allows changes and deletion, far more than needed; sharing a Cloud Storage bucket does not grant access to the BigQuery dataset; and project Editor is a broad role that permits modifications well beyond querying. -
A pipeline must repeat the same multi-step ETL every night and rerun failed steps automatically. Which design best automates this?
Correct answer: D. A Cloud Composer (Airflow) DAG scheduled nightly models the step dependencies and supports automatic retries for failed tasks, which is exactly what repeatable, resilient automation needs. Manual triggering is not automation and is error-prone; a single Dataflow job with no scheduling does not orchestrate multiple steps or schedule itself; and Bigtable is a data store, not a workflow scheduler. -
Which statement best captures when to choose Dataflow over Dataproc?
Correct answer: A. Dataflow is preferred when you want serverless, autoscaling Apache Beam pipelines with no clusters to manage, for both batch and streaming. Existing Hive scripts point toward Dataproc, not Dataflow; Dataproc can process streaming workloads (for example via Spark Streaming), so the claim that it cannot is false; and BigQuery can be written to from several services, so exclusivity is not the reason. -
A dataset contains personal data subject to regional regulations requiring it stay in the EU. Which design choice helps meet this requirement?
Correct answer: A. Placing storage and processing in EU regions or locations supports data-residency requirements that the data remain in the EU. Making the dataset public ignores the compliance obligation and exposes personal data; using a US location violates an EU residency requirement; and disabling logging harms auditability and security without addressing residency. -
A BigQuery dashboard reruns the same expensive aggregation many times an hour with little change in underlying data. What reduces repeated query cost?
Correct answer: D. A materialised view precomputes and incrementally refreshes the aggregation, so the dashboard reads a small, ready result instead of rescanning the base table each time. Re-importing into Bigtable does not serve SQL aggregations and adds work; increasing refresh frequency raises cost rather than lowering it; and removing partitioning makes scans larger and more expensive. -
An IoT system sends millions of events per second that must be buffered before processing. Which architecture handles ingestion at this scale?
Correct answer: C. Publishing to Pub/Sub absorbs and decouples millions of events per second, and Dataflow then processes and loads them into BigQuery - a standard high-scale streaming pattern. Cloud SQL cannot sustain that ingest rate; row-by-row direct writes to BigQuery do not provide buffering and are inefficient at this scale; and one object per event in Cloud Storage creates unmanageable object counts and poor query performance. -
Which design supports data portability so pipelines could move between environments with less rework?
Correct answer: B. Using the open Apache Beam model (run by Dataflow) and parameterising environment-specific values keeps pipelines portable and easier to move between environments. Hard-coding resource names ties the pipeline to one project; manual console clicks are not reproducible or portable; and an undocumented proprietary format makes portability and maintenance harder. -
A nightly batch load into BigQuery occasionally fails midway, leaving partial data. What is the most robust way to keep loads correct?
Correct answer: D. Loading into a staging table and swapping into production only on success prevents partial or corrupt data from being exposed to consumers - a standard idempotent-load pattern. Ignoring failures leaves bad data live until the next run; disabling retries makes transient failures worse, not safer; and switching to Bigtable changes the data model without solving the partial-load problem. -
Which option best describes the difference between ETL and ELT in a BigQuery context?
Correct answer: A. ETL transforms data before loading it into the destination, while ELT loads raw data first and then transforms it inside the warehouse - common with BigQuery's scale and tools like Dataform. The two are not identical, since the order of transform and load differs; the claim that ELT transforms before loading while ETL never transforms reverses the definitions; and neither approach is tied exclusively to streaming or batch. -
A BigQuery cost report shows runaway on-demand spend from frequent SELECT * queries on a very wide table. Which change reduces cost most directly?
Correct answer: B. Because BigQuery is columnar and on-demand pricing is based on bytes scanned, selecting only the needed columns reads far less data than SELECT * on a wide table, cutting cost directly. Adding slots changes capacity but does not reduce data scanned on-demand; wrapping the table in a view without changing the queries still scans all columns; and table expiration affects retention, not query cost. -
A real-time fraud system needs sub-10-millisecond reads of a user's recent activity by user ID, at very high request rates. Which store and key design fit?
Correct answer: C. Bigtable delivers very low-latency, high-throughput key-based reads, and a well-designed row key on user ID gives fast single-user lookups for a fraud system. BigQuery is built for analytical scans, not single-digit-millisecond point reads at high request rates; Cloud Storage object reads are far too slow for this; and a single Cloud SQL table will not sustain that latency and throughput at scale. -
An orchestration needs to run a Dataflow job, wait for it to finish, then run a BigQuery transformation, and alert on failure. Which tool models these dependencies best?
Correct answer: A. Cloud Composer (Airflow) lets you express ordered tasks and dependencies in a DAG, wait for completion, chain the next task, and alert on failure. A single cron entry firing both at once ignores the dependency that the transformation must wait for the Dataflow job; manual checking is not automation; and Bigtable is a data store, not a workflow engine. -
Different teams must be able to find, understand and trust shared datasets across the organisation. Which capability most directly enables this?
Correct answer: B. A data catalogue and governance layer such as Dataplex provides searchable metadata, tags and quality signals so teams can discover and trust shared data. Turning off audit logging reduces accountability and does nothing for discovery; granting everyone Owner is a serious security risk and not a discovery feature; and private spreadsheets fragment data and prevent organisation-wide discovery.
Practice questions FAQ
- Are these real PDE exam questions?
- No. These are original study questions written to test understanding. They are not real exam questions, exam dumps, or copied from any provider.
- How should I use these practice questions?
- Answer each one, read the explanation (including why the wrong options are wrong), and use the per-domain score below to focus your revision on weak areas. Revisit before exam day.
- How many questions should I do before the exam?
- Enough to score consistently across every domain, alongside full-length practice from official or reputable providers. Understanding why each answer is right matters more than raw volume.
- What score means I am ready?
- A good signal is consistently scoring around 80% or higher across all domains on questions you have not seen before, and being able to explain why the wrong options are wrong.
- Should I use exam dumps?
- No. Dumps (real or leaked questions) breach provider policy, can void your certification, and do not build the understanding the exam actually tests.