How long does it take to study for the Professional Data Engineer exam?

With Google Cloud data experience, most people need roughly 6 to 10 weeks of focused study alongside hands-on practice. Without that base, plan for considerably longer and build BigQuery and Dataflow experience first.

How many questions are on the exam, and what is the passing score?

Google's exam guide lists 40 to 50 questions in two hours. Google does not publish a fixed passing score - the result is pass or fail - so aim for broad competence across all five domains rather than a target number.

Do I need to know how to code for the PDE?

You need working SQL for BigQuery and enough familiarity with the Apache Beam model (used by Dataflow) to reason about pipelines. It is not a software-engineering exam, but it assumes you can read and design data pipelines, not just name services.

BigQuery or Bigtable - which should I focus on?

Learn when to use each, because that distinction is exactly what the exam tests. BigQuery is the serverless warehouse for SQL analytics on large datasets; Bigtable is wide-column NoSQL for high-throughput, low-latency key-based access such as time-series or IoT. Choosing correctly per scenario matters more than depth in one.

Dataflow or Dataproc - what is the difference?

Dataflow is serverless and runs Apache Beam pipelines (batch and streaming) with no cluster to manage - usually the default. Dataproc is managed Hadoop and Spark, chosen when you must run existing open-source jobs or specific tooling. Picking the right one for a scenario is a recurring exam theme.

Study guide · Data & Analytics

Google Cloud Professional Data Engineer: Study Guide

expert

A practical, step-by-step plan to take PDE from "interested" to exam-ready - the mechanics, what to study in what order, how to practise, and how to know you are ready.

By The Exam Atlas Editorial Team · Verified 2026-06-07

Study plans by timeline

6-week intensive	With GCP data experience (~14-16 hrs/week): drill the five domains around service selection, then full-length mocks.
8-week balanced	The default (~10 hrs/week): roughly a week and a half per domain, hands-on in BigQuery and Dataflow, mocks at the end.
10-week steady	For those newer to GCP (~7 hrs/week): start with storage choices and BigQuery basics, then build up pipelines, governance and operations.

What to study, in order

Weeks 1-2	Storage choices: BigQuery vs Bigtable vs Cloud Storage vs Cloud SQL/Spanner; when each fits and why
Weeks 3-4	Pipelines: Dataflow (Apache Beam), Pub/Sub, Dataproc; batch vs streaming and the trade-offs
Weeks 5-6	BigQuery in depth: partitioning, clustering, cost control; orchestration with Cloud Composer (Airflow)
Weeks 7-8	Governance and operations: Dataplex, IAM, monitoring; then full-length timed practice

The Google Cloud Professional Data Engineer (PDE) certifies that you can take data end to end on Google Cloud: design the processing system, ingest and transform batch and streaming data, store it well, make it analysis-ready, and keep the workloads running. The thing that sets this exam apart from associate-level cloud certifications is that it is service-heavy and judgement-based: it rarely asks what a service is, and almost always asks which service you would choose for a situation, and why, reasoning about cost, latency, throughput and scale. That is why the fastest preparation is to build real pipelines in a free project rather than only reading, and why Google recommends real Google Cloud experience before sitting it.

This guide is a full self-study course. It walks through each of the five exam sections in depth, centres on the service-selection decisions that decide most results (BigQuery versus Bigtable, Dataflow versus Dataproc, streaming versus batch) and on BigQuery cost, then turns the content into a week-by-week plan, a final-week routine and an exam-day description. It is original teaching material and study guidance only. It contains no real or simulated exam questions, and you should always confirm the current sections, weights and service list against Google’s own Professional Data Engineer exam guide before you book, because Google revises this exam and adds services over time.

Chapter 1: Exam overview and how to use this guide

What the PDE actually measures

The PDE measures whether you can design, build, operationalise, secure and monitor data-processing systems on Google Cloud, the way a working data engineer does. In practice that means turning raw data into reliable, governed analytics: ingesting streams and batches, choosing the right storage, building pipelines, and automating the workloads that keep them running. It is a professional-level exam, not a beginner one: there is no formal prerequisite, but Google recommends roughly three or more years of industry experience including one or more years designing and managing solutions on Google Cloud, and that experience shows in the questions. If you are newer to GCP, the honest path is to build a foundation (often via the Associate Cloud Engineer) and real BigQuery and Dataflow practice first.

The exam is organised into five sections with official weights: Designing data processing systems at about 22%, Ingesting and processing the data at about 25%, Storing the data at about 20%, Preparing and using data for analysis at about 15%, and Maintaining and automating data workloads at about 18%. Two planning facts follow. First, ingesting and storing together are the largest and most service-heavy part, so the bulk of your building belongs there. Second, the two operational and governance sections (preparing/using and maintaining/automating) add up to a third of the exam, so IAM, governance, monitoring and orchestration are not optional.

Format and the “pass or fail” scoring

The exam is 40 to 50 multiple-choice and multiple-select questions in two hours (120 minutes), per Google’s official exam guide; some third-party sites quote other numbers, but the official guide is the source to trust. It is taken online-proctored or at a test centre. Google does not publish a fixed passing score, and the result is reported only as pass or fail, so any percentage you see quoted is an unofficial estimate. The practical consequence is that you should aim for broad, consistent competence across all five sections rather than chasing a target number, because you cannot game a threshold you cannot see. The credential is valid for two years, after which you recertify (the recertification exam is cheaper than the first sitting; confirm current pricing before booking).

How to use this course

Read the chapters in order. The designing chapter sets up the service-selection thinking that the ingesting and storing chapters then apply, and the operational chapters build on all of them. Throughout, the most valuable habit is to convert reading into a one-line decision rule (“if the scenario says X, choose service Y because Z”), because that is the exact skill the exam tests. The final chapters turn the content into a schedule, a final-week routine and an exam-day plan. Short worked examples appear where a choice is easy to misread, but none of these are exam questions; they are teaching illustrations. Note that Google has broadened recent versions of this exam to include AI and machine-learning data preparation (feature engineering, and preparing unstructured data for embeddings and retrieval-augmented generation), so this guide reflects that current scope.

Chapter 2: Designing data processing systems (about 22%)

This section is the architectural backbone of the exam. Most of its questions are really “which design, and why”, so it is where you build the service-selection judgement that the rest of the exam relies on. It covers selecting services, and designing for security and compliance, reliability and fidelity, flexibility and portability, and data migrations.

Designing for security and compliance

A well-designed system is secure and compliant by construction. The exam expects you to reason about Identity and Access Management (Cloud IAM) and organisation policies, data security through encryption and key management, privacy (handling personally identifiable information), and regional considerations (data sovereignty) that constrain where data may live and be processed. It also expects awareness of designing project, dataset and table architecture for proper data governance, and of multi-environment patterns (separating development from production). The instinct to carry is that security and governance are design inputs, not afterthoughts: where data sits, who can reach it, and how it is encrypted shape the architecture.

Designing for reliability, fidelity, flexibility and portability

Reliability and fidelity cover preparing and cleaning data (with tools such as Dataform, Dataflow and Cloud Data Fusion), monitoring and orchestrating pipelines, disaster recovery and fault tolerance, decisions about ACID properties (atomicity, consistency, isolation, durability) and availability, and data validation. Flexibility and portability cover mapping current and future business requirements to the architecture and designing for portability across environments, including multi-cloud and data-residency requirements, plus data staging, cataloguing, profiling and discovery as part of governance. The teaching point is that a good design anticipates change and failure: it can recover, it validates what it ingests, and it does not paint itself into a corner that a future requirement cannot fit.

Designing data migrations

Real data engineering often means moving existing systems onto Google Cloud, so the exam includes designing data migrations: analysing current stakeholder needs, users, processes and technologies and planning a path to the desired state, then planning the migration and validation using services such as the BigQuery Data Transfer Service, Database Migration Service, Transfer Appliance, Google Cloud networking and Datastream (for change-data-capture style replication). You do not need deep operational detail on each, but you should recognise which migration tool fits which situation, for example a one-off bulk physical transfer versus ongoing change capture from an operational database.

Chapter 3: Ingesting and processing the data (about 25%)

This is the largest section, and it is where pipelines are built. It covers planning the pipelines, building them, and deploying and operationalising them, across both batch and streaming. The recurring exam decision here is which processing service, so this chapter centres on that.

Planning and building pipelines

Planning a pipeline means defining the data sources and sinks, the transformation and orchestration logic, the networking fundamentals, and data encryption in transit and at rest. Building it means data cleansing, choosing the right service, and handling transformations for both batch and streaming (including windowing and late-arriving data), plus processing logic, data acquisition and import, and integrating new sources. The current exam also includes AI data enrichment as part of processing, reflecting how pipelines increasingly feed machine-learning workloads.

Choosing the processing service

The central skill is matching the workload to the service. Dataflow runs Apache Beam pipelines, the unified model for both batch and streaming, and is serverless with no cluster to manage; it is usually the default choice for new transformation pipelines, especially streaming. Dataproc is managed Hadoop and Spark, the right choice when you must run existing open-source jobs (Spark, Hadoop ecosystem) or specific tooling rather than rewriting them in Beam. Pub/Sub is the ingestion and decoupling layer: global, at-least-once messaging that receives event streams and buffers them for downstream processing. Apache Kafka appears as an alternative messaging system to recognise. The clean decision rule the exam rewards: for serverless new pipelines (batch or streaming) think Dataflow; for lifting existing Spark/Hadoop think Dataproc; for ingesting and decoupling event streams think Pub/Sub.

Streaming versus batch, and windowing

A recurring judgement is streaming versus batch. Streaming processes events continuously as they arrive (typically Pub/Sub into a Dataflow streaming pipeline), used when low latency matters. Batch processes data in scheduled bulk loads, used when periodic processing is acceptable and often simpler and cheaper. Windowing is the streaming concept of grouping unbounded event data into time-based windows (for example five-minute windows) so you can aggregate it, and handling late-arriving data is part of doing this correctly. As a teaching example: aggregating clickstream events every minute for a live dashboard is a streaming, windowed problem; recomputing yesterday’s totals overnight is a batch problem. Reading which one a scenario describes points you to the right architecture.

Deploying and operationalising

Pipelines have to run repeatably, so this section includes job automation and orchestration with Cloud Composer (managed Apache Airflow) and Workflows, and CI/CD for pipelines. Orchestration with Composer is how multi-step pipelines are scheduled and chained reliably, a theme that returns in the maintaining-and-automating section.

Chapter 4: Storing the data (about 20%)

This section is about choosing and tuning where data lives, and it is where cost and performance are most directly won or lost. It covers selecting storage systems, planning a data warehouse, using a data lake, and designing a data platform. The signature exam decision is which store, especially BigQuery versus Bigtable.

Choosing the storage service

Google Cloud offers many storage options, and the exam tests matching them to data access patterns. The ones to know: BigQuery, the serverless, columnar data warehouse for SQL analytics over large datasets, the default when the scenario is “analyse with SQL”. Bigtable, a wide-column NoSQL store for high-throughput, low-latency, key-based access such as time-series, IoT and ad-tech, the choice for “millions of reads/writes by key with low latency”. Cloud Storage, object storage for files and as a data-lake landing zone, with storage classes and lifecycle rules. Cloud SQL, managed relational databases (MySQL, PostgreSQL, SQL Server) for transactional workloads; Spanner, a globally distributed, strongly consistent relational database that scales horizontally; and the current guide also lists BigLake, AlloyDB, Firestore and Memorystore among the managed services to recognise. The single most important distinction to drill is BigQuery versus Bigtable: analytical SQL warehouse versus key-based NoSQL. The exam does not want depth in one, it wants the right choice per scenario.

Tuning BigQuery for cost and performance

Because BigQuery is central, knowing how to make it cheap and fast is heavily rewarded, and the questions are about cost and performance rather than SQL syntax. Partitioning splits a table by date or range so a query prunes to the relevant partitions and scans fewer rows. Clustering sorts data within partitions by chosen columns, which further cuts the bytes scanned. Bytes scanned is what drives on-demand cost, so reducing it reduces the bill directly. A materialized view precomputes a frequent query for speed, and slots and reservations offer capacity-based pricing as an alternative to on-demand for steady, heavy workloads. The instinct to carry: partition and cluster to scan less, and choose on-demand versus reservations by how predictable and heavy the workload is. As a teaching example, partitioning a large events table by day means a query for last week reads only seven partitions instead of the whole table, cutting both time and cost.

Data warehouse, data lake and platform

Beyond single stores, this section covers planning a data warehouse (designing the data model, deciding the degree of normalisation, mapping business requirements, and supporting the expected access patterns), using a data lake (managing discovery, access and cost controls on data in Cloud Storage), and designing a data platform using tools such as Dataplex and Dataplex Catalog alongside BigQuery and Cloud Storage, including a federated governance model across distributed data. The teaching point is that storage design is not just picking a service but shaping how the whole platform is modelled, governed and accessed.

Chapter 5: Preparing and using data for analysis (about 15%); and maintaining and automating workloads (about 18%)

These two sections together make up a third of the exam, and they are where pipelines connect to the people who use the data and to the operations that keep them running. They are easy marks once you have actually operated a pipeline, so do not skip them.

Preparing data for analysis

This section makes data analysis-ready and accessible. Preparing data for visualisation includes connecting to tools, precalculating fields, using BigQuery features for business intelligence such as BI Engine and materialized views, and troubleshooting poorly performing queries. A notable, current addition is preparing data for AI and ML: feature engineering and training or serving models (for example with BigQuery ML), and preparing unstructured data for embeddings and retrieval-augmented generation (RAG), reflecting how data engineering now feeds AI workloads. Sharing data covers defining sharing rules, publishing datasets, publishing reports and visualisations, and BigQuery sharing via Analytics Hub. Security and access run through all of it, with IAM, data masking and Cloud Data Loss Prevention (Cloud DLP) controlling who sees what. The instinct to carry is that “using” data well means making it fast, governed and shareable for whoever consumes it, whether that is an analyst, a dashboard or a model.

Maintaining and automating data workloads

This section keeps the system running. Optimising resources means minimising cost per business need, ensuring enough resources for business-critical processes, and deciding between persistent or job-based clusters (for example ephemeral Dataproc clusters that spin up for a job and shut down). Designing automation and repeatability centres on Cloud Composer (creating directed acyclic graphs, or DAGs) and scheduling jobs in a repeatable way. Organising workloads by business requirements includes capacity management (such as BigQuery editions and reservations) and choosing interactive versus batch query jobs. Monitoring and troubleshooting relies on observability through Cloud Monitoring, Cloud Logging and the BigQuery admin panel, plus monitoring planned usage. As a teaching example: an unreliable nightly pipeline is usually an orchestration and monitoring problem, solved with a well-structured Composer DAG and alerting in Cloud Monitoring, not by throwing a bigger warehouse at it.

Why these sections reward hands-on practice

Both sections are far easier if you have done the work once: built a Composer DAG, set an IAM role to least privilege, watched a pipeline in Cloud Monitoring, and tuned a slow BigQuery query. That is why the study plan in the next chapter insists on operating a real pipeline rather than only reading about operations.

Chapter 6: Service selection, BigQuery cost, and study planning

Service selection is where the PDE is won or lost

Stepping back from the five sections, the meta-skill of this exam is service choice under constraints. The recurring decisions are BigQuery versus Bigtable (SQL analytics warehouse versus key-based NoSQL), Dataflow versus Dataproc (serverless Beam versus managed Hadoop/Spark for existing jobs), and streaming versus batch (Pub/Sub plus Dataflow streaming versus scheduled batch loads). For each pair, learn the two or three signals that tip the decision: query or access pattern, latency, throughput, cost, and whether you must run existing open-source tooling. If you can read a scenario and name the right service with a one-line justification, you are studying the exact thing that decides your result.

Learn BigQuery cost, not just BigQuery SQL

The other meta-skill is BigQuery economics. Know how partitioning prunes by date or range, how clustering sorts within partitions, and how both reduce the bytes scanned that drive on-demand cost; know when a materialized view helps and how slots and reservations differ from on-demand. The same instinct, choosing the cheaper, faster design for a given access pattern, generalises across the platform and is precisely the judgement the exam rewards.

Set up to practise and choose a timeline

PDE is a practical exam, so install nothing you do not need but do set up a real project using the Google Cloud free tier and the free BigQuery sandbox, and build hands-on throughout. A balanced plan runs about eight weeks at roughly ten hours a week: storage choices and BigQuery basics first, then ingestion and Dataflow pipelines, then Dataproc and orchestration, then governance and operations, finishing on sample questions and timed reviews. People with solid GCP data experience can compress to a six-week intensive at fourteen to sixteen hours a week drilling the five sections around service selection; those newer to GCP should add weeks and build Associate-level knowledge and BigQuery/Dataflow practice first. To turn whichever timeline you pick into dated weeks for your own start date, use the free study-plan generator. If you are comparing data-engineering credentials across clouds, the Microsoft Fabric DP-700 vs Google Cloud Professional Data Engineer comparison sets out the differences.

Chapter 7: Final preparation and revision

Build the full loop, then calibrate

The strongest final preparation is to build the whole loop at least once. Using the free tier and the BigQuery sandbox: land a public dataset in Cloud Storage, load and query it in BigQuery with partitioning and clustering, push events through Pub/Sub into a Dataflow streaming pipeline, and orchestrate a batch job with Cloud Composer. Add IAM roles at least privilege and a Dataplex view for governance, then set up basic Cloud Monitoring. Doing this once teaches more than re-reading the documentation, and it grounds the service-selection questions in real experience. Then work Google’s official sample questions to calibrate where you stand against the real exam style.

Consolidate the deciding judgement

In the closing week, consolidate rather than learn new material. Re-walk the service-selection boundaries (BigQuery versus Bigtable, Dataflow versus Dataproc, streaming versus batch) until each is a one-line rule; refresh BigQuery partitioning, clustering and bytes-scanned cost; and re-run the streaming and orchestration steps in your project so they are fresh. Run at least one full-length timed practice, treat each miss as a diagnosis of a weak section, and, because Google does not publish a passing score, aim to be comfortably and consistently correct across all five sections on fresh questions before you book.

Use only legitimate materials

Prepare from the official exam guide, Google Cloud documentation and training, the official sample questions, and your own hands-on project. Avoid sites offering recycled live questions; they breach Google’s certification policy and teach the wrong habits, recall instead of the design judgement the exam measures. Building real pipelines is both the honest route and the one that actually prepares you.

Chapter 8: Exam day and format

What to expect on the day

On the day, the exam is 40 to 50 multiple-choice and multiple-select questions in two hours (120 minutes), taken online-proctored or at a test centre. If you test online, arrive early to clear the system and identity checks before the clock starts. Expect scenario questions: a short situation describing data, requirements and constraints, then a choice about which service or design fits, with cost, latency and reliability trade-offs built into the options. The experience you built in a real project is what makes these readable.

Pacing and question style

With 40 to 50 questions in 120 minutes you have a comfortable few minutes each, so read carefully and notice whether a question wants one answer or several. The wrong options are usually plausible-but-suboptimal services, so the discipline is to identify the constraint that matters most (latency, cost, existing tooling, access pattern) and let it decide. If a question is genuinely uncertain, make your best choice and move on; the time budget rewards momentum. Remember the result is pass or fail with no published threshold, so consistent competence across the five sections is the goal rather than a target score.

After the exam, recertification and next steps

Your pass-or-fail result is provided after the session (Google typically confirms the outcome and any provisional result per its current process). A pass earns the Professional Data Engineer credential, valid for two years; you keep it current by passing the recertification exam before it expires, so plan for that ongoing commitment and confirm the current recertification details and pricing in advance. From here, you might compare the PDE with the Microsoft Fabric (DP-700), Databricks and SnowPro Core data-engineering paths, or step sideways to the Professional Cloud Architect for broader GCP design. Confirm the current exam guide on the Google Cloud certification page before you book, since Google updates this exam and its service list periodically.

Domain by domain: what to master

Designing data processing systems: Selecting storage and processing services · Designing for reliability, security and compliance · Designing for flexibility, portability and cost
Ingesting and processing the data: Planning and building batch and streaming pipelines · Dataflow (Apache Beam), Pub/Sub and Dataproc · Orchestration with Cloud Composer (Airflow)
Storing the data: Selecting storage systems (BigQuery, Bigtable, Cloud Storage) · Planning for data warehouses and data lakes · Partitioning, clustering and storage optimisation
Preparing and using data for analysis: Preparing data for visualisation and sharing · Enabling analytics access and governance · Data discovery and quality (Dataplex)
Maintaining and automating data workloads: Optimising resources and managing workloads · Monitoring, logging and troubleshooting pipelines · Automating and repeating workloads

Key concepts to master

BigQuery: Serverless, columnar analytics warehouse for SQL over large datasets. Default choice for analytical queries; you pay for storage and bytes scanned (or slots).
Dataflow (Apache Beam): Fully managed service for batch and streaming pipelines written in the unified Apache Beam model. The exam's default for transformation pipelines, especially streaming.
Pub/Sub: Global, at-least-once messaging for ingesting and decoupling event streams. The usual front door for streaming data into Dataflow and BigQuery.
Bigtable: Wide-column NoSQL store for high-throughput, low-latency key-based access (time-series, IoT, ad-tech). Row-key design decides performance; it is not for ad-hoc SQL analytics.
Partitioning and clustering: BigQuery cost and speed levers. Partitioning prunes by date/range; clustering sorts within partitions. Together they cut bytes scanned, which cuts cost and latency.

What you should be able to do

By exam day, you should be able to:

Choose the right store for a scenario: BigQuery, Bigtable, Cloud Storage, Cloud SQL or Spanner
Design a streaming pipeline with Pub/Sub into Dataflow, and a batch pipeline into BigQuery
Decide between Dataflow and Dataproc for a given workload and justify it
Partition and cluster a BigQuery table and explain the cost and latency impact
Orchestrate and schedule pipelines with Cloud Composer (Airflow)
Apply IAM least-privilege and govern data with Dataplex; set up monitoring for pipelines

How to practise

Practise in a real project using the Google Cloud free tier and the free BigQuery sandbox. Land a public dataset in Cloud Storage, load and query it in BigQuery (partitioned and clustered), build a streaming path with Pub/Sub into Dataflow, and orchestrate a batch job with Cloud Composer (Airflow). Add IAM roles and a Dataplex view for governance. Then sit the official sample questions and full-length timed practice, reviewing the reasoning behind every miss.

Practise actively from early on - recall and apply, don't just re-read.
Each week, review the previous week's weak spots before moving on.
Do at least one full-length, timed mock near the end, then a second after fixing weak areas.
Warm up with our original PDE practice questions (concept checks, not exam dumps).

We never publish exam dumps or "real" questions. Use official practice and reputable providers for question banks.

Are you ready? (readiness checklist)

You score at or above the pass mark (Not published by Google (pass/fail result)) on full-length, timed mocks - consistently, not once.
No more than one or two weak domains remain, and you know exactly which.
You can explain why the wrong options are wrong, not just spot the right one.
You've completed at least one full-length mock under real time pressure.
You could pass next week, not only on the day you crammed.

On exam day

40 to 50 multiple-choice and multiple-select questions in two hours, online-proctored or at a test centre. Expect scenario questions on service selection (BigQuery, Bigtable, Dataflow, Dataproc, Pub/Sub) and on cost, latency and reliability trade-offs. Confirm the current exam guide on the Google Cloud certification page.

Arrive early, or run the online-proctoring system check well ahead; have valid ID ready.
Budget your time per question and keep moving - don't sink minutes into one item.
Where the format allows, flag hard questions and return to them rather than stalling.
Read scenario and performance-based questions twice: work out what is actually asked first.
Taper in the final days - light review and rest beat an all-nighter.

Common mistakes to avoid

Memorising service names without learning when to choose each - the exam is built around BigQuery vs Bigtable vs Dataflow vs Dataproc trade-offs.
Reaching for Dataproc out of habit; Dataflow (serverless Beam) is usually preferred unless you must run existing Hadoop/Spark jobs or open-source tooling.
Ignoring cost: not knowing how partitioning, clustering and bytes-scanned drive BigQuery cost is a common gap the scenarios punish.
Treating Bigtable like a relational warehouse - it is key-based NoSQL, and a poor row-key design destroys performance.
Skipping governance and operations (IAM, Dataplex, monitoring, Cloud Composer), which span two domains worth a third of the exam.

Resource stack

Start with the free and official resources above. Paid courses and question banks help if you want structure, but they are optional, not required to pass.

What to study next

The PDE is the flagship Google Cloud data-engineering credential. Compare it with the Microsoft Fabric (DP-700), Databricks and SnowPro Core data-engineering paths, or step sideways to the Professional Cloud Architect for broader GCP design.

FAQ

How long does it take to study for the Professional Data Engineer exam?: With Google Cloud data experience, most people need roughly 6 to 10 weeks of focused study alongside hands-on practice. Without that base, plan for considerably longer and build BigQuery and Dataflow experience first.
How many questions are on the exam, and what is the passing score?: Google's exam guide lists 40 to 50 questions in two hours. Google does not publish a fixed passing score - the result is pass or fail - so aim for broad competence across all five domains rather than a target number.
Do I need to know how to code for the PDE?: You need working SQL for BigQuery and enough familiarity with the Apache Beam model (used by Dataflow) to reason about pipelines. It is not a software-engineering exam, but it assumes you can read and design data pipelines, not just name services.
BigQuery or Bigtable - which should I focus on?: Learn when to use each, because that distinction is exactly what the exam tests. BigQuery is the serverless warehouse for SQL analytics on large datasets; Bigtable is wide-column NoSQL for high-throughput, low-latency key-based access such as time-series or IoT. Choosing correctly per scenario matters more than depth in one.
Dataflow or Dataproc - what is the difference?: Dataflow is serverless and runs Apache Beam pipelines (batch and streaming) with no cluster to manage - usually the default. Dataproc is managed Hadoop and Spark, chosen when you must run existing open-source jobs or specific tooling. Picking the right one for a scenario is a recurring exam theme.

Sources

Google Cloud - Professional Data Engineer ↗