PDE Glossary: Core GCP Data Engineering Terms Defined Simply

Plain-English definitions of the core Google Cloud terms for Professional Data Engineer study. Simplified for learning; Google Cloud documentation is authoritative.

Term	Definition
BigQuery	Serverless, columnar data warehouse for SQL analytics over large datasets.
BigQuery slot	A unit of compute capacity; queries use slots, billed on-demand or via reservations.
Partitioning	Splitting a BigQuery table by date or range so queries scan fewer rows.
Clustering	Sorting data within partitions by chosen columns to cut bytes scanned.
Materialised view	A precomputed, auto-refreshed query result that speeds up frequent queries.
Bigtable	Wide-column NoSQL store for high-throughput, low-latency key-based access.
Row key	Bigtable’s primary access path; its design decides read/write performance.
Cloud Storage	Object storage for files and as a data-lake layer; has storage classes and lifecycle rules.
Cloud SQL	Managed relational database (MySQL, PostgreSQL, SQL Server) for transactional workloads.
Spanner	Globally distributed, strongly consistent relational database that scales horizontally.
Dataflow	Managed service that runs Apache Beam batch and streaming pipelines, serverless.
Apache Beam	The unified programming model for batch and streaming pipelines that Dataflow runs.
Windowing	Grouping streaming data into time-based windows for aggregation in Beam/Dataflow.
Pub/Sub	Global, at-least-once messaging for ingesting and decoupling event streams.
Dataproc	Managed Hadoop and Spark for running existing open-source big-data jobs.
Dataprep	A visual, no-code tool for exploring and cleaning data (Cloud Dataprep by Trifacta).
Dataform	A tool for managing SQL-based ELT transformations and workflows in BigQuery.
Cloud Composer	Managed Apache Airflow for orchestrating and scheduling data pipelines.
Airflow DAG	A directed graph of tasks defining a pipeline’s steps and dependencies in Composer.
Dataplex	A data fabric for organising, governing and discovering data across lakes and warehouses.
Data Catalog	The metadata and discovery service (part of Dataplex) for finding and tagging data.
Streaming vs batch	Processing events continuously as they arrive vs in scheduled bulk loads.
ETL / ELT	Extract-Transform-Load vs Extract-Load-Transform (transform in the warehouse).
IAM	Identity and Access Management - roles and permissions that control who can do what.
Service account	A non-human identity that pipelines and services use to authenticate to GCP.
Cloud Monitoring / Logging	The services for metrics, dashboards, alerts and logs used to operate pipelines.

Google Cloud Professional Data Engineer (PDE) Glossary

Sources