Plain-English definitions of the core Google Cloud terms for Professional Data Engineer study. Simplified for learning; Google Cloud documentation is authoritative.
| Term | Definition |
|---|---|
| BigQuery | Serverless, columnar data warehouse for SQL analytics over large datasets. |
| BigQuery slot | A unit of compute capacity; queries use slots, billed on-demand or via reservations. |
| Partitioning | Splitting a BigQuery table by date or range so queries scan fewer rows. |
| Clustering | Sorting data within partitions by chosen columns to cut bytes scanned. |
| Materialised view | A precomputed, auto-refreshed query result that speeds up frequent queries. |
| Bigtable | Wide-column NoSQL store for high-throughput, low-latency key-based access. |
| Row key | Bigtable’s primary access path; its design decides read/write performance. |
| Cloud Storage | Object storage for files and as a data-lake layer; has storage classes and lifecycle rules. |
| Cloud SQL | Managed relational database (MySQL, PostgreSQL, SQL Server) for transactional workloads. |
| Spanner | Globally distributed, strongly consistent relational database that scales horizontally. |
| Dataflow | Managed service that runs Apache Beam batch and streaming pipelines, serverless. |
| Apache Beam | The unified programming model for batch and streaming pipelines that Dataflow runs. |
| Windowing | Grouping streaming data into time-based windows for aggregation in Beam/Dataflow. |
| Pub/Sub | Global, at-least-once messaging for ingesting and decoupling event streams. |
| Dataproc | Managed Hadoop and Spark for running existing open-source big-data jobs. |
| Dataprep | A visual, no-code tool for exploring and cleaning data (Cloud Dataprep by Trifacta). |
| Dataform | A tool for managing SQL-based ELT transformations and workflows in BigQuery. |
| Cloud Composer | Managed Apache Airflow for orchestrating and scheduling data pipelines. |
| Airflow DAG | A directed graph of tasks defining a pipeline’s steps and dependencies in Composer. |
| Dataplex | A data fabric for organising, governing and discovering data across lakes and warehouses. |
| Data Catalog | The metadata and discovery service (part of Dataplex) for finding and tagging data. |
| Streaming vs batch | Processing events continuously as they arrive vs in scheduled bulk loads. |
| ETL / ELT | Extract-Transform-Load vs Extract-Load-Transform (transform in the warehouse). |
| IAM | Identity and Access Management - roles and permissions that control who can do what. |
| Service account | A non-human identity that pipelines and services use to authenticate to GCP. |
| Cloud Monitoring / Logging | The services for metrics, dashboards, alerts and logs used to operate pipelines. |