Flashcards · Data & Analytics
PDE Flashcards
Free flashcards for PDE: flip each card to reveal the definition. Built from the glossary as a study aid, these are concept checks, not real exam questions.
1 / 26
Click the card (or press Space) to flip · use Prev/Next to move
All 26 terms
- BigQuery
- Serverless, columnar data warehouse for SQL analytics over large datasets.
- BigQuery slot
- A unit of compute capacity; queries use slots, billed on-demand or via reservations.
- Partitioning
- Splitting a BigQuery table by date or range so queries scan fewer rows.
- Clustering
- Sorting data within partitions by chosen columns to cut bytes scanned.
- Materialised view
- A precomputed, auto-refreshed query result that speeds up frequent queries.
- Bigtable
- Wide-column NoSQL store for high-throughput, low-latency key-based access.
- Row key
- Bigtable's primary access path; its design decides read/write performance.
- Cloud Storage
- Object storage for files and as a data-lake layer; has storage classes and lifecycle rules.
- Cloud SQL
- Managed relational database (MySQL, PostgreSQL, SQL Server) for transactional workloads.
- Spanner
- Globally distributed, strongly consistent relational database that scales horizontally.
- Dataflow
- Managed service that runs Apache Beam batch and streaming pipelines, serverless.
- Apache Beam
- The unified programming model for batch and streaming pipelines that Dataflow runs.
- Windowing
- Grouping streaming data into time-based windows for aggregation in Beam/Dataflow.
- Pub/Sub
- Global, at-least-once messaging for ingesting and decoupling event streams.
- Dataproc
- Managed Hadoop and Spark for running existing open-source big-data jobs.
- Dataprep
- A visual, no-code tool for exploring and cleaning data (Cloud Dataprep by Trifacta).
- Dataform
- A tool for managing SQL-based ELT transformations and workflows in BigQuery.
- Cloud Composer
- Managed Apache Airflow for orchestrating and scheduling data pipelines.
- Airflow DAG
- A directed graph of tasks defining a pipeline's steps and dependencies in Composer.
- Dataplex
- A data fabric for organising, governing and discovering data across lakes and warehouses.
- Data Catalog
- The metadata and discovery service (part of Dataplex) for finding and tagging data.
- Streaming vs batch
- Processing events continuously as they arrive vs in scheduled bulk loads.
- ETL / ELT
- Extract-Transform-Load vs Extract-Load-Transform (transform in the warehouse).
- IAM
- Identity and Access Management - roles and permissions that control who can do what.
- Service account
- A non-human identity that pipelines and services use to authenticate to GCP.
- Cloud Monitoring / Logging
- The services for metrics, dashboards, alerts and logs used to operate pipelines.