Glossary

Data & Analytics glossary

163 key terms and acronyms from across Data & Analytics certifications, in plain English. Definitions are simplified for learning; the official exam outlines are authoritative.

Aggregation: Rolling values up with SUM, AVG, COUNT and similar functions.
Airflow DAG: A directed graph of tasks defining a pipeline's steps and dependencies in Composer.
Apache Beam: The unified programming model for batch and streaming pipelines that Dataflow runs.
Apache Spark: The distributed engine underneath Databricks for processing large data in parallel.
Append: Stacking the rows of two queries with the same columns.
Applied Steps: The ordered list of transformations Power Query records and replays on refresh.
Auto Loader: Incremental ingestion of new files from cloud storage (the cloudFiles source).
BigQuery: Serverless, columnar data warehouse for SQL analytics over large datasets.
BigQuery slot: A unit of compute capacity; queries use slots, billed on-demand or via reservations.
Bigtable: Wide-column NoSQL store for high-throughput, low-latency key-based access.
Bookmark: A saved state of a report page (filters, selection, visibility).
Calculated column: A DAX value computed per row and stored at refresh.
Calculated field: A new field defined by a formula.
Cardinality: The relationship type: one-to-many, one-to-one or many-to-many.
Catalog: The top level of the Unity Catalog namespace, containing schemas.
Cloud Composer: Managed Apache Airflow for orchestrating and scheduling data pipelines.
Cloud Monitoring / Logging: The services for metrics, dashboards, alerts and logs used to operate pipelines.
Cloud services layer: The brain that handles authentication, metadata, query optimisation and security.
Cloud SQL: Managed relational database (MySQL, PostgreSQL, SQL Server) for transactional workloads.
Cloud Storage: Object storage for files and as a data-lake layer; has storage classes and lifecycle rules.
Cluster / compute: The Spark compute resource that runs your code; can be all-purpose (interactive) or job clusters.
Clustering: Sorting data within partitions by chosen columns to cut bytes scanned.
Clustering key: A chosen column set that co-locates related rows so pruning is more effective on large tables.
Compute layer: The virtual warehouses that run queries and data loads.
Continuous: A field shown along an axis; the pill is green.
Continuous Data Protection (CDP): The umbrella for Time Travel, Fail-safe and cloning that guards data automatically.
COPY INTO: The command that bulk-loads files from a stage into a table, or unloads data out.
Credit: The unit Snowflake uses to bill compute; a running warehouse consumes credits by size and time.
Dashboard: A single view that combines multiple sheets, filters and actions.
Data blending: Linking two separate data sources on a shared field.
Data Catalog: The metadata and discovery service (part of Dataplex) for finding and tagging data.
Data model: The connected set of tables, relationships and measures behind a report.
Data pipeline: An orchestration item that copies data and runs activities in sequence.
Databricks Data Intelligence Platform: The lakehouse platform (built on Apache Spark and Delta Lake) for data engineering, analytics and ML.
Databricks Jobs / Workflows: The orchestration tool that schedules and runs tasks (notebooks, pipelines, scripts) on a timetable or trigger.
Databricks SQL: The SQL interface and warehouses for querying lakehouse data and building dashboards.
Dataflow: Managed service that runs Apache Beam batch and streaming pipelines, serverless.
Dataflows Gen2: The low-code, Power Query-based transform item for ingesting and shaping data.
Dataform: A tool for managing SQL-based ELT transformations and workflows in BigQuery.
Dataplex: A data fabric for organising, governing and discovering data across lakes and warehouses.
Dataprep: A visual, no-code tool for exploring and cleaning data (Cloud Dataprep by Trifacta).
Dataproc: Managed Hadoop and Spark for running existing open-source big-data jobs.
Dataset (semantic model): The published data model that reports connect to.
DAX: Data Analysis Expressions - the formula language for measures and columns.
Delta (Delta Lake): The open table format Fabric uses, adding transactions and versioning over Parquet files.
Delta Lake: The open table format adding ACID transactions, schema enforcement and time travel over files.
Delta table: A table stored in Delta Lake format; the default table type on Databricks.
Deployment pipeline: A tool to promote content across development, test and production stages.
Dimension: A field that slices the data (often blue, discrete), such as Region or Category.
Dimension table: A table of descriptive attributes (who, what, when, where).
Dimensional model: Fact and dimension tables shaped for analytics and reporting.
Direct Lake: A semantic-model mode that reads OneLake Delta tables directly, at import-like speed.
Discrete: A field shown as distinct headers; the pill is blue.
Drill-through: Navigation to a detail page filtered to the selected item.
Dynamic data masking: Hiding sensitive column values from unauthorised users at query time.
Edition: A Snowflake service tier (e.g. Standard, Enterprise) with different features and limits.
ETL / ELT: Extract-Transform-Load vs Extract-Load-Transform (transform in the warehouse).
Eventhouse: Real-Time Intelligence storage that holds KQL databases.
Eventstream: A no-code item for capturing, transforming and routing streaming data.
Expectations: Data-quality rules in a declarative pipeline that validate, drop or fail rows that break them.
External (unmanaged) table: A table pointing at data in a location you manage; dropping it leaves the files in place.
Extract: A saved snapshot (.hyper) of the data for faster, offline use.
Fact table: A table of events or transactions (the numbers you measure).
Fail-safe: A separate, Snowflake-managed 7-day recovery period after Time Travel ends.
File format: A named set of options (e.g. CSV, JSON) describing how files in a stage are parsed.
Filter context: The set of filters acting on a measure when it is evaluated.
Filters shelf: Where fields are placed to limit what the view shows.
FLATTEN: A function that expands nested arrays or objects into separate rows.
Full vs incremental load: Reloading all data versus loading only new or changed rows.
Gateway: A bridge that lets the Power BI service reach on-premises data.
Group: Combining selected members of a field into one category.
IAM: Identity and Access Management - roles and permissions that control who can do what.
Join: Combining tables at the row level on a shared key.
KQL: Kusto Query Language, used to query high-volume event and telemetry data.
Lakeflow Declarative Pipelines (DLT): The declarative pipeline framework (formerly Delta Live Tables) that builds and maintains tables for you.
Lakehouse: A Fabric item storing files and Delta tables; loaded with notebooks, read via a SQL endpoint.
Level of Detail (LOD): An expression (FIXED / INCLUDE / EXCLUDE) that sets the granularity of a calculation.
Lineage: The tracked flow of data from source to table to dashboard, surfaced by Unity Catalog.
Live connection: Queries the source data directly each time.
M: The language behind Power Query transformations (mostly generated for you).
Managed table: A table whose data and metadata Databricks manages; dropping it deletes the underlying files.
Marks card: The panel that controls colour, size, label, detail, shape and tooltip.
Materialised view: A precomputed, auto-refreshed query result that speeds up frequent queries.
Materialized view: A precomputed, automatically maintained result set for faster repeated queries.
Measure: A DAX calculation evaluated at query time, used for aggregations.
Medallion architecture: Bronze (raw), silver (cleaned), gold (business-ready) data layering.
Merge: Joining two queries on a matching key, like a SQL join.
Metadata cache: Statistics in the services layer that answer some queries without scanning data.
Micro-partition: A small, immutable columnar storage unit Snowflake creates automatically for table data.
Microsoft Fabric: The unified analytics platform that holds all the items below over one data lake.
Mirroring: Continuously replicating an external database into OneLake as Delta tables.
Multi-cluster shared-data architecture: The design that separates storage, compute and cloud services into independent layers.
Network policy: A rule that allows or blocks account access by IP address range.
Notebook: A code-first item (PySpark, Spark SQL) for transforming data at scale.
OneLake: The single, tenant-wide data lake every Fabric workspace and item shares.
OneLake shortcut: A pointer to data in another location, reused without copying it.
Pages shelf: Splits a view into a sequence you can step through by a field.
Parameter: A user-controllable value that can feed calculations, filters or reference lines.
Partitioning: Splitting a table's files by a column's values to speed up some queries.
Pearson VUE: The testing provider that delivers the SnowPro Core exam, online or at a test centre.
Power BI Desktop: The free authoring app where you prepare, model and build reports.
Power Query: The data-preparation engine for cleaning, transforming and combining data.
Privilege: A specific permission (e.g. SELECT, INSERT) granted on an object to a role.
Pruning: Skipping micro-partitions that cannot match a query, using their metadata, to speed it up.
Pub/Sub: Global, at-least-once messaging for ingesting and decoupling event streams.
PySpark: The Python API for Spark, used to read, transform and write data in code.
Reader account: A Snowflake-managed account a provider creates so a non-customer can read shared data.
Relationship: A link between two tables, defining how filters flow.
Results cache: Returns identical query results without recompute, for 24 hours, using no warehouse.
Role: A container of privileges; RBAC grants privileges to roles, and roles to users.
Role-based access control (RBAC): Snowflake's security model: privileges flow through a hierarchy of roles.
Row key: Bigtable's primary access path; its design decides read/write performance.
Row-level security (RLS): Restricting which rows a user can see, by role.
Rows / Columns shelves: Where fields are placed to define the structure of the view.
Scaling out: Adding clusters (multi-cluster warehouse) to handle more concurrent queries.
Scaling up: Increasing a warehouse size (e.g. XS to L) for more power on a single, larger query.
Scheduled refresh: Automatic updates of a dataset from its source on a timetable.
Schema (database): A grouping of tables and views within a catalog.
Secure data sharing: Giving another account live, read-only access to objects with no data copied.
Secure view: A view that hides its definition and underlying detail, used for sharing sensitive data.
Semantic model: The published data model (dataset) that Power BI reports connect to.
Semi-structured data: Flexible data (JSON, Avro, Parquet) with no fixed table schema, queryable in Snowflake.
Sensitivity label: A governance tag that classifies and protects an item's data.
Sequence: An object that generates unique, increasing numbers, often for surrogate keys.
Service account: A non-human identity that pipelines and services use to authenticate to GCP.
Set: A custom subset of data, which can be dynamic and used in calculations.
Share: The object that defines what is shared and with which accounts.
Show Me: A helper that suggests chart types for the fields you have selected.
Slicer: An on-canvas control that filters the report for the viewer.
Snowflake AI Data Cloud: Snowflake's cloud platform for data storage, processing and sharing across clouds.
Snowflake Marketplace: A catalogue where providers publish data and services for others to access via sharing.
Snowpipe: Continuous, automated loading of files as they arrive, rather than in scheduled batches.
Spanner: Globally distributed, strongly consistent relational database that scales horizontally.
Spark: The distributed engine behind Fabric notebooks for large-scale transformation.
Spark SQL: Running SQL queries over data in the Spark engine and the lakehouse.
Spark structured streaming: Spark's API for processing streaming data in notebooks.
Stage: A location for data files; internal (in Snowflake) or external (e.g. an S3 bucket).
Star schema: The recommended model shape: fact tables surrounded by dimension tables.
Storage layer: Where table data is held as compressed, columnar micro-partitions in cloud storage.
Stored procedure: Procedural code that runs operations and logic on the server side.
Story: An ordered sequence of sheets or dashboards that tells a narrative.
Streaming vs batch: Processing events continuously as they arrive vs in scheduled bulk loads.
Structured Streaming: Spark's engine for incremental, continuous processing of data as it arrives.
System-defined role: A built-in role such as ACCOUNTADMIN, SYSADMIN, SECURITYADMIN or PUBLIC.
Table calculation: A calculation across the values already in the view (e.g. % of total).
Tableau Public: The free version of Tableau Desktop that publishes to the public web.
Time intelligence: DAX patterns for periods (year-to-date, prior year, and similar).
Time travel: Querying a previous version of a Delta table by version number or timestamp.
Tooltip: The pop-up detail shown when hovering over a mark.
Union: Stacking rows from tables that share the same columns.
Unity Catalog: The governance layer: a catalog.schema.table namespace with central permissions and lineage.
Unloading: Exporting data from Snowflake to files in a stage with COPY INTO.
User-defined function (UDF): A custom function (SQL, JavaScript, Python and more) that returns a value.
VARIANT: A data type that stores semi-structured data such as JSON within a column.
View: A saved query that presents data as a virtual table.
Virtual warehouse: An independent compute cluster that runs queries and loads, billed in credits while running.
Visual: A chart, table, card or map placed on a report page.
Warehouse: A Fabric item with a full T-SQL engine for set-based transformation and serving.
Warehouse cache: Local cached data on a running warehouse that speeds up repeated queries.
Windowing: Grouping streaming data into time-based windows for aggregation in Beam/Dataflow.
Windowing function: A time-based grouping (such as tumbling or hopping) over a stream.
Workspace: A container for Fabric items where a team collaborates and sets permissions.
Zero-copy clone: An instant copy of a table, schema or database that shares storage until data changes.