Glossary
Data & Analytics glossary
163 key terms and acronyms from across Data & Analytics certifications, in plain English. Definitions are simplified for learning; the official exam outlines are authoritative.
- Aggregation
- Rolling values up with SUM, AVG, COUNT and similar functions.
- Airflow DAG
- A directed graph of tasks defining a pipeline's steps and dependencies in Composer.
- Apache Beam
- The unified programming model for batch and streaming pipelines that Dataflow runs.
- Apache Spark
- The distributed engine underneath Databricks for processing large data in parallel.
- Append
- Stacking the rows of two queries with the same columns.
- Applied Steps
- The ordered list of transformations Power Query records and replays on refresh.
- Auto Loader
- Incremental ingestion of new files from cloud storage (the cloudFiles source).
- BigQuery
- Serverless, columnar data warehouse for SQL analytics over large datasets.
- BigQuery slot
- A unit of compute capacity; queries use slots, billed on-demand or via reservations.
- Bigtable
- Wide-column NoSQL store for high-throughput, low-latency key-based access.
- Bookmark
- A saved state of a report page (filters, selection, visibility).
- Calculated column
- A DAX value computed per row and stored at refresh.
- Calculated field
- A new field defined by a formula.
- Cardinality
- The relationship type: one-to-many, one-to-one or many-to-many.
- Catalog
- The top level of the Unity Catalog namespace, containing schemas.
- Cloud Composer
- Managed Apache Airflow for orchestrating and scheduling data pipelines.
- Cloud Monitoring / Logging
- The services for metrics, dashboards, alerts and logs used to operate pipelines.
- Cloud services layer
- The brain that handles authentication, metadata, query optimisation and security.
- Cloud SQL
- Managed relational database (MySQL, PostgreSQL, SQL Server) for transactional workloads.
- Cloud Storage
- Object storage for files and as a data-lake layer; has storage classes and lifecycle rules.
- Cluster / compute
- The Spark compute resource that runs your code; can be all-purpose (interactive) or job clusters.
- Clustering
- Sorting data within partitions by chosen columns to cut bytes scanned.
- Clustering key
- A chosen column set that co-locates related rows so pruning is more effective on large tables.
- Compute layer
- The virtual warehouses that run queries and data loads.
- Continuous
- A field shown along an axis; the pill is green.
- Continuous Data Protection (CDP)
- The umbrella for Time Travel, Fail-safe and cloning that guards data automatically.
- COPY INTO
- The command that bulk-loads files from a stage into a table, or unloads data out.
- Credit
- The unit Snowflake uses to bill compute; a running warehouse consumes credits by size and time.
- Dashboard
- A single view that combines multiple sheets, filters and actions.
- Data blending
- Linking two separate data sources on a shared field.
- Data Catalog
- The metadata and discovery service (part of Dataplex) for finding and tagging data.
- Data model
- The connected set of tables, relationships and measures behind a report.
- Data pipeline
- An orchestration item that copies data and runs activities in sequence.
- Databricks Data Intelligence Platform
- The lakehouse platform (built on Apache Spark and Delta Lake) for data engineering, analytics and ML.
- Databricks Jobs / Workflows
- The orchestration tool that schedules and runs tasks (notebooks, pipelines, scripts) on a timetable or trigger.
- Databricks SQL
- The SQL interface and warehouses for querying lakehouse data and building dashboards.
- Dataflow
- Managed service that runs Apache Beam batch and streaming pipelines, serverless.
- Dataflows Gen2
- The low-code, Power Query-based transform item for ingesting and shaping data.
- Dataform
- A tool for managing SQL-based ELT transformations and workflows in BigQuery.
- Dataplex
- A data fabric for organising, governing and discovering data across lakes and warehouses.
- Dataprep
- A visual, no-code tool for exploring and cleaning data (Cloud Dataprep by Trifacta).
- Dataproc
- Managed Hadoop and Spark for running existing open-source big-data jobs.
- Dataset (semantic model)
- The published data model that reports connect to.
- DAX
- Data Analysis Expressions - the formula language for measures and columns.
- Delta (Delta Lake)
- The open table format Fabric uses, adding transactions and versioning over Parquet files.
- Delta Lake
- The open table format adding ACID transactions, schema enforcement and time travel over files.
- Delta table
- A table stored in Delta Lake format; the default table type on Databricks.
- Deployment pipeline
- A tool to promote content across development, test and production stages.
- Dimension
- A field that slices the data (often blue, discrete), such as Region or Category.
- Dimension table
- A table of descriptive attributes (who, what, when, where).
- Dimensional model
- Fact and dimension tables shaped for analytics and reporting.
- Direct Lake
- A semantic-model mode that reads OneLake Delta tables directly, at import-like speed.
- Discrete
- A field shown as distinct headers; the pill is blue.
- Drill-through
- Navigation to a detail page filtered to the selected item.
- Dynamic data masking
- Hiding sensitive column values from unauthorised users at query time.
- Edition
- A Snowflake service tier (e.g. Standard, Enterprise) with different features and limits.
- ETL / ELT
- Extract-Transform-Load vs Extract-Load-Transform (transform in the warehouse).
- Eventhouse
- Real-Time Intelligence storage that holds KQL databases.
- Eventstream
- A no-code item for capturing, transforming and routing streaming data.
- Expectations
- Data-quality rules in a declarative pipeline that validate, drop or fail rows that break them.
- External (unmanaged) table
- A table pointing at data in a location you manage; dropping it leaves the files in place.
- Extract
- A saved snapshot (.hyper) of the data for faster, offline use.
- Fact table
- A table of events or transactions (the numbers you measure).
- Fail-safe
- A separate, Snowflake-managed 7-day recovery period after Time Travel ends.
- File format
- A named set of options (e.g. CSV, JSON) describing how files in a stage are parsed.
- Filter context
- The set of filters acting on a measure when it is evaluated.
- Filters shelf
- Where fields are placed to limit what the view shows.
- FLATTEN
- A function that expands nested arrays or objects into separate rows.
- Full vs incremental load
- Reloading all data versus loading only new or changed rows.
- Gateway
- A bridge that lets the Power BI service reach on-premises data.
- Group
- Combining selected members of a field into one category.
- IAM
- Identity and Access Management - roles and permissions that control who can do what.
- Join
- Combining tables at the row level on a shared key.
- KQL
- Kusto Query Language, used to query high-volume event and telemetry data.
- Lakeflow Declarative Pipelines (DLT)
- The declarative pipeline framework (formerly Delta Live Tables) that builds and maintains tables for you.
- Lakehouse
- A Fabric item storing files and Delta tables; loaded with notebooks, read via a SQL endpoint.
- Level of Detail (LOD)
- An expression (FIXED / INCLUDE / EXCLUDE) that sets the granularity of a calculation.
- Lineage
- The tracked flow of data from source to table to dashboard, surfaced by Unity Catalog.
- Live connection
- Queries the source data directly each time.
- M
- The language behind Power Query transformations (mostly generated for you).
- Managed table
- A table whose data and metadata Databricks manages; dropping it deletes the underlying files.
- Marks card
- The panel that controls colour, size, label, detail, shape and tooltip.
- Materialised view
- A precomputed, auto-refreshed query result that speeds up frequent queries.
- Materialized view
- A precomputed, automatically maintained result set for faster repeated queries.
- Measure
- A DAX calculation evaluated at query time, used for aggregations.
- Medallion architecture
- Bronze (raw), silver (cleaned), gold (business-ready) data layering.
- Merge
- Joining two queries on a matching key, like a SQL join.
- Metadata cache
- Statistics in the services layer that answer some queries without scanning data.
- Micro-partition
- A small, immutable columnar storage unit Snowflake creates automatically for table data.
- Microsoft Fabric
- The unified analytics platform that holds all the items below over one data lake.
- Mirroring
- Continuously replicating an external database into OneLake as Delta tables.
- Multi-cluster shared-data architecture
- The design that separates storage, compute and cloud services into independent layers.
- Network policy
- A rule that allows or blocks account access by IP address range.
- Notebook
- A code-first item (PySpark, Spark SQL) for transforming data at scale.
- OneLake
- The single, tenant-wide data lake every Fabric workspace and item shares.
- OneLake shortcut
- A pointer to data in another location, reused without copying it.
- Pages shelf
- Splits a view into a sequence you can step through by a field.
- Parameter
- A user-controllable value that can feed calculations, filters or reference lines.
- Partitioning
- Splitting a table's files by a column's values to speed up some queries.
- Pearson VUE
- The testing provider that delivers the SnowPro Core exam, online or at a test centre.
- Power BI Desktop
- The free authoring app where you prepare, model and build reports.
- Power Query
- The data-preparation engine for cleaning, transforming and combining data.
- Privilege
- A specific permission (e.g. SELECT, INSERT) granted on an object to a role.
- Pruning
- Skipping micro-partitions that cannot match a query, using their metadata, to speed it up.
- Pub/Sub
- Global, at-least-once messaging for ingesting and decoupling event streams.
- PySpark
- The Python API for Spark, used to read, transform and write data in code.
- Reader account
- A Snowflake-managed account a provider creates so a non-customer can read shared data.
- Relationship
- A link between two tables, defining how filters flow.
- Results cache
- Returns identical query results without recompute, for 24 hours, using no warehouse.
- Role
- A container of privileges; RBAC grants privileges to roles, and roles to users.
- Role-based access control (RBAC)
- Snowflake's security model: privileges flow through a hierarchy of roles.
- Row key
- Bigtable's primary access path; its design decides read/write performance.
- Row-level security (RLS)
- Restricting which rows a user can see, by role.
- Rows / Columns shelves
- Where fields are placed to define the structure of the view.
- Scaling out
- Adding clusters (multi-cluster warehouse) to handle more concurrent queries.
- Scaling up
- Increasing a warehouse size (e.g. XS to L) for more power on a single, larger query.
- Scheduled refresh
- Automatic updates of a dataset from its source on a timetable.
- Schema (database)
- A grouping of tables and views within a catalog.
- Secure data sharing
- Giving another account live, read-only access to objects with no data copied.
- Secure view
- A view that hides its definition and underlying detail, used for sharing sensitive data.
- Semantic model
- The published data model (dataset) that Power BI reports connect to.
- Semi-structured data
- Flexible data (JSON, Avro, Parquet) with no fixed table schema, queryable in Snowflake.
- Sensitivity label
- A governance tag that classifies and protects an item's data.
- Sequence
- An object that generates unique, increasing numbers, often for surrogate keys.
- Service account
- A non-human identity that pipelines and services use to authenticate to GCP.
- Set
- A custom subset of data, which can be dynamic and used in calculations.
- Share
- The object that defines what is shared and with which accounts.
- Show Me
- A helper that suggests chart types for the fields you have selected.
- Slicer
- An on-canvas control that filters the report for the viewer.
- Snowflake AI Data Cloud
- Snowflake's cloud platform for data storage, processing and sharing across clouds.
- Snowflake Marketplace
- A catalogue where providers publish data and services for others to access via sharing.
- Snowpipe
- Continuous, automated loading of files as they arrive, rather than in scheduled batches.
- Spanner
- Globally distributed, strongly consistent relational database that scales horizontally.
- Spark
- The distributed engine behind Fabric notebooks for large-scale transformation.
- Spark SQL
- Running SQL queries over data in the Spark engine and the lakehouse.
- Spark structured streaming
- Spark's API for processing streaming data in notebooks.
- Stage
- A location for data files; internal (in Snowflake) or external (e.g. an S3 bucket).
- Star schema
- The recommended model shape: fact tables surrounded by dimension tables.
- Storage layer
- Where table data is held as compressed, columnar micro-partitions in cloud storage.
- Stored procedure
- Procedural code that runs operations and logic on the server side.
- Story
- An ordered sequence of sheets or dashboards that tells a narrative.
- Streaming vs batch
- Processing events continuously as they arrive vs in scheduled bulk loads.
- Structured Streaming
- Spark's engine for incremental, continuous processing of data as it arrives.
- System-defined role
- A built-in role such as ACCOUNTADMIN, SYSADMIN, SECURITYADMIN or PUBLIC.
- Table calculation
- A calculation across the values already in the view (e.g. % of total).
- Tableau Public
- The free version of Tableau Desktop that publishes to the public web.
- Time intelligence
- DAX patterns for periods (year-to-date, prior year, and similar).
- Time travel
- Querying a previous version of a Delta table by version number or timestamp.
- Tooltip
- The pop-up detail shown when hovering over a mark.
- Union
- Stacking rows from tables that share the same columns.
- Unity Catalog
- The governance layer: a catalog.schema.table namespace with central permissions and lineage.
- Unloading
- Exporting data from Snowflake to files in a stage with COPY INTO.
- User-defined function (UDF)
- A custom function (SQL, JavaScript, Python and more) that returns a value.
- VARIANT
- A data type that stores semi-structured data such as JSON within a column.
- View
- A saved query that presents data as a virtual table.
- Virtual warehouse
- An independent compute cluster that runs queries and loads, billed in credits while running.
- Visual
- A chart, table, card or map placed on a report page.
- Warehouse
- A Fabric item with a full T-SQL engine for set-based transformation and serving.
- Warehouse cache
- Local cached data on a running warehouse that speeds up repeated queries.
- Windowing
- Grouping streaming data into time-based windows for aggregation in Beam/Dataflow.
- Windowing function
- A time-based grouping (such as tumbling or hopping) over a stream.
- Workspace
- A container for Fabric items where a team collaborates and sets permissions.
- Zero-copy clone
- An instant copy of a table, schema or database that shares storage until data changes.