Big Data Glossary

This glossary collects abbreviations and professional terms that recur across the big data materials. Most entries are intentionally short. Topic-specific pages may provide fuller explanations or examples.

General Concepts

Big Data

Datasets and data workloads that are too large, too fast, or too complex for traditional single-machine processing methods.

Cluster

A group of connected computers that work together as one larger system.

Data pipeline

A sequence of steps that moves data from acquisition through storage, processing, analysis, and delivery.

Distributed system

A system that stores data or performs computation across multiple machines rather than on one computer.

ETL (Extract, Transform, Load)

A common data-engineering workflow that collects data, cleans or reshapes it, and loads it into a target system.

Fault tolerance

The ability of a system to continue working even when some components fail.

High availability

A design goal that keeps a system accessible and usable with minimal downtime.

Horizontal scaling

Increasing capacity by adding more machines and distributing the work.

Latency

The time required for one operation, such as a read or query, to finish.

Node

One machine or compute instance inside a distributed system or cluster.

Parallel processing

Running multiple tasks at the same time across different cores or machines.

Real-time processing

Processing data with very low delay so the result is useful immediately or within seconds.

Scalability

The ability of a system to handle growth in data volume, users, or work without failing or becoming impractical.

Schema

A definition of how data is organized, including fields, types, and relationships.

Semi-structured data

Data that has some organization, such as JSON or XML, but does not fit a rigid relational table format.

Stream processing

Processing data continuously as it arrives instead of waiting for large batches.

Structured data

Data organized into a well-defined format, such as rows and columns in a relational table.

Throughput

The amount of data or number of operations a system can process per unit of time.

Unstructured data

Data without a fixed tabular structure, such as free text, images, audio, or video.

Vertical scaling

Increasing capacity by making one machine more powerful, for example by adding more CPU, RAM, or faster storage.

Storage, Databases, and Cloud

ACID

A set of transaction properties: atomicity, consistency, isolation, and durability.

BASE

A looser distributed-systems model often summarized as basically available, soft state, and eventual consistency.

Consistency

The degree to which clients see the same data after updates.

Data lake

A large repository that stores raw data in many formats for later use.

Data warehouse

A system optimized for structured analytics, reporting, and large-scale queries.

Eventual consistency

A model in which replicas may temporarily disagree after an update but converge over time.

HDFS (Hadoop Distributed File System)

Hadoop’s distributed file system for storing large datasets across many machines.

Lakehouse

An architecture that combines low-cost data lake storage with warehouse- like management and query features.

NoSQL

A broad category of non-relational databases designed for flexible data models, horizontal scaling, or both. The phrase commonly means not only SQL.

Object storage

A storage model that keeps data as objects identified by keys rather than as files in a traditional hierarchical file system.

Partitioning

Dividing data into smaller pieces so it can be stored or processed more efficiently.

RDBMS (Relational Database Management System)

A database system based on tables, rows, columns, and SQL.

Replication

Keeping multiple copies of data on different machines for durability, availability, or faster reads.

Schema evolution

Controlled change to a schema over time as fields are added, removed, renamed, or retyped.

Sharding

A form of partitioning in which data is split across multiple machines so no single node stores everything.

SQL (Structured Query Language)

The standard language used to query and manage relational data.

Processing and Spark Terms

API (Application Programming Interface)

A defined way for software components to communicate, often through functions, classes, or services.

Batch processing

Processing data in accumulated groups instead of continuously as records arrive.

DAG (Directed Acyclic Graph)

A graph with no cycles. In Spark, it represents the dependency plan of a job’s transformations.

DataFrame

A distributed data abstraction organized into named columns, commonly used in Spark for structured data processing.

Hadoop

An open-source ecosystem for distributed storage and large-scale data processing.

Lazy evaluation

A strategy in which computations are not executed until their results are actually needed.

MapReduce

A distributed processing model that applies map and reduce steps to large datasets in parallel.

PySpark

The Python interface to Apache Spark.

RDD (Resilient Distributed Dataset)

Spark’s low-level distributed data abstraction. RDDs are immutable and fault tolerant.

Resource manager

A service that allocates cluster resources to jobs and applications.

Shuffle

A data movement step in distributed processing where records are redistributed across machines, usually by key.

Spark

An open-source distributed computing engine widely used for data processing, analytics, and machine learning.

Spark SQL

Spark’s structured data module built around SQL, DataFrames, and related query features.

SparkContext

The lower-level Spark entry point used mainly for RDD-based operations.

SparkSession

The main entry point for modern Spark applications, especially when using DataFrames and Spark SQL.

Structured Streaming

Spark’s current streaming engine, which treats streaming data as an unbounded table and uses the DataFrame and SQL APIs.

YARN (Yet Another Resource Negotiator)

Hadoop’s cluster resource-management layer.

Analytics and Machine Learning

AI (Artificial Intelligence)

A broad field focused on building systems that perform tasks associated with human intelligence.

Classification

A predictive task where the output is a discrete label, such as spam or not spam.

Clustering

An unsupervised learning task that groups similar data points together.

CNN (Convolutional Neural Network)

A deep learning architecture commonly used for image and grid-like data.

Deep learning

A subset of machine learning based on multi-layer neural networks.

Feature

An input variable or measurable property used by a model.

GAN (Generative Adversarial Network)

A deep learning model that learns to generate realistic synthetic data by training a generator and a discriminator together.

GPU (Graphics Processing Unit)

A processor that is highly effective for the parallel numerical computation used in many machine learning workloads.

LLM (Large Language Model)

A machine learning model trained on large amounts of text to understand and generate language.

Machine learning

Methods that allow systems to learn patterns from data and make predictions or decisions.

Model

A mathematical or computational representation learned from data and used for prediction, classification, generation, or explanation.

PCA (Principal Component Analysis)

A dimensionality-reduction technique that transforms data into a smaller set of informative components.

Regression

A predictive task where the output is a continuous value, such as price or temperature.

Reinforcement learning

A learning approach in which an agent improves by taking actions and receiving rewards or penalties.

RNN (Recurrent Neural Network)

A neural network architecture designed for sequential data such as text or time series.

Supervised learning

Learning from labeled examples where the correct output is known during training.

TPU (Tensor Processing Unit)

A specialized processor designed for certain machine learning workloads.

Unsupervised learning

Learning from data without labeled target outputs.

Security and Privacy

CCPA (California Consumer Privacy Act)

A California privacy law that gives residents rights related to personal data collection, deletion, and disclosure.

Data masking

Protecting sensitive values by replacing them with altered or obscured versions.

Encryption

Protecting data by converting it into a form that requires a key or other authorized method to read.

FERPA (Family Educational Rights and Privacy Act)

A United States federal law that protects student education records.

GDPR (General Data Protection Regulation)

A European Union regulation governing the collection and use of personal data.

HIPAA (Health Insurance Portability and Accountability Act)

A United States federal law that sets privacy and security requirements for certain health information.

MFA (Multi-factor authentication)

An authentication method that requires more than one proof of identity, such as a password plus a one-time code.