Big Data Glossary¶

This glossary collects abbreviations and professional terms that recur across the big data materials. Most entries are intentionally short. Topic-specific pages may provide fuller explanations or examples.

General Concepts¶

Big Data¶: Datasets and data workloads that are too large, too fast, or too complex for traditional single-machine processing methods.
Cluster¶: A group of connected computers that work together as one larger system.
Data pipeline¶: A sequence of steps that moves data from acquisition through storage, processing, analysis, and delivery.
Distributed system¶: A system that stores data or performs computation across multiple machines rather than on one computer.
ETL (Extract, Transform, Load)¶: A common data-engineering workflow that collects data, cleans or reshapes it, and loads it into a target system.
Fault tolerance¶: The ability of a system to continue working even when some components fail.
High availability¶: A design goal that keeps a system accessible and usable with minimal downtime.
Horizontal scaling¶: Increasing capacity by adding more machines and distributing the work.
Latency¶: The time required for one operation, such as a read or query, to finish.
Node¶: One machine or compute instance inside a distributed system or cluster.
Parallel processing¶: Running multiple tasks at the same time across different cores or machines.
Real-time processing¶: Processing data with very low delay so the result is useful immediately or within seconds.
Scalability¶: The ability of a system to handle growth in data volume, users, or work without failing or becoming impractical.
Schema¶: A definition of how data is organized, including fields, types, and relationships.
Semi-structured data¶: Data that has some organization, such as JSON or XML, but does not fit a rigid relational table format.
Stream processing¶: Processing data continuously as it arrives instead of waiting for large batches.
Structured data¶: Data organized into a well-defined format, such as rows and columns in a relational table.
Throughput¶: The amount of data or number of operations a system can process per unit of time.
Unstructured data¶: Data without a fixed tabular structure, such as free text, images, audio, or video.
Vertical scaling¶: Increasing capacity by making one machine more powerful, for example by adding more CPU, RAM, or faster storage.

Storage, Databases, and Cloud¶

ACID¶: A set of transaction properties: atomicity, consistency, isolation, and durability.
BASE¶: A looser distributed-systems model often summarized as basically available, soft state, and eventual consistency.
Consistency¶: The degree to which clients see the same data after updates.
Data lake¶: A large repository that stores raw data in many formats for later use.
Data warehouse¶: A system optimized for structured analytics, reporting, and large-scale queries.
Eventual consistency¶: A model in which replicas may temporarily disagree after an update but converge over time.
HDFS (Hadoop Distributed File System)¶: Hadoop’s distributed file system for storing large datasets across many machines.
Lakehouse¶: An architecture that combines low-cost data lake storage with warehouse- like management and query features.
NoSQL¶: A broad category of non-relational databases designed for flexible data models, horizontal scaling, or both. The phrase commonly means not only SQL.
Object storage¶: A storage model that keeps data as objects identified by keys rather than as files in a traditional hierarchical file system.
Partitioning¶: Dividing data into smaller pieces so it can be stored or processed more efficiently.
RDBMS (Relational Database Management System)¶: A database system based on tables, rows, columns, and SQL.
Replication¶: Keeping multiple copies of data on different machines for durability, availability, or faster reads.
Schema evolution¶: Controlled change to a schema over time as fields are added, removed, renamed, or retyped.
Sharding¶: A form of partitioning in which data is split across multiple machines so no single node stores everything.
SQL (Structured Query Language)¶: The standard language used to query and manage relational data.

Processing and Spark Terms¶

API (Application Programming Interface)¶: A defined way for software components to communicate, often through functions, classes, or services.
Batch processing¶: Processing data in accumulated groups instead of continuously as records arrive.
DAG (Directed Acyclic Graph)¶: A graph with no cycles. In Spark, it represents the dependency plan of a job’s transformations.
DataFrame¶: A distributed data abstraction organized into named columns, commonly used in Spark for structured data processing.
Hadoop¶: An open-source ecosystem for distributed storage and large-scale data processing.
Lazy evaluation¶: A strategy in which computations are not executed until their results are actually needed.
MapReduce¶: A distributed processing model that applies map and reduce steps to large datasets in parallel.
PySpark¶: The Python interface to Apache Spark.
RDD (Resilient Distributed Dataset)¶: Spark’s low-level distributed data abstraction. RDDs are immutable and fault tolerant.
Resource manager¶: A service that allocates cluster resources to jobs and applications.
Shuffle¶: A data movement step in distributed processing where records are redistributed across machines, usually by key.
Spark¶: An open-source distributed computing engine widely used for data processing, analytics, and machine learning.
Spark SQL¶: Spark’s structured data module built around SQL, DataFrames, and related query features.
SparkContext¶: The lower-level Spark entry point used mainly for RDD-based operations.
SparkSession¶: The main entry point for modern Spark applications, especially when using DataFrames and Spark SQL.
Structured Streaming¶: Spark’s current streaming engine, which treats streaming data as an unbounded table and uses the DataFrame and SQL APIs.
YARN (Yet Another Resource Negotiator)¶: Hadoop’s cluster resource-management layer.

Analytics and Machine Learning¶

AI (Artificial Intelligence)¶: A broad field focused on building systems that perform tasks associated with human intelligence.
Classification¶: A predictive task where the output is a discrete label, such as spam or not spam.
Clustering¶: An unsupervised learning task that groups similar data points together.
CNN (Convolutional Neural Network)¶: A deep learning architecture commonly used for image and grid-like data.
Deep learning¶: A subset of machine learning based on multi-layer neural networks.
Feature¶: An input variable or measurable property used by a model.
GAN (Generative Adversarial Network)¶: A deep learning model that learns to generate realistic synthetic data by training a generator and a discriminator together.
GPU (Graphics Processing Unit)¶: A processor that is highly effective for the parallel numerical computation used in many machine learning workloads.
LLM (Large Language Model)¶: A machine learning model trained on large amounts of text to understand and generate language.
Machine learning¶: Methods that allow systems to learn patterns from data and make predictions or decisions.
Model¶: A mathematical or computational representation learned from data and used for prediction, classification, generation, or explanation.
PCA (Principal Component Analysis)¶: A dimensionality-reduction technique that transforms data into a smaller set of informative components.
Regression¶: A predictive task where the output is a continuous value, such as price or temperature.
Reinforcement learning¶: A learning approach in which an agent improves by taking actions and receiving rewards or penalties.
RNN (Recurrent Neural Network)¶: A neural network architecture designed for sequential data such as text or time series.
Supervised learning¶: Learning from labeled examples where the correct output is known during training.
TPU (Tensor Processing Unit)¶: A specialized processor designed for certain machine learning workloads.
Unsupervised learning¶: Learning from data without labeled target outputs.

Security and Privacy¶

CCPA (California Consumer Privacy Act)¶: A California privacy law that gives residents rights related to personal data collection, deletion, and disclosure.
Data masking¶: Protecting sensitive values by replacing them with altered or obscured versions.
Encryption¶: Protecting data by converting it into a form that requires a key or other authorized method to read.
FERPA (Family Educational Rights and Privacy Act)¶: A United States federal law that protects student education records.
GDPR (General Data Protection Regulation)¶: A European Union regulation governing the collection and use of personal data.
HIPAA (Health Insurance Portability and Accountability Act)¶: A United States federal law that sets privacy and security requirements for certain health information.
MFA (Multi-factor authentication)¶: An authentication method that requires more than one proof of identity, such as a password plus a one-time code.

Big Data Glossary¶

General Concepts¶

Storage, Databases, and Cloud¶

Processing and Spark Terms¶

Analytics and Machine Learning¶

Security and Privacy¶

Table of Contents

Previous topic

Next topic

This Page