Big Data Glossary¶
This glossary collects abbreviations and professional terms that recur across the big data materials. Most entries are intentionally short. Topic-specific pages may provide fuller explanations or examples.
General Concepts¶
- Big Data¶
Datasets and data workloads that are too large, too fast, or too complex for traditional single-machine processing methods.
- Cluster¶
A group of connected computers that work together as one larger system.
- Data pipeline¶
A sequence of steps that moves data from acquisition through storage, processing, analysis, and delivery.
- Distributed system¶
A system that stores data or performs computation across multiple machines rather than on one computer.
- ETL (Extract, Transform, Load)¶
A common data-engineering workflow that collects data, cleans or reshapes it, and loads it into a target system.
- Fault tolerance¶
The ability of a system to continue working even when some components fail.
- High availability¶
A design goal that keeps a system accessible and usable with minimal downtime.
- Horizontal scaling¶
Increasing capacity by adding more machines and distributing the work.
- Latency¶
The time required for one operation, such as a read or query, to finish.
- Node¶
One machine or compute instance inside a distributed system or cluster.
- Parallel processing¶
Running multiple tasks at the same time across different cores or machines.
- Real-time processing¶
Processing data with very low delay so the result is useful immediately or within seconds.
- Scalability¶
The ability of a system to handle growth in data volume, users, or work without failing or becoming impractical.
- Schema¶
A definition of how data is organized, including fields, types, and relationships.
- Semi-structured data¶
Data that has some organization, such as JSON or XML, but does not fit a rigid relational table format.
- Stream processing¶
Processing data continuously as it arrives instead of waiting for large batches.
- Structured data¶
Data organized into a well-defined format, such as rows and columns in a relational table.
- Throughput¶
The amount of data or number of operations a system can process per unit of time.
- Unstructured data¶
Data without a fixed tabular structure, such as free text, images, audio, or video.
- Vertical scaling¶
Increasing capacity by making one machine more powerful, for example by adding more CPU, RAM, or faster storage.
Storage, Databases, and Cloud¶
- ACID¶
A set of transaction properties: atomicity, consistency, isolation, and durability.
- BASE¶
A looser distributed-systems model often summarized as basically available, soft state, and eventual consistency.
- Consistency¶
The degree to which clients see the same data after updates.
- Data lake¶
A large repository that stores raw data in many formats for later use.
- Data warehouse¶
A system optimized for structured analytics, reporting, and large-scale queries.
- Eventual consistency¶
A model in which replicas may temporarily disagree after an update but converge over time.
- HDFS (Hadoop Distributed File System)¶
Hadoop’s distributed file system for storing large datasets across many machines.
- Lakehouse¶
An architecture that combines low-cost data lake storage with warehouse- like management and query features.
- NoSQL¶
A broad category of non-relational databases designed for flexible data models, horizontal scaling, or both. The phrase commonly means not only SQL.
- Object storage¶
A storage model that keeps data as objects identified by keys rather than as files in a traditional hierarchical file system.
- Partitioning¶
Dividing data into smaller pieces so it can be stored or processed more efficiently.
- RDBMS (Relational Database Management System)¶
A database system based on tables, rows, columns, and SQL.
- Replication¶
Keeping multiple copies of data on different machines for durability, availability, or faster reads.
- Schema evolution¶
Controlled change to a schema over time as fields are added, removed, renamed, or retyped.
- Sharding¶
A form of partitioning in which data is split across multiple machines so no single node stores everything.
- SQL (Structured Query Language)¶
The standard language used to query and manage relational data.
Processing and Spark Terms¶
- API (Application Programming Interface)¶
A defined way for software components to communicate, often through functions, classes, or services.
- Batch processing¶
Processing data in accumulated groups instead of continuously as records arrive.
- DAG (Directed Acyclic Graph)¶
A graph with no cycles. In Spark, it represents the dependency plan of a job’s transformations.
- DataFrame¶
A distributed data abstraction organized into named columns, commonly used in Spark for structured data processing.
- Hadoop¶
An open-source ecosystem for distributed storage and large-scale data processing.
- Lazy evaluation¶
A strategy in which computations are not executed until their results are actually needed.
- MapReduce¶
A distributed processing model that applies map and reduce steps to large datasets in parallel.
- PySpark¶
The Python interface to Apache Spark.
- RDD (Resilient Distributed Dataset)¶
Spark’s low-level distributed data abstraction. RDDs are immutable and fault tolerant.
- Resource manager¶
A service that allocates cluster resources to jobs and applications.
- Shuffle¶
A data movement step in distributed processing where records are redistributed across machines, usually by key.
- Spark¶
An open-source distributed computing engine widely used for data processing, analytics, and machine learning.
- Spark SQL¶
Spark’s structured data module built around SQL, DataFrames, and related query features.
- SparkContext¶
The lower-level Spark entry point used mainly for RDD-based operations.
- SparkSession¶
The main entry point for modern Spark applications, especially when using DataFrames and Spark SQL.
- Structured Streaming¶
Spark’s current streaming engine, which treats streaming data as an unbounded table and uses the DataFrame and SQL APIs.
- YARN (Yet Another Resource Negotiator)¶
Hadoop’s cluster resource-management layer.
Analytics and Machine Learning¶
- AI (Artificial Intelligence)¶
A broad field focused on building systems that perform tasks associated with human intelligence.
- Classification¶
A predictive task where the output is a discrete label, such as spam or not spam.
- Clustering¶
An unsupervised learning task that groups similar data points together.
- CNN (Convolutional Neural Network)¶
A deep learning architecture commonly used for image and grid-like data.
- Deep learning¶
A subset of machine learning based on multi-layer neural networks.
- Feature¶
An input variable or measurable property used by a model.
- GAN (Generative Adversarial Network)¶
A deep learning model that learns to generate realistic synthetic data by training a generator and a discriminator together.
- GPU (Graphics Processing Unit)¶
A processor that is highly effective for the parallel numerical computation used in many machine learning workloads.
- LLM (Large Language Model)¶
A machine learning model trained on large amounts of text to understand and generate language.
- Machine learning¶
Methods that allow systems to learn patterns from data and make predictions or decisions.
- Model¶
A mathematical or computational representation learned from data and used for prediction, classification, generation, or explanation.
- PCA (Principal Component Analysis)¶
A dimensionality-reduction technique that transforms data into a smaller set of informative components.
- Regression¶
A predictive task where the output is a continuous value, such as price or temperature.
- Reinforcement learning¶
A learning approach in which an agent improves by taking actions and receiving rewards or penalties.
- RNN (Recurrent Neural Network)¶
A neural network architecture designed for sequential data such as text or time series.
- Supervised learning¶
Learning from labeled examples where the correct output is known during training.
- TPU (Tensor Processing Unit)¶
A specialized processor designed for certain machine learning workloads.
- Unsupervised learning¶
Learning from data without labeled target outputs.
Security and Privacy¶
- CCPA (California Consumer Privacy Act)¶
A California privacy law that gives residents rights related to personal data collection, deletion, and disclosure.
- Data masking¶
Protecting sensitive values by replacing them with altered or obscured versions.
- Encryption¶
Protecting data by converting it into a form that requires a key or other authorized method to read.
- FERPA (Family Educational Rights and Privacy Act)¶
A United States federal law that protects student education records.
- GDPR (General Data Protection Regulation)¶
A European Union regulation governing the collection and use of personal data.
- HIPAA (Health Insurance Portability and Accountability Act)¶
A United States federal law that sets privacy and security requirements for certain health information.
- MFA (Multi-factor authentication)¶
An authentication method that requires more than one proof of identity, such as a password plus a one-time code.