*****************
Big Data Glossary
*****************

This glossary collects abbreviations and professional terms that recur across
the big data materials. Most entries are intentionally short. Topic-specific
pages may provide fuller explanations or examples.

General Concepts
================

.. glossary::
   :sorted:

   Big Data
      Datasets and data workloads that are too large, too fast, or too
      complex for traditional single-machine processing methods.

   Cluster
      A group of connected computers that work together as one larger system.

   Data pipeline
      A sequence of steps that moves data from acquisition through storage,
      processing, analysis, and delivery.

   Distributed system
      A system that stores data or performs computation across multiple
      machines rather than on one computer.

   ETL (Extract, Transform, Load)
      A common data-engineering workflow that collects data, cleans or reshapes
      it, and loads it into a target system.

   Fault tolerance
      The ability of a system to continue working even when some components
      fail.

   High availability
      A design goal that keeps a system accessible and usable with minimal
      downtime.

   Horizontal scaling
      Increasing capacity by adding more machines and distributing the work.

   Latency
      The time required for one operation, such as a read or query, to finish.

   Node
      One machine or compute instance inside a distributed system or cluster.

   Parallel processing
      Running multiple tasks at the same time across different cores or
      machines.

   Real-time processing
      Processing data with very low delay so the result is useful immediately
      or within seconds.

   Scalability
      The ability of a system to handle growth in data volume, users, or work
      without failing or becoming impractical.

   Schema
      A definition of how data is organized, including fields, types, and
      relationships.

   Semi-structured data
      Data that has some organization, such as JSON or XML, but does not fit a
      rigid relational table format.

   Stream processing
      Processing data continuously as it arrives instead of waiting for large
      batches.

   Structured data
      Data organized into a well-defined format, such as rows and columns in a
      relational table.

   Throughput
      The amount of data or number of operations a system can process per unit
      of time.

   Unstructured data
      Data without a fixed tabular structure, such as free text, images,
      audio, or video.

   Vertical scaling
      Increasing capacity by making one machine more powerful, for example by
      adding more CPU, RAM, or faster storage.

Storage, Databases, and Cloud
=============================

.. glossary::
   :sorted:

   ACID
      A set of transaction properties: atomicity, consistency, isolation, and
      durability.

   BASE
      A looser distributed-systems model often summarized as basically
      available, soft state, and eventual consistency.

   Consistency
      The degree to which clients see the same data after updates.

   Data lake
      A large repository that stores raw data in many formats for later use.

   Data warehouse
      A system optimized for structured analytics, reporting, and large-scale
      queries.

   Eventual consistency
      A model in which replicas may temporarily disagree after an update but
      converge over time.

   HDFS (Hadoop Distributed File System)
      Hadoop's distributed file system for storing large datasets across many
      machines.

   Lakehouse
      An architecture that combines low-cost data lake storage with warehouse-
      like management and query features.

   NoSQL
      A broad category of non-relational databases designed for flexible data
      models, horizontal scaling, or both. The phrase commonly means not only
      SQL.

   Object storage
      A storage model that keeps data as objects identified by keys rather than
      as files in a traditional hierarchical file system.

   Partitioning
      Dividing data into smaller pieces so it can be stored or processed more
      efficiently.

   RDBMS (Relational Database Management System)
      A database system based on tables, rows, columns, and SQL.

   Replication
      Keeping multiple copies of data on different machines for durability,
      availability, or faster reads.

   Schema evolution
      Controlled change to a schema over time as fields are added, removed,
      renamed, or retyped.

   Sharding
      A form of partitioning in which data is split across multiple machines so
      no single node stores everything.

   SQL (Structured Query Language)
      The standard language used to query and manage relational data.

Processing and Spark Terms
==========================

.. glossary::
   :sorted:

   API (Application Programming Interface)
      A defined way for software components to communicate, often through
      functions, classes, or services.

   Batch processing
      Processing data in accumulated groups instead of continuously as records
      arrive.

   DAG (Directed Acyclic Graph)
      A graph with no cycles. In Spark, it represents the dependency plan of a
      job's transformations.

   DataFrame
      A distributed data abstraction organized into named columns, commonly
      used in Spark for structured data processing.

   Hadoop
      An open-source ecosystem for distributed storage and large-scale data
      processing.

   Lazy evaluation
      A strategy in which computations are not executed until their results are
      actually needed.

   MapReduce
      A distributed processing model that applies map and reduce steps to large
      datasets in parallel.

   PySpark
      The Python interface to Apache Spark.

   RDD (Resilient Distributed Dataset)
      Spark's low-level distributed data abstraction. RDDs are immutable and
      fault tolerant.

   Resource manager
      A service that allocates cluster resources to jobs and applications.

   Shuffle
      A data movement step in distributed processing where records are
      redistributed across machines, usually by key.

   Spark
      An open-source distributed computing engine widely used for data
      processing, analytics, and machine learning.

   Spark SQL
      Spark's structured data module built around SQL, DataFrames, and related
      query features.

   SparkContext
      The lower-level Spark entry point used mainly for RDD-based operations.

   SparkSession
      The main entry point for modern Spark applications, especially when using
      DataFrames and Spark SQL.

   Structured Streaming
      Spark's current streaming engine, which treats streaming data as an
      unbounded table and uses the DataFrame and SQL APIs.

   YARN (Yet Another Resource Negotiator)
      Hadoop's cluster resource-management layer.

Analytics and Machine Learning
==============================

.. glossary::
   :sorted:

   AI (Artificial Intelligence)
      A broad field focused on building systems that perform tasks associated
      with human intelligence.

   Classification
      A predictive task where the output is a discrete label, such as spam or
      not spam.

   Clustering
      An unsupervised learning task that groups similar data points together.

   CNN (Convolutional Neural Network)
      A deep learning architecture commonly used for image and grid-like data.

   Deep learning
      A subset of machine learning based on multi-layer neural networks.

   Feature
      An input variable or measurable property used by a model.

   GAN (Generative Adversarial Network)
      A deep learning model that learns to generate realistic synthetic data by
      training a generator and a discriminator together.

   GPU (Graphics Processing Unit)
      A processor that is highly effective for the parallel numerical
      computation used in many machine learning workloads.

   LLM (Large Language Model)
      A machine learning model trained on large amounts of text to understand
      and generate language.

   Machine learning
      Methods that allow systems to learn patterns from data and make
      predictions or decisions.

   Model
      A mathematical or computational representation learned from data and used
      for prediction, classification, generation, or explanation.

   PCA (Principal Component Analysis)
      A dimensionality-reduction technique that transforms data into a smaller
      set of informative components.

   Regression
      A predictive task where the output is a continuous value, such as price
      or temperature.

   Reinforcement learning
      A learning approach in which an agent improves by taking actions and
      receiving rewards or penalties.

   RNN (Recurrent Neural Network)
      A neural network architecture designed for sequential data such as text or
      time series.

   Supervised learning
      Learning from labeled examples where the correct output is known during
      training.

   TPU (Tensor Processing Unit)
      A specialized processor designed for certain machine learning workloads.

   Unsupervised learning
      Learning from data without labeled target outputs.

Security and Privacy
====================

.. glossary::
   :sorted:

   CCPA (California Consumer Privacy Act)
      A California privacy law that gives residents rights related to personal
      data collection, deletion, and disclosure.

   Data masking
      Protecting sensitive values by replacing them with altered or obscured
      versions.

   Encryption
      Protecting data by converting it into a form that requires a key or other
      authorized method to read.

   FERPA (Family Educational Rights and Privacy Act)
      A United States federal law that protects student education records.

   GDPR (General Data Protection Regulation)
      A European Union regulation governing the collection and use of personal
      data.

   HIPAA (Health Insurance Portability and Accountability Act)
      A United States federal law that sets privacy and security requirements
      for certain health information.

   MFA (Multi-factor authentication)
      An authentication method that requires more than one proof of identity,
      such as a password plus a one-time code.