************************
Introduction To Big Data
************************

Definition of Big Data:
=======================
Big Data is a term that refers to datasets that are so large, generates so
fast, or so complex that traditional data processing tools and methods are no
longer sufficient to handle them. In other words, Big Data is characterized by
its size, complexity, and speed of creation.

Examples of Big Data in various domains:
----------------------------------------
+ Social Media: Social media platforms generate enormous amounts of data every
  day, including user profiles, posts, comments, likes, shares, and more. This
  data can be analyzed to understand consumer behavior, sentiment analysis, and
  identifying influencers. Some examples include:

  * Facebook uses Big Data to understand user behavior and preferences to
    personalize user experiences, improve ad targeting, and identify trends.
  * Twitter uses Big Data to analyze tweets and hashtags, identify influential
    users, and track sentiment around brands, products, and events.

+ Content Creation: Big Data is used to create and distribute content in
  various formats, including text, images, videos, and more. Some examples
  include:

  * Netflix uses Big Data to personalize content recommendations for its users
    based on their viewing history and preferences.
  * YouTube uses Big Data to analyze user behavior, identify popular videos,
    and recommend content to its users.

+ Healthcare: Big Data is used in healthcare to improve patient outcomes,
  reduce costs, and identify new treatments. Some examples include:

  * IBM Watson Health uses Big Data to analyze medical records, clinical
    trials, and research papers to help doctors diagnose and treat diseases
    more accurately.
  * The Centers for Disease Control and Prevention (CDC) uses Big Data to track
    disease outbreaks and prevent the spread of infectious diseases.

+ Finance: Big Data is used in finance to manage risk, identify new investment
  opportunities, and improve customer experiences. Some examples include:

  * Goldman Sachs uses Big Data to analyze market trends, manage risk, and
    identify new investment opportunities.
  * American Express uses Big Data to analyze transaction data and personalize
    its customer experiences.

+ Retail: Big Data is used in retail to understand customer behavior, optimize
  pricing and inventory, and improve customer experiences. Some examples
  include:

  * Amazon uses Big Data to analyze customer behavior, personalize
    recommendations, and optimize pricing and inventory.
  * Walmart uses Big Data to optimize its supply chain, manage inventory, and
    improve customer experiences.

These are just a few examples of how Big Data is used in various domains, but
the applications are almost limitless, and Big Data is being used in many other
areas such as transportation, manufacturing, energy, and more.

Comparison of Big Data to traditional data:
===========================================

Big Data is often contrasted with traditional data. Big Data is characterized
by its size, complexity, and speed of creation, while traditional data is
typically small, structured, and easy to process using traditional data
processing tools and methods. Big Data requires specialized tools and
technologies to store, manage, and process data efficiently and effectively.
While Big Data provides many opportunities for businesses and organizations to
gain insights and make better decisions, it also presents various challenges,
including data privacy, security, and ethical concerns. Therefore, it's
essential to have a comprehensive understanding of Big Data and its
implications when working with large and complex data sets.

More details are discussed in a later section about the characteristics of big
data.

.. list-table:: Big vs Traditional Data
  :header-rows: 1

  * -
    - Traditional data
    - Big data
  * - Size
    - Small
    - Large
  * - Organization
    - structured
    - unstructured, semi-structured
  * - Types
    - numbers, texts
    - images, videos, sounds, sensor data
  * - Generation speed
    - slow
    - fast
  * - Integrity
    - clean and complete
    - messy and broken
  * - Integrity
    - clean data
    - messy, broken and missing

.. _five-vs:

Characteristics of Big Data (Five V's)
======================================
There are five main characteristics that define Big Data, known as the five
V's: Value, Volume, Velocity, Variety, and Veracity,.

**Value**: Value refers to the insights and value that can be derived from Big
Data. The ability to extract valuable insights from Big Data is becoming
increasingly important for businesses to stay competitive. According to a
recent study, data-driven organizations are 23 times more likely to acquire
customers, 6 times more likely to retain customers, and 19 times more likely to
be profitable than non-data-driven organizations [#]_.

**Volume**: Volume refers to the large amount of data that is generated and
collected every day. The growth of data volume has been exponential over the
past decade, with the total amount of data expected to reach 175 zettabytes by
2025, up from 33 zettabytes in 2018 [#]_. This massive increase in data
volume is due to the proliferation of digital devices, the rise of the Internet
of Things (IoT), and the explosion of social media and online content.

**Velocity**: Velocity refers to the speed at which data is generated and
processed. The speed of data creation has also increased dramatically in recent
years, with some data sources generating data in real-time. For example,
Facebook generates 150,000 messages and 147,000 photo uploads per minute as of
August 2020, and customers streamed 404,444 hours of videos on Netflix per
minute [#]_. The velocity of data also includes the speed at which data needs
to be processed, analyzed, and acted upon in real-time, which is becoming
increasingly important for businesses to stay competitive.

**Variety**: Variety refers to the different types of data that are generated
and collected, including structured, semi-structured, and unstructured data.
Structured data is organized and easily searchable, while semi-structured and
unstructured data, such as text, images, and videos, are more complex and
require advanced analytics tools to extract insights. The variety of data is
also expanding, with new data sources such as IoT devices and social media
platforms generating different types of data that require new processing and
analysis techniques.

Examples:

+-----------------------+--------------------------------------------+
| Data Type             | Example                                    |
+=======================+============================================+
| Structured Data       | Relational Database                        |
+-----------------------+--------------------------------------------+
| Semi-Structured Data  | JSON, XML, email                           |
+-----------------------+--------------------------------------------+
| Unstructured Data     | Text documents, Images, Audio/Video files  |
+-----------------------+--------------------------------------------+

**Veracity**: Veracity refers to the accuracy and reliability of data. With the
large volume, velocity, and variety of data, it is becoming increasingly
difficult to ensure the accuracy and reliability of data.

According to a recent study, poor data quality costs the US economy $3 trillion
GDP per year [#]_. Veracity is essential to ensure that data-driven decisions
are accurate and reliable.

The veracity is more of a concern rather than a property. **The veracity of
data is only relevant when the data is messy, broken, and missing**, which is
very common in Big Data.

When the dataset is naturally complete and clean, veracity is not a problem. We
will not talk about veracity on these kind of data like dataset exported from
traditional relational database. Another example are the core datasets from
financial institution which are so well-maintained in perfect format and
integrity. We do not worry about veracity in these cases.

Example table with missing data:

+----------+-----+--------+----------+
| Name     | Age | Gender | Income   |
+==========+=====+========+==========+
| John     | 32  | Male   | $50,000  |
+----------+-----+--------+----------+
| Sarah    | 45  | Female |    ?     |
+----------+-----+--------+----------+
| Michael  |  ?  | Male   | $70,000  |
+----------+-----+--------+----------+
| Samantha | 28  |   ?    | $35,000  |
+----------+-----+--------+----------+
| David    |  ?  | Male   |    ?     |
+----------+-----+--------+----------+

.. container:: footnote

  .. rubric:: References

  .. [#] "Five facts: How customer analytics boosts corporate performance."
      McKinsey.com `link <mckinsey_>`_
  .. [#] "The Digitization of the World from Edge to Core." `IDC White Paper`_.
  .. [#] "53 Important Statistics About How Much Data Is Created Every Day"
      Financesonline.com `link <data growth_>`_.
  .. [#] "Why your bad data is creating a 3 trillion problem." Inc.com `link
      <bad data_>`_

.. _IDC White Paper: https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage-whitepaper.pdf

.. _data growth: https://financesonline.com/how-much-data-is-created-every-day/

.. _bad data: https://www.inc.com/anne-gherini/why-your-bad-data-is-creating-a-3-trillion-problem.html

.. _mckinsey: https://www.mckinsey.com/capabilities/growth-marketing-and-sales/our-insights/five-facts-how-customer-analytics-boosts-corporate-performance

Challenges
==========
+ Data Quality - The fifth V, Veracity, poses a significant challenge as the
  data is often messy, inconsistent, and incomplete. In the United States
  alone, such dirty data costs companies $600 billion each year.
+ Discovery - Finding insights within Big Data is akin to searching for a
  needle in a haystack. Analyzing petabytes of data requires powerful
  algorithms and can be extremely difficult, making it challenging to uncover
  patterns and insights.
+ Storage - As organizations accumulate more data, managing it can become
  increasingly complex. The key question becomes: "Where do we store it?" A
  scalable storage system that can expand or contract as needed is crucial.
+ Analytics - Analyzing Big Data is often complicated by the sheer volume and
  variety of data. Understanding the nature of the data can be challenging,
  making analysis even more difficult.
+ Security - Given the vast size of Big Data, keeping it secure is a
  significant challenge. This includes measures such as user authentication,
  access restrictions, data access logging, and proper data encryption.
+ Lack of Talent - Despite the prevalence of Big Data projects in major
  organizations, finding a skilled team of developers, data scientists, and
  analysts with sufficient domain knowledge remains a challenge.

Technologies and tools used in Big Data processing
==================================================
As we know, Big Data refers to large, complex datasets that cannot be processed
using traditional data processing tools. To deal with Big Data, we use
distributed computing and parallel processing. This means we divide the data
into smaller chunks and process them in parallel to achieve high performance.

There are many frameworks used for Big Data processing, but the most commonly
used are Apache Hadoop and Apache Spark. Hadoop is a collection of open-source
software utilities that facilitate the distributed processing of large data
sets across clusters of computers. Spark is another powerful open-source data
processing engine that provides high-level APIs in Java, Scala, Python, and R.

To handle the ingestion and integration of Big Data, we use tools such as
Apache Nifi, Kafka, and Flume. These tools help us move data from different
sources into our Big Data platform, where it can be processed and analyzed.

We also need to store the data that we are processing. In Big Data, we use a
distributed file system called Hadoop Distributed File System (HDFS). Other
storage technologies include S3 and Azure Data Lake Storage.

When it comes to databases, we use NoSQL databases like MongoDB, Cassandra, and
HBase. NoSQL databases are designed to handle large and unstructured data,
making them ideal for Big Data processing.

Finally, we'll touch on cloud-based Big Data processing platforms such as AWS,
Azure, and GCP. These platforms provide an easy way to set up Big Data
processing infrastructure in the cloud, making it easier to scale and manage.

The above concludes our discussion of Big Data processing technologies and
tools. It has only been briefly introduced here as it will be covered in detail
in later modules.