Introduction To Big Data

Definition of Big Data:

Big Data is a term that refers to datasets that are so large and complex that traditional data processing tools and methods are no longer sufficient to handle them. In other words, Big Data is characterized by its size, complexity, and speed of creation.

Examples of Big Data in various domains:

  • Social Media: Social media platforms generate enormous amounts of data every day, including user profiles, posts, comments, likes, shares, and more. This data can be analyzed to understand consumer behavior, sentiment analysis, and identifying influencers. Some examples include:

    • Facebook uses Big Data to understand user behavior and preferences to personalize user experiences, improve ad targeting, and identify trends.

    • Twitter uses Big Data to analyze tweets and hashtags, identify influential users, and track sentiment around brands, products, and events.

  • Content Creation: Big Data is used to create and distribute content in various formats, including text, images, videos, and more. Some examples include:

    • Netflix uses Big Data to personalize content recommendations for its users based on their viewing history and preferences.

    • YouTube uses Big Data to analyze user behavior, identify popular videos, and recommend content to its users.

  • Healthcare: Big Data is used in healthcare to improve patient outcomes, reduce costs, and identify new treatments. Some examples include:

    • IBM Watson Health uses Big Data to analyze medical records, clinical trials, and research papers to help doctors diagnose and treat diseases more accurately.

    • The Centers for Disease Control and Prevention (CDC) uses Big Data to track disease outbreaks and prevent the spread of infectious diseases.

  • Finance: Big Data is used in finance to manage risk, identify new investment opportunities, and improve customer experiences. Some examples include:

    • Goldman Sachs uses Big Data to analyze market trends, manage risk, and identify new investment opportunities.

    • American Express uses Big Data to analyze transaction data and personalize its customer experiences.

  • Retail: Big Data is used in retail to understand customer behavior, optimize pricing and inventory, and improve customer experiences. Some examples include:

    • Amazon uses Big Data to analyze customer behavior, personalize recommendations, and optimize pricing and inventory.

    • Walmart uses Big Data to optimize its supply chain, manage inventory, and improve customer experiences.

These are just a few examples of how Big Data is used in various domains, but the applications are almost limitless, and Big Data is being used in many other areas such as transportation, manufacturing, energy, and more.

Comparison of Big Data to traditional data:

Big Data is often contrasted with traditional data. Big Data is characterized by its size, complexity, and speed of creation, while traditional data is typically small, structured, and easy to process using traditional data processing tools and methods. Big Data requires specialized tools and technologies to store, manage, and process data efficiently and effectively. While Big Data provides many opportunities for businesses and organizations to gain insights and make better decisions, it also presents various challenges, including data privacy, security, and ethical concerns. Therefore, it’s essential to have a comprehensive understanding of Big Data and its implications when working with large and complex data sets.

More details are discussed in a later section about the characteristics of big data.

Big vs Traditional Data

Traditional data

Big data

Size

Small

Large

Organization

structured

unstructured, semi-structured

Types

numbers, texts

images, videos, sounds, sensor data

Generation speed

slow

fast

Integrity

clean and complete

messy and broken

Integrity

clean data

messy, broken and missing

Characteristics of Big Data (Five V’s)

There are five main characteristics that define Big Data, known as the five V’s: Value, Volume, Velocity, Variety, and Veracity,.

Value: Value refers to the insights and value that can be derived from Big Data. The ability to extract valuable insights from Big Data is becoming increasingly important for businesses to stay competitive. According to a recent study, data-driven organizations are 23 times more likely to acquire customers, 6 times more likely to retain customers, and 19 times more likely to be profitable than non-data-driven organizations [1].

Volume: Volume refers to the large amount of data that is generated and collected every day. The growth of data volume has been exponential over the past decade, with the total amount of data expected to reach 175 zettabytes by 2025, up from 33 zettabytes in 2018 [2]. This massive increase in data volume is due to the proliferation of digital devices, the rise of the Internet of Things (IoT), and the explosion of social media and online content.

Velocity: Velocity refers to the speed at which data is generated and processed. The speed of data creation has also increased dramatically in recent years, with some data sources generating data in real-time. For example, Facebook generates 150,000 messages and 147,000 photo uploads per minute as of August 2020, and customers streamed 404,444 hours of videos on Netflix per minute [3]. The velocity of data also includes the speed at which data needs to be processed, analyzed, and acted upon in real-time, which is becoming increasingly important for businesses to stay competitive.

Variety: Variety refers to the different types of data that are generated and collected, including structured, semi-structured, and unstructured data. Structured data is organized and easily searchable, while semi-structured and unstructured data, such as text, images, and videos, are more complex and require advanced analytics tools to extract insights. The variety of data is also expanding, with new data sources such as IoT devices and social media platforms generating different types of data that require new processing and analysis techniques.

Examples:

Data Type

Example

Structured Data

Relational Database

Semi-Structured Data

JSON, XML, email

Unstructured Data

Text documents, Images, Audio/Video files

Veracity: Veracity refers to the accuracy and reliability of data. With the large volume, velocity, and variety of data, it is becoming increasingly difficult to ensure the accuracy and reliability of data.

According to a recent study, poor data quality costs the US economy $3 trillion GDP per year [4]. Veracity is essential to ensure that data-driven decisions are accurate and reliable.

The veracity is more of a concern rather than a property. The veracity of data is only relevant when the data is messy, broken, and missing, which is very common in Big Data.

When the dataset is naturally complete and clean, veracity is not a problem. We will not talk about veracity on these kind of data like dataset exported from traditional relational database. Another example are the core datasets from financial institution which are so well-maintained in perfect format and integrity. We do not worry about veracity in these cases.

Example table with missing data:

Name

Age

Gender

Income

John

32

Male

$50,000

Sarah

45

Female

?

Michael

?

Male

$70,000

Samantha

28

?

$35,000

David

?

Male

?

References

Challenges

  • Data Quality - The fifth V, Veracity, poses a significant challenge as the data is often messy, inconsistent, and incomplete. In the United States alone, such dirty data costs companies $600 billion each year.

  • Discovery - Finding insights within Big Data is akin to searching for a needle in a haystack. Analyzing petabytes of data requires powerful algorithms and can be extremely difficult, making it challenging to uncover patterns and insights.

  • Storage - As organizations accumulate more data, managing it can become increasingly complex. The key question becomes: “Where do we store it?” A scalable storage system that can expand or contract as needed is crucial.

  • Analytics - Analyzing Big Data is often complicated by the sheer volume and variety of data. Understanding the nature of the data can be challenging, making analysis even more difficult.

  • Security - Given the vast size of Big Data, keeping it secure is a significant challenge. This includes measures such as user authentication, access restrictions, data access logging, and proper data encryption.

  • Lack of Talent - Despite the prevalence of Big Data projects in major organizations, finding a skilled team of developers, data scientists, and analysts with sufficient domain knowledge remains a challenge.

Technologies and tools used in Big Data processing

As we know, Big Data refers to large, complex datasets that cannot be processed using traditional data processing tools. To deal with Big Data, we use distributed computing and parallel processing. This means we divide the data into smaller chunks and process them in parallel to achieve high performance.

There are many frameworks used for Big Data processing, but the most commonly used are Apache Hadoop and Apache Spark. Hadoop is a collection of open-source software utilities that facilitate the distributed processing of large data sets across clusters of computers. Spark is another powerful open-source data processing engine that provides high-level APIs in Java, Scala, Python, and R.

To handle the ingestion and integration of Big Data, we use tools such as Apache Nifi, Kafka, and Flume. These tools help us move data from different sources into our Big Data platform, where it can be processed and analyzed.

We also need to store the data that we are processing. In Big Data, we use a distributed file system called Hadoop Distributed File System (HDFS). Other storage technologies include S3 and Azure Data Lake Storage.

When it comes to databases, we use NoSQL databases like MongoDB, Cassandra, and HBase. NoSQL databases are designed to handle large and unstructured data, making them ideal for Big Data processing.

Finally, we’ll touch on cloud-based Big Data processing platforms such as AWS, Azure, and GCP. These platforms provide an easy way to set up Big Data processing infrastructure in the cloud, making it easier to scale and manage.

The above concludes our discussion of Big Data processing technologies and tools. It has only been briefly introduced here as it will be covered in detail in later modules.