Course Content Summary

Misunderstanding on Characteristics

  • Do not forget value!

  • Velocity is about processing

    • not the speed of generation, but the speed of processing

    • determined by the demand

  • Veracity is only relevant to low-quality data

    • More of a concern than a characteristic

    • Focus on low-quality data

    • Weird to say “high veracity data”

Misunderstanding on Data Volume

  • Consider effective data volume

    • not all data needs to be processed

      • irrelevant data can be skipped

      • job can be done with only a subset of data

    • not all data needs to be processed together

      • results do not require a global view of the data

  • Data volume changes after each stage

    • big raw -> big processed -> big model (deep learning)

      • every stage needs a big data tool

    • big raw -> small processed -> small model (traditional ML, statistical models, visualization)

      • only the first stage needs a big data tool

    • Small raw -> small processed -> small model (when the result is needed in a short time)

      • no big data tool needed

  • Data size pitfalls

    • Many companies have huge amount of historical data. Most of their applications only need to access a small portion of the data. They are not big data applications.

    • Using a model pre-trained on a huge dataset does not mean you are dealing with big data. You may only be using the model to process a small dataset in your application, which only need a single computer or even a cell phone.

    • A car company may have a huge amount of data generated by sensors in the many of their cars. Most of the sensor data is only processed locally on the car itself to make some decisions. It is not big data. However, if you application must aggregate all the sensor data from all the cars to make a global decision, then it is big data.

Dataset Types Determines Choice of Tools

  • Structured, semi-structured, unstructured types

    Dataset Type

    Storage

    Processing

    Example

    Structured

    NoSQL

    Spark SQL

    RDBMS exports, Excel

    Semi-structured

    NoSQL, DFS, Object Store

    Spark RDD, etc

    JSON, XML

    Unstructured

    DFS, Object Store

    MapReduce, Spark RDD

    Text corpus, binary files

  • Binary files are usually unstructured

    • Multimedia: Image, video, audio

    • Deep learning model, Machine learning model

    • Exception

      • some binary files that has an intrinsic structure

      • some binary files are used as a blob field in a structured dataset

  • Decide which tool to use

    • Match data volume in each stage!

    • Match data types

    • Match special demands

      • Real-time

      • Data quality control

    • Match budget

      • Nothing is free! Always consider cost.

      • No demand means waste of money.

      • Never imagine demands.