Course Content Summary¶
Misunderstanding on Characteristics¶
Do not forget value!
Velocity is about processing
not the speed of generation, but the speed of processing
determined by the demand
Veracity is only relevant to low-quality data
More of a concern than a characteristic
Focus on low-quality data
Weird to say “high veracity data”
Misunderstanding on Data Volume¶
Consider effective data volume
not all data needs to be processed
irrelevant data can be skipped
job can be done with only a subset of data
not all data needs to be processed together
results do not require a global view of the data
Data volume changes after each stage
big raw -> big processed -> big model (deep learning)
every stage needs a big data tool
big raw -> small processed -> small model (traditional ML, statistical models, visualization)
only the first stage needs a big data tool
Small raw -> small processed -> small model (when the result is needed in a short time)
no big data tool needed
Data size pitfalls
Many companies have huge amount of historical data. Most of their applications only need to access a small portion of the data. They are not big data applications.
Using a model pre-trained on a huge dataset does not mean you are dealing with big data. You may only be using the model to process a small dataset in your application, which only need a single computer or even a cell phone.
A car company may have a huge amount of data generated by sensors in the many of their cars. Most of the sensor data is only processed locally on the car itself to make some decisions. It is not big data. However, if you application must aggregate all the sensor data from all the cars to make a global decision, then it is big data.
Dataset Types Determines Choice of Tools¶
Structured, semi-structured, unstructured types
Dataset Type
Storage
Processing
Example
Structured
NoSQL
Spark SQL
RDBMS exports, Excel
Semi-structured
NoSQL, DFS, Object Store
Spark RDD, etc
JSON, XML
Unstructured
DFS, Object Store
MapReduce, Spark RDD
Text corpus, binary files
Binary files are usually unstructured
Multimedia: Image, video, audio
Deep learning model, Machine learning model
Exception
some binary files that has an intrinsic structure
some binary files are used as a blob field in a structured dataset
Decide which tool to use
Match data volume in each stage!
Match data types
Match special demands
Real-time
Data quality control
Match budget
Nothing is free! Always consider cost.
No demand means waste of money.
Never imagine demands.