Course Content Summary¶

Misunderstanding on Characteristics¶

Do not forget value!
Velocity is about processing
- not the speed of generation, but the speed of processing
- determined by the demand
Veracity is only relevant to low-quality data
- More of a concern than a characteristic
- Focus on low-quality data
- Weird to say “high veracity data”

Consider effective data volume
- not all data needs to be processed
  - irrelevant data can be skipped
  - job can be done with only a subset of data
- not all data needs to be processed together
  - results do not require a global view of the data
Data volume changes after each stage
- big raw -> big processed -> big model (deep learning)
  - every stage needs a big data tool
- big raw -> small processed -> small model (traditional ML, statistical models, visualization)
  - only the first stage needs a big data tool
- Small raw -> small processed -> small model (when the result is needed in a short time)
  - no big data tool needed
Data size pitfalls
- Many companies have huge amount of historical data. Most of their applications only need to access a small portion of the data. They are not big data applications.
- Using a model pre-trained on a huge dataset does not mean you are dealing with big data. You may only be using the model to process a small dataset in your application, which only need a single computer or even a cell phone.
- A car company may have a huge amount of data generated by sensors in the many of their cars. Most of the sensor data is only processed locally on the car itself to make some decisions. It is not big data. However, if you application must aggregate all the sensor data from all the cars to make a global decision, then it is big data.

Structured, semi-structured, unstructured types

Dataset Type	Storage	Processing	Example
Structured	NoSQL	Spark SQL	RDBMS exports, Excel
Semi-structured	NoSQL, DFS, Object Store	Spark RDD, etc	JSON, XML
Unstructured	DFS, Object Store	MapReduce, Spark RDD	Text corpus, binary files