Final Review¶
Introduction¶
Final exam is a take-home open-book exam. You need to finish it within a 2-day time window. Please refer to the Canvas quiz “Final Exam” for the exact time. Questions are all related to the topics we have covered in this course. Below are the requirements and restrictions:
You may also use any resources you can find to finish the exam.
As an exam for individual, you need to finish them independently and you are not allowed to discuss with anyone else about the exam.
You are not allowed to copy from any resources directly. Questions should be answered using your own language based on the content you learned in class. Quotation is allowed but you need to cite the source.
You are not allowed to discuss with other students about the exam.
You are not allowed to copy from other students.
You are not allowed to post any questions about the exam on any online forums.
You are not allowed to generate answers using any online tools including generative AI models. This will provide answers look correct but is not related to the course materials.
Contents that are totally not related to the course materials will be considered as cheating. You will get 0 points for the exam.
I will use plagiarism detection tools to check the similarity of your answers with other students’ answers and online resources. If you are found to have violated the above rules, you will get 0 points for the exam.
Summary and Relationships¶
Big vs small
Consider effective data volume in the context of an application
not all data needs to be processed
irrelevant data can be skipped
job can be done with a subset of data
not all data needs to be processed together
results do not require a global view of the data
Characteristics
Velocity
not the speed of generation
but the speed of saving/consumption/processing
determined by the demand
Veracity
More of a concern than a characteristic
Focus on low-quality data
Weird to say “high veracity data”
Data volume after each stage
big raw -> big processed -> big model (deep learning)
big raw -> small processed -> small model (traditional ML, statistical models, visualization)
Small raw -> small processed -> small model (when the result is needed in a short time)
Dataset types in big data
Dataset Type
Storage
Processing
Example
Structured
NoSQL
Spark SQL
RDBMS exports, Excel
Semi-structured
NoSQL, DFS, Object Store
Spark RDD, etc
JSON, XML
Unstructured
DFS, Object Store
MapReduce, Spark RDD
Text corpus, binary files
Binary files are usually unstructured
Multimedia: Image, video, audio
Deep learning model, Machine learning model
Exception
some binary files that has an intrinsic structure
some binary files are used as a blob field in a structured dataset
Decide which tool to use
Match data volume in each stage!
Match data type
Match special demands
real-time
data quality control
Match budget
always consider cost
never imagine demands