Final Review

Introduction

Final exam is a take-home open-book exam. You need to finish it within a 2-day time window. Please refer to the Canvas quiz “Final Exam” for the exact time. Questions are all related to the topics we have covered in this course. Below are the requirements and restrictions:

  • You may also use any resources you can find to finish the exam.

  • As an exam for individual, you need to finish them independently and you are not allowed to discuss with anyone else about the exam.

  • You are not allowed to copy from any resources directly. Questions should be answered using your own language based on the content you learned in class. Quotation is allowed but you need to cite the source.

  • You are not allowed to discuss with other students about the exam.

  • You are not allowed to copy from other students.

  • You are not allowed to post any questions about the exam on any online forums.

  • You are not allowed to generate answers using any online tools including generative AI models. This will provide answers look correct but is not related to the course materials.

  • Contents that are totally not related to the course materials will be considered as cheating. You will get 0 points for the exam.

  • I will use plagiarism detection tools to check the similarity of your answers with other students’ answers and online resources. If you are found to have violated the above rules, you will get 0 points for the exam.

Summary and Relationships

  • Big vs small

    • Consider effective data volume in the context of an application

      • not all data needs to be processed

        • irrelevant data can be skipped

        • job can be done with a subset of data

      • not all data needs to be processed together

        • results do not require a global view of the data

  • Characteristics

    • Velocity

      • not the speed of generation

      • but the speed of saving/consumption/processing

      • determined by the demand

    • Veracity

      • More of a concern than a characteristic

      • Focus on low-quality data

      • Weird to say “high veracity data”

  • Data volume after each stage

    • big raw -> big processed -> big model (deep learning)

    • big raw -> small processed -> small model (traditional ML, statistical models, visualization)

    • Small raw -> small processed -> small model (when the result is needed in a short time)

  • Dataset types in big data

    Dataset Type

    Storage

    Processing

    Example

    Structured

    NoSQL

    Spark SQL

    RDBMS exports, Excel

    Semi-structured

    NoSQL, DFS, Object Store

    Spark RDD, etc

    JSON, XML

    Unstructured

    DFS, Object Store

    MapReduce, Spark RDD

    Text corpus, binary files

  • Binary files are usually unstructured

    • Multimedia: Image, video, audio

    • Deep learning model, Machine learning model

    • Exception

      • some binary files that has an intrinsic structure

      • some binary files are used as a blob field in a structured dataset

  • Decide which tool to use

    • Match data volume in each stage!

    • Match data type

    • Match special demands

      • real-time

      • data quality control

    • Match budget

      • always consider cost

      • never imagine demands