Project

Learning outcome

  • Practice scientific background research

  • Practice how to design a big data application

  • Practice scientific background review writing

  • Practice scientific proposal writing

Objective

Write a proposal like article on a big data problem you want to solve. In this report, you must:

  1. Identify the problem (<1 page)

    The data source must be big data. The problem must be big data problem. The problem must be interesting and challenging. The data should be extremely big in volume, or relative big but very complex for traditional machine learning model (text, image, sound, etc). You must have a good understanding of the dataset you plan to use. For data with big volume, you can use the processing tools to make it small enough for analysis. For complex data, you can either use some special tools to convert them or find the right deep learning models to directly consume them.

  2. Review the background (2-3 pages)

    You must review the background of the problem. Carefully review the related articles and justify why your problem is big data and why it is interesting and challenging. The review should be systematic and comprehensive. Employ what you have learning in the course to organize the information you found.

  3. Propose your method (1-2 pages)

    You must propose a solution to the problem. The solution must be feasible and interesting. The solution should be based on the knowledge you have learned in the course. The solution must be well justified. You do not need to provide a lot of details. You can provide a high level overview of the solution. List the steps of how you will go though the big data life cycle. List the tools you plan to use in each stage of the big data application. Also discuss how you are going to address all of the aspects in the big data life cycle. Details on how to used the tools are not required. You can provide a high level overview of how you will use the tools. You can also discuss the pros and cons of the tools you plan to use.

  4. References (1 page)

Report Requirement

  • A single page Topic Proposal due before the project to get approval from the instructor. You should focus on the justification on the topic (problem) as a valid big data project and also discuss your planned approach.

  • 4-5 pages letter size paper, 1.5 spacing, 12pt normal text font

  • Be organized and neat with headings, styles to separate sections

  • You can either use LaTeX or Microsoft Word

  • Provide image, figure, table, as needed; Do not forget to provide credit to the source of the image, figure, table

  • Use MLA style or other professional format for citation and reference

Report Structure

Your report is required to have the following sections:

Note

The bullet points listed under each section is the rubric that will be used. Do not use them as subsections! Please show section titles explicitly!

  • Abstract

    • Summarize the proposal in a concise manner

    • Clearly state the problem statement, research question, and objectives

    • Highlight the importance and relevance of the proposed research

  • Introduction

    • Introduce the research topic and the problem being addressed

    • Explain why the problem is important to study

    • State the research question and objectives

  • Background review

    • Review relevant literature and theoretical frameworks related to big data applications

    • Identify gaps or shortcomings in existing research

    • Provide a clear rationale for the proposed research based on the prior researches

  • Method

    • Clearly outline the research design and methodology, including data collection and analysis methods

    • Justify the choice of methodology and explain why it is appropriate for the research question and objectives

    • Discuss potential limitations or challenges of the proposed methodology

  • References

    • List all references cited in the proposal

    • Use a consistent MLA style

      • In-text citation: parenthesis with numbers like in the article (1)

Warning

This section serves as a rubric that will be used to grade your project.

Warning

Be careful with the difference between the introduction and background sections.

Pitfalls

  • Do not propose a problem that is not a big data problem. The key is that throughout the life cycle of the proposed application. Some sorts of big data tools must be required.

  • Good examples of big data projects

    • Start with large volume of data and the application must see all data to find the patterns.

    • Start with medium volume but data must be processed in real time or in a short period of time, so tools like Spark must be used

    • Start with medium or small volume to develop the model and plan to deploy the model to a large volume of data. The model must be able to handle the large volume of data.

  • Example of projects that are not big data projects

    • Involves large volume of data but there is

      • no need to process all at once

      • no need to work on the whole dataset

      • no need to aggregate the data distributed in different locations

    • Developing a deep learning model requires big volume of data. However, using that as an existing model in your application is not a big data project.

Hints

Data sources

You do not need to download or own the dateset. You only need to identify it and justify the problem is big data problem. Below are some example datasets.

  • Image datasets: Image datasets are widely available online and can be used for a variety of projects, such as object recognition, image classification, and facial recognition. Avoid small datasets, such as the MNIST dataset.

  • Text datasets: Text datasets can be used for natural language processing (NLP) projects, such as sentiment analysis, language modeling, and text classification. Some popular text datasets include the Reuters Corpus, the IMDB movie review dataset, and the Gutenberg Project.

  • Audio datasets: Audio datasets can be used for speech recognition and music analysis projects. Some popular audio datasets include the Speech Commands dataset, the UrbanSound8K dataset, and the Million Song Dataset.

  • Social media datasets: Social media datasets can be used for sentiment analysis, network analysis, and other social media-related projects. Some popular social media datasets include the Twitter Sentiment Analysis dataset, the Reddit Comment dataset, and the Facebook100 dataset.

  • Health-related datasets: Health-related datasets can be used for projects related to healthcare, such as disease prediction and drug discovery. Some popular health-related datasets include the MIMIC-III dataset, the Breast Cancer Wisconsin (Diagnostic) dataset, and the National Health and Nutrition Examination Survey (NHANES) dataset.

Methods

Consider the following aspects when proposing the method.

  • Acquisition

    • Direct download of prepared datasets (public domain data source)

    • Web scraping to collect data from websites

    • Application programming interface (API) to collect data from social media platforms

    • Sensor data collection

  • Processing

    • Hadoop

    • Spark

  • Analysis

    • Traditional statistical analysis

    • Traditional machine learning

    • Deep learning

  • Storage (discuss options for all stages)

    • HDFS

    • Cloud services

    • NoSQL

  • Security and management