Project

Learning outcome

  • Practice scientific background research

  • Practice how to design a big data application

  • Practice scientific background review writing

  • Practice scientific proposal writing

Objective

Write a proposal like article on a big data problem you want to solve.

What is a Big Data Problem?

For this project, a “big data problem” is one that is beyond the capability of traditional data tools and requires the use of big data tools and technologies to address big data challenges. The key is that throughout the life cycle of your proposed application, some big data tools must be employed.

Deadlines

  • Topic Proposal: End of Week 5

  • Final Report: End of semester

Report Requirement

  • A single page Topic Proposal due before the project to get approval from the instructor. You should focus on the justification on the topic (problem) as a valid big data project and also discuss your planned approach.

  • 4-5 pages letter size paper, 1.5 spacing, 12pt normal text font

  • Be organized and neat with headings, styles to separate sections

  • You can either use LaTeX or Microsoft Word

  • Provide image, figure, table, as needed; Do not forget to provide credit to the source of the image, figure, table

  • Use MLA style or other professional format for citation and reference

Report Structure and Grading Rubric

Your report is required to have the following sections. The bullet points under each section outline the grading criteria and should not be used as subsections in your report.

  • Abstract

    • Summarize the proposal in a concise manner (approx. 250 words).

    • Clearly state the problem statement, research question, and objectives.

    • Highlight the importance and relevance of the proposed research.

  • Introduction

    • Introduce the research topic and the problem being addressed in detail.

    • Explain why the problem is important to study and its real-world impact.

    • State the research question and a clear, measurable set of objectives.

  • Background review

    • Review relevant literature and theoretical frameworks related to big data applications. This should include academic papers, articles, and books.

    • Identify gaps or shortcomings in existing research that your project aims to address.

    • Provide a clear rationale for the proposed research based on the prior researches.

  • Method

    • Clearly outline the research design and methodology, including data collection, processing, and analysis methods.

    • Justify the choice of methodology and explain why it is appropriate for the research question and objectives.

    • Discuss potential limitations or challenges of the proposed methodology and how you plan to mitigate them.

  • References

    • List all references cited in the proposal.

    • Use a consistent style, such as MLA style.

      • In-text citation: parenthesis with numbers like in the article (1)

Academic Integrity

All work submitted for this project must be your own. Plagiarism, including copying from online sources, other students, or generative AI tools without proper citation, will not be tolerated. All sources must be properly cited. Any violation of academic integrity will result in a grade of zero for the project.

Warning

Be careful with the difference between the introduction and background sections.

Pitfalls

  • Do not propose a problem that is not a big data problem. The key is that throughout the life cycle of the proposed application. Some sorts of big data tools must be involved.

  • Good examples of big data projects

    • Start with large volume of data and the application must see all data to find the patterns.

    • Start with medium volume but data must be processed in real time or in a short period of time, so tools like Spark must be used

    • Start with medium or small volume to develop the model and plan to deploy the model to a large volume of data. The model must be able to handle the large volume of data.

  • Example of projects that are not big data projects

    • Involves large volume of data but there is

      • no need to process all at once

      • no need to work on the whole dataset

      • no need to aggregate the data distributed in different locations

    • Developing a deep learning model requires big volume of data. However, using that as an existing model in your application is not a big data project.

Hints

Data sources

You do not need to download or own the dateset. You only need to identify it and justify the problem is big data problem. Below are some example datasets. These datasets are not big data datasets. They can server as a good start point but you need learn from them and propose a plan to extend them to a big data if you wish to use them. Some of these datasets are hard to extend, so they can only be used for learning and inspirations.

  • Image datasets: Image datasets are widely available online and can be used for a variety of projects, such as object recognition, image classification, and facial recognition. Avoid small datasets, such as the MNIST dataset.

  • Text datasets: Text datasets can be used for natural language processing (NLP) projects, such as sentiment analysis, language modeling, and text classification. Some popular text datasets include the Reuters Corpus, the IMDB movie review dataset, and the Gutenberg Project.

  • Audio datasets: Audio datasets can be used for speech recognition and music analysis projects. Some popular audio datasets include the Speech Commands dataset, the UrbanSound8K dataset, and the Million Song Dataset.

  • Social media datasets: Social media datasets can be used for sentiment analysis, network analysis, and other social media-related projects. Some popular social media datasets include the Twitter Sentiment Analysis dataset, the Reddit Comment dataset, and the Facebook100 dataset.

  • Health-related datasets: Health-related datasets can be used for projects related to healthcare, such as disease prediction and drug discovery. Some popular health-related datasets include the MIMIC-III dataset, the Breast Cancer Wisconsin (Diagnostic) dataset, and the National Health and Nutrition Examination Survey (NHANES) dataset.

Methods

Consider the following aspects when proposing the method.

  • Acquisition

    • Direct download of prepared datasets (public domain data source)

    • Web scraping to collect data from websites

    • Application programming interface (API) to collect data from social media platforms

    • Sensor data collection

  • Processing

    • Hadoop

    • Spark

  • Analysis

    • Traditional statistical analysis

    • Traditional machine learning

    • Deep learning

  • Storage (discuss options for all stages)

    • HDFS

    • Cloud services

    • NoSQL

  • Security and management