Project¶
Learning outcome¶
Practice scientific background research
Practice how to design a big data application
Practice scientific background review writing
Practice scientific proposal writing
Objective¶
Write a proposal like article on a big data problem you want to solve.
What is a Big Data Problem?¶
For this project, a “big data problem” is one that is beyond the capability of traditional data tools and requires the use of big data tools and technologies to address big data challenges. The key is that throughout the life cycle of your proposed application, some big data tools must be employed.
Deadlines¶
Topic Proposal: End of Week 5
Final Report: End of semester
Report Requirement¶
A single page Topic Proposal due before the project to get approval from the instructor. You should focus on the justification on the topic (problem) as a valid big data project and also discuss your planned approach.
4-5 pages letter size paper, 1.5 spacing, 12pt normal text font
Be organized and neat with headings, styles to separate sections
You can either use LaTeX or Microsoft Word
Provide image, figure, table, as needed; Do not forget to provide credit to the source of the image, figure, table
Use MLA style or other professional format for citation and reference
Report Structure and Grading Rubric¶
Your report is required to have the following sections. The bullet points under each section outline the grading criteria and should not be used as subsections in your report.
Abstract
Summarize the proposal in a concise manner (approx. 250 words).
Clearly state the problem statement, research question, and objectives.
Highlight the importance and relevance of the proposed research.
Introduction
Introduce the research topic and the problem being addressed in detail.
Explain why the problem is important to study and its real-world impact.
State the research question and a clear, measurable set of objectives.
Background review
Review relevant literature and theoretical frameworks related to big data applications. This should include academic papers, articles, and books.
Identify gaps or shortcomings in existing research that your project aims to address.
Provide a clear rationale for the proposed research based on the prior researches.
Method
Clearly outline the research design and methodology, including data collection, processing, and analysis methods.
Justify the choice of methodology and explain why it is appropriate for the research question and objectives.
Discuss potential limitations or challenges of the proposed methodology and how you plan to mitigate them.
References
List all references cited in the proposal.
Use a consistent style, such as MLA style.
In-text citation: parenthesis with numbers like in the article (1)
Academic Integrity¶
All work submitted for this project must be your own. Plagiarism, including copying from online sources, other students, or generative AI tools without proper citation, will not be tolerated. All sources must be properly cited. Any violation of academic integrity will result in a grade of zero for the project.
Warning
Be careful with the difference between the introduction and background sections.
Pitfalls¶
Do not propose a problem that is not a big data problem. The key is that throughout the life cycle of the proposed application. Some sorts of big data tools must be involved.
Good examples of big data projects
Start with large volume of data and the application must see all data to find the patterns.
Start with medium volume but data must be processed in real time or in a short period of time, so tools like Spark must be used
Start with medium or small volume to develop the model and plan to deploy the model to a large volume of data. The model must be able to handle the large volume of data.
Example of projects that are not big data projects
Involves large volume of data but there is
no need to process all at once
no need to work on the whole dataset
no need to aggregate the data distributed in different locations
Developing a deep learning model requires big volume of data. However, using that as an existing model in your application is not a big data project.
Hints¶
Data sources¶
You do not need to download or own the dateset. You only need to identify it and justify the problem is big data problem. Below are some example datasets. These datasets are not big data datasets. They can server as a good start point but you need learn from them and propose a plan to extend them to a big data if you wish to use them. Some of these datasets are hard to extend, so they can only be used for learning and inspirations.
Image datasets: Image datasets are widely available online and can be used for a variety of projects, such as object recognition, image classification, and facial recognition. Avoid small datasets, such as the MNIST dataset.
Text datasets: Text datasets can be used for natural language processing (NLP) projects, such as sentiment analysis, language modeling, and text classification. Some popular text datasets include the Reuters Corpus, the IMDB movie review dataset, and the Gutenberg Project.
Audio datasets: Audio datasets can be used for speech recognition and music analysis projects. Some popular audio datasets include the Speech Commands dataset, the UrbanSound8K dataset, and the Million Song Dataset.
Social media datasets: Social media datasets can be used for sentiment analysis, network analysis, and other social media-related projects. Some popular social media datasets include the Twitter Sentiment Analysis dataset, the Reddit Comment dataset, and the Facebook100 dataset.
Health-related datasets: Health-related datasets can be used for projects related to healthcare, such as disease prediction and drug discovery. Some popular health-related datasets include the MIMIC-III dataset, the Breast Cancer Wisconsin (Diagnostic) dataset, and the National Health and Nutrition Examination Survey (NHANES) dataset.
Methods¶
Consider the following aspects when proposing the method.
Acquisition
Direct download of prepared datasets (public domain data source)
Web scraping to collect data from websites
Application programming interface (API) to collect data from social media platforms
Sensor data collection
Processing
Hadoop
Spark
Analysis
Traditional statistical analysis
Traditional machine learning
Deep learning
Storage (discuss options for all stages)
HDFS
Cloud services
NoSQL
Security and management