Project¶
Learning outcome¶
Practice scientific background research
Practice how to design a big data application
Practice scientific background review writing
Practice scientific proposal writing
Objective¶
Write a proposal like article on a big data problem you want to solve. In this report, you must:
Identify the problem (<1 page)
The data source must be big data. The problem must be big data problem. The problem must be interesting and challenging. The data should be extremely big in volume, or relative big but very complex for traditional machine learning model (text, image, sound, etc). You must have a good understanding of the dataset you plan to use. For data with big volume, you can use the processing tools to make it small enough for analysis. For complex data, you can either use some special tools to convert them or find the right deep learning models to directly consume them.
Review the background (2-3 pages)
You must review the background of the problem. Carefully review the related articles and justify why your problem is big data and why it is interesting and challenging. The review should be systematic and comprehensive. Employ what you have learning in the course to organize the information you found.
Propose your method (1-2 pages)
You must propose a solution to the problem. The solution must be feasible and interesting. The solution should be based on the knowledge you have learned in the course. The solution must be well justified. You do not need to provide a lot of details. You can provide a high level overview of the solution. List the steps of how you will go though the big data life cycle. List the tools you plan to use in each stage of the big data application. Also discuss how you are going to address all of the aspects in the big data life cycle. Details on how to used the tools are not required. You can provide a high level overview of how you will use the tools. You can also discuss the pros and cons of the tools you plan to use.
References (1 page)
Report Requirement¶
A single page Topic Proposal due before the project to get approval from the instructor. You should focus on the justification on the topic (problem) as a valid big data project and also discuss your planned approach.
4-5 pages letter size paper, 1.5 spacing, 12pt normal text font
Be organized and neat with headings, styles to separate sections
You can either use LaTeX or Microsoft Word
Provide image, figure, table, as needed; Do not forget to provide credit to the source of the image, figure, table
Use MLA style or other professional format for citation and reference
Report Structure¶
Your report is required to have the following sections:
Note
The bullet points listed under each section is the rubric that will be used. Do not use them as subsections! Please show section titles explicitly!
Abstract
Summarize the proposal in a concise manner
Clearly state the problem statement, research question, and objectives
Highlight the importance and relevance of the proposed research
Introduction
Introduce the research topic and the problem being addressed
Explain why the problem is important to study
State the research question and objectives
Background review
Review relevant literature and theoretical frameworks related to big data applications
Identify gaps or shortcomings in existing research
Provide a clear rationale for the proposed research based on the prior researches
Method
Clearly outline the research design and methodology, including data collection and analysis methods
Justify the choice of methodology and explain why it is appropriate for the research question and objectives
Discuss potential limitations or challenges of the proposed methodology
References
List all references cited in the proposal
Use a consistent MLA style
In-text citation: parenthesis with numbers like in the article (1)
Warning
This section serves as a rubric that will be used to grade your project.
Warning
Be careful with the difference between the introduction and background sections.
Pitfalls¶
Do not propose a problem that is not a big data problem. The key is that throughout the life cycle of the proposed application. Some sorts of big data tools must be required.
Good examples of big data projects
Start with large volume of data and the application must see all data to find the patterns.
Start with medium volume but data must be processed in real time or in a short period of time, so tools like Spark must be used
Start with medium or small volume to develop the model and plan to deploy the model to a large volume of data. The model must be able to handle the large volume of data.
Example of projects that are not big data projects
Involves large volume of data but there is
no need to process all at once
no need to work on the whole dataset
no need to aggregate the data distributed in different locations
Developing a deep learning model requires big volume of data. However, using that as an existing model in your application is not a big data project.
Hints¶
Data sources¶
You do not need to download or own the dateset. You only need to identify it and justify the problem is big data problem. Below are some example datasets.
Image datasets: Image datasets are widely available online and can be used for a variety of projects, such as object recognition, image classification, and facial recognition. Avoid small datasets, such as the MNIST dataset.
Text datasets: Text datasets can be used for natural language processing (NLP) projects, such as sentiment analysis, language modeling, and text classification. Some popular text datasets include the Reuters Corpus, the IMDB movie review dataset, and the Gutenberg Project.
Audio datasets: Audio datasets can be used for speech recognition and music analysis projects. Some popular audio datasets include the Speech Commands dataset, the UrbanSound8K dataset, and the Million Song Dataset.
Social media datasets: Social media datasets can be used for sentiment analysis, network analysis, and other social media-related projects. Some popular social media datasets include the Twitter Sentiment Analysis dataset, the Reddit Comment dataset, and the Facebook100 dataset.
Health-related datasets: Health-related datasets can be used for projects related to healthcare, such as disease prediction and drug discovery. Some popular health-related datasets include the MIMIC-III dataset, the Breast Cancer Wisconsin (Diagnostic) dataset, and the National Health and Nutrition Examination Survey (NHANES) dataset.
Methods¶
Consider the following aspects when proposing the method.
Acquisition
Direct download of prepared datasets (public domain data source)
Web scraping to collect data from websites
Application programming interface (API) to collect data from social media platforms
Sensor data collection
Processing
Hadoop
Spark
Analysis
Traditional statistical analysis
Traditional machine learning
Deep learning
Storage (discuss options for all stages)
HDFS
Cloud services
NoSQL
Security and management