Project¶
Learning outcome¶
Practice scientific background research
Practice how to design a big data application
Practice scientific background review writing
Practice scientific proposal writing
Objective¶
Write a proposal like article on a big data problem you want to solve.
What is a Big Data Problem?¶
For this project, a “big data problem” is one that is beyond the capability of traditional data tools and requires the use of big data tools and technologies to address big data challenges. The key is that throughout the life cycle of your proposed application, some big data tools must be employed.
Deadlines¶
Topic Proposal: Middle of the semester
Final Report: End of semester
Report Requirement¶
A single page Topic Proposal due before the project to get approval from the instructor. You should focus on the justification on the topic (problem) as a valid big data project and also discuss your planned approach.
4-5 pages letter size paper, 1.5 spacing, 12pt normal text font
Be organized and neat with headings, styles to separate sections
You can either use LaTeX or Microsoft Word
Provide image, figure, table, as needed; Do not forget to provide credit to the source of the image, figure, table
Use MLA style or other professional format for citation and reference
Report Structure and Grading Rubric¶
Your report is required to have the following sections. The bullet points under each section outline the grading criteria and should not be used as subsections in your report.
Abstract
Summarize the proposal in a concise manner (approx. 250 words).
Clearly state the problem statement, research question, and objectives.
Highlight the importance and relevance of the proposed research.
Introduction
Introduce the research topic and the problem being addressed in detail.
Explain why the problem is important to study and its real-world impact.
State the research question and a clear, measurable set of objectives.
Background review
Review relevant literature and theoretical frameworks related to big data applications. This should include academic papers, articles, and books.
Identify gaps or shortcomings in existing research that your project aims to address.
Provide a clear rationale for the proposed research based on the prior researches.
Method
Clearly outline the research design and methodology, including data collection, processing, and analysis methods.
Justify the choice of methodology and explain why it is appropriate for the research question and objectives.
Discuss potential limitations or challenges of the proposed methodology and how you plan to mitigate them.
References
List all references cited in the proposal.
Use a consistent style, such as MLA style.
In-text citation: parenthesis with numbers like in the article (1)
Academic Integrity¶
All work submitted for this project must be your own. Plagiarism, including copying from online sources, other students, or generative AI tools without proper citation, will not be tolerated. All sources must be properly cited. Any violation of academic integrity will result in a grade of zero for the project.
Warning
Be careful with the difference between the introduction and background sections.
Pitfalls¶
Do not propose a problem that is not a big data problem. The key is that throughout the life cycle of the proposed application. Some sorts of big data tools must be involved.
Good examples of big data projects
Start with large volume of data and the application must see all data to find the patterns.
Start with medium volume but data must be processed in real time or in a short period of time, so tools like Spark must be used
Start with medium or small volume to develop the model and plan to deploy the model to a large volume of data. The model must be able to handle the large volume of data.
Example of projects that are not big data projects
Involves large volume of data but there is
no need to process all at once
no need to work on the whole dataset
no need to aggregate the data distributed in different locations
Developing a deep learning model requires big volume of data. However, using that as an existing model in your application is not a big data project.
Hints¶
Data sources¶
You do not need to download or own the dateset. You only need to identify it and justify the problem is big data problem. Below are some example datasets. These datasets are not big data datasets. Make sure that the effective data volume in your proposed project is big data. Effective data volume is the amount of data that needs to be processed together to solve the problem. Unnecessary data volume does not count as big data.
Methods¶
Consider the following aspects when proposing the method.
Acquisition
Direct download of prepared datasets (public domain data source). Must show the exact data volume and justify it is big data.
Web scraping to collect data from websites. Must be legally allowed and must show the exact data volume and justify it is big data.
Application programming interface (API) to collect data from social media platforms.
Sensor data collection.
Processing
Hadoop
Spark
Cloud computing services
Analysis
Traditional statistical analysis
Traditional machine learning
Deep learning
Storage (discuss options for all stages)
HDFS
NoSQL
Cloud storage services
Security and management
Data governance and ethical considerations
Compliance with legal and regulatory requirements