Project Instructions¶

Learning outcome¶

Process COVID-19 data using Spark SQL.

This is a pretty open project without provided notebook template. Please make your own notebook to complete the tasks.

The data is from the Kaggle: https://www.kaggle.com/datasets/sudalairajkumar/novel-corona-virus-2019-dataset
We will be using an old version hosted in our GitHub dataset repository: https://github.com/uwf-fang-teaching/cap4786-datasets/tree/main/covid
You can always click the file and click “Raw” to get the URL to access the file.
Downloading from URL in your DCE (DataBricks Community Edition) notebook is recommended. You can refer to the instruction on file operations on Canvas for details.
The involved data files are:
- covid_19_data.csv - General data, including confirmed, deaths, and recovered counts across the world
- time_series_covid_19_recovered.csv - Recovered data, including longitude and latitude of regions and countries
- COVID19_open_line_list.csv - Contains the symptoms data, only part of the data is listed
- COVID19_line_list_data.csv - Contains the age data

Load the data into a Spark DataFrames. Only load the csv files needed for the future tasks. Show the schema of each DataFrame.
Show the top ten countries with the most recoveries. (general dataset)
Show the longitude and latitude of the top 10 countries with the most recoveries. Join #2 result and the longitude and latitude data from the recovered dataset.
Show the symptoms by country. (open line list dataset)
Show the average age of death by country ordered by age. (line list dataset)
Rank the countries by the number of deaths. (general dataset)

10% bonus points
Do you notice any data quality issues? If so, how would you address them?
- Good enough to find one dataset with data quality issues
- Describe the issue
- Suggest how to fix it