Project Instructions

Learning outcome

  • Learn how to employ Spark SQL to process data.

Objective

Process COVID-19 data using Spark SQL.

COVID-19 Data Processing

This is a pretty open project without provided notebook template. Please make your own notebook to complete the tasks.

Data Source

  • The data is from the Kaggle: https://www.kaggle.com/datasets/sudalairajkumar/novel-corona-virus-2019-dataset

  • We will be using an old version hosted in our GitHub dataset repository: https://github.com/uwf-fang-teaching/cap4786-datasets/tree/main/covid

  • You can always click the file and click “Raw” to get the URL to access the file.

  • Downloading from URL in your DCE (DataBricks Community Edition) notebook is recommended. You can refer to the instruction on file operations on Canvas for details.

  • The involved data files are:

    • covid_19_data.csv - General data, including confirmed, deaths, and recovered counts across the world

    • time_series_covid_19_recovered.csv - Recovered data, including longitude and latitude of regions and countries

    • COVID19_open_line_list.csv - Contains the symptoms data, only part of the data is listed

    • COVID19_line_list_data.csv - Contains the age data

Tasks

  1. Load the data into a Spark DataFrames. Only load the csv files needed for the future tasks. Show the schema of each DataFrame.

  2. Show the top ten countries with the most recoveries. (general dataset)

  3. Show the longitude and latitude of the top 10 countries with the most recoveries. Join #2 result and the longitude and latitude data from the recovered dataset.

  4. Show the symptoms by country. (open line list dataset)

  5. Show the average age of death by country ordered by age. (line list dataset)

  6. Rank the countries by the number of deaths. (general dataset)

Requirement

  • Provide a report in DCE notebook format.

  • Make your report organized and easy to read!

  • The report should contain both code and text.

  • The report should show the output of the code.

  • Write markdown cells for your descriptive contents.

  • Write code cells for your code and keep the output.

  • Split cells by the logical steps of your analysis.

Bonus

  • 10% bonus points

  • Do you notice any data quality issues? If so, how would you address them?

    • Good enough to find one dataset with data quality issues

    • Describe the issue

    • Suggest how to fix it