PySpark Environment For Learning

PySpark Environment

Spark can be configured in local or cluster mode. It also support standalone, Mesos, YARN as resource managers. Installation of Spark in cluster mode requires a lot of experience and effort. It is not necessary in this course.

For teaching purpose, we will install Spark with its Python language binding PySpark in a minimal local standalone mode on a remote computer. We will make used of the free service Google Colaboratory. It provides a virtual machine to users and allow user to access it through an interactive notebook like interface.

Google Colab

Google Colaboratory (also known as Google Colab or simply Colab) is a cloud-based platform that allows users to write, run, and share Python code using Jupyter notebooks. It is a free service provided by Google that allows users to access powerful computing resources, including GPUs and TPUs, without the need for expensive hardware.

With Colab, you can easily create and share documents that contain live code, equations, visualizations, and text. You can also collaborate with others in real-time, making it an ideal tool for data science teams, researchers, educators, and students.

Google Colab website

Tutorials for Google Colab:

Install PySpark on Colab

Input !pip install pyspark in a code cell and run.

Access PySpark functionalities

  • Import PySpark components in a code cell

  • Use the provided SparkContext object sc to access PySpark functionalities

File Operations

  • Each notebook runs in a separate virtual machine with its own file system.

  • This file system is temporary and will be deleted when the virtual machine is recycled.

  • To load data (three options)

    1. write code to download data from a URL

    2. upload data from your local machine

    3. mount Google Drive to the virtual machine

  • To persist data (two options)

    1. save data to Google Drive

    2. download data to your local machine using the file explorer sidebar

Export the notebook

Use the menu “File -> print” to generate a PDF file or print it to a printer.

Save Data

You can upload and download in the file explorer sidebar or mount your Google Drive to the Colab virtual machine.