PySpark Environment For Learning¶
PySpark Environment¶
Spark can be configured in local or cluster mode. It also support standalone, Mesos, YARN as resource managers. Installation of Spark in cluster mode requires a lot of experience and effort. It is not necessary in this course.
For teaching purpose, we will install Spark with its Python language binding PySpark in a minimal local standalone mode on a remote computer. We will make used of the free service Google Colaboratory. It provides a virtual machine to users and allow user to access it through an interactive notebook like interface.
Google Colab¶
Google Colaboratory (also known as Google Colab or simply Colab) is a cloud-based platform that allows users to write, run, and share Python code using Jupyter notebooks. It is a free service provided by Google that allows users to access powerful computing resources, including GPUs and TPUs, without the need for expensive hardware.
With Colab, you can easily create and share documents that contain live code, equations, visualizations, and text. You can also collaborate with others in real-time, making it an ideal tool for data science teams, researchers, educators, and students.
Tutorials for Google Colab:
Install PySpark on Colab¶
Input !pip install pyspark
in a code cell and run.
Access PySpark functionalities¶
Import PySpark components in a code cell
Use the provided SparkContext object
sc
to access PySpark functionalities
File Operations¶
Each notebook runs in a separate virtual machine with its own file system.
This file system is temporary and will be deleted when the virtual machine is recycled.
To load data (three options)
write code to download data from a URL
upload data from your local machine
mount Google Drive to the virtual machine
To persist data (two options)
save data to Google Drive
download data to your local machine using the file explorer sidebar
Export the notebook¶
Use the menu “File -> print” to generate a PDF file or print it to a printer.
Save Data¶
You can upload and download in the file explorer sidebar or mount your Google Drive to the Colab virtual machine.