******************************** PySpark Environment For Learning ******************************** PySpark Environment =================== Spark can be configured in **local** or **cluster** mode. It also support **standalone**, **Mesos**, **YARN** as resource managers. Installation of Spark in cluster mode requires a lot of experience and effort. It is not necessary in this course. For teaching purpose, we will install Spark with its Python language binding PySpark in a minimal **local standalone** mode on a remote computer. We will make used of the free service **Google Colaboratory**. It provides a virtual machine to users and allow user to access it through an interactive notebook like interface. Google Colab ============ Google Colaboratory (also known as Google Colab or simply Colab) is a cloud-based platform that allows users to write, run, and share Python code using Jupyter notebooks. It is a free service provided by Google that allows users to access powerful computing resources, including GPUs and TPUs, without the need for expensive hardware. With Colab, you can easily create and share documents that contain live code, equations, visualizations, and text. You can also collaborate with others in real-time, making it an ideal tool for data science teams, researchers, educators, and students. `Google Colab website `_ .. _colab: https://colab.research.google.com/ Tutorials for Google Colab: + `Google Colab Tutorial for Beginners (towardai.net) `_ + `Google Colab Tutorial (tutorialspoint.com) `_ .. _ta: https://pub.towardsai.net/google-colab-tutorial-for-beginners-834595494d44 .. _tp: https://www.tutorialspoint.com/google_colab/index.htm Install PySpark on Colab ------------------------ Input ``!pip install pyspark`` in a code cell and run. Access PySpark functionalities ------------------------------ + Import PySpark components in a code cell + Use the provided SparkContext object :code:`sc` to access PySpark functionalities File Operations --------------- + Each notebook runs in a separate virtual machine with its own file system. + This file system is temporary and will be deleted when the virtual machine is recycled. + To load data (three options) 1. write code to download data from a URL 2. upload data from your local machine 3. mount Google Drive to the virtual machine + To persist data (two options) 1. save data to Google Drive 2. download data to your local machine using the file explorer sidebar Export the notebook ------------------- Use the menu "File -> print" to generate a PDF file or print it to a printer. Save Data --------- You can upload and download in the file explorer sidebar or mount your Google Drive to the Colab virtual machine.