PySpark Programming for Big Data

Spark support multiple programming languages including Java, Scala, Python, and R. In this course, we will use Python as the programming language to work with Spark. The PySpark components are built on top of the Spark core and provide Python APIs to interact with Spark.

We will only briefly introduce how to program in PySpark to manipulate datasets and achieve certain big data processing tasks.

Why PySpark

  • Other languages

    • Scala

      • native language for Spark

      • best performance

      • concise and powerful language

      • the language itself is not worth learning as a new language compared to Python

      • complicated syntax

    • Java: too verbose

    • R: not a general-purpose language

  • Advantages

    • Easier to install using pip than Scala

    • You may have learned Python already

    • Python worth the time to learn

PySpark Basics

  • Concepts

    SparkContext

    Entry point to most of the PySpark functionalities.

    RDD (Resilient Distributed Dataset)

    A fundamental building block of PySpark which is fault-tolerant, immutable distributed collection of objects

    DataFrame

    A distributed collection of data organized into named columns.

    DataSet

    Similar to DataFrame but with strict types. Optional in this course.

  • RDD

    • from Spark Core module

    • lower-level API

    • for unstructured data and semi-structured data

  • DataFrame and DataSet

    • from Spark SQL module

    • higher-level API

    • for structured data