PySpark Programming for Big Data¶

Spark support multiple programming languages including Java, Scala, Python, and R. In this course, we will use Python as the programming language to work with Spark. The PySpark components are built on top of the Spark core and provide Python APIs to interact with Spark.

We will only briefly introduce how to program in PySpark to manipulate datasets and achieve certain big data processing tasks.

Why PySpark¶

Other languages
- Scala
  - native language for Spark
  - best performance
  - concise and powerful language
  - the language itself is not worth learning as a new language compared to Python
  - complicated syntax
- Java: too verbose
- R: not a general-purpose language
Advantages
- Easier to install using pip than Scala
- You may have learned Python already
- Python worth the time to learn

PySpark Basics¶

Concepts

SparkContext¶
Entry point to most of the PySpark functionalities.

RDD (Resilient Distributed Dataset)¶
A fundamental building block of PySpark which is fault-tolerant, immutable distributed collection of objects

DataFrame¶
A distributed collection of data organized into named columns.

DataSet¶
Similar to DataFrame but with strict types. Optional in this course.
RDD
- from Spark Core module
- lower-level API
- for unstructured data and semi-structured data
DataFrame and DataSet
- from Spark SQL module
- higher-level API
- for structured data

PySpark Programming for Big Data¶

Why PySpark¶

PySpark Basics¶

Table of Contents

Previous topic

Next topic

This Page

PySpark Programming for Big Data¶

Why PySpark¶

PySpark Basics¶

Related Python Concepts¶

Suggested learning resources¶

Python Basics¶

Lambda Functions¶