PySpark Programming for Big Data¶
Spark support multiple programming languages including Java, Scala, Python, and R. In this course, we will use Python as the programming language to work with Spark. The PySpark components are built on top of the Spark core and provide Python APIs to interact with Spark.
We will only briefly introduce how to program in PySpark to manipulate datasets and achieve certain big data processing tasks.
Why PySpark¶
Other languages
Scala
native language for Spark
best performance
concise and powerful language
the language itself is not worth learning as a new language compared to Python
complicated syntax
Java: too verbose
R: not a general-purpose language
Advantages
Easier to install using pip than Scala
You may have learned Python already
Python worth the time to learn
PySpark Basics¶
Concepts
- SparkContext¶
Entry point to most of the PySpark functionalities.
- RDD (Resilient Distributed Dataset)¶
A fundamental building block of PySpark which is fault-tolerant, immutable distributed collection of objects
- DataFrame¶
A distributed collection of data organized into named columns.
- DataSet¶
Similar to DataFrame but with strict types. Optional in this course.
RDD
from Spark Core module
lower-level API
for unstructured data and semi-structured data
DataFrame and DataSet
from Spark SQL module
higher-level API
for structured data