PySpark Programming: RDD¶

RDD (Resilient Distributed Dataset)¶

External readings
- RDD Programming Guide
- A Comprehensive Guide to PySpark RDD Operations

The low-level abstraction in PySpark
Can do anything with RDDs but the DataFrame and DataSet APIs are more user-friendly and efficient for structured data
RDDs are more flexible and can be used for unstructured and semi-structured data
Characteristics
- Distributed (data is distributed across multiple machines)
- Immutable (once created, it cannot be changed)
- Lazy evaluation (operations are not executed until a operation that requires shuffling is called)
Two types of operations
- As extensions to traditional map/reduce paradigm
- Transformations
  - Takes an RDD and return another RDD
  - Distributed RDD to distributed RDD
  - Can be chained
  - Include but more than just mapping
  - Most transformations are lazy
  - Some transformations may trigger shuffling
- Actions
  - Take an RDD and return a single object
  - Distributed RDD to local object
  - Include but more than just reducing
  - All actions will trigger the execution of the DAG
Useful transformations
- map(func), apply a function to every element in an iterable, func should transform one item to another
- flatMap(func), apply a function to every element in an iterable and flatten the resulting list, func should take one item and generate a list
- filter(func), filter elements elements according to the returned value of a boolean function applied to every element, func should return True or False according to the input
Useful actions
- collect(), collect all data and return a local list, Never use with large dataset
- count()
- first()
- take(num), take the first num of values and return a local list
- reduce(func), func should take two parameters and give one result
- sum()
- max()
- min()
- mean()
- countByValue()
  - Python dict like type as output {value1: count1, value2: count2, ...}
  - models key-value pair
  - key must be unique
Pair RDD

PySpark provides a specialized group of operations for working with Pair RDDs. In PySpark, pair RDDs are a specialized subtype of the RDD data structure that take the form of key-value pairs.
- example [(key1, value1), (key2, value2), ...]
- Useful transformations for pair RDDs
  - sortByKey(ascending=True)
  - sortBy(func, ascending=True)
    - func should take an item and return the value used to perform sorting
    - e.g. pairs.sortBy(lambda p: p[1]) use the value (second item in the pair) to sort the pairs
  - groupByKey()
    - create a new RDD with each item being pairs with a same key
- Useful actions for pair RDDs
  - reduceByKey(func)
    - func should take two values (second item in the pair) and return one
    - used to combine values that have a same key
  - countByKey()

PySpark Programming: RDD¶

RDD (Resilient Distributed Dataset)¶

Table of Contents

Previous topic

Next topic

This Page