******************************** PySpark Programming for Big Data ******************************** Spark support multiple programming languages including Java, Scala, Python, and R. In this course, we will use Python as the programming language to work with Spark. The PySpark components are built on top of the Spark core and provide Python APIs to interact with Spark. We will only briefly introduce how to program in PySpark to manipulate datasets and achieve certain big data processing tasks. Why PySpark =========== + Other languages * Scala - native language for Spark - best performance - concise and powerful language - the language itself is not worth learning as a new language compared to Python - complicated syntax * Java: too verbose * R: not a general-purpose language + Advantages * Easier to install using pip than Scala * You may have learned Python already * Python worth the time to learn PySpark Basics ============== + Concepts .. glossary:: SparkContext Entry point to most of the PySpark functionalities. RDD (Resilient Distributed Dataset) A fundamental building block of PySpark which is fault-tolerant, immutable distributed collection of objects DataFrame A distributed collection of data organized into named columns. DataSet Similar to DataFrame but with strict types. Optional in this course. + RDD * from Spark Core module * lower-level API * for unstructured data and semi-structured data + DataFrame and DataSet * from Spark SQL module * higher-level API * for structured data Related Python Concepts ======================= As the programming language used in PySpark is Python, we will need to know some Python concepts to work with PySpark. Please refer to the Python programming resources online if you are not familiar with Python. We will learn some useful python concepts to be used in data manipulation using PySpark APIs. For details refer to any Python programming resources online. Suggested learning resources ---------------------------- + `G4G Python Tutorial `_ + `Official Python Reference `_ + Ask Chatbot on specific questions: ChatGPT, Google Gemini, Claude AI, etc. Python Basics ------------- + Python is a high-level, interpreted, interactive, and object-oriented scripting language. + Data types * int * float * str * tuple - immutable list of objects of any type * list - mutable list of objects of any type * dict - key-value pair of objects of any type + Control structures * if-elif-else * for loop * while loop + Functions * def keyword * return keyword * positional arguments * keyword arguments * default arguments * lambda functions Lambda Functions ---------------- + A simple and disposable way to write a python function. + Many RDD transformations require a function as its parameter + Syntax * ``lambda : `` * start with ``lambda`` keyword * followed by a list of parameters * after the colon, provide the return value calculated based on the parameters + Examples * ``lambda x: x*x``, square the input * ``lambda line: line.split()``, split a text to a list of text using spaces as delimiters * ``lambda paragraph: paragraph.split('\n')``, split a text to a list of lines using new line character as delimiters * ``lambda word: (word, 1)``, convert a work to a key-value pair of word and 1 * ``lambda word: len(word)``, convert a string to its length * ``lambda pair: pair[0]``, return the first element of the pair * ``lambda pair: pair[1]``, return the second element of the pair * ``lambda x, y: x+y``, sum up two value * ``lambda p1, p2: p1[1] + p2[1]``, sum up the second values in two pairs