Apache Spark using Scala
This course offers you hands-on knowledge to create Apache Spark applications using Scala programming language in a completely case study based approach. Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Scala, Java, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
Note: This is not an introductory or theory based course. And we also don’t discuss about the comparison between Spark vs other big data tool available in the market.
Learning Outcomes. By the end of this course,
- You would be able to comfortably setup Spark Development Environment in your local system and start working on any given big data application. And later, if you want you can productionalize the same on AWS cloud using AWS EMR cluster and AWS Lambda scripts.
- You’ll be having enough hands-on experience on Spark that you’ll get a feeling as you’ve more than 2.6 years experience into “Developing Data Pipelines using Big Data Technologies”.
- You will be able to identify the type of data (structured, semi-structured or unstructured) been given to you and choose which of the Spark data abstractions to be used (RDD, Dataframe or Dataset) there.
- Basis the complexity of the data pipeline you are working with, you’ll be assess which techniques to be used there – Dataset or Spark SQL or RDD
- Basis the nature of the data (Confidential/PII or not) you’ll be able to decide whether to go ahead with on-premise or cloud based solution. You can also estimate the computational resources requirement for the given data volume.
You should have at least some programming experience in any language, basic level should suffice, i.e., variable declaration, control statements – if-else, looping, collections, etc.
PART – 1: Getting Started with Spark – Programming RDD using Databricks Notebook
Setting up your 1st Spark cluster on Databricks Community Edition, it’s a free cloud platform. Introduction to Spark RDD. Transformations and Actions. Distribute key-value pairs (Paired RDD).
- Writing your 1st Spark Program using Databricks Notebook– Work Count example
- Creating RDD from text files, parquet files, JSON files, Scala Collections.
- Transformation and Actions
- Why is Spark Good for Data Science? Iteration, Caching and Persistence
- Understanding Cluster Topology
- Pair RDDs. Transformations and Actions on Pair RDDs. Joins
- Optimization using Partitioning and Partitioners
- Wide vs Narrow Dependencies
PART – 2: Setting up your local environment for Spark, Programming Scala and Completing RDD
Scala basics – Conditional Statement, iteration, user defined functions and higher order functions. OOPs Concepts – class, object and trait. Scala collections.
- Introduction to Scala – variable declaration, control statements, loop, pattern matching, higher order functions and handling null values in Scala
- Object Orientation – class, object and trait. Inheritance
- Companion class/object and case class/object
- Scala Collections
- Exception Handling
- Interpreting DAG on YARN UI and optimizing your Spark job
PART – 3: Structured Data: SQL, Dataframes and Datasets
With our new found understanding of the cost of data movement in a Spark job, and some experience in optimizing jobs for data locality, we’ll focus on how we can more easily achieve similar optimizations. Can structured data help us? We’ll look at Spark SQL and its powerful optimizer which uses structure to apply impressive optimizations. We’ll move on to cover DataFrames and Datasets, which give us a way to mix RDDs with the powerful automatic optimizations behind Spark SQL.
- Structured vs Unstructured Data
- Converting RDDs to Dataframes by adding the underlying schema
- Reading structured data using SparkSession.read.fun() family of functions
- Applying transformation on dataframes using SQL like operations, e.g. df.select(), df.groupBy($”col1″).sum($”col2″), etc.
- Spark SQL:
- Creating temporary view on top of dataframe and start writing your ANSII standard SQL queries to process your data
- Working on multiple data sets and implementing complex business transformations using window/analytics functions
- Interoperability – Converting RDD to Dataset and vese versa, Dataframe to Dataset and vise versa
- Using both RDD like functional transformations and Dataframe’s SQL like operations with Dataset
PART – 4: Structured Streaming:
Stream processing applications work with continuously updated data and react to changes in real-time. Data frames in Spark 2.x support infinite data, thus effectively unifying batch and streaming applications. In this course, Structured Streaming in Apache Spark 2, you’ll focus on using the tabular data frame API to work with streaming, unbounded datasets using the same APIs that work with bounded batch data.
- Understanding the High Level Streaming API in Spark 2.x
- Triggers and Output modes
- Unified APIs for Batch and Streaming
- Building Advanced Streaming Pipelines Using Structured Streaming
- Stateful window operations
- Tumbling and Sliding windows
- Watermarks and late data
- Windowed joins
- Integrating Apache Kafka with Structured Streaming
PART – 5: Real time case study
To make things fun and interesting, we will introduce multiple datasets coming from disparate data sources – SFTP, MS SQLServer, Amazon S3 and Google Analytics. And create an industry standard ETL pipeline to populate a data mart implemented on Amazon Redshift.
- Set up an Amazon EMR (Elastic Map Reduce) cluster and start Zeppline notebook
- Set up Amazon Redshift database and its client
- Create Spark dataframe out of files from remote SFTP
- Read data from Amazon S3 in Spark
- Create Spark dataframe out of data from MS SQL Server tables
- Read Google Analytics data and process it in Spark
- Schedule Spark job in Amazon Data Pipeline