Apache Spark using Scala
This course offers you hands-on knowledge to create Data Pipelines using Apache Spark with Scala & AWS in a completely case study based approach. Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Scala, Java, Python and R, and an optimised engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
Note: This is not an introductory or theory based course. And we also don’t discuss much about the comparison between Spark vs other Big Data tools like Apache Hadoop and Apache Flink. [Hive, Pig & Sqoop are optional for the participants to go through!]
Learning Outcomes. By the end of this course,
- You would be able to comfortably setup Spark Development Environment in your local computer and start working on any given big data pipelines by your own. And later, if you want you can productionalize the same over AWS cloud using AWS EMR cluster and AWS Lambda scripts.
- You’ll be having enough hands-on experience on Spark that you’ll get a feeling as you’ve more than 2.5 – 3 years experience into “Developing Data Pipelines using Big Data Technologies”.
- You will be able to identify the type of data (structured, semi-structured or unstructured) been given to you and choose which of the Spark data abstractions to be used (RDD, Dataframe or Dataset) there.
- Basis the complexity of the data pipeline you are working with, you’ll be assess which techniques to be used there – Dataset or Spark SQL or RDD
- Basis the nature of the data (Confidential/PII or not) you’ll be able to decide whether to go ahead with on-premise or cloud based solution. You can also estimate the computational resources requirement for the given data volume.
You should have at least some programming experience in any language, basic level should suffice, i.e., variable declaration, control statements – if-else, looping, collections, etc. And, “Lots of desire to learn new exciting things!”
PART – 1: Getting Started with Spark – Programming RDD using Databricks Notebook
Setting up your 1st Spark cluster on Databricks Community Edition, it’s a free cloud platform. Introduction to Spark RDD. Transformations and Actions. Distribute key-value pairs (Paired RDD).
- Writing your 1st Spark Program using Databricks Notebook– Word Count example
- Creating RDD from different file formats like text file, object file, sequence file and New API Hadoop file, etc. And stand-alone Scala Collections as well.
- Transformation and Actions
- Why is Spark Good for Data Science? Iteration, Caching and Persistence
- Understanding Cluster Topology
- Pair RDDs. Transformations and Actions on Pair RDDs – groupByKey, cogroup, reduceByKey, aggregateByKey, foldByKey and join
- Optimization using Partitioning and Partitioners
- Narrow transformation vs Wider transformation
- Spark application execution model, application_id >> jobs >> stages >> tasks
PART – 2: Setting up your local environment for Spark, Programming Scala and Completing RDD
Scala basics – Conditional Statement, iteration, user defined functions and higher order functions. OOPs Concepts – class, object and trait. Scala collections.
- Introduction to Scala – variable declaration, control statements, loops, pattern matching, higher order functions, function currying, implicit variable/function/class and handling null values using Option/Some/None
- Object Orientation – class, object and trait.
- Companion class/object and case class
- Scala Collections and whole bunch of RDD-like higher order functions on it
- Exception Handling
- Interpreting DAG on YARN UI and optimizing your Spark job
PART – 3: Structured Data: Dataframes, Datasets and Spark SQL
With our new found understanding of the cost of data movement in a Spark job, and some experience in optimizing jobs for data locality, we’ll focus on how we can more easily achieve similar optimizations. Can structured data help us? We’ll look at Spark SQL and its powerful optimizer which uses structure to apply impressive optimizations. We’ll move on to cover DataFrames and Datasets, which give us a way to mix RDDs with the powerful automatic optimizations behind Spark SQL.
- Structured vs Unstructured Data
- Converting RDDs to Dataframes by adding the underlying schema
- Reading structured data using SparkSession.read.fun() family of functions to load different file formats and RDBMS tables/queries and creating Dataframes out of it
- Reading data from third party systems like, SFTP Servers, S3, MPP Database – AWS Redshift, NoSQL Database – MongoDb
- Applying transformation on Dataframes using DSL(Domain Specific Language), e.g. df.select(), df.groupBy($”col1″).agg($”col2″ -> fun, ..), etc.
- Applying window/analytics functions like lead(), lag(), rank(), dense_rank() to perform complex data analysis
- Spark SQL:
- Creating temporary view on top of dataframe and start writing your ANSII standard SQL queries to process your data
- Applying the same transformations and window/analytics functions on the Dataframes
- Datasets – Typed Dataframes
- Interoperability – Converting RDD to Dataset and vice versa, Dataframe to Dataset and vice versa
- Using both RDD like functional transformations and Dataframe’s DSL & SQL like operations on Datasets
PART – 4: Structured Streaming:
Stream processing applications work with continuously updated data and react to changes in real-time. Data frames in Spark 2.x support infinite data, thus effectively unifying batch and streaming applications. In this course, Structured Streaming in Apache Spark 2, you’ll focus on using the tabular data frame API to work with streaming, unbounded datasets using the same APIs that work with bounded batch data.
- Understanding the High Level Streaming API in Spark 2.x
- Triggers and Output modes
- Unified APIs for Batch and Streaming
- Building Advanced Streaming Pipelines Using Structured Streaming
- Stateful window operations
- Tumbling and Sliding windows
- Watermarks and late data
- Windowed joins
- Integrating Apache Kafka with Structured Streaming
PART – 5: Real time case study
To make things fun and interesting, we will introduce multiple datasets coming from disparate data sources – SFTP, MS SQLServer, Amazon S3 and Google Analytics. And create an industry standard ETL pipeline to populate a data mart implemented on Amazon Redshift.
- Set up an Amazon EMR (Elastic Map Reduce) cluster and start Zeppline notebook
- Set up Amazon Redshift database and its client
- Create Spark dataframe out of files from remote SFTP
- Read data from Amazon S3 in Spark
- Create Spark dataframe out of data from MS SQL Server tables
- Read Google Analytics data and process it in Spark
- Schedule Spark job in Amazon Data Pipeline