Hands-on Apache Spark
Learn Apache Spark from basics to advanced topics in this hands-on course
3 days
20
Instructor-led, hands-on exercises
Hebrew
Bring your own (installation instructions will be sent prior to course start)
Included
Apache Spark is the main platform for deep BigData analysis. Companies from all industries - Finance, AdTech, Cyber, Commerce and Internet are using Spark in different modes for ETL, BI, Machine-learning and stream-processing. This developers course gives you hands-on experience with Spark basic and advanced modules, and is focused on Spark DataFrames - the data-optimized API of Spark. Teaching and exercises are done on a cloud environment (AWS EMR, S3 and Zeppelin).
Objectives
Participants will gain end-to-end familiarity with Apache Spark, and know how to:
- Install and deploy Apache Spark.
- Design Spark computations, using transformations and actions and the DataFrame API.
- Use the Spark eco-system including SparkSQL, Spark Streaming, Spark.ML, and more.
- Use best-practices and debug and monitoring tooling to produce production-ready deployments.,
- Make use of Spark in real-life scnearios, and trafe-off accuracy and performance where needed.
Prerequisites
At least 3 years of programming experience, and experience with either Python, Java or Scala.
Syllabus
Module 1 - Preliminaries and Introduction to BigData
- Short Scala introduction for Java and Python programmers.
- Functional Programming.
- Getting to know the BigData ecosystem.
- Apache Hadoop (HDFS, MapReduce) and Apache Spark.
- Principles of MapReduce.
- The foundation for BigData - data locality, partitioning, shuffeling.
- BigData tools and applications.
- Hands-on AWS: EC2, connecting via SSH, S3, AWS CLI, EMR and HDFS.
Module 2 - Spark RDD
- Spark low level API - RDD.
- SparkSession and SparkContext.
- Transformations and actions.
- Functional programming and distributed execution.
- Working with files.
Module 3 - Spark DataFrames
- Distributed computation with DataFrames.
- Reading files with DataFrames.
- DataFrames API principles.
- Data Partitioning - hashmod full-order.
- Grouping, sorting, joining in distributed system.
- Query plan and explain.
Module 4 - Running Spark on a cluster
- Spark cluster components.
- Scheduling - jobs, stages, tasks.
- Writing a Spark applications in Scala, Java and Python.
- Using spark-submit - local and cluster mode.
- Monitoring execution via Spark UI.
- Logging, writing and collecting.
Module 5 - Spark SQL - Querying with SQL on Spark
- SparkSQL components: HiveQL, MetaStore, Storage.
- File formats - Parquet, csv, json.
- Analytical functions.,
- IMDB example SQL walk-through.
Module 6 - DataFrames API dive-in
- IMDB example - hands on using DataFrame syntax.
- Creating Dynamic schema - hands on exercise with legacy data.
- Defining User Defined Functions (UDF).
Module 7 - Spark Structured Streaming
- Streaming principles - event time, watermark, unbounded table.
- DataFrame API for streaming.
- Hands-on - reading web logs from Kafka and detecting bots.
Module 8 - BigData sketching and approximation techniques
- Smart sampling.
- Bloom filter, linear counting, min-count.
- Spark approximation functions.
Module 9 - Machine Learning with Apache Spark and intro to Spark.ml
- Machine learning terms - train, test, overfit, regularisation.
- The Spark.ML data-frames API.
- Recommendation system algorithm - ALS.
- Hands-on example - Movie recommendation.
- Hyper-parameters