Apache Spark is the main platform for deep BigData analysis. Companies from all industries - Finance, AdTech, Cyber, Commerce and Internet are using Spark in different modes for ETL, BI, Machine-learning and stream-processing. This developers course gives you hands-on experience with Spark basic and advanced modules, and is focused on Spark DataFrames - the data-optimized API of Spark. Teaching and exercises are done on a cloud environment (AWS EMR, S3 and Zeppelin).
Participants will gain end-to-end familiarity with Apache Spark, and know how to:
- Install and deploy Apache Spark.
- Design Spark computations, using transformations and actions and the DataFrame API.
- Use the Spark eco-system including SparkSQL, Spark Streaming, Spark.ML, and more.
- Use best-practices and debug and monitoring tooling to produce production-ready deployments.,
- Make use of Spark in real-life scnearios, and trafe-off accuracy and performance where needed.
At least 3 years of programming experience, and experience with either Python, Java or Scala.
- Short Scala introduction for Java and Python programmers.
- Functional Programming.
- Getting to know the BigData ecosystem.
- Apache Hadoop (HDFS, MapReduce) and Apache Spark.
- Principles of MapReduce.
- The foundation for BigData - data locality, partitioning, shuffeling.
- BigData tools and applications.
- Hands-on AWS: EC2, connecting via SSH, S3, AWS CLI, EMR and HDFS.
- Spark low level API - RDD.
- SparkSession and SparkContext.
- Transformations and actions.
- Functional programming and distributed execution.
- Working with files.
- Distributed computation with DataFrames.
- Reading files with DataFrames.
- DataFrames API principles.
- Data Partitioning - hashmod full-order.
- Grouping, sorting, joining in distributed system.
- Query plan and explain.
- Spark cluster components.
- Scheduling - jobs, stages, tasks.
- Writing a Spark applications in Scala, Java and Python.
- Using spark-submit - local and cluster mode.
- Monitoring execution via Spark UI.
- Logging, writing and collecting.
- SparkSQL components: HiveQL, MetaStore, Storage.
- File formats - Parquet, csv, json.
- Analytical functions.,
- IMDB example SQL walk-through.
- IMDB example - hands on using DataFrame syntax.
- Creating Dynamic schema - hands on exercise with legacy data.
- Defining User Defined Functions (UDF).
- Streaming principles - event time, watermark, unbounded table.
- DataFrame API for streaming.
- Hands-on - reading web logs from Kafka and detecting bots.
- Smart sampling.
- Bloom filter, linear counting, min-count.
- Spark approximation functions.
- Machine learning terms - train, test, overfit, regularisation.
- The Spark.ML data-frames API.
- Recommendation system algorithm - ALS.
- Hands-on example - Movie recommendation.