Join us for Hands-on Apache Spark
Learn Apache Spark from basics to advanced topics in this hands-on course. Leave your details below and our representatives will be in touch soon.
Please enter your name
Please enter your email
Next course: 16,19,23/1/2020
Invalid captcha
We will be storing the details you submit so our representatives can reach out to you to complete the registration process, and will also use it to notify you of similar courses in the future via our newsletter.

Hands-on Apache Spark

Learn Apache Spark from basics to advanced topics in this hands-on course

Apache Spark is the main platform for deep BigData analysis. Companies from all industries - Finance, AdTech, Cyber, Commerce and Internet are using Spark in different modes for ETL, BI, Machine-learning and stream-processing. This developers course gives you hands-on experience with Spark basic and advanced modules, and is focused on Spark DataFrames - the data-optimized API of Spark. Teaching and exercises are done on a cloud environment (AWS EMR, S3 and Zeppelin).

Objectives

Participants will gain end-to-end familiarity with Apache Spark, and know how to:

  • Install and deploy Apache Spark.
  • Design Spark computations, using transformations and actions and the DataFrame API.
  • Use the Spark eco-system including SparkSQL, Spark Streaming, Spark.ML, and more.
  • Use best-practices and debug and monitoring tooling to produce production-ready deployments.,
  • Make use of Spark in real-life scnearios, and trafe-off accuracy and performance where needed.
Prerequisites

At least 3 years of programming experience, and experience with either Python, Java or Scala.

Syllabus
  • Short Scala introduction for Java and Python programmers.
  • Functional Programming.
  • Getting to know the BigData ecosystem.
  • Apache Hadoop (HDFS, MapReduce) and Apache Spark.
  • Principles of MapReduce.
  • The foundation for BigData - data locality, partitioning, shuffeling.
  • BigData tools and applications.
  • Hands-on AWS: EC2, connecting via SSH, S3, AWS CLI, EMR and HDFS.
  • Spark low level API - RDD.
  • SparkSession and SparkContext.
  • Transformations and actions.
  • Functional programming and distributed execution.
  • Working with files.
  • Distributed computation with DataFrames.
  • Reading files with DataFrames.
  • DataFrames API principles.
  • Data Partitioning - hashmod full-order.
  • Grouping, sorting, joining in distributed system.
  • Query plan and explain.
  • Spark cluster components.
  • Scheduling - jobs, stages, tasks.
  • Writing a Spark applications in Scala, Java and Python.
  • Using spark-submit - local and cluster mode.
  • Monitoring execution via Spark UI.
  • Logging, writing and collecting.
  • SparkSQL components: HiveQL, MetaStore, Storage.
  • File formats - Parquet, csv, json.
  • Analytical functions.,
  • IMDB example SQL walk-through.
  • IMDB example - hands on using DataFrame syntax.
  • Creating Dynamic schema - hands on exercise with legacy data.
  • Defining User Defined Functions (UDF).
  • Streaming principles - event time, watermark, unbounded table.
  • DataFrame API for streaming.
  • Hands-on - reading web logs from Kafka and detecting bots.
  • Smart sampling.
  • Bloom filter, linear counting, min-count.
  • Spark approximation functions.
  • Machine learning terms - train, test, overfit, regularisation.
  • The Spark.ML data-frames API.
  • Recommendation system algorithm - ALS.
  • Hands-on example - Movie recommendation.
  • Hyper-parameters

Ready to get started?

Enroll Now
Related courses