In the upcoming version of Spark (3.1), the Spark on Kubernetes integration will officially be declared production ready. A lot of companies have already adopted Spark on Kubernetes to benefit from containerization, reduce their costs, and make their architecture more portable and flexible. In this talk we'll go over the main pros & cons of running Spark on Kubernetes (as opposed to Hadoop YARN or proprietary platforms). The speaker, an ex-Databricks engineer now co-founder of Data Mechanics, a commercial Spark platform deployed on Kubernetes, will give practical tips to make this migration successful.
Ask me anything
Monthly virtual office hours on BigData technologies and architecture
We know how challenging the BigData landscape can be and now, when we are all working remotely because of Covid19, we believe it's the perfect time to move our office hours to a monthly virtual event.View other AMA sessions
January 27thAdd to calendar >
Presto and Elasticsearch - better together
More often than not, we find ourselves implementing BigData architectures that include both Presto and Elasticsearch. Presto is usually deployed for what we call the cold layer and Elasticsearch for the hot layer. In most systems, real-time access isn’t required for the lion’s share of the data where the main concern is keeping costs low, making S3 and Presto a great fit. Usually, ultra-low latency queries are only required for a portion of the data and that is where Elasticsearch, which is more hardware demanding and hence costlier, really shines. In this session we will show how we can inter-connect them seamlessly, and demo a couple of really cool features that make some really interesting use-cases finally possible.
- BigData Q&A open to all - ask our experts anything!
Monthly on a Wednesday
Our team of experts will be here to answer all your BigData questions, live. Once every month we will host a live event, beginning with a short presentation on a bleeding edge topic, and then follow up with a Q&A session that is open for all.
January 6th, 2021
October 28th, 2020
The field of information-retrieval and text search has come a long way since its inception, several dozen years ago. Join us on this session, where we will discuss the modern text search practice with Elasticsearch, the Lucene-based search engine server and today's de-facto standard for full-text search applications. We will start from the basic keyword search - analyzers, term normalization, stemming and morphologic properties. We will, of course, discuss the common challenges it has, such as boosts, synonyms, ontologies, phrases and how to deal with them. Continuing from there, we will review the modern and future approaches for full-text search, from vector search to word embedding methods like BERT, and how those come into play. We will also discuss how we can improve precision and recall by using judgment lists, click-streams and search logs.
September 23rd, 2020
Let's talk about the latest of Delta Lake 0.7.0 and how much you can use it with SQL only. We will begin with a short Delta Lake intro and then dive into all the goodies of DDL and DML commands (like CREATE, ALTER, DROP, SELECT, UPDATE, DELETE, MERGE, EXPLAIN) which are supported by Delta Lake. We will review and demo all live, and let's see where it goes from there with your questions!
July 29th, 2020
Build an Open Source Data Access Layer to federate Kafka, Data Lakes, and more. Starburst Presto provides a federated "Single Source of Access" to create a multi-node, elastically scaling cluster to pull data from data warehouses, data lakes, relational data, NoSQL, and Kafka queues. Users run a single SQL query that joins data from all of them merging the data on the fly into a single result-set. Join in on this AMA to learn how to implement this in your environment.
July 22nd, 2020
Kafka is a key component in data architectures because it's the enabler for easy decoupling between systems and performance improvements like adding backpressure management to existing components. In this Ask Me Anything session we'll review some messaging patterns as Pub-Sub and Observer and how they are related to architectural patterns as CQRS, Event Sourcing and Event Collaboration. We'll cover the advantages/disadvantages of each pattern and we'll learn to identify the best opportunities to use them.
July 1st, 2020
The Machine Learning ecosystem is booming in recent years and with new product and technology announcements coming every week it’s easy to get lost. We invite you to join us as Gad, Director of Machine Learning, will explain what’s worth looking at and arm you with knowledge on how to choose the right tool for the task.
June 17th, 2020
For any Big Data architecture, the main goal is to make available data to their users which will be very hard to use in traditional architectures because of size or latency requirements. In this AMA, we'll cover how to expose data efficiently in terms of performance and governance. We'll review some interesting patterns and technologies which makes easier for your users to consume the data previously processed in your pipelines.
June 10th, 2020
This AMA session will begin with a very short introduction to Storage System and BlockManager. During this session we are going to show you when and how Spark saves data to disk using the storage system. It's going to be fairly low-level, but there will be enough high-level info that anybody should benefit. This session can get interactive so expect questions to drive how low / high we end up discussing.
June 3rd, 2020
In this session, we'll review the differences between the most important Big Data file formats for Event Streaming, their pros and cons and how to choose the best fit for a specific use case. We'll also take a look to the proper architecture to provide greater control over data quality using Schema Management. Need to add a new column to a downstream database? You don’t need an involved change process and at least 4 meetings to coordinate 15 teams. Join us to learn how it's possible to reduce operational complexity in the application development cycle.
May 27th, 2020
There are so many Elasticsearch clusters out there, and many of them suffer from performance and stability issues because of mis-configuration or incorrect capacity planning. In this session we will look at the common errors people make when deploying Elasticsearch clusters, and offer best-practices, do's and don'ts so it doesn't happen to you as well.
May 20th, 2020
This session will showcase typical Big Data architectures on AWS and show you how to build them yourself. From building Data Warehouses and Data Lakes to make huge amounts of data queryable, orchestrating data pipelines and ETL processes, ingesting data at scale, to handling and computing on high-velocity data streams. These are huge tasks but are relatively easy to get done with AWS, and this session will show you where to begin.
May 13th, 2020
In this session we will introduce Kafka Streams, a client library for building real-time processing applications, where the input and output data are stored in Kafka clusters. We will compare it with other popular real-time frameworks such as Flink and Spark Structured Streaming and talk about when to use which one.
May 6th, 2020
The Elastic Stack is being used almost everywhere today for application and system monitoring. In this session we will show you how to add alerting to any Elastic-based monitoring system, so you can also get alerted via Email, Slack, PagerDuty and more when any of the alerting rules you defined gets triggered.