Ask me anything

Monthly virtual office hours on BigData technologies and architecture

The next session will be announced soon!

We know how challenging the BigData landscape can be and now, when we are all working remotely because of Covid19, we believe it's the perfect time to move our office hours to a monthly virtual event.

View other AMA sessions

Past Event

Watch on YouTube: Apache Kafka vs Apache Pulsar

February 24th, 2021

Apache Kafka vs Apache Pulsar

Apache Kafka has been the go-to publish-subscribe (pub-sub) messaging system for a while. It offers functionality for a wide range of enterprise use cases, along with a large ecosystem of tools and a dedicated community. But lately, upstart Apache Pulsar has been gaining ground. Pulsar takes the best parts of Kafka and expands on them to solve problems that were out of scope of Kafka’s original design but which are lately coming to Kafka. In this session we'll cover the differences between them, how to choose one depending on your use case and the state of the art of both.

Our team of experts will be here to answer all your BigData questions, live. Once every month we will host a live event, beginning with a short presentation on a bleeding edge topic, and then follow up with a Q&A session that is open for all.

Previous sessions:

Watch on YouTube: Presto and Elasticsearch - better together

January 27th, 2021
Presto and Elasticsearch - better together

More often than not, we find ourselves implementing BigData architectures that include both Presto and Elasticsearch. Presto is usually deployed for what we call the cold layer and Elasticsearch for the hot layer. In most systems, real-time access isn’t required for the lion’s share of the data where the main concern is keeping costs low, making S3 and Presto a great fit. Usually, ultra-low latency queries are only required for a portion of the data and that is where Elasticsearch, which is more hardware demanding and hence costlier, really shines. In this session we will show how we can inter-connect them seamlessly, and demo a couple of really cool features that make some really interesting use-cases finally possible.
Watch on YouTube: Spark on Kubernetes: why and how to migrate your Spark pipelines to Cloud-Native Apache Spark

January 6th, 2021
Spark on Kubernetes: why and how to migrate your Spark pipelines to Cloud-Native Apache Spark

In the upcoming version of Spark (3.1), the Spark on Kubernetes integration will officially be declared production ready. A lot of companies have already adopted Spark on Kubernetes to benefit from containerization, reduce their costs, and make their architecture more portable and flexible. In this talk we'll go over the main pros & cons of running Spark on Kubernetes (as opposed to Hadoop YARN or proprietary platforms). The speaker, an ex-Databricks engineer now co-founder of Data Mechanics, a commercial Spark platform deployed on Kubernetes, will give practical tips to make this migration successful.
Watch on YouTube: Modern full-text search with Elasticsearch

October 28th, 2020
Modern full-text search with Elasticsearch

The field of information-retrieval and text search has come a long way since its inception, several dozen years ago. Join us on this session, where we will discuss the modern text search practice with Elasticsearch, the Lucene-based search engine server and today's de-facto standard for full-text search applications. We will start from the basic keyword search - analyzers, term normalization, stemming and morphologic properties. We will, of course, discuss the common challenges it has, such as boosts, synonyms, ontologies, phrases and how to deal with them. Continuing from there, we will review the modern and future approaches for full-text search, from vector search to word embedding methods like BERT, and how those come into play. We will also discuss how we can improve precision and recall by using judgment lists, click-streams and search logs.
Watch on YouTube: Introduction to Delta Lake SQL

September 23rd, 2020
Introduction to Delta Lake SQL

Let's talk about the latest of Delta Lake 0.7.0 and how much you can use it with SQL only. We will begin with a short Delta Lake intro and then dive into all the goodies of DDL and DML commands (like CREATE, ALTER, DROP, SELECT, UPDATE, DELETE, MERGE, EXPLAIN) which are supported by Delta Lake. We will review and demo all live, and let's see where it goes from there with your questions!
Watch on YouTube: SQL Query Anything, Anywhere with Starburst Presto

July 29th, 2020
SQL Query Anything, Anywhere with Starburst Presto

Build an Open Source Data Access Layer to federate Kafka, Data Lakes, and more. Starburst Presto provides a federated "Single Source of Access" to create a multi-node, elastically scaling cluster to pull data from data warehouses, data lakes, relational data, NoSQL, and Kafka queues. Users run a single SQL query that joins data from all of them merging the data on the fly into a single result-set. Join in on this AMA to learn how to implement this in your environment.
Watch on YouTube: Usage patterns for Kafka

July 22nd, 2020
Usage patterns for Kafka

Kafka is a key component in data architectures because it's the enabler for easy decoupling between systems and performance improvements like adding backpressure management to existing components. In this Ask Me Anything session we'll review some messaging patterns as Pub-Sub and Observer and how they are related to architectural patterns as CQRS, Event Sourcing and Event Collaboration. We'll cover the advantages/disadvantages of each pattern and we'll learn to identify the best opportunities to use them.
Watch on YouTube: The State of Cloud Machine Learning

July 1st, 2020
The State of Cloud Machine Learning

The Machine Learning ecosystem is booming in recent years and with new product and technology announcements coming every week it’s easy to get lost. We invite you to join us as Gad, Director of Machine Learning, will explain what’s worth looking at and arm you with knowledge on how to choose the right tool for the task.
Watch on YouTube: How to expose Big Data efficiently

June 17th, 2020
How to expose Big Data efficiently

For any Big Data architecture, the main goal is to make available data to their users which will be very hard to use in traditional architectures because of size or latency requirements. In this AMA, we'll cover how to expose data efficiently in terms of performance and governance. We'll review some interesting patterns and technologies which makes easier for your users to consume the data previously processed in your pipelines.
Watch on YouTube: On storage system in Apache Spark

June 10th, 2020
On storage system in Apache Spark

This AMA session will begin with a very short introduction to Storage System and BlockManager. During this session we are going to show you when and how Spark saves data to disk using the storage system. It's going to be fairly low-level, but there will be enough high-level info that anybody should benefit. This session can get interactive so expect questions to drive how low / high we end up discussing.
Watch on YouTube: Avro, Parquet or JSON? What to use and, more importantly, how to manage schemas

June 3rd, 2020
Avro, Parquet or JSON? What to use and, more importantly, how to manage schemas

In this session, we'll review the differences between the most important Big Data file formats for Event Streaming, their pros and cons and how to choose the best fit for a specific use case. We'll also take a look to the proper architecture to provide greater control over data quality using Schema Management. Need to add a new column to a downstream database? You don’t need an involved change process and at least 4 meetings to coordinate 15 teams. Join us to learn how it's possible to reduce operational complexity in the application development cycle.
Watch on YouTube: Elasticsearch: Performance and Stability in Production

May 27th, 2020
Elasticsearch: Performance and Stability in Production

There are so many Elasticsearch clusters out there, and many of them suffer from performance and stability issues because of mis-configuration or incorrect capacity planning. In this session we will look at the common errors people make when deploying Elasticsearch clusters, and offer best-practices, do's and don'ts so it doesn't happen to you as well.
Watch on YouTube: Big Data Architectures on Amazon Web Services (AWS)

May 20th, 2020
Big Data Architectures on Amazon Web Services (AWS)

This session will showcase typical Big Data architectures on AWS and show you how to build them yourself. From building Data Warehouses and Data Lakes to make huge amounts of data queryable, orchestrating data pipelines and ETL processes, ingesting data at scale, to handling and computing on high-velocity data streams. These are huge tasks but are relatively easy to get done with AWS, and this session will show you where to begin.
Watch on YouTube: Kafka Streams: a gentle comparison with other real-time frameworks

May 13th, 2020
Kafka Streams: a gentle comparison with other real-time frameworks

In this session we will introduce Kafka Streams, a client library for building real-time processing applications, where the input and output data are stored in Kafka clusters. We will compare it with other popular real-time frameworks such as Flink and Spark Structured Streaming and talk about when to use which one.
Watch on YouTube: Alerting with Elasticsearch and the Elastic Stack

May 6th, 2020
Alerting with Elasticsearch and the Elastic Stack

The Elastic Stack is being used almost everywhere today for application and system monitoring. In this session we will show you how to add alerting to any Elastic-based monitoring system, so you can also get alerted via Email, Slack, PagerDuty and more when any of the alerting rules you defined gets triggered.

Ask me anything

Monthly event

Apache Kafka vs Apache Pulsar

Monthly on a Wednesday

Previous sessions:

Presto and Elasticsearch - better together

Spark on Kubernetes: why and how to migrate your Spark pipelines to Cloud-Native Apache Spark

Modern full-text search with Elasticsearch

Introduction to Delta Lake SQL

SQL Query Anything, Anywhere with Starburst Presto

Usage patterns for Kafka

The State of Cloud Machine Learning

How to expose Big Data efficiently

On storage system in Apache Spark

Avro, Parquet or JSON? What to use and, more importantly, how to manage schemas

Elasticsearch: Performance and Stability in Production

Big Data Architectures on Amazon Web Services (AWS)

Kafka Streams: a gentle comparison with other real-time frameworks

Alerting with Elasticsearch and the Elastic Stack

Propose a topic for a future session