Ask me anything

Monthly virtual office hours on BigData technologies and architecture

The next session will be announced soon!

We know how challenging the BigData landscape can be and now, when we are all working remotely because of Covid19, we believe it's the perfect time to move our office hours to a monthly virtual event.

View other AMA sessions
Past Event

Monthly event

08:00 PST 11:00 EST 16:00 GMT 16:00 CET
February 24th, 2021

Apache Kafka vs Apache Pulsar

Apache Kafka has been the go-to publish-subscribe (pub-sub) messaging system for a while. It offers functionality for a wide range of enterprise use cases, along with a large ecosystem of tools and a dedicated community. But lately, upstart Apache Pulsar has been gaining ground. Pulsar takes the best parts of Kafka and expands on them to solve problems that were out of scope of Kafka’s original design but which are lately coming to Kafka. In this session we'll cover the differences between them, how to choose one depending on your use case and the state of the art of both.

chat

Monthly on a Wednesday

08:00 PST 11:00 EST 16:00 GMT 16:00 CET

Our team of experts will be here to answer all your BigData questions, live. Once every month we will host a live event, beginning with a short presentation on a bleeding edge topic, and then follow up with a Q&A session that is open for all.

Previous sessions:

  • January 27th, 2021

    Presto and Elasticsearch - better together

    More often than not, we find ourselves implementing BigData architectures that include both Presto and Elasticsearch. Presto is usually deployed for what we call the cold layer and Elasticsearch for the hot layer. In most systems, real-time access isn’t required for the lion’s share of the data where the main concern is keeping costs low, making S3 and Presto a great fit. Usually, ultra-low latency queries are only required for a portion of the data and that is where Elasticsearch, which is more hardware demanding and hence costlier, really shines. In this session we will show how we can inter-connect them seamlessly, and demo a couple of really cool features that make some really interesting use-cases finally possible.

  • January 6th, 2021

    Spark on Kubernetes: why and how to migrate your Spark pipelines to Cloud-Native Apache Spark

    In the upcoming version of Spark (3.1), the Spark on Kubernetes integration will officially be declared production ready. A lot of companies have already adopted Spark on Kubernetes to benefit from containerization, reduce their costs, and make their architecture more portable and flexible. In this talk we'll go over the main pros & cons of running Spark on Kubernetes (as opposed to Hadoop YARN or proprietary platforms). The speaker, an ex-Databricks engineer now co-founder of Data Mechanics, a commercial Spark platform deployed on Kubernetes, will give practical tips to make this migration successful.

  • October 28th, 2020

    Modern full-text search with Elasticsearch

    The field of information-retrieval and text search has come a long way since its inception, several dozen years ago. Join us on this session, where we will discuss the modern text search practice with Elasticsearch, the Lucene-based search engine server and today's de-facto standard for full-text search applications. We will start from the basic keyword search - analyzers, term normalization, stemming and morphologic properties. We will, of course, discuss the common challenges it has, such as boosts, synonyms, ontologies, phrases and how to deal with them. Continuing from there, we will review the modern and future approaches for full-text search, from vector search to word embedding methods like BERT, and how those come into play. We will also discuss how we can improve precision and recall by using judgment lists, click-streams and search logs.

  • September 23rd, 2020

    Introduction to Delta Lake SQL

    Let's talk about the latest of Delta Lake 0.7.0 and how much you can use it with SQL only. We will begin with a short Delta Lake intro and then dive into all the goodies of DDL and DML commands (like CREATE, ALTER, DROP, SELECT, UPDATE, DELETE, MERGE, EXPLAIN) which are supported by Delta Lake. We will review and demo all live, and let's see where it goes from there with your questions!

  • July 29th, 2020

    SQL Query Anything, Anywhere with Starburst Presto

    Build an Open Source Data Access Layer to federate Kafka, Data Lakes, and more. Starburst Presto provides a federated "Single Source of Access" to create a multi-node, elastically scaling cluster to pull data from data warehouses, data lakes, relational data, NoSQL, and Kafka queues. Users run a single SQL query that joins data from all of them merging the data on the fly into a single result-set. Join in on this AMA to learn how to implement this in your environment.

  • July 22nd, 2020

    Usage patterns for Kafka

    Kafka is a key component in data architectures because it's the enabler for easy decoupling between systems and performance improvements like adding backpressure management to existing components. In this Ask Me Anything session we'll review some messaging patterns as Pub-Sub and Observer and how they are related to architectural patterns as CQRS, Event Sourcing and Event Collaboration. We'll cover the advantages/disadvantages of each pattern and we'll learn to identify the best opportunities to use them.

  • July 1st, 2020

    The State of Cloud Machine Learning

    The Machine Learning ecosystem is booming in recent years and with new product and technology announcements coming every week it’s easy to get lost. We invite you to join us as Gad, Director of Machine Learning, will explain what’s worth looking at and arm you with knowledge on how to choose the right tool for the task.

  • June 17th, 2020

    How to expose Big Data efficiently

    For any Big Data architecture, the main goal is to make available data to their users which will be very hard to use in traditional architectures because of size or latency requirements. In this AMA, we'll cover how to expose data efficiently in terms of performance and governance. We'll review some interesting patterns and technologies which makes easier for your users to consume the data previously processed in your pipelines.

  • June 10th, 2020

    On storage system in Apache Spark

    This AMA session will begin with a very short introduction to Storage System and BlockManager. During this session we are going to show you when and how Spark saves data to disk using the storage system. It's going to be fairly low-level, but there will be enough high-level info that anybody should benefit. This session can get interactive so expect questions to drive how low / high we end up discussing.

  • June 3rd, 2020

    Avro, Parquet or JSON? What to use and, more importantly, how to manage schemas

    In this session, we'll review the differences between the most important Big Data file formats for Event Streaming, their pros and cons and how to choose the best fit for a specific use case. We'll also take a look to the proper architecture to provide greater control over data quality using Schema Management. Need to add a new column to a downstream database? You don’t need an involved change process and at least 4 meetings to coordinate 15 teams. Join us to learn how it's possible to reduce operational complexity in the application development cycle.

  • May 27th, 2020

    Elasticsearch: Performance and Stability in Production

    There are so many Elasticsearch clusters out there, and many of them suffer from performance and stability issues because of mis-configuration or incorrect capacity planning. In this session we will look at the common errors people make when deploying Elasticsearch clusters, and offer best-practices, do's and don'ts so it doesn't happen to you as well.

  • May 20th, 2020

    Big Data Architectures on Amazon Web Services (AWS)

    This session will showcase typical Big Data architectures on AWS and show you how to build them yourself. From building Data Warehouses and Data Lakes to make huge amounts of data queryable, orchestrating data pipelines and ETL processes, ingesting data at scale, to handling and computing on high-velocity data streams. These are huge tasks but are relatively easy to get done with AWS, and this session will show you where to begin.

  • May 13th, 2020

    Kafka Streams: a gentle comparison with other real-time frameworks

    In this session we will introduce Kafka Streams, a client library for building real-time processing applications, where the input and output data are stored in Kafka clusters. We will compare it with other popular real-time frameworks such as Flink and Spark Structured Streaming and talk about when to use which one.

  • May 6th, 2020

    Alerting with Elasticsearch and the Elastic Stack

    The Elastic Stack is being used almost everywhere today for application and system monitoring. In this session we will show you how to add alerting to any Elastic-based monitoring system, so you can also get alerted via Email, Slack, PagerDuty and more when any of the alerting rules you defined gets triggered.

Propose a topic for a future session

Contact Us
We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.