Edge AI for Real-Time Analytics: Architecture Patterns for Sub-Second Decisions

A data engineering view of edge AI for real-time analytics: where inference runs, how edge events stream back to a central OLAP store like ClickHouse, and the latency and consistency trade-offs that decide the design.

Most writing on edge AI stops at "run the model on the device." That framing misses the harder half of the problem. The model is the easy part. The real engineering work is the data plane around it: where features come from, where inference results go, how thousands of remote sites stay consistent with a central store, and how you query the whole fleet fast enough to act on it. Edge AI for real-time analytics is a data engineering problem first and a model problem second.

This post treats it that way. We will walk a reference architecture connecting edge inference to a central real-time OLAP backend such as ClickHouse, Apache Pinot, or Apache Druid, look at the latency budgets that justify pushing compute to the edge, and cover the data-plane and lifecycle patterns that decide whether the system holds up at scale.

Two Flavors of "Real-Time at the Edge"

There are two distinct things people mean by edge AI, and most production systems need both.

Inference-at-edge runs a model on or near the device that generates the data: a camera, a PLC on a factory line, a vehicle gateway, a point-of-sale terminal. The decision is made locally, in single-digit to low-double-digit milliseconds, without a network hop. This is what keeps a control loop safe or a checkout flowing when the uplink is congested or down.

Streaming-back-to-center treats every edge event - sensor readings, inference outputs, confidence scores - as a record to ship into a central analytics backend. That backend is where you ask cross-fleet questions: which sites are drifting, what changed after the last model rollout, how does this hour compare to the same hour last week.

Inference-at-edge optimizes for the latency of a single local decision. Streaming-back-to-center optimizes for queryable history across the whole fleet. They solve different problems, and a serious deployment runs both at once: the edge decides, the center learns.

The split matters because the two paths have opposite design pressures. The edge path wants minimal dependencies, deterministic timing, and graceful degradation under network loss. The central path wants completeness, schema discipline, and fast ad-hoc query. Routing every decision through the cloud so it lands in your warehouse conflates the two, and produces a system that is neither fast nor reliable.

A Reference Architecture, End to End

A workable pattern moves data through four stages. Written as text:

Edge devices (sensors, cameras, PLCs)
     -> Local inference (ONNX Runtime / TensorRT / TFLite on a gateway)
     -> Edge gateway: normalize protocols (OPC-UA, Modbus), pre-aggregate, buffer
     -> Event transport (MQTT broker -> Kafka)
     -> Central real-time OLAP (ClickHouse / Pinot / Druid)
     -> Dashboards, alerting, anomaly detection, retraining triggers

The gateway is the load-bearing component. Industrial protocols are messy and per-vendor; the gateway normalizes them into a common event schema before anything leaves the site. EMQ's production reference for this exact shape - edge gateway normalizing OPC-UA and Modbus, MQTT into EMQX, through Kafka, into ClickHouse Cloud - delivers sub-second analytics even across public networks, with the broker layer benchmarked at around 2 million messages per second (ClickHouse blog, 2025).

On the ClickHouse side, the canonical ingestion shape is the three-layer pattern: a Kafka engine table consumes the stream, a materialized view transforms each batch, and a ReplicatedMergeTree table stores it for query (ClickHouse + Kafka guide). Roughly:

-- 1. Source: read raw edge events off Kafka
  CREATE TABLE edge_events_queue (
      site_id      LowCardinality(String),
      device_id    String,
      ts           DateTime64(3),
      metric       Float64,
      inference    String,
      confidence   Float32
  ) ENGINE = Kafka SETTINGS
      kafka_broker_list = 'kafka:9092',
      kafka_topic_list  = 'edge.events',
      kafka_format      = 'JSONEachRow';
  
  -- 2. Destination: partitioned, sorted for fleet queries
  CREATE TABLE edge_events (
      site_id LowCardinality(String), device_id String,
      ts DateTime64(3), metric Float64,
      inference String, confidence Float32
  ) ENGINE = ReplicatedMergeTree
  ORDER BY (site_id, device_id, ts)
  PARTITION BY toYYYYMMDD(ts);
  
  -- 3. Materialized view wires the queue to the table
  CREATE MATERIALIZED VIEW edge_events_mv TO edge_events
  AS SELECT * FROM edge_events_queue;

Why a columnar OLAP store and not a transactional database or a search engine for this tier? Because the central question is analytical: scans over hundreds of millions of edge events, grouped by site and time, with sub-second response. That is the workload ClickHouse, Pinot, and Druid are built for. For a broader walk through tool selection at this layer, see our engineer's guide to real-time data analysis tools, and for the streaming side specifically, the Kafka, Flink, and ClickHouse architecture blueprint.

Latency Budgets: When the Edge Actually Pays Off

Edge compute is not free. It adds fleet management, hardware heterogeneity, and a second deployment target. You only take that on when the latency math demands it, so do the math before committing.

The numbers are real and large. Local inference typically lands in the 1-20 ms range, while a full cloud round-trip - sensor data up, queue, inference, command back down - routinely runs 50-500 ms and is highly variable under network congestion (Edge AI inference guide, 2025). For a user or device on another continent from a single central region, 300-500 ms is normal. The edge path is not just faster on average; it is deterministic, because it removes the network from the critical decision.

Dimension	Inference at the edge	Inference in the cloud
Decision latency	1-20 ms, deterministic	50-500 ms, network-dependent
Behavior under network loss	Keeps working locally	Stalls or fails
Compute available	Constrained (CPU/NPU, limited RAM)	Effectively unlimited
Model size ceiling	Tight; needs quantization/pruning	Large models run fine
Cross-fleet visibility	None on its own	Native
Operational cost	Fleet mgmt, OTA, heterogeneity	Higher per-inference and egress

The decision rule is straightforward. If a missed deadline causes physical harm or lost transactions - a safety interlock, a collision-avoidance loop, a checkout that must not block - inference belongs at the edge. If the decision tolerates a few hundred milliseconds and benefits from a larger model or richer context, keep it central and stream the events back. A retail loss-prevention camera flags a suspect event locally in milliseconds; the same event still streams to ClickHouse so the fleet can be queried for patterns. Most real deployments draw this line per decision, not per system.

Data-Plane Patterns: What to Send and What to Keep

The link between thousands of sites and the center is the scarce resource. How you treat it shapes both cost and forensic capability. Four patterns cover most cases, and they combine.

Pre-aggregation at the edge. Roll up high-frequency telemetry into per-minute or per-event summaries before transmitting. A vibration sensor sampling at kHz does not need every raw point in the center; p95, peak, and a drift indicator usually carry the signal. This is the biggest single lever on bandwidth and central storage.
Raw passthrough for forensics. Keep full-fidelity raw data when you need to reconstruct exactly what the model saw - incident review, audit, retraining. Buffer it at the edge and ship on demand rather than continuously.
Selective sampling. Always send anomalies and low-confidence inferences at full detail; sample the high-confidence majority. Inference confidence is a natural sampling key: the cases the model is unsure about are the ones worth keeping.
Tiered hot/cold storage. In the central store, keep recent edge events on fast storage for sub-second query, and age older partitions to object storage. ClickHouse tiered storage and TTL moves, or an Iceberg lakehouse for the cold tier, give you queryable history without paying hot-tier prices for last quarter's telemetry.

A common mistake is choosing one globally. Pre-aggregating everything destroys the forensic trail; passing everything through raw blows the budget. Pick per data stream, driven by how that stream gets used.

Model Lifecycle and Feedback Across Thousands of Sites

A model trained once and forgotten is a liability. Edge environments drift - lighting changes, machines wear, traffic patterns shift - and the model degrades silently unless you instrument it. Managing this across a large fleet is where edge AI projects most often fail operationally.

Treat rollout like any other risky deploy. Push new model versions over the air with delta packaging to limit bandwidth, roll out to a canary subset first, watch accuracy and resource metrics, and keep a tested rollback path before going fleet-wide (Edge AI lifecycle management). Version every artifact and stamp the model version onto every inference event. That last detail is what makes the central store useful: when you can GROUP BY model_version over millions of edge events, a regression after a rollout shows up as a measurable shift, not a support ticket three weeks later.

The feedback loop closes through the same pipe. Edge inference results - prediction, confidence, and where available the eventual ground-truth outcome - stream back into the central OLAP store. There you compute drift: compare the live distribution of confidence and predictions against the training baseline, per site and per model version, and trigger retraining when the gap crosses a threshold. The central store does double duty here. It holds the queryable history of edge events and it is the observability plane for the fleet itself - which devices are offline, which are slow, where inference quality is sliding. That second role is why databases, not models, are the foundation of any serious AI system, a point we make at length in why databases are critical in the agentic AI era.

A query to surface drift candidates is unremarkable, which is the point - this is ordinary OLAP work:

SELECT site_id, model_version,
         avg(confidence)               AS mean_conf,
         countIf(confidence < 0.6) / count() AS low_conf_ratio
  FROM edge_events
  WHERE ts >= now() - INTERVAL 1 HOUR
  GROUP BY site_id, model_version
  HAVING low_conf_ratio > 0.2
  ORDER BY low_conf_ratio DESC;

Pitfalls That Sink Edge Deployments

A handful of failure modes recur across projects.

Model size versus hardware. Edge accelerators have hard memory and compute ceilings. Plan for quantization and pruning from the start; do not design around a model the target hardware cannot run.
Hardware heterogeneity. A fleet built over years spans CPU generations, NPUs, and OS versions. The deployment system has to handle multiple target runtimes (ONNX Runtime, TensorRT, TFLite) rather than assuming one.
Network partitions. The edge will lose its uplink. Local inference must keep deciding, and the gateway must buffer events durably and reconcile when the link returns. Design for the partition; it is not an edge case, it is Tuesday.
Consistency between edge and center. The edge holds the recent, locally-true state; the center holds the complete, eventually-consistent picture. You cannot have both strong consistency and edge autonomy under partition. Pick eventual consistency deliberately, stamp events with edge-generated timestamps and idempotency keys, and let the central store deduplicate.

That last trade-off is the heart of the architecture. Choosing edge autonomy means accepting that the central view lags reality by the buffer-and-ship interval, and designing every downstream query and alert to tolerate it.

Key Takeaways

Edge AI for real-time analytics is a data engineering problem first: the data plane around the model matters more than the model.
Production systems need both inference-at-edge (deterministic local decisions) and streaming-back-to-center (queryable fleet history). They pull in opposite directions - keep them separate.
A proven reference shape is edge inference -> gateway normalization and pre-aggregation -> MQTT/Kafka -> central columnar OLAP (ClickHouse, Pinot, Druid), using the Kafka-table -> materialized-view -> MergeTree pattern for ingestion.
Push inference to the edge when a missed deadline causes harm or lost transactions (1-20 ms local vs 50-500 ms cloud round-trip); keep it central when a few hundred milliseconds is fine.
Manage the model fleet like a risky deploy: canary rollouts, OTA delta updates, rollback, and a model version stamped on every event so drift is queryable.
Design for network partitions and eventual consistency from day one; edge autonomy and a complete central view cannot both be strongly consistent at once.

If you are designing an edge-to-center analytics platform and want a second pair of eyes on the architecture, talk to us - this is the kind of system we build and operate.