Skip to the content.

Stream processing

Why do we need stream processing

Transmitting event streams

Event

Messaging system

Producer sends a message containing the event, which is then pushed to consumers.

Direct messaging from producer to consumer

It is the responsibility of application to handle the message lose and failure.

Messaging through broker(mindset: transient messaging)

A broker is essentially a data store system (could be in-memory or on disk or hybrid) where producer sends message to broker, then broker pushes message to consumer, once the message is delivered/processed it will be deleted from broker. Producer does not wait for consumer to get the message processed.

Well known products: RabbitMQ, ActiveMQ, Azure Service Bus, Google Cloud Pub/Sub.

Multi consumers

Delivery guarantee
Message broker compares to database

Partitioned logs(mindset: permanent traceable/repeatable messaging)

The broker behinds this is also called log-based message broker.

Well known products: Apache Kafka, Amazon Kinesis Streams, Twitter’s DistributedLog.

Offsets

Disk space usage

Failure handling

Databases and Streams

Change data capture

Databases usually write the data change to its WAL, and a log-based message broker is well suited for transporting the change log from the source database to the derived system. Facebook Wormhole is exactly implemented to capture the database logs and replicated to remote system.

Event sourcing

TBA

Process streams

Event time and processing time

The event occurrence time usually is not equal to the event processing time, network slowness or consumer failure could cause the event received has an older timestamp than the backend server. The out-of-order events might cause a bad result in stream analysis, i.e., google search query trend analysis.

Which time to use

Using above three timestamps, we could calculate the time drift and estimate the true time the event was actually occurred.

Define time window

Type of time window

Stream joins

Stream-stream join(window join)

Example: Calculate click-through rate of a URL, there will be two types of events: (a) Type in a search query, (b) Click on the search result. We want to combine two types of events for calculation.

Challenges:

Solutions:

Stream-table join(stream enrichment)

Example: The input is a stream of activity events containing a user ID, and the output is a stream of activity events in which the user ID has been augmented with profile information about the user.

Challenges:

Solutions:

Table-table join(materialized view maintenance)

Example: Twitter user post/delete a tweet, follow/unfollow another user will get two tables involved.

Challenges and Solutions: TBA

Time dependence of joins

Challenges:

Solutions:

Fault tolerance in stream processing

Micro-batching or Checkpoint + Atomic distributed transaction or Idempotence make sure the exactly-once semantics of steam processing.