Watermark sense for windows

7/14/2023

So, if we are keeping a count of the number of customers in a time window, the count we are outputting always represents the “count so far” - future later data could arrive and revise this count. Instead of thinking of getting a single answer for our windowed computation we need to think of getting a stream of revisions representing “the result so far”. This table will get updated as new events arrive. That is, it is a table where the key is the aggregation ID and the window time. Rather than having a lot of special constructs for windowing, we can just say that a windowed computation is going to be a table. Now that we’ve understood the streaming representation of a table we can understand how to generalize the dataflow model for handling windowed computation. This means that to maintain the current count per region I need to subtract one from the count for the old region and add one to the count for the new region.Ĭlearly being able to compute both on pure streams (like clicks) as well as streams of revisions (like customer location) is important.

On the other hand, an update event (e.g., “Alice moved to Europe”) is revising the location information for the customer. The reason is that the count of clicks is never revised down - clicks arrive but old clicks never go away…so we only ever add to my count of clicks. In other words, another representation for the evolution of a table over time is a stream of the updates to the table:īut note that the mechanics of this computation are quite different than computing, say, the count of clicks by each customer. I can take the stream of updates to accounts and use it to keep a running count of the current number of customers in each region. We wrote about it in the first blog on Kafka Streams API. The answer is yes, there is a stream representation for a table like this. One approach to doing this would be to just run this count once a day, but can I do it in a streaming fashion? Let’s say that I have a table of customer accounts and I want to compute the number of customers in each geographical region. But most streaming processing systems don’t really represent tables of data. The task at hand for streaming apps isn’t processing only pure events, but combining it with data coming from these data stores. Most organizations have at their core a set of entities maintained in mutable databases - this might hold their customer account information, their sales, their inventory, etc. However, much of the data in an organization is not in this form. When people think of streams of events they mostly think about immutable entities.

By way of doing that, let me introduce the concept of Tables in Kafka Streams API. The third critique requires outlining the more general thing this is a special case of. The second critique is that if we get specific about what we are trading off it is mostly non-functional characteristics and don’t really belong in your code at all but are more like tuning knobs. If there was no better solution we’d accept that this was unavoidable complexity, but I think there is a better approach, which I’ll outline. There are at least 8 varieties of trigger and several types of watermark. All this watermark business is complex, no denying that. The first critique is fairly straight-forward.

It is a special case of a more general treatment of updatable data.
It mistakes an operational tuning parameter for business logic.
The meaning of a watermark is that it forces a final answer and discards any answer after that final answer. Here is what a typical computation using Watermarks looks like in the Beam API: Item (1) is exactly correct, but I think that while (2) is better than what came before, is still suboptimal and overly complex.
A particular approach to resolving this based on triggers and watermarks.
The observation that data can come out of order, and that the ability to revise results as new data arrives is important.
I think it’s important to differentiate two things: This has been widely copied by different stream processing systems. The Dataflow model of computation has integrated a system for coping with this into the Beam API. Their key observation is that in most cases you can’t globally order data arrival, which means that stream processing must handle out-of-order data. The Google Dataflow team has done a fantastic job in evangelizing their model of handling time for stream processing.

0 Comments

Watermark sense for windows

Leave a Reply.

Author

Archives

Categories