Use Case: I need to connect two stream sources (eg
orders(order_id, order_val) and
shipments(shipment_id, shipment_val)) based on an ID (
order_id = shipment_id) and generate a new event
shipment_order(id, order_val, shipment_val)
Note that the gap between events can be very large (1 year – 2 years).
order(order_id = 1, value=1) could arrive today, but the
shipment(shipment_id=1, value=2) could arrive after one year.
I examine patterns to achieve window-efficient stream joins:
Save events to DynamoDB (or other datastore) and reissue events via DynamoDB streams (modify data collection in general). If either a DDB order event or a DDB shipping event arrives, I will check DynamoDB for both events using the DDB event ID and make a join and send a new cargo_order
Kafka Stream is used with large windows (2 years)
What are some good patterns for getting such big window joins?
Is stateful systems such as Kafka recommending a 2-year window join? (What are the implications?)
Note: Assume that the system handles about 200 million events per day with spikes. By efficiency I mean the overall time / cost efficiency (and tradeoffs).