Apache Kafka has become a standard for event-driven architectures, specifically stream processing. This has always been contrasted to batch processing, whether that’s traditional ETL or something like ML training.
Many architectures show hybrid implementations that support both batch and streaming. But wouldn’t it make sense to just adopt a technology like Kafka for all data ingestion needs? Everything could be streamed in, but that doesn’t mean everything has to be processed in real time. Kafka can hold onto data for as long as needed, and is really just a type of distributed database; not just a message queue.
Why not simply use Kafka as a “central nervous system” to the entire architecture, with all data sources publishing to Kafka, and all consuming applications subscribing to Kafka topics? Any batch processing can just be a separate service that grabs data from Kafka when needed.
Does anyone do this, or is streaming always added on as a second part of a hybrid architecture?