java – Architectural design for sending large amount of analytics data from production servers to s3 without impacting request performance

Lets say we have a server getting upto 1000 requests per second, serving them at p99 of 20ms (strong business case for not increasing this latency). The server gc parameters have been carefully tuned for this performance and current latency is already bottlenecked by gc. We want to log structured data related to requests and responses, ideally 100% of it without dropping anything, to S3 in for example gzipped jsonlines format (analytics will be done on this data, each file should be ideally 100MB-500MB in size). Analytics does not have to be realtime. A few hours of delay, for example, is fine. Also the IOUtilization already approaches 100% so writing this data to disk at any time is likely not an option. All code is in Java.

Solution 1:
Use the threads getting and serving requests as producers and have them enqueue each request/response into blocking buffer(s) with error/edge case handling of buffer being full, exception, etc. This way the producer threads dont get blocked no matter what. Then have a consumer threadpool consume from these buffer(s) in a batched way, compress and send to s3. The upside is that it is a simple(ish) solution. The main downside is that all this is done in the same jvm and might increase allocation rate and degrade performance for main requests? I suspect the main source of new object creation might be during serialization to string (is this true?). Putting objects into a fixed queue size or draining to (using drainTo method on BlockingQueue) to an existing collection should not allocate anything new I think.

Solution 2:
Setup a separate service running on the same host (so separate jvm with its own tuned gc if necessary) that exposes endpoints like locaholhost:8080/request for example. Producers send data to these endpoints and all consumer logic lies in this service (mostly same as before). Downside is that this might be more complex. Also sending data, even to localhost, might block the producer thread (whose main job is to serve requests) and decrease throughput per host?

For Solution 1 or 2 are there any Java compatible libraries (producer/consumer libraries or high performance TCP based messaging libraries) that might be appropriate to use instead of rolling my own?

I know these questions can be answered by benchmarking and making a poc, but looking for some direction in case someone has suggestions or maybe a third way I haven’t though of.