Optimizing AppendToStreamAsync Performance for High-Volume Stream Writes in EventStoreDb (Docker on Mac M3)

Hey everyone,

I’m currently developing a larger business application with a strong reliance on Event Sourcing, using EventStoreDb as the backbone. So far, the experience has been surprisingly smooth — which is awesome.

Setup:

For development, I run a single-node EventStoreDb instance via Docker on my MacBook M3 (64 GB RAM, 16 CPU cores). The Docker container is configured with:

EVENTSTORE_INSECURE=true
EVENTSTORE_INT_IP=172.30.240.11
EVENTSTORE_TELEMETRY_OPTOUT=true
EVENTSTORE_ENABLE_ATOM_PUB_OVER_HTTP=true
EVENTSTORE_RUN_PROJECTIONS=All
EVENTSTORE_START_STANDARD_PROJECTIONS=true

Use Case:

We’re following a transactional outbox pattern to emit domain events. Every event is tied to a dedicated stream (aggregate stream pattern), for example:
price-8DA9A782-5B91-434A-B233-9177E1CDC13C

Let’s say there’s a mass update in our product price range — around 1.5 million prices need to be adjusted. This means:

  • 1.5 million events
  • 1.5 million individual streams
  • Each stream gets a single AppendToStreamAsync call

In my current tests, I’m observing a throughput of 3,000 individual AppendToStreamAsync calls in ~5.5 seconds. While this doesn’t sound terrible, it also doesn’t seem particularly fast — especially given the scale I need to support.

My Questions:

  1. Is this throughput (3k appends in 5.5s) expected for Dockerized EventStoreDb on macOS with ARM (M3)?
  2. What are best practices to improve append performance for a high number of distinct streams?
  3. I’ve heard virtualized environments (e.g., Docker on Mac) can introduce performance bottlenecks. Are there specific optimizations for that setup?
  4. Would it be better to queue events in larger batches per stream where possible, or use another strategy altogether?

Any insights or real-world numbers from your own setups would be super helpful — especially from those running EventStore in containerized development environments.

Thanks in advance!

Update:
I’ve realized that during large-scale price updates, instead of writing each change to its own dedicated stream, I could direct all events into a consolidated stream — for example, something like price-change-8DA9A782-5B91-434A-B233-9177E1CDC13C.

By batching the 1.5 million price change events into a single stream, I can take advantage of AppendToStreamAsync’s ability to handle multiple events per call, significantly improving throughput and reducing overall write time. I heard to keep streams small, but in this case, I don´t have any choice?

Throughput / Benchmark question are notoriously hard to respond to because of variations in the environment being tested on.

I’ve heard virtualized environments (e.g., Docker on Mac) can introduce performance bottlenecks. Are there specific optimizations for that setup?

  • adjusting the number of CPU Core and memory given to the docker instance

Yes one Append with multiple events will be faster.

Another alternative, if you want to keep 1 stream per price-change

  • parallelise the append inside that 1 process ( multiple concurrent AppendToStreamAsync() and using a semaphore to control the concurrency )
  • multiple processes
  • multiple clients to append in parallel.

( there’s a bunch of work the server can do when using multiple client, compared to one client doing all the )

Hey Yves,

Thanks a lot for your quick reply!

I’ve also been thinking about introducing parallelism, but I’m still a bit hesitant due to potential side effects on the guaranteed ordering in our system. In most cases (I’d say around 90%), we’re working with aggregated streams, so ordering is only critical within each stream. In those scenarios, even with parallel processing, the event order should remain intact as we’re appending a batch of events sequentially into a single stream.

It’s that remaining 10% that makes me nervous, especially where cross-stream ordering or subtle race conditions could sneak in. That’s why I’ve held off on implementing parallel writes for now and consider it more of a last-resort option.

For now, I’ve assigned Docker the full 16 cores and at least 16 GB of RAM. Looking at the Docker Dashboard, I can see EventStoreDB occasionally spikes to around 250% CPU usage, so it’s definitely working hard under load.

EDIT

Just to confirm — when you mentioned “multiple concurrent AppendToStreamAsync() and using a semaphore to control the concurrency,” were you referring to something like using a SemaphoreSlim to limit the number of simultaneous append operations?

var semaphore = new SemaphoreSlim(8); // Limit to 8 concurrent appends
var tasks = priceChangeEvents.Select(async e =>
{
    await semaphore.WaitAsync();
    try
    {
        await client.AppendToStreamAsync($"price-change-{e.Id}", ...);
    }
    finally
    {
        semaphore.Release();
    }
});
await Task.WhenAll(tasks);

Yes, I’m usually using similar stuff to this piece of code

but I’m still a bit hesitant due to potential side effects on the guaranteed ordering in our system.

Are price update order dependant, even in a bulk scenario ?
I would guess pricing update happens at any time and on any item in the source system.
So your bulk import probably needs to make sure all update are processed, in no particular order.

If you bulk update does contain information that need to be in the same stream, you can first pre-process it to get the data over a specific “item” in one go and send it all
so your import becomes a list(A) of list(B) .
where the order of processing of list(A) does not matter, while internally you make sure list(B) information is processed in order in the target stream

I’ve replaced the SemaphoreSlim approach with Parallel.ForEach. I have not exactly understood why to control the concurrency. Can you share some insights?

    async Task IterateQueueAsync(
        IEnumerable<IGrouping<EventStream, EventStore.OutgoingLetter>> letters,
        CancellationToken cancellationToken
    )
    {
        using CancellationTokenSource cts = CancellationTokenSource.CreateLinkedTokenSource(
            cancellationToken
        );

        await Parallel.ForEachAsync(
            letters,
            new ParallelOptions() { CancellationToken = cts.Token },
            async (grouping, token) =>
            {
                try
                {
                    if (token.IsCancellationRequested)
                        return;

                    using IServiceScope innerScope = scopeFactory.CreateScope();
                    await using IDocumentSession innerSession =
                        innerScope.ServiceProvider.GetRequiredService<IDocumentSession>();

                    await HandleLetterGroupAsync(innerSession, grouping, token);
                }
                catch (Exception ex)
                {
                    HandleException(ex);
                    await cts.CancelAsync();
                    throw;
                }
            }
        );
    }

In our architecture, we use a transactional outbox and consolidate events by their target EventStream. This ensures correct event ordering within a stream, which is the only constraint we enforce. As I have mentioned, most of our scenarios tolerate unordered processing across streams, which allows us to leverage parallelism effectively. But maybe I am too scared about unwanted side-effects, as we develop a large scale application with many different types of aggregates and bounded context. From a logical point of view, it should be working just fine.
The result with Parallel.ForEachAsync is 50,000 events processed in under 9 seconds.
This is an acceptable outcome for our situation.

If you allow, I´d like to ask a quick side-question…

For cross-service communication, we subscribe to the $all stream using EventTypeFilter.Prefix(…), enqueue the filtered events into a ConcurrentQueue, and persist them via a transactional inbox with custom checkpointing:

var options = new SubscriptionFilterOptions(EventTypeFilter.Prefix(eventTypes));
await client.SubscribeToAllAsync(
    checkpoint.WithFromAll,
    async (_, @event, ct) => await HandleEventAsync(Constants.AllStream, @event, ct),
    resolveLinkTos: true,
    filterOptions: options,
    cancellationToken: cancellationToken
);

Would you consider this filtering approach on $all sufficient for a scalable microservice environment, or would you recommend a more specialized strategy per service?

I have not exactly understood why to control the concurrency. Can you share some insights?

you can’t let that piece of code eventually exhaust all the resources , you need a way to control that .

Would you consider this filtering approach on $all sufficient for a scalable microservice environment, or would you recommend a more specialized strategy per service?

Yes , that’s what I would start with and have the necessary telemetry to check if the processing is within tolerance.
Once processing is not within tolerance consistently , I’d find out what is having trouble and concentrate on solving that one only.

Typical metrics you need :
lag of the subscriber

  • if on $all

    • the last event timestamp in the DB vs the one you’re processing => general trend should be towards 0,
    • the position on the last event in the DB vs the one you’re processing => general trend should be towards 0
      Both will increase from time to time, as you’re appending new event and processing them .
      the rate of decrease should be higher than the rate of increase, otherwise you’ll never catch-up
  • how long HandleEventAsync is taking ( usual histogram type ) for each service

  • how many calls to HandleEventAsync per unit of time