Stephen, please excuse my slow response. These are awesome questions; let me see if I can pull this all together without infodumping. Hopefully this makes sense to you.
Terms/Mapping:
timeline = the domain’s canonical, append-only log of signals. Effectively a global stream of commands and resulting domain events.
Topic = decision-making observer; in orthodox ES terms, closest to a policy handler / process manager, and sometimes saga-like
Query = projection / read model
Flow = an instance of Topic/Query
Route = the routing rule that decides which Flow instances should observe an event.
{FlowKey}-checkpoint = the checkpoint stream for a Flow instance, storing the serialized flow state plus checkpoint metadata such as position, error position/message, and done status.
{FlowKey}-routes = a per-flow linkTo stream which acts like the observer’s inbox
The ESDB JS projection named resume is specifically a coordination projection. It watches timeline plus *-checkpoint, fans events out into *-routes, tracks scheduled work, and outputs something like { checkpoint, routes, schedule }
When I boot up my framework, I kick off a “resume” process for each Flow. I ask the resume projection which flows still have pending work, and then each of those flows resumes from its own {FlowKey}-routes stream after its checkpoint. So the normal steady-state recovery story is actually fine.
The expensive part is only when I have to reset/rebuild the resume projection itself, because that one has to re-read the full global timeline to reconstruct all route streams and pending checkpoints.
To answer your second question directly: here is a representative slice of the business workflow. A recent chain looked like this:
Position Type
#1914631 UpdateScanRegistry (<-- that is a command included in the event timeline, I believe that is non-orthodox behavior)
#1914632 ScanRegistered
#1914647 NewImageDetected
#1914666 ClientIdentified
#1914667 ScanCompleted
Routing:
Here is the same chain with actual routing behavior.
#1914631 UpdateScanRegistry
This is the external input. In ES terms, this is the command-ish ingress point from outside the bounded context. It contains 18 potential new images.
#1914632 ScanRegistered
UpdateScanRegistry routed to the Topic: SmartScanner, which will emit one ScanRegistered per valid scan in the command.
#1914647 NewImageDetected
This one was for user ajohnston.
This event routed to three observers:
- ScanActivityQuery|ajohnston (multi instance)
- ShiftScheduleTopic (single instance)
- ClientClassifierTopic (single instance)
For this one event, the downstream processing looks like this:
Stream Name: ScanActivityQuery|ajohnston-routes
The route stream contains links like:
-> #1914648 NewImageDetected
-> #1914647 NewImageDetected
-> #1914644 NewImageDetected
So that stream is basically the observer-specific inbox/history for that one projection instance.
Stream Name: ShiftScheduleTopic-routes
This is a process / policy branch, not a read model. It watches the sequence of scans across operators to infer when they take breaks and lunch time. In code, if the gap is large enough, it emits TimeOffTask events which then get routed downstream to TimeOnTaskQuery|ajohnston.
Checkpointing:
The checkpoint stream is the consumer offset + serialized state for a given Flow (e.g. ShiftScheduleTopic-checkpoint, ScanActivityQuery|{instance}-checkpoint).
The lifecycle is:
An event is appended to timeline.
The JS resume projection links that event to every relevant {FlowKey}-routes stream.
A Topic or Query on the client consumes its routed events.
After it successfully processes them, they write a checkpoint event to {FlowKey}-checkpoint (maxCount = 1)
So the checkpoint is produced by the flow host after successful handling, not by the projection itself.
As a simple example:
#1914667 NewImageDetected gets linked into:
ClientImageList|{ClientKey}-routes
Then that observer runs independently.
For example, ClientImageList:
- reads the routed
NewImageDetected
- updates its in-memory/query state
- then writes a checkpoint to
ClientImageList|{ClientKey}-checkpoint
Let me know if more information would help here, as I don’t want to fill the page up with too much of my ramblings.
Where I’m at conceptually right now:
Your first question is the same one I’m asking myself: if this projection is only about restart/resume, then old completed work should not matter conceptually. Resetting the resume projection is expensive because it replays my entire global timeline log, which is a “one-time” cost to make all the checkpointing function as described.
The thing holding me back from simply relying on checkpoints and scavenging timeline is that I need to retain the events for at least 6 months for auditing, historical analysis, etc… I have unlimited disk drive space to store them on, so that’s not a problem; it may be that I need to migrate events out to a “cold storage” timeline periodically. I just don’t know what other people normally do in this situation.