Architecture

Replay last month before you change next week’s prompt.

The most underrated property the event log buys you isn’t audit. It’s the ability to ask, of any prompt change, ‘what would this have done last month?’ — before a single user sees it.

It’s not just audit.

When people first hear the phrase ‘event log’, they think audit. Compliance. Forensics. Things you reach for after something has gone wrong. That isn’t wrong, but it badly under-sells the move.

The point of the event log isn’t to know what happened. The point is to be able to ask, of any change you’re about to ship, what it would have done if it had been live a month ago.

Production history as the test suite.

If every command is typed, every command appends a typed row to the log in the same transaction, and every workflow is a recipe of those commands — then your production history is, mechanically, a test suite. You don’t synthesise inputs. You don’t imagine what your users might do. You replay last month against next week’s prompt and read the diff.

What this catches that nothing else does.

What gets caught here is the entire class of failures that ‘we tested it on examples we made up’ never catches. The case where a prompt change made one type of customer better and another type silently worse. The case where a recipe started producing the right output and the wrong audit trail. The case where the model started agreeing with itself in a loop that no one wrote a test for, because no one knew to.

Evaluation is a property of the architecture.

This is what we mean when we say evaluation should be a property of the architecture, not a project. You don’t schedule an eval. The eval is what running the replay does.

← All field notes Talk to us