Anton, chapter 4: Plumbing matures

The week opens with one line of config and a quiet decision: the local inference path moves off Ollama onto vLLM. One environment variable changes, and the serving stack underneath changes with it. Real concurrency, proper batching, something that can actually take production load. Ollama got me through the first two weeks; vLLM is what I want sitting under a system that several people are going to lean on every day. The kind of swap that looks trivial in a diff and reshapes everything that runs on top of it.

Memory rebuild

Then memory. The simple Postgres-rows-tagged-by-user store from week one is fine until it isn't, and around now it isn't. I rebuild it around proper retrieval semantics. pg_trgm scored matching instead of vector embeddings, because at personal-knowledge-base scale trigrams are enough and an embedding pipeline is a tax I don't want to pay. Provenance tags so every fact can be traced to its source. Domain-biased retrieval: a calendar query weights calendar facts, a media query weights media facts. An adaptive recall window, short for chit-chat and long for things that look like research. A working-memory scratchpad for the live conversation. An LLM pass that paraphrases the query before searching, because the way a question is asked is rarely the way the answer is stored. A few days later I add an episodic layer, summaries of past conversations, but only injected when the query has temporal intent. Always-on episodic memory is context bloat dressed up as helpfulness. The principle that crystallises out of all this is one I keep coming back to: filter at load time, not write time. Store generously, retrieve narrowly.

Schedules and codenames

Schedules go from a thin BullMQ wrapper to an actual domain. Timezone support, daily-duplicate merging, an execution log, on-demand run, silent mode, a UI that doesn't make me wince. Scheduling deserves to be a first-class citizen because half of what I want Anton to do is recurring: the morning briefing, the weekly review, the reminder to call my godfather. In the same commit LiteLLM lands as the gateway in front of every model provider, because the schedule plus agent combination needs one place to route LLM calls, not many. With LiteLLM comes the codename roster: sunny, haiku, oscar, gandalf, gizmo, gatsby, merlin, gustav. Names instead of provider IDs. It feels mildly silly the first time I type "ask gandalf" in a config and it sticks immediately. Models get swapped, deprecated, repriced; the codename is stable. The indirection costs nothing and pays back every time a provider changes something.

A naming sin from last week catches up with me. I'd called two different things "skills": a typed function in code and a reusable prompt template in the database. Two commits clean it up. First, "skills" become "saved prompts" everywhere. Then I drop the "saved" prefix because it adds nothing. Now skill means a typed function with a runtime contract, prompt means a template stored in the database. The cost of the cleanup is a couple of hours; the cost of letting the collision live another month would have been far worse. Naming is one of those things where the bill compounds.

Skill contracts

The runtime contract for skills lands the same day as defineSkill(). Every skill declares its inputs, outputs, scopes, and handler in one shape, and the skill-runner becomes its own service: a separate container hosting skills as individually deployable units, callable over HTTP. I split it out for four reasons. Hot reload, so a single skill updates without restarting the world. Per-skill metrics: invocation counts, error rates, p50 and p95. Scope isolation, because a leaf skill should not need the parent agent's whole context surface to do one job. And the option of sandboxing later, which is much cheaper to add to a service that's already separate than to one tangled into the worker. A day later every domain unifies onto callSkill() and createSkillTool(). One contract, one entry point, one way to add a new capability.

Traces and permissions

Traces become first-class. The trace viewer from week one was reading checkpoints out of Redis, which is fine for live debugging and useless for anything historical. I move execution traces into Postgres: every agent run is a row in agent_traces, queryable from the UI, surviving failures and retries. The next day a small follow-up locks down the invariant: one trace per request, no matter what. Now I can ask questions of the system's own behaviour. What happened in this run, last night, last week. The dev loop tightens again.

Permission filtering moves into runAgent() itself, which closes a class of bugs I keep almost shipping: a scheduled job inheriting wildcard permissions because the filter lived in the wrong place. Every caller (parent agent, scheduled job, mesh probe, the /invoke endpoint) gets the same filter applied at the same point. The right place to enforce a rule is the place every path has to go through.

The Invoke tab grows a VueFlow graph that draws the live agent architecture from the runtime configs. It is the first time I can look at Anton and see his shape, not just his logs. Agents on the canvas, tools as edges, the whole thing redrawing as I add or remove a domain. The visualisation pushes a mental model into my head before it's anywhere in the code: agents are the organising principle. Everything else hangs off them.

Then a refactor I've been wanting for a while. The parent agent had 63 tools attached to it and had started making the kind of mistakes you make when there's too much choice on the table. I restructure it into 10 subsystem delegates, each a small agent of its own. The parent's job becomes routing and synthesis, not calling everything directly. Six times fewer tools at the top level, and the answers get sharper immediately. It's the same lesson as last weekend's calendar saga, scaled up: the LLM is good at picking the right thing from a small menu and bad at picking the right thing from a long one. Give it a small menu.

Output validation lands in the agent loop, with intermediate messages and deterministic tool confirmations. It kills an entire family of "tool succeeded, response is empty" bugs that used to surface as a silent Anton, which is the worst kind of failure mode in a chat interface. Then prime directives: a small set of immutable rules the agent loop enforces above any individual prompt. The first version is verbose; I rewrite each one to a single line. Directives sit at the top of the prompt-precedence stack: directives, agent prompt, prompt template, user message. The non-negotiable rules live in code; everything else is editable.

Prompts as data

Late in the week, the move that ties the rest together. All agent prompts go into the database. No hardcoded fallbacks. Editable from the UI, versioned, one row per agent. Anton's behaviour stops being something I deploy and starts being something I configure. Want a different tone for the calendar agent? Edit the row. Want to test a new prompt for research? Save a version, run the suite, keep it or roll back. The principle is the same one as memory: keep behaviour as data, not code, and you keep the option to change your mind cheaply.

By Sunday night Anton is a different shape than he was on Monday. Inference is on vLLM. Models route through LiteLLM under codenames I picked in an evening. Memory has retrieval that respects domain and intent. Schedules are a real domain. Skills are a typed contract running in their own service with hot reload and metrics. Traces are queryable rows. Permissions are enforced at one chokepoint. Prompts live in the database. The parent has ten delegates instead of sixty-three tools. The week's lesson, the one I'm taking forward: the work that pays the most is the work that turns implicit conventions into explicit contracts. Once a thing has a shape on disk and an entry point in code, you can change anything around it without fear. Plumbing is invisible until it isn't, and this week was almost entirely plumbing.