The ops playbook for teams running AI agents in production.
Most teams treat "production" as "it works on my laptop." Here is what real production operations looks like for agent systems.
Token spend is just the start. Tool calls, retries, infrastructure, and failure handling all have real costs most teams never see.
What does incident response look like when the system that broke is an AI? The answer is not as different from SRE as you might think.
Shipping a new prompt version straight to 100% of traffic is the equivalent of deploying untested code to production with no rollback plan.
Every SRE principle — SLOs, error budgets, on-call rotations, runbooks — applies directly to AI agent systems. Here is how to map them.