Published khoảng 12 giờ trước 25 min read

🤖 The AI SaaS Playbook (Practical Edition)📘

MayFest2026

Companion to 🚀 The SaaS Template Playbook 📖. That file covers everything every SaaS needs. This file covers what changes — and what's new — when AI is core to the product.

Practical-first. Code snippets, decision tables, real defaults, no buzzwords. If a section doesn't help you ship next week, it doesn't belong here.

📋 Table of Contents

⚡ The Shift in 60 Seconds
🎯 Pick One: AI-Native vs AI-Augmented
- 🚪 2.5 Two Starting Points: Greenfield vs Retrofit
🏗️ Reference Architecture
🤖 Agents as First-Class Actors
🔌 The LLM Gateway (Provider Abstraction)
📝 Prompts as Code
🛠️ Tools, Function Calling & MCP
🧠 Memory & RAG (the practical version)
📐 Structured Outputs
💧 Streaming UX
💵 Cost Control, Budgets & Model Routing
🧾 Outcome-Based & Metered Pricing — the implementation
✅ Evals — how to actually test agents
🔭 Observability for Agents
⚡ Caching (Prompt + Semantic)
🛡️ Safety, Abuse & PII
🙋 Human-in-the-Loop & Autonomy Levels
⏳ Long-Running Agent Jobs
🏢 AI-Specific Multi-Tenancy Concerns
🗺️ The 10-Phase Build Plan
⚠️ Pitfalls
📋 Cheat Sheet

1. ⚡ The Shift in 60 Seconds

What practically changes when AI becomes core:

Dimension	Classic SaaS	AI SaaS
Primary actor	Human user clicking UI	Agent making LLM calls + tool calls
Pricing	Per-seat / per-feature	Per-outcome / per-token / credit-based
Latency budget	< 500 ms p95	Streaming partials in < 1 s; full response variable
Cost driver	Compute + DB	Token spend (often > infra cost)
Failure mode	5xx, 4xx	"Wrong answer," hallucination, prompt injection
Testing	Unit + integration + E2E	+ evals against ground-truth datasets
Observability	Logs + traces + errors	+ prompt/response capture, replay, scoring
Auth boundary	User	+ agent identity, scoped tokens, tool permissions
Audit	"Who did X"	+ "Which prompt + model + tools produced X"

The single biggest practical change: your largest variable cost is now tokens, not servers. Every architectural decision in this playbook is downstream of that fact.

2. 🎯 Pick One: AI-Native vs AI-Augmented

These are different products. Don't try to be both.

Type	Definition	Examples	Pricing
AI-Native	Product is the AI. Without the model, there's nothing.	Cursor, Perplexity, ElevenLabs, Lovable	Usage / credit-based
AI-Augmented	Existing SaaS surface where AI is one feature among many.	Notion AI, Linear AI, Slack AI	Add-on or premium tier

Decisions that flip:

Question	AI-Native	AI-Augmented
Where does AI failure show?	Whole product fails	Feature degrades; rest works
Eval coverage	Mandatory before launch	Per-feature; ship incrementally
Cost model	Pass-through with margin	Bundle into plan + soft caps
BYO API key	Often supported	Rare
Model picker	Often user-visible	Hidden behind feature

For the rest of this playbook, patterns work for both — but if you're AI-native, treat §11 (cost), §13 (evals), and §16 (safety) as launch blockers, not nice-to-haves.

2.1. 🚪 Two Starting Points: Greenfield vs Retrofit

The rest of this playbook describes the patterns. This section is about the sequence — what you build first depends on whether you're starting clean or layering AI onto a product that already has paying customers. Both paths converge on the same target architecture (§3); they differ in what you build first and what you can defer.

🌱 Greenfield: building a new AI SaaS

You have no legacy code, no existing tenants, no in-flight migrations. The temptation is to build §3 in parallel. Don't — primitives have an order.

Decide AI-Native vs AI-Augmented (§2) before anything else. It changes pricing, eval scope, and whether AI failure breaks the product. Skipping the decision is how products end up neither.
Build the Gateway (§5) in week one — even if it wraps a single provider with a single model. Every primitive in this playbook assumes calls flow through one chokepoint. Adding it first is ~300 lines; adding it later is a refactor across every feature.
Model aliases (smart / fast / reasoning) from day one. Never let raw provider model IDs leak into business code, even in the prototype. Model deprecations are constant.
One feature deep before going wide. Take your most differentiated AI surface end-to-end through Gateway → prompts-as-code → trace → eval → cost cap before starting a second. Five shallow surfaces produce five things you can't trust.
Cost caps in Phase 1, not Phase 6. Trivial to add when there's no usage; painful when real customers depend on the limits.
Evals from day one — even with five examples. The muscle matters more than the coverage. Teams that defer evals never start them.
Defer until you have evidence: agent runtime (§4), MCP servers (§7.4), semantic cache (§15.2), credit ledger (§12.2), outcome-based billing (§12.5). Real patterns, but most products ship without them for the first six months.

The shortest viable path: §20 phases 1, 2, 5, 6, 8 in the first two weeks. Add the rest when a feature actually demands them.

🔧 Retrofit: adding AI to an existing SaaS

You already have auth, tenancy, billing, audit, and an observability stack. Most of §3 exists in non-AI form — you're adding the AI primitives, not rebuilding the platform. The risk isn't under-building; it's over-building and destabilizing what already works.

Pick the smallest user-visible AI surface first. "Summarize this," "draft a reply," "classify this ticket." Not "rebuild our core flow as an agent." Small surfaces are reversible.
Gateway as sidecar, not refactor. Land pkg/llm/ (or a new service) alongside the existing code, behind a feature flag. Don't touch parts of the codebase the AI feature doesn't need.
Reuse, don't replace, the boring infrastructure. Existing tenancy, RBAC, billing, audit, and rate-limit middleware should wrap AI calls the same way they wrap any other request. Re-implementing them "AI-aware" is how you introduce inconsistencies that take 18 months to find.
Minimum new tables: llm_trace + llm_call_log. Defer agent, agent_run, credit_ledger, pending_action, semantic_cache until a feature actually needs them.
Cost cap on day one, even if the feature is free. A workspace-level token ceiling protects you from runaway loops in the prototype. Easier now than after a $10k week.
Capture traces before you build evals. Every AI call writes to llm_trace from the first deploy. By the time feature two ships, you have real production examples to seed an eval set — no synthetic data needed.
Update support and ops workflows before launch. CS needs read access to llm_trace before the first "the AI said something weird" ticket. Oncall needs the cost dashboard before the first runaway-bill alert.
Two common traps: AI-ifying too many surfaces at once (ship one well, then expand), and treating AI as a pure-engineering project (pricing, support, and legal need to ship alongside the feature).

The shortest viable path: §20 phases 1, 5, 6, 8 — Gateway, streaming UX on one surface, cost caps, trace capture. Skip prompts-as-code and evals until you have a second prompt to compare against; the first one is just learning.

3. 🏗️ Reference Architecture

[Client]
   │  prompt + context
   ▼
[App API]  ───►  [LLM Gateway]  ───►  [LLM provider(s)]
   │                  │
   │             prompt cache │ semantic cache
   │             rate limit   │ fallback
   │             cost meter   │ provider routing
   ▼
[Tool registry] ◄────┐
   │                 │
   ▼                 │ tool calls
[App services / DB / external APIs]
   │
   ├──► [Vector DB] ──── embeddings worker
   ├──► [Eval store]
   └──► [Trace store] ── prompt+response capture

The LLM Gateway is the keystone. Every model call goes through it — no direct SDK calls scattered through your codebase. It's where you implement caching, cost metering, fallback, and provider abstraction.

You can build it in ~300 lines (see §5) or use one off the shelf:

Option	When to use
Build it (300–800 LoC)	You want full control, native to your stack
LiteLLM (Python, OSS)	You want OpenAI-compatible proxy across 100+ providers, fast
Portkey / Helicone / OpenRouter	You want managed gateway with dashboards
Vercel AI SDK	You're TS-only and want streaming primitives

Recommendation: build a thin one if you're Go-native (pkg/llm/), use LiteLLM if you're Python-heavy.

4. 🤖 Agents as First-Class Actors

If your platform deploys agents (autonomous or user-launched), treat them like users in your data model. The Multica deep-dive captures the canonical pattern — polymorphic actor fields.

4.1 Schema

-- Every "who did this" column gets a type + id pair
CREATE TABLE comment (
  id UUID PRIMARY KEY,
  workspace_id UUID NOT NULL,
  author_type TEXT NOT NULL CHECK (author_type IN ('user','agent','system','api_key')),
  author_id   UUID NOT NULL,
  content TEXT NOT NULL,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE TABLE agent (
  id UUID PRIMARY KEY,
  workspace_id UUID NOT NULL,
  name TEXT NOT NULL,
  model TEXT NOT NULL,           -- "claude-sonnet-4-6", "gpt-5", ...
  system_prompt TEXT,
  tool_allowlist TEXT[],          -- which tools it can call
  daily_token_budget BIGINT,
  created_by UUID NOT NULL,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

4.2 Agent tokens (auth)

Agents authenticate with their own short-lived tokens, not the user's session.

// When a user kicks off an agent run:
agentToken := signJWT(jwt.Claims{
    Subject:    agent.ID,
    Issuer:     "your-app",
    Audience:   []string{"agent-runtime"},
    ExpiresAt:  time.Now().Add(2 * time.Hour),
    NotBefore:  time.Now(),
    CustomClaims: map[string]any{
        "workspace_id": workspaceID,
        "actor_type":   "agent",
        "kicked_off_by": userID,
        "tool_scope":   agent.ToolAllowlist,
    },
})

Why short-lived: an agent token is a bearer credential running on someone's machine. Ten minutes after the agent finishes, that token should be useless.

4.3 Audit log

Every audit row records both the agent and the human who kicked it off:

audit_log:
  actor_type = "agent"
  actor_id = <agent_uuid>
  on_behalf_of_user_id = <user_uuid>   -- the human who launched this run
  action = "issue.update"
  metadata = { model: "...", run_id: "...", trace_id: "..." }

This is what makes "the AI did X to my data" auditable months later.

4.4 Build vs. use an agent framework

Sooner or later you'll ask whether to write the agent loop yourself or pull in a framework. Decide on the criteria, not the feature list — frameworks rebrand quarterly.

Three real questions:

Are you prototyping or productionizing? Frameworks excel at the first 80% (loop, tool calls, retries, basic memory). The last 20% — tenant-scoped budgets, cancellation, audit logs, replay, your domain's exact tool semantics — is where most teams hit framework walls and rip them out.
How vendor-locked are you willing to be? Every framework has an opinion (OpenAI's Responses API, LangChain's runnables, Google's Vertex contract). Once your prompts and tools are shaped by that opinion, switching costs are real.
What language is your backend? Most agent frameworks are Python-first. If you're a Go/TS shop, the calculus changes — a thin custom orchestrator on top of the LLM Gateway (§5) is often less code than a Python sidecar.

The landscape (as of 2026 — verify before adopting; this space churns):

Framework	Language	Sweet spot	When to skip
OpenAI Agents SDK	Python (TS preview)	You're OpenAI-first, want handoffs/guardrails baked in, and the Responses API model fits your shape.	You need provider-agnostic routing or strict structured outputs from non-OpenAI models.
LangGraph (LangChain)	Python, TS	Stateful, graph-shaped agent flows with explicit nodes + checkpoints. Good for "agent that pauses for human approval, resumes later."	Simple linear tool-loop agents — LangGraph is overkill and the LangChain abstractions leak.
CrewAI	Python	Multi-agent role-play scenarios ("researcher hands to writer hands to editor"). Easy to demo.	Production single-agent workflows — its abstractions optimize for the demo, not the long tail.
Google ADK / Vertex AI Agent Builder	Python (Java/Go SDKs)	You're already on GCP, want managed deployment + Gemini-native, and need enterprise IAM/audit out of the box.	You're not on GCP; lock-in is high.
Pydantic AI	Python	Type-first, FastAPI-style ergonomics, model-agnostic. Closest thing to "if I'd written it myself."	TS/Go shops.
Mastra	TypeScript	First-class TS agent framework with workflows, evals, and memory baked in.	Python-only shops; smaller ecosystem than LangChain/LangGraph.
Vercel AI SDK	TypeScript	Streaming-first UX primitives (`useChat`, `streamText`) for Next.js apps. Not really an "agent framework" — it's the rendering layer.	Backend agent orchestration.
Custom on top of the LLM Gateway	Any	You have an opinion about tool shape, memory, budgeting, and want to own them. ~500–1500 LoC.	Greenfield prototyping where time-to-first-demo matters more than the final architecture.

Template recommendation: start with a custom orchestrator on top of pkg/llm/ (§5) — the agent loop is ~200 lines of Go and gives you exact control over multi-tenancy, budgets, and audit. Reach for a framework only when you hit a specific pattern it solves better (LangGraph for graph-shaped pause/resume flows, OpenAI Agents SDK if you've fully committed to Responses API + handoffs).

Whatever you pick, the framework is an implementation detail of the worker — your API surface, DB schema (§4.1), audit log (§4.3), and observability (§14) stay framework-agnostic. Swapping LangGraph for OpenAI Agents SDK should be a worker-side rewrite, not a platform rewrite.

5. 🔌 The LLM Gateway (Provider Abstraction)

5.1 The interface (Go)

package llm

type ChatRequest struct {
    Messages    []Message
    Model       string         // "claude-sonnet-4-6", "gpt-5", "gemini-2-pro", "auto"
    Tools       []Tool
    Stream      bool
    JSONSchema  json.RawMessage // for structured outputs
    MaxTokens   int
    Temperature float64
    
    // Tracking
    WorkspaceID string
    UserID      string
    Feature     string  // e.g. "summarize", "agent.codegen"
    IdemKey     string
}

type ChatResponse struct {
    ID       string
    Model    string
    Choices  []Choice
    Usage    TokenUsage
    Provider string
    Cached   bool
    DurationMs int64
}

type Gateway interface {
    Chat(ctx context.Context, req ChatRequest) (ChatResponse, error)
    ChatStream(ctx context.Context, req ChatRequest) (<-chan StreamEvent, error)
    Embed(ctx context.Context, model string, texts []string) ([][]float32, error)
}

5.2 What goes inside `Chat()` — the layered pipeline

1. Validate + normalize (model alias resolution)
2. Check budget        ─► reject if over cap
3. Check prompt cache  ─► return cached response if hit
4. Check semantic cache─► return semantic match if cosine > 0.97
5. Pick provider       ─► routing rules (model name → provider)
6. Call provider with timeout + retry
7. On failure: fallback to secondary provider
8. Capture trace       ─► async write to trace store
9. Meter usage         ─► async increment in Redis + Stripe
10. Return response

5.3 Provider routing

# llm-routing.yaml
models:
  fast:
    primary: { provider: anthropic, model: claude-haiku-4-5 }
    fallback: { provider: openai, model: gpt-5-mini }
  smart:
    primary: { provider: anthropic, model: claude-sonnet-4-6 }
    fallback: { provider: openai, model: gpt-5 }
  reasoning:
    primary: { provider: anthropic, model: claude-opus-4-7 }
    fallback: { provider: openai, model: o3 }
  cheap:
    primary: { provider: google, model: gemini-2-flash }

Code calls llm.Chat({ Model: "smart", ... }). The gateway resolves to the actual model. Never hardcode a provider's exact model name in business logic — you'll regret it the day prices change or a model is deprecated.

5.4 Fallback rules

Fall back on timeout / 5xx / rate limit — not on bad output (that's an eval problem).
Cap retries at 1 fallback to avoid stacking latency.
Log every fallback as a metric (llm.fallback.count) so you can detect provider issues.

5.5 Idempotency for LLM calls

Two LLM calls with identical input shouldn't get charged twice. Hash (workspaceID, model, messages, tools, jsonSchema) → cache key. TTL 24h. Saves real money during retries and frontend double-clicks.

6. 📝 Prompts as Code

Treat prompts like SQL queries: version-controlled, testable, parameterized — never inline strings.

6.1 Filesystem layout

prompts/
  summarize/
    v1.md
    v2.md
    eval.jsonl       # ground-truth examples
    schema.json      # input variables
  agent/codegen/
    system.v3.md
    eval.jsonl

6.2 Loader with variable substitution

// prompts/loader.go
type Prompt struct {
    Name    string
    Version string
    Body    string  // with {{vars}}
}

func (p Prompt) Render(vars map[string]any) (string, error) {
    tmpl := template.Must(template.New(p.Name).Parse(p.Body))
    var buf bytes.Buffer
    return buf.String(), tmpl.Execute(&buf, vars)
}

6.3 Versioning rule

Every prompt has a version (v1, v2, summarize.v3).
Old versions stay in the repo — you'll need them to reproduce historical outputs and run regression evals.
The active version is selected by config or feature flag, not by replacing the file.

# config.yaml
prompts:
  summarize: "summarize/v3"
  codegen:   "agent/codegen/system.v2"

6.4 What goes in a prompt vs in a tool

Belongs in prompt	Belongs in a tool
Persona, format rules, examples	Anything that needs current data
Stable how-to instructions	Anything that mutates state
Output schema	Anything that should be auditable

If the prompt embeds data that changes hourly, you have a stale-context bug waiting to happen. Push it to a tool call.

6.5 Don't ship prompts longer than they need to be

Every extra token costs money + adds latency.
Move stable instructions to system prompt; ship per-call deltas only.
Use prompt caching (§15) for the stable prefix.

7. 🛠️ Tools, Function Calling & MCP

7.1 Tool registry pattern

type Tool struct {
    Name        string
    Description string
    Schema      json.RawMessage  // JSON Schema for input
    Handler     func(ctx context.Context, input json.RawMessage) (string, error)
    Permissions []string          // RBAC permissions required
    Audited     bool              // log every call to audit_log
}

var Registry = map[string]Tool{}

func Register(t Tool) { Registry[t.Name] = t }

7.2 The execution loop

agent calls tool → gateway dispatches → handler runs with the agent's permissions →
  result back to model → next round

Critical: the tool runs as the agent's identity, not the user's. Use the agent token's claims for authz checks.

7.3 Tool authorization

Two layers:

Allowlist on the agent: agent.tool_allowlist = ["search", "read_issue", "comment"]. Agent can only call tools on its list.
Per-call permission check: Can(actorAgent, "issue.update", issue). Same Can() helper from your generic SaaS playbook (§6.3).

Don't skip layer 2 even if the agent passes layer 1 — multi-tenancy bugs hide here.

7.4 MCP servers

Model Context Protocol is the emerging standard for exposing tools to LLM clients (Claude Desktop, Cursor, IDEs). For an AI SaaS, expose two MCP surfaces:

Surface	Audience	Auth
Public MCP server	External clients (Claude Desktop, Cursor, ChatGPT integrations)	OAuth or API key
Internal MCP server	Your own agent runtimes	Workspace-scoped agent token

Implementing MCP is ~200 LoC of JSON-RPC over stdio or HTTP. SDKs exist for Python, TS, Go.

7.5 Dangerous tools need confirmation

For destructive tools (delete, send email, post to Slack, run code, charge a card):

agent: "I'd like to call delete_issue with id=123"
runtime: pause + emit confirmation_required event
user: clicks "approve"
runtime: resume + execute

Implement this with a pending_tool_call table and a WebSocket push. Default destructive tools to require confirmation. See §17 (Human-in-the-Loop).

7.6 Tool output budget

Don't dump 100 KB of search results into the model. Tools should:

Cap output at a sensible token budget (e.g., 4 KB).
Provide pagination + summarization.
Return IDs the model can re-query for detail.

Otherwise you'll burn context and money.

7.7 Code execution: never on your infra, always sandboxed

If your agent runs LLM-generated code (python_exec, run_sql, execute_shell), it executes in an ephemeral, network-isolated, secret-free sandbox. Don't roll your own — the failure mode is "agent root-shells your prod box."

Sandbox	Type	Sweet spot
E2B	Managed (also self-hostable)	Default. Per-request micro-VMs in ~150 ms cold-start, Python/Node/Bash/filesystem, file mount, language-native SDKs. Drop-in for "ChatGPT Code Interpreter–style" tools.
Modal / Daytona	Managed	Heavier, longer-lived sandboxes for jobs that need a real workspace (data analysis, repo modifications).
Cloudflare Workers / Sandboxed iframes	Self-host	Pure-JS evaluation when the workload is small and trusted.
Firecracker microVMs	DIY	You have an infra team and want full control. Most teams should not pick this.

E2B is the recommended template default — it maps cleanly to the tool registry pattern (§7.1): one tool, one sandbox per call, output capped via §7.6, all wrapped in the usual audit log.

8. 🧠 Memory & RAG (the practical version)

8.1 Three kinds of memory, three different solutions

Kind	TTL	Storage	Example
Conversational	This session	In-memory + Postgres	Chat history within a thread
Episodic	Per workspace, long-lived	Postgres	"User said their team is on PG 16"
Semantic / RAG	Knowledge base	Vector DB	Company docs, past tickets

Don't conflate them. They have different access patterns and different invalidation rules.

Memory frameworks (when DIY gets tedious):

Tool	Type	Sweet spot	Watch out for
Mem0	OSS + managed (Apache 2.0)	Drop-in user/agent memory layer with `add()` / `search()` / `update()`. Auto-extracts and dedupes facts. Best when you want "give the agent a memory" without building the schema yourself.	Opinionated about extraction prompts; works best on chat-shaped data.
Letta (formerly MemGPT)	OSS, self-host (Apache 2.0)	Stateful agents with hierarchical memory (core memory, archival memory, recall) and OS-style page-in/page-out. Strong for long-lived persistent agents.	Heavier abstraction — agents are the memory; harder to bolt onto an existing app.
OpenViking (Volcengine / ByteDance)	OSS, Python-first	Unifies memories + resources + skills under a filesystem paradigm (`viking://` URIs) with three-tier context loading (L0/L1/L2) to cut tokens, plus directory-recursive retrieval that combines vector search with hierarchical navigation. Interesting fit when you have structured knowledge (multi-doc workspaces, skill libraries) where flat RAG loses information.	License: AGPLv3 on the main project (CLI/examples are Apache 2.0) — a hard blocker for many closed-source SaaS legal teams. Verify with counsel before adopting. Younger project, smaller community than Letta/Mem0.
DIY on Postgres + pgvector	—	You already have the multi-tenancy/audit/RLS plumbing and your "memory" is mostly extracted facts (a `memory` table with `kind`, `payload`, `embedding`, `workspace_id`).	Accept that you're building extraction + dedupe yourself. Most templates land here.

Recommendation: start DIY (one memory table next to chunk), add Mem0 if extraction/dedupe becomes the bottleneck, reach for Letta if you're building agent-as-product where the agent has its own persistent identity across months. Consider OpenViking when your context is hierarchically structured (e.g., per-project knowledge bases with skills + resources) and AGPLv3 is acceptable for your distribution model.

8.2 RAG, the boring version that works

Most AI SaaS RAG pipelines are over-engineered. Start here:

1. Chunk documents at semantic boundaries (paragraphs / sections; ~500 tokens)
2. Generate embeddings via cheap model (text-embedding-3-small, voyage-3-lite)
3. Store in Postgres + pgvector with metadata (workspace_id, doc_id, chunk_index)
4. Hybrid retrieval: BM25 (pg_trgm/FTS) + vector (cosine) → reciprocal rank fusion
5. Re-rank top 50 with a cross-encoder (Cohere Rerank, Voyage rerank-2) → top 8
6. Stuff into prompt with citation tokens

You don't need a dedicated vector DB until ~5M chunks. pgvector + HNSW handles that comfortably and saves you a service.

8.3 Chunking that doesn't suck

Don't split mid-sentence.
Keep section headings with the chunk.
For code: split by symbol (function/class), not by line count.
Add a chunk header: [Doc: X / Section: Y] so the model has context even out of order.

8.4 Embeddings worker

Embeddings are async. Never block a write on embedding generation.

1. User saves doc → INSERT into doc + INSERT into outbox
2. Embeddings worker reads outbox → calls embedding API in batches → UPSERT into chunk
3. Mark outbox row done

Batch sizes of 100 are usually optimal across providers.

8.5 Multi-tenancy in vectors

Every chunk row has workspace_id. Every query filters by it. It's tempting to skip this for "shared knowledge" — don't. Mistakes here become headlines.

For pgvector:

CREATE INDEX ON chunk USING hnsw (embedding vector_cosine_ops);
-- queries always include WHERE workspace_id = $1

8.6 When to invalidate

Source doc changed → re-chunk, re-embed (delete old chunks first).
Source doc deleted → cascade delete chunks.
Embedding model changed → full re-embed (don't mix model versions in one index).

8.7 RAG is a search problem first

The single biggest improvement in any RAG system is better retrieval — not bigger context windows, not cleverer prompts. Run search-quality evals (recall@k, MRR) before tuning prompts.

8.8 Ingestion: don't write your own scraper

For any RAG that pulls from the open web or customer-hosted docs, the ingestion step is where most engineering time disappears (rendering JS, dealing with PDFs, deduping, cleaning boilerplate).

Tool	Type	Sweet spot
Crawl4AI	OSS, Python	LLM-shaped output by default — Markdown + structured chunks, JS rendering via Playwright, sitemap + multi-page crawl, async. Default pick for "give me clean docs from a URL list."
Firecrawl	Managed (OSS option)	Same shape, hosted. Pay per page; saves you running headless browsers.
Unstructured.io	OSS + managed	Best for PDFs, Office docs, emails — strong layout-aware parsing. Pair with Crawl4AI for the web side.
LlamaParse	Managed	High-quality PDF/table extraction; expensive but accurate on hard documents.

Whatever ingestor you pick, it runs in a worker (§18) that emits to the same outbox + embeddings pipeline (§8.4) — your RAG indexing path stays one shape.

9. 📐 Structured Outputs

When you need machine-readable output (extracting fields, generating UI, calling code), use JSON mode + JSON Schema — not regex on free text.

9.1 The pattern

schema := `{
  "type": "object",
  "properties": {
    "title": { "type": "string", "maxLength": 120 },
    "priority": { "enum": ["low","med","high"] },
    "due_date": { "type": "string", "format": "date" }
  },
  "required": ["title", "priority"]
}`

resp, _ := gateway.Chat(ctx, ChatRequest{
    Model: "smart",
    JSONSchema: json.RawMessage(schema),
    Messages: []Message{ ... },
})

var issue IssueDraft
json.Unmarshal([]byte(resp.Choices[0].Content), &issue)

9.2 Validation belt-and-suspenders

Even with JSON mode, validate server-side. Models occasionally produce schema-shaped-but-invalid output (wrong enum, out-of-range number). Use the same Zod / pydantic schema you'd use for human-submitted data.

9.3 When JSON mode isn't enough

Cross-field constraints ("if A then B"): validate, reject, retry once with the validation error in the prompt.
Generated data that needs DB references (foreign keys): post-process to resolve names → IDs, fail loudly if unresolved.

9.4 Higher-level structured-output libraries

If you find yourself writing the same "schema → prompt → parse → validate → retry" loop in multiple places, lift it.

Tool	Language	Sweet spot	Watch out for
Instructor	Python (also JS, Go, Elixir ports)	Pydantic-first wrapper around OpenAI/Anthropic/etc. Define a `BaseModel`, get type-safe outputs with automatic retries on validation failure. The default for Python AI SaaS.	Couples your code to the Instructor abstraction; bare SDK calls remain available so the lock-in is shallow.
BAML	Cross-language (TS, Python, Ruby, Go via codegen)	A small DSL for prompts + schemas that compiles to typed clients. Great for teams with many prompts and a strong typing culture; treats prompts like API definitions.	New tool to learn, separate `.baml` files in your repo, codegen step in CI.
TypeChat (Microsoft)	TypeScript	Small, focused on TS-first apps; schema is a TS type, validator regenerates on parse failure.	Less active than Instructor/BAML; fewer providers wrapped.
Outlines / LMQL	Python	Constrained decoding (model literally cannot emit invalid JSON/regex). Useful for local/self-hosted models without native JSON mode.	Provider-side JSON mode is now table stakes; this matters mainly for OSS model deployments.

Template recommendation: Python services → Instructor. Multi-language teams or strong "prompts-as-API" culture → BAML. Otherwise: bare JSON Schema (§9.1) + the same Zod/pydantic schema you already use for HTTP validation (§22.5 in the main playbook) is enough.

10. 💧 Streaming UX

Users tolerate 30-second LLM responses only if they see progress. Streaming is non-negotiable for any chat-like surface.

10.1 The transport

Direction	Use
Server → client	SSE (text/event-stream) — simpler, plays nicer with HTTP/2 + edges
Bidirectional needed (cancel, mid-stream input)	WebSocket

Default to SSE.

10.2 The event taxonomy (steal this)

event: token            data: { content: "Hello" }           // text delta
event: thinking         data: { content: "Considering..." } // reasoning delta
event: tool_use         data: { name: "search", input: {...} }
event: tool_result      data: { name: "search", output: "..." }
event: status           data: { stage: "retrieving" }
event: error            data: { code: "rate_limited", message: "..." }
event: done             data: { usage: { input: 100, output: 250 } }

Mirror the structure across providers. The frontend should render the same components regardless of backend.

10.3 Cancellation

Streaming MUST be cancellable. When user closes the tab, navigates away, or clicks "stop":

const ctrl = new AbortController()
fetch("/api/chat", { signal: ctrl.signal })
// later
ctrl.abort()

Server-side: detect ctx.Done() and abort the upstream LLM call. Don't keep paying for tokens the user no longer wants.

10.4 Token-by-token UI

Render incrementally with no animation delay.
Markdown rendering: parse-as-you-go (libraries: marked-react, streaming-markdown).
Code blocks: syntax-highlight progressively or buffer until ``` closes.
Show a "stop" button while streaming, "regenerate" button after.

11. 💵 Cost Control, Budgets & Model Routing

The single biggest operational mistake in AI SaaS: deploying without budget caps and waking up to a $40,000 bill.

11.1 Three layers of caps

[Tenant cap]  workspace.daily_token_budget          → 401 if exceeded
[User cap]    user.daily_request_budget             → 429 if exceeded
[Per-call cap] max_tokens on the request            → enforced by provider

All three. Always.

11.2 Real-time budget check

Hot path can't query Stripe or sum a Postgres table. Use Redis:

key: budget:{workspace_id}:{YYYY-MM-DD}
op:  INCRBY <tokens>
ttl: 36h

After every call, increment by usage.input_tokens + usage.output_tokens. Before every call, check GET against the workspace's daily limit.

11.3 Soft-fail UX

Don't just 403. When near the cap:

Banner: "You're at 80% of your daily AI budget."
At 100%: inline upgrade prompt — "Upgrade to Pro for 10x credits."
Reset hourly/daily based on plan.

11.4 Model routing for cost

Cheapest model that meets the bar. Real heuristic:

func routeModel(taskKind string) string {
    switch taskKind {
    case "classify", "extract", "rewrite":
        return "fast"      // Haiku / mini
    case "summarize", "answer", "draft":
        return "smart"     // Sonnet / GPT-5
    case "agent", "code", "reasoning":
        return "reasoning" // Opus / o3
    default:
        return "smart"
    }
}

Then run evals (§13) per task kind to verify the cheap model holds quality. Most tasks classify on fast; 90% of cost lives in 10% of tasks.

11.5 Cost dashboard (build this)

Per-workspace daily spend, per-feature breakdown, per-model breakdown. Without this you can't price your product.

CREATE TABLE llm_call_log (
    id UUID PK,
    workspace_id UUID,
    user_id UUID,
    feature TEXT,
    model TEXT,
    provider TEXT,
    input_tokens INT,
    output_tokens INT,
    cached_tokens INT,
    cost_usd_micros BIGINT,  -- store in micros to avoid float
    cache_hit BOOL,
    duration_ms INT,
    created_at TIMESTAMPTZ
);
-- Partition by month if volume is high

Materialized views (refreshed hourly) for the dashboard.

11.6 BYO key

For power users, support "bring your own API key." Stored encrypted, used as a passthrough.

workspace.byok = { provider: "anthropic", key_encrypted: "..." }

Two benefits: no margin pressure on heavy users, lets enterprises use their existing AI vendor relationship.

12. 🧾 Outcome-Based & Metered Pricing — the implementation

The "per-outcome" pricing trend is real but often misunderstood. You still bill per unit of work — the unit is just bigger than a seat.

12.1 Three patterns that actually work

Pattern	Example	Best for
Credits	"1,000 AI credits/mo, top-up $5 = 500 more"	Mixed-feature products
Per-call	"$0.05 per generation"	Single high-value output
Per-task / per-outcome	"$2 per resolved ticket"	Agentic / replacement-of-labor

12.2 The credit ledger

Keep a single ledger. Every consuming feature debits; every plan/topup credits.

CREATE TABLE credit_ledger (
    id UUID PRIMARY KEY,
    workspace_id UUID NOT NULL,
    delta BIGINT NOT NULL,        -- +N for grant, -N for usage
    reason TEXT NOT NULL,         -- "plan_grant" | "topup" | "feature.summarize" | "agent.run"
    metadata JSONB,
    created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Materialized view for current balance
CREATE MATERIALIZED VIEW credit_balance AS
SELECT workspace_id, SUM(delta) AS balance
FROM credit_ledger GROUP BY workspace_id;

Refresh credit_balance after every write. Or use a running_total column with row-level locking on the latest entry.

12.3 Mapping tokens to credits

Don't expose tokens to users — they don't care and pricing changes break their mental model. Convert internally:

func tokensToCredits(model string, in, out int) int64 {
    cost := costUSDMicros(model, in, out)
    return cost / pricePerCreditMicros // e.g., 1 credit = $0.001
}

Show users credits. Track tokens internally for cost analysis.

12.4 Stripe metered billing

For usage-based, push usage to Stripe daily (not per call):

// nightly cron
for _, ws := range workspaces {
    usage := sumYesterdaysUsage(ws.ID)
    stripe.UsageRecords.New(&stripe.UsageRecordParams{
        SubscriptionItem: &ws.UsageItemID,
        Quantity:         &usage,
        Timestamp:        &yesterday,
        Action:           stripe.UsageRecordActionSet,
    })
}

12.5 Outcome-based billing (the hard one)

For "$2 per resolved ticket," you need:

A definition of "resolved" the customer agrees to.
An immutable record of each outcome (outcome table).
A dispute window (5–7 days).
A finalize-and-bill cron after the window.

Don't sell outcome-based until you have eval coverage on what counts as "outcome." Disputes will eat you alive otherwise.

13. ✅ Evals — how to actually test agents

This is where most AI SaaS quality dies. Implement evals before launch, not after.

13.1 The simplest useful eval

# evals/summarize.jsonl
{"input": "...long article...", "expected_must_contain": ["climate", "policy"]}
{"input": "...", "expected_must_contain": ["..."]}

# evals/run.py
def score(output, expected):
    return all(term.lower() in output.lower() for term in expected["expected_must_contain"])

# Run nightly + on every PR that touches prompts/

Start with 20 hand-written examples. Add 1 more every time a user reports a bad output. In 3 months you have 100 — enough to catch real regressions.

13.2 Eval categories

Type	Method	When
Exact match / contains	String compare	Extraction, classification
Schema validity	JSON Schema validate	Structured output
Reference comparison	BLEU / ROUGE / embedding similarity	Translation, summarization
LLM-as-judge	Stronger model scores output	Open-ended quality
Human review	Manual labels on samples	Subjective quality, safety
A/B in production	Compare metrics across variants	Final word

LLM-as-judge is fast and useful but biased. Cross-check with human labels on a sample. Don't ship a judge prompt without validating it.

13.3 Regression evals on every prompt change

# .github/workflows/evals.yml
on: [pull_request]
jobs:
  evals:
    if: contains(github.event.pull_request.changed_files, 'prompts/')
    steps:
      - run: python evals/run.py
      - run: python evals/compare.py --base main --head HEAD

Block merges if quality drops by N% on the eval set. This is the closest thing to unit testing for LLMs.

13.4 Capture production outputs as eval data

Sample 1% of production calls (with PII scrubbed) into your eval store. Periodically promote interesting ones to ground-truth labeled examples. The longer you run, the better your eval set gets.

13.5 Tools

Tool	Type	Sweet spot
Promptfoo	OSS, YAML-driven, fast	Great default. Run from CI, diff prompts side-by-side, web UI for inspection. The "Jest for prompts."
DeepEval	OSS, Python (pytest-native)	If your team writes pytest already. Bundles 14+ metrics (faithfulness, hallucination, answer-relevancy, G-Eval), runs as `@pytest.mark.eval` decorators.
Ragas	OSS, Python	The standard for RAG-specific evals — context precision/recall, faithfulness, answer correctness. Pair with Promptfoo/DeepEval for end-to-end coverage.
Braintrust	Hosted	Dashboards, team workflows, dataset versioning, prompt-iteration UX. Best when you have 3+ engineers iterating on prompts.
Langfuse	OSS + hosted	Evals + observability in one tool — re-run a production trace as an eval, score it, version the prompt. Pairs perfectly with §14.5.
LangSmith	Hosted	If you're using LangChain anyway.
OpenAI Evals	OSS framework, Python	Reference framework if you want to stay close to OpenAI's eval philosophy.
DIY	200 LoC + a JSONL file	Often best for the first 6 months.

Recommendation: start with a JSONL file + a make eval script (§13.1). Add Promptfoo the day you have >20 cases. Add Ragas the day you ship RAG. Add Langfuse the day you want production traces and evals to live in the same database.

14. 🔭 Observability for Agents

Standard observability (logs/metrics/traces) plus LLM-specific signals.

14.1 Capture every LLM call

CREATE TABLE llm_trace (
    id UUID PK,
    request_id TEXT,         -- correlates to your APM trace
    workspace_id UUID,
    feature TEXT,
    model TEXT,
    messages_hash TEXT,
    messages JSONB,          -- full prompt for replay
    response JSONB,          -- full response
    tools JSONB,             -- tool calls + results
    usage JSONB,
    latency_ms INT,
    cost_usd_micros BIGINT,
    cache_hit BOOL,
    score FLOAT,             -- user thumbs up/down or eval score
    created_at TIMESTAMPTZ
);
-- Heavy table; partition by day, drop after 30–90 days

Make this searchable in your admin tool. "Show me the last 10 chat completions for workspace X" should be one click — that's how you debug "why did the AI say something weird?"

14.2 Signals to plot on Grafana

p50 / p95 / p99 latency per model
Token throughput per minute
Cost per minute (broken down by feature + workspace)
Cache hit rate (prompt cache + semantic cache)
Error rate per provider
Fallback rate
Eval score over time (if you score in production)

14.3 Trace IDs across the stack

Every LLM call gets a trace ID that flows: API → gateway → provider → tool calls → DB. When a customer says "this answer was wrong," you find that trace ID and see exactly what happened.

14.4 User feedback signal

Thumbs up/down on every AI-generated output. Persist in llm_trace.score. Aggregate weekly. The directional signal is gold even with 1% response rate.

14.5 Don't build the trace UI yourself — pick an LLM observability tool

The llm_trace schema in §14.1 is what you need; the UI to search/replay/diff/score it is what you don't want to build. Wire one of these as the destination for trace exports (most have OTel-compatible ingestion, so the LLM Gateway emits once and you swap dashboards by config).

Tool	Type	Sweet spot	Watch out for
Langfuse	OSS, self-host or cloud	Default recommendation. Open-source, generous free cloud tier, drop-in for the `llm_trace` schema, evals + prompt management + datasets in one tool. SDKs for Python/TS/Go.	Self-hosting Postgres + ClickHouse adds ops burden — use cloud until trace volume justifies it.
LangSmith	Managed (LangChain)	You're already deep in LangChain/LangGraph — tightest integration, best replay UX for graph agents.	Lock-in to LangChain abstractions; pricing scales with trace volume.
Helicone	OSS, self-host or cloud	Lightest-touch — works as an HTTP proxy in front of OpenAI/Anthropic, so zero SDK changes. Great for getting to "I can see my LLM calls" in 10 minutes.	Proxy model means it sits on the request path; budget for the latency hop.
Arize Phoenix	OSS, self-host	Strong eval + drift detection, OTel-native. Good for ML-heavy teams that already speak Arize.	Less polished trace replay UX than Langfuse/LangSmith.
Braintrust	Managed	Eval-first workflow with great prompt-iteration UX (diff prompts, run on dataset, compare).	Smaller community than Langfuse.
Logfire (Pydantic)	Managed	If you're already on Pydantic AI, it Just Works — OTel-native, great Python ergonomics.	Python-shaped.

Template recommendation: start with Langfuse cloud — free tier covers prototype volume, matches the llm_trace schema almost 1-for-1, and self-hosting later is a config flip, not a migration. Add Helicone in front of providers if you want zero-code-change observability before you've wired the gateway.

The LLM Gateway (§5) is where this integration lives — one writer, many destinations. Your handler code stays unchanged.

15. ⚡ Caching (Prompt + Semantic)

Two distinct caches with different rules.

15.1 Prompt cache (provider-managed)

Anthropic, OpenAI, and Google all support prompt caching now. Use it always for stable prefixes.

# Anthropic example
client.messages.create(
    model="claude-sonnet-4-6",
    system=[
        {"type": "text", "text": large_system_prompt, "cache_control": {"type": "ephemeral"}},
    ],
    messages=[{"role": "user", "content": user_query}],
)

Rule of thumb: anything over 1024 tokens that you reuse should be cached. System prompts, tool schemas, few-shot examples, RAG context that doesn't change — all cacheable.

Cache hit ratio of 80%+ on a chat product is normal and a 10x cost reduction.

15.2 Semantic cache (your responsibility)

For high-volume, low-novelty queries (FAQ-style chatbots), cache by meaning, not exact match:

1. Embed query
2. Vector search recent cached responses for this workspace
3. If cosine > 0.97 AND same model AND same tools: return cached response
4. Else: call model, cache result with embedding

CREATE TABLE semantic_cache (
    id UUID PK,
    workspace_id UUID,
    feature TEXT,
    model TEXT,
    query_embedding vector(1536),
    response TEXT,
    hits INT DEFAULT 0,
    created_at TIMESTAMPTZ,
    expires_at TIMESTAMPTZ
);
CREATE INDEX ON semantic_cache USING hnsw (query_embedding vector_cosine_ops);

Caveats: semantic cache is dangerous for personalized output. Scope by (workspace_id, user_id) if responses include user-specific data.

15.3 What NOT to cache

Anything with current time / "today" semantics.
Anything with user-specific data unless scoped.
Tool-using calls where tool results vary.
Anything regulated (healthcare, legal, financial advice).

16. 🛡️ Safety, Abuse & PII

16.1 Input filtering

Cheap, fast classifier on every user input:

Off-topic / spam
Prompt injection attempts ("ignore previous instructions...")
Disallowed content per your policy

OpenAI's moderation endpoint and Llama Guard are both cheap or free.

16.2 Prompt injection — the actual mitigations

Prompt injection isn't fully solved. Your best defenses:

Treat tool outputs as untrusted. Never let a tool result execute another tool without re-validating against the user's intent.
Strict tool allowlists per agent. A summarizer doesn't need a delete_data tool.
Confirm destructive actions. §17.
Don't reflect tool output verbatim into another LLM call as instructions. Use clear delimiters and instruct the model to treat tool output as data.
Audit all tool calls. When an injection succeeds, you'll need the trace.
Sandbox code execution. If your agent runs arbitrary code, it runs in an ephemeral container with no network egress and no secrets. Use E2B or equivalent (§7.7) — never your own infra.

16.2a Red-team your prompts before users do

You can't reason your way to "injection-proof." You have to attack it.

Tool	Type	Sweet spot
NVIDIA garak	OSS, Python	The "nmap for LLMs." Probes for prompt injection, jailbreaks, encoding attacks, training-data leakage, malware generation, hallucinated package names. Runs against any provider via a plugin model. Run on every model upgrade and every system-prompt change.
PyRIT (Microsoft)	OSS, Python	Microsoft's automated red-teaming framework — multi-turn attacks, chained prompts, scenario-based testing. Heavier than garak; better for structured engagements.
promptfoo redteam	OSS	Adversarial test generation built into your existing eval suite. Lower setup cost if you already use Promptfoo.
Lakera Guard / Prompt Armor	Managed	Runtime injection detection as a sidecar — pair with your input filter.

Bake garak into CI — run a curated probe set on every PR that touches prompts or agent tools. Treat findings the way you'd treat OWASP ZAP results: known accepted risks documented, regressions block the merge.

16.3 Output filtering

Before showing AI output to a user (especially in customer-facing AI), filter for:

PII leakage (the model regurgitating training data)
Toxicity
Hallucinated URLs (validate links resolve before rendering)
Hallucinated function calls / API names that don't exist

16.4 PII scrubbing for telemetry

You will store prompts in llm_trace. Some prompts contain PII. Either:

Don't store the raw prompt — store a hash + a redacted version.
Store but encrypt — the production team can't read it without a break-glass procedure.
Tiered retention — raw 7 days, hashed 30 days.

16.5 Abuse: rate limits + cost limits + content limits

Beyond per-call rate limits:

Cumulative cost cap per IP / per signup-day (catch credit-card-stuffing attacks).
Block / ratelimit based on signup recency (account age < 24h gets stricter limits).
Cloudflare Turnstile / hCaptcha on signup.

The most common attack pattern in 2025–2026: trial accounts mass-created to scrape free LLM credits. Defend at signup.

17. 🙋 Human-in-the-Loop & Autonomy Levels

Define autonomy levels per tool/action and let workspace admins set policy.

17.1 Five levels

Level	Behavior	Example
L1 — Suggest	Agent suggests; human executes	"Draft this email for me"
L2 — Auto-with-undo	Agent acts; user can undo	"Apply formatting"
L3 — Confirm-each	Agent proposes; human approves each step	"Refactor across files"
L4 — Confirm-once	Human approves a plan; agent executes	"Process this batch of tickets"
L5 — Fully autonomous	Agent runs; audit log only	"Reply to FAQ tickets matching pattern X"

17.2 Implementation

CREATE TABLE pending_action (
    id UUID PK,
    workspace_id UUID,
    agent_id UUID,
    user_id UUID,            -- who must approve
    tool TEXT,
    input JSONB,
    rationale TEXT,
    status TEXT,             -- pending | approved | rejected | expired
    expires_at TIMESTAMPTZ,
    created_at TIMESTAMPTZ
);

Agent calls "execute_with_approval" → row inserted → WS push to user → user clicks approve → row updates → agent resumes via wakeup.

17.3 Defaults that won't get you sued

All destructive tools default to L3.
All tools that send external messages (email, Slack, social) default to L3 for the first 100 uses per agent, then L4 (per-batch approval).
All tools that spend money default to L3 with a confirmation modal showing the amount.
Workspace admins can override defaults; users on the workspace cannot.

18. ⏳ Long-Running Agent Jobs

LLM-based jobs can run for minutes or hours. Don't try to do this in the request path.

18.1 The pattern

1. POST /api/agents/run → 202 Accepted, returns run_id
2. Worker picks up the job, runs the agent loop
3. Worker streams progress events to a per-run channel
4. Client subscribes via WS or SSE: GET /api/agents/runs/{run_id}/events
5. On completion, worker writes result + emits completion event
6. Client can fetch full result via GET /api/agents/runs/{run_id}

18.2 Resumable runs

Agents can run for hours and survive worker restarts. Store enough state to resume:

CREATE TABLE agent_run (
    id UUID PK,
    workspace_id UUID,
    agent_id UUID,
    status TEXT,             -- queued | running | paused | completed | failed | cancelled
    current_step INT,
    state JSONB,             -- agent's working memory, last LLM session ID
    result JSONB,
    error TEXT,
    started_at TIMESTAMPTZ,
    completed_at TIMESTAMPTZ,
    last_heartbeat_at TIMESTAMPTZ
);

Worker writes last_heartbeat_at every 10 s. Janitor cron picks up rows with stale heartbeats and re-queues.

18.3 Cancellation

User clicks "cancel" → row status becomes cancelling → worker checks the status every iteration → sees cancelling → cleans up + sets cancelled. The Multica pattern (§6.3) is the canonical example.

18.4 Cost guardrails on long runs

Every long run has a hard cost cap. When exceeded, the worker stops the agent loop, marks the run failed-budget-exceeded, refunds nothing, and emails the user.

19. 🏢 AI-Specific Multi-Tenancy Concerns

Building on §5 of the main playbook. Things you must handle that don't apply to non-AI SaaS:

19.1 Tenant context contamination

If you cache prompts or embeddings, scope every cache key by workspace_id. A cross-tenant cache hit is a customer-data leak.

19.2 Provider-side isolation

OpenAI, Anthropic, etc. don't see your tenants. They see you. So:

Track per-tenant usage yourself (the provider's usage dashboard is for you, not a per-customer audit trail).
Pass an opaque user_id field per call (most providers support it) to help abuse triage.
Don't pass real customer emails to providers.

19.3 Per-tenant model overrides

Some tenants want a specific model (compliance, regional latency, BYO API key). Your abstraction must support this:

workspace:
  ai_settings:
    model_override: "claude-sonnet-4-6"   # null → use platform default
    byok: { provider: "openai", key_id: "..." }
    region: "eu"

19.4 Data residency

Enterprise tenants will ask "is my data sent to the US?" Have answers ready:

List which model providers / regions are used.
Support EU-only deployments by routing to EU endpoints (Anthropic Bedrock EU, OpenAI Azure EU, etc.).
Note any retention by the provider (most are zero-retention now, but check per-provider).

19.5 No-train guarantees

Default to opt-out of provider training. Every major provider now has zero-retention API tiers — use them. Document this in your DPA.

20. 🗺️ The 10-Phase Build Plan

Layered on top of the 14-phase plan in the main playbook. Run these phases after you have core auth + tenancy + billing in place — don't try to build AI-native without those foundations.

🌱 Phase 1 — LLM Gateway (2 days)

pkg/llm/ (or equivalent) — interface, provider adapters for one provider.
Basic call/stream/embed methods.
Token + cost metering writes to llm_call_log.
Idempotency by request hash.

Done when: you can call gateway.Chat(...) and see the call logged with cost.

📝 Phase 2 — Prompts as Code (1 day)

prompts/ directory with versioned templates.
Loader + variable substitution.
Config-driven version selection.
One eval file per prompt with 20 examples.

Done when: changing a prompt requires a new file, the old one stays, and CI runs evals.

🛠️ Phase 3 — Tool Registry + One Real Tool (1 day)

Tool struct + registry.
One tool wired end-to-end (e.g., "search workspace docs").
Permission check enforced.
Tool calls audited.

Done when: an LLM call can request the tool, your code dispatches, and the audit log captures it.

🧠 Phase 4 — RAG (2 days)

pgvector enabled.
Chunking + embeddings worker.
Hybrid retrieval (BM25 + cosine + RRF).
Citation rendering in UI.

Done when: uploading a doc and asking a question returns an answer with cited chunks.

💧 Phase 5 — Streaming UX (1 day)

SSE endpoint.
Frontend hook that renders tokens as they arrive.
Cancel button propagates to upstream LLM call.
Markdown rendered progressively.

Done when: a 30-second response feels fast because tokens are flowing.

💵 Phase 6 — Cost Caps + Credits (2 days)

Credit ledger table + balance materialized view.
Per-workspace daily budget check (Redis).
Stripe metered billing wired (daily push).
Cost dashboard in admin panel.

Done when: a workspace at quota gets a paywall instead of a runaway bill.

✅ Phase 7 — Evals in CI (1 day)

Promptfoo or DIY runner.
Block PR merges that drop scores by > 5%.
Sample 1% of production calls into eval candidates table.

Done when: changing a prompt requires passing evals.

🔭 Phase 8 — LLM Trace + Admin Replay (1 day)

llm_trace table populated for every call.
Admin panel page: search by workspace + user + feature.
One-click "rerun this prompt" for debug.
Thumbs up/down captured.

Done when: support can resolve "the AI said something wrong" tickets in < 5 min.

🛡️ Phase 9 — Safety Layer (1 day)

Moderation pre-check on user input.
PII scrubbing on stored traces.
Tool-allowlist per agent.
Destructive tools default to confirmation.

Done when: the obvious abuse vectors (prompt injection demos, NSFW input, free-credit scraping) all fail.

⏳ Phase 10 — Long-Running Agent Runs (2 days)

agent_run table + worker pool.
Resume on worker restart.
Cancellation propagation.
Per-run cost cap.
WS streaming of progress to UI.

Done when: a 5-minute agent task survives a worker restart and shows live progress.

Total: ~14 days for a single experienced engineer to layer AI-native primitives onto a working SaaS template.

21. ⚠️ Pitfalls

Pitfall	Guardrail
Hardcoded provider model name in business logic	Always go through `model: "smart"` aliases via the gateway.
No daily token cap → runaway bill	Per-workspace Redis counter checked on every call.
Provider outage takes whole product down	Fallback provider configured per model alias.
Prompt change ships without testing	CI runs evals on `prompts/` changes; block on regression.
Tool runs as user, not agent	Agent token's claims drive permission checks.
Tool output piped back into next prompt as instructions	Treat tool output as data; use clear delimiters.
RAG returns chunks from wrong tenant	`workspace_id` filter on every vector query.
Embeddings model upgraded mid-fleet → scoring chaos	Re-embed everything; don't mix model versions in one index.
Streaming endpoint can't be cancelled	Plumb client AbortController through to upstream LLM call.
LLM trace contains raw PII forever	Tiered retention: raw 7 days, redacted 30 days.
Semantic cache returns cross-user response	Scope cache key by `(workspace_id, user_id)`.
Long-running agent dies on worker restart	Heartbeat + resumable state; janitor re-queues.
Free trial accounts farm AI credits	Cumulative cost cap per IP + Turnstile + low budget on new accounts.
Credits balance computed by SUM on every check	Materialized view or running-total column.
Outcome billing without dispute window	5–7 day dispute window before finalizing invoice.
Destructive tool runs without confirmation	All destructive tools default to L3 (confirm-each).
User retries → double charge	Idempotency key on the LLM call hashed by content.
Cache invalidates correctly except for one path	Tag cached entries with version; bump version on writes.
Provider rate-limited → cascading timeout	Circuit breaker + fast fallback + user-visible "system busy" banner.
Eval score looks great but production quality bad	Production sampling → real user feedback → keep the eval set honest.

22. 📋 Cheat Sheet

Architecture rules

Every LLM call goes through the Gateway. No direct provider SDK calls in business code.
Every call carries workspace_id, user_id, feature, and request_id.
Every call is hashed for idempotency.
Every call is captured in llm_trace.
Every call is metered into the credit ledger.
Every prompt is in a file, versioned, with at least one eval example.
Every tool has a JSON Schema + permission check + audit flag.
Every cache key includes workspace_id (and user_id for personalized output).
Every long-running agent has a heartbeat + resumable state + cost cap.

Defaults

Setting	Default
Per-call timeout	60 s (chat), 30 s (extraction), 5 min (agent)
Max tokens per response	4096
Provider retry	1 attempt + 1 fallback
Daily token budget (free)	50,000 tokens
Daily token budget (pro)	2,000,000 tokens
Eval set minimum	20 examples to ship; 100 to deprecate
Trace retention	7 days raw, 30 days redacted
Semantic cache cosine threshold	0.97
Embedding model	`text-embedding-3-small` or `voyage-3-lite` (cheap, fast)
Default chat model	"smart" alias → mid-tier (Sonnet / GPT-5)
Confirmation required	All destructive tools, all spend > $1, all external sends

The model alias table (review every quarter)

fast:      claude-haiku-4-5      | gpt-5-mini       | gemini-2-flash
smart:     claude-sonnet-4-6     | gpt-5            | gemini-2-pro
reasoning: claude-opus-4-7       | o3               | gemini-2-pro-thinking
embed:     voyage-3-lite         | text-embedding-3-small
rerank:    voyage-rerank-2       | cohere-rerank-3

Update model IDs as new versions ship. The alias names stay stable; the mapping moves.

Schema additions on top of base SaaS template

agent
agent_run
llm_call_log     -- partitioned by month
llm_trace        -- partitioned by day
credit_ledger
credit_balance   -- materialized view
prompt_version   -- if you go DB-driven instead of file-driven
tool_call        -- audited tool invocations
pending_action   -- human-in-the-loop queue
chunk            -- RAG chunks with embeddings
semantic_cache
eval_example
eval_run

KPIs to track from day one

AI feature DAU / WAU
Cost per active workspace (per day, per month)
Cache hit rate (prompt cache + semantic cache)
p95 streaming time-to-first-token
p95 full response time
Eval score per prompt over time
Thumbs up / thumbs down ratio
Provider availability / fallback rate
Cost-to-revenue ratio per workspace (red flag if > 30%)

Hard rules (non-negotiable)

No LLM call without a budget check.
No prompt change without an eval run.
No tool call without a permission check.
No cached response across tenants.
No destructive action without a confirmation policy.
No long-running run without a heartbeat + cost cap.
No raw PII in long-term trace storage.
No hardcoded provider model names in business logic.
No streaming endpoint that can't be cancelled.
No AI feature without observability (llm_trace + cost dashboard).

💭 Closing Thought

The "SaaSpocalypse" framing misses the practical truth: AI doesn't kill SaaS — it adds a new, expensive, non-deterministic dependency to it. Everything in your generic SaaS template still applies. This file is just the additional discipline you need when one component of your stack has variable cost, variable quality, and variable failure modes.

If you internalize four things:

The Gateway is the keystone — every call goes through it.
Prompts are code — versioned, tested, reviewed.
Cost caps before launch — never optional.
Evals before prompt changes — your only defense against silent quality drift.

…you can build an AI SaaS that doesn't surprise you with bills, doesn't degrade silently, and doesn't leak across tenants. The rest is detail.

Now go ship.

If you found this helpful, let me know by leaving a 👍 or a comment!, or if you think this post could help someone, feel free to share it! Thank you very much! 😃

Android iOS JavaScript ReactJS