0

πŸ€– The AI SaaS Playbook (Practical Edition)πŸ“˜

Companion to πŸš€ The SaaS Template Playbook πŸ“–. That file covers everything every SaaS needs. This file covers what changes β€” and what's new β€” when AI is core to the product.

Practical-first. Code snippets, decision tables, real defaults, no buzzwords. If a section doesn't help you ship next week, it doesn't belong here.


πŸ“‹ Table of Contents

  1. ⚑ The Shift in 60 Seconds
  2. 🎯 Pick One: AI-Native vs AI-Augmented
  3. πŸ—οΈ Reference Architecture
  4. πŸ€– Agents as First-Class Actors
  5. πŸ”Œ The LLM Gateway (Provider Abstraction)
  6. πŸ“ Prompts as Code
  7. πŸ› οΈ Tools, Function Calling & MCP
  8. 🧠 Memory & RAG (the practical version)
  9. πŸ“ Structured Outputs
  10. πŸ’§ Streaming UX
  11. πŸ’΅ Cost Control, Budgets & Model Routing
  12. 🧾 Outcome-Based & Metered Pricing β€” the implementation
  13. βœ… Evals β€” how to actually test agents
  14. πŸ”­ Observability for Agents
  15. ⚑ Caching (Prompt + Semantic)
  16. πŸ›‘οΈ Safety, Abuse & PII
  17. πŸ™‹ Human-in-the-Loop & Autonomy Levels
  18. ⏳ Long-Running Agent Jobs
  19. 🏒 AI-Specific Multi-Tenancy Concerns
  20. πŸ—ΊοΈ The 10-Phase Build Plan
  21. ⚠️ Pitfalls
  22. πŸ“‹ Cheat Sheet

1. ⚑ The Shift in 60 Seconds

What practically changes when AI becomes core:

Dimension Classic SaaS AI SaaS
Primary actor Human user clicking UI Agent making LLM calls + tool calls
Pricing Per-seat / per-feature Per-outcome / per-token / credit-based
Latency budget < 500 ms p95 Streaming partials in < 1 s; full response variable
Cost driver Compute + DB Token spend (often > infra cost)
Failure mode 5xx, 4xx "Wrong answer," hallucination, prompt injection
Testing Unit + integration + E2E + evals against ground-truth datasets
Observability Logs + traces + errors + prompt/response capture, replay, scoring
Auth boundary User + agent identity, scoped tokens, tool permissions
Audit "Who did X" + "Which prompt + model + tools produced X"

The single biggest practical change: your largest variable cost is now tokens, not servers. Every architectural decision in this playbook is downstream of that fact.


2. 🎯 Pick One: AI-Native vs AI-Augmented

These are different products. Don't try to be both.

Type Definition Examples Pricing
AI-Native Product is the AI. Without the model, there's nothing. Cursor, Perplexity, ElevenLabs, Lovable Usage / credit-based
AI-Augmented Existing SaaS surface where AI is one feature among many. Notion AI, Linear AI, Slack AI Add-on or premium tier

Decisions that flip:

Question AI-Native AI-Augmented
Where does AI failure show? Whole product fails Feature degrades; rest works
Eval coverage Mandatory before launch Per-feature; ship incrementally
Cost model Pass-through with margin Bundle into plan + soft caps
BYO API key Often supported Rare
Model picker Often user-visible Hidden behind feature

For the rest of this playbook, patterns work for both β€” but if you're AI-native, treat Β§11 (cost), Β§13 (evals), and Β§16 (safety) as launch blockers, not nice-to-haves.


2.1. πŸšͺ Two Starting Points: Greenfield vs Retrofit

The rest of this playbook describes the patterns. This section is about the sequence β€” what you build first depends on whether you're starting clean or layering AI onto a product that already has paying customers. Both paths converge on the same target architecture (Β§3); they differ in what you build first and what you can defer.

🌱 Greenfield: building a new AI SaaS

You have no legacy code, no existing tenants, no in-flight migrations. The temptation is to build Β§3 in parallel. Don't β€” primitives have an order.

  • Decide AI-Native vs AI-Augmented (Β§2) before anything else. It changes pricing, eval scope, and whether AI failure breaks the product. Skipping the decision is how products end up neither.
  • Build the Gateway (Β§5) in week one β€” even if it wraps a single provider with a single model. Every primitive in this playbook assumes calls flow through one chokepoint. Adding it first is ~300 lines; adding it later is a refactor across every feature.
  • Model aliases (smart / fast / reasoning) from day one. Never let raw provider model IDs leak into business code, even in the prototype. Model deprecations are constant.
  • One feature deep before going wide. Take your most differentiated AI surface end-to-end through Gateway β†’ prompts-as-code β†’ trace β†’ eval β†’ cost cap before starting a second. Five shallow surfaces produce five things you can't trust.
  • Cost caps in Phase 1, not Phase 6. Trivial to add when there's no usage; painful when real customers depend on the limits.
  • Evals from day one β€” even with five examples. The muscle matters more than the coverage. Teams that defer evals never start them.
  • Defer until you have evidence: agent runtime (Β§4), MCP servers (Β§7.4), semantic cache (Β§15.2), credit ledger (Β§12.2), outcome-based billing (Β§12.5). Real patterns, but most products ship without them for the first six months.

The shortest viable path: Β§20 phases 1, 2, 5, 6, 8 in the first two weeks. Add the rest when a feature actually demands them.

πŸ”§ Retrofit: adding AI to an existing SaaS

You already have auth, tenancy, billing, audit, and an observability stack. Most of Β§3 exists in non-AI form β€” you're adding the AI primitives, not rebuilding the platform. The risk isn't under-building; it's over-building and destabilizing what already works.

  • Pick the smallest user-visible AI surface first. "Summarize this," "draft a reply," "classify this ticket." Not "rebuild our core flow as an agent." Small surfaces are reversible.
  • Gateway as sidecar, not refactor. Land pkg/llm/ (or a new service) alongside the existing code, behind a feature flag. Don't touch parts of the codebase the AI feature doesn't need.
  • Reuse, don't replace, the boring infrastructure. Existing tenancy, RBAC, billing, audit, and rate-limit middleware should wrap AI calls the same way they wrap any other request. Re-implementing them "AI-aware" is how you introduce inconsistencies that take 18 months to find.
  • Minimum new tables: llm_trace + llm_call_log. Defer agent, agent_run, credit_ledger, pending_action, semantic_cache until a feature actually needs them.
  • Cost cap on day one, even if the feature is free. A workspace-level token ceiling protects you from runaway loops in the prototype. Easier now than after a $10k week.
  • Capture traces before you build evals. Every AI call writes to llm_trace from the first deploy. By the time feature two ships, you have real production examples to seed an eval set β€” no synthetic data needed.
  • Update support and ops workflows before launch. CS needs read access to llm_trace before the first "the AI said something weird" ticket. Oncall needs the cost dashboard before the first runaway-bill alert.
  • Two common traps: AI-ifying too many surfaces at once (ship one well, then expand), and treating AI as a pure-engineering project (pricing, support, and legal need to ship alongside the feature).

The shortest viable path: Β§20 phases 1, 5, 6, 8 β€” Gateway, streaming UX on one surface, cost caps, trace capture. Skip prompts-as-code and evals until you have a second prompt to compare against; the first one is just learning.


3. πŸ—οΈ Reference Architecture

[Client]
   β”‚  prompt + context
   β–Ό
[App API]  ───►  [LLM Gateway]  ───►  [LLM provider(s)]
   β”‚                  β”‚
   β”‚             prompt cache β”‚ semantic cache
   β”‚             rate limit   β”‚ fallback
   β”‚             cost meter   β”‚ provider routing
   β–Ό
[Tool registry] ◄────┐
   β”‚                 β”‚
   β–Ό                 β”‚ tool calls
[App services / DB / external APIs]
   β”‚
   β”œβ”€β”€β–Ί [Vector DB] ──── embeddings worker
   β”œβ”€β”€β–Ί [Eval store]
   └──► [Trace store] ── prompt+response capture

The LLM Gateway is the keystone. Every model call goes through it β€” no direct SDK calls scattered through your codebase. It's where you implement caching, cost metering, fallback, and provider abstraction.

You can build it in ~300 lines (see Β§5) or use one off the shelf:

Option When to use
Build it (300–800 LoC) You want full control, native to your stack
LiteLLM (Python, OSS) You want OpenAI-compatible proxy across 100+ providers, fast
Portkey / Helicone / OpenRouter You want managed gateway with dashboards
Vercel AI SDK You're TS-only and want streaming primitives

Recommendation: build a thin one if you're Go-native (pkg/llm/), use LiteLLM if you're Python-heavy.


4. πŸ€– Agents as First-Class Actors

If your platform deploys agents (autonomous or user-launched), treat them like users in your data model. The Multica deep-dive captures the canonical pattern β€” polymorphic actor fields.

4.1 Schema

-- Every "who did this" column gets a type + id pair
CREATE TABLE comment (
  id UUID PRIMARY KEY,
  workspace_id UUID NOT NULL,
  author_type TEXT NOT NULL CHECK (author_type IN ('user','agent','system','api_key')),
  author_id   UUID NOT NULL,
  content TEXT NOT NULL,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE TABLE agent (
  id UUID PRIMARY KEY,
  workspace_id UUID NOT NULL,
  name TEXT NOT NULL,
  model TEXT NOT NULL,           -- "claude-sonnet-4-6", "gpt-5", ...
  system_prompt TEXT,
  tool_allowlist TEXT[],          -- which tools it can call
  daily_token_budget BIGINT,
  created_by UUID NOT NULL,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

4.2 Agent tokens (auth)

Agents authenticate with their own short-lived tokens, not the user's session.

// When a user kicks off an agent run:
agentToken := signJWT(jwt.Claims{
    Subject:    agent.ID,
    Issuer:     "your-app",
    Audience:   []string{"agent-runtime"},
    ExpiresAt:  time.Now().Add(2 * time.Hour),
    NotBefore:  time.Now(),
    CustomClaims: map[string]any{
        "workspace_id": workspaceID,
        "actor_type":   "agent",
        "kicked_off_by": userID,
        "tool_scope":   agent.ToolAllowlist,
    },
})

Why short-lived: an agent token is a bearer credential running on someone's machine. Ten minutes after the agent finishes, that token should be useless.

4.3 Audit log

Every audit row records both the agent and the human who kicked it off:

audit_log:
  actor_type = "agent"
  actor_id = <agent_uuid>
  on_behalf_of_user_id = <user_uuid>   -- the human who launched this run
  action = "issue.update"
  metadata = { model: "...", run_id: "...", trace_id: "..." }

This is what makes "the AI did X to my data" auditable months later.

4.4 Build vs. use an agent framework

Sooner or later you'll ask whether to write the agent loop yourself or pull in a framework. Decide on the criteria, not the feature list β€” frameworks rebrand quarterly.

Three real questions:

  1. Are you prototyping or productionizing? Frameworks excel at the first 80% (loop, tool calls, retries, basic memory). The last 20% β€” tenant-scoped budgets, cancellation, audit logs, replay, your domain's exact tool semantics β€” is where most teams hit framework walls and rip them out.
  2. How vendor-locked are you willing to be? Every framework has an opinion (OpenAI's Responses API, LangChain's runnables, Google's Vertex contract). Once your prompts and tools are shaped by that opinion, switching costs are real.
  3. What language is your backend? Most agent frameworks are Python-first. If you're a Go/TS shop, the calculus changes β€” a thin custom orchestrator on top of the LLM Gateway (Β§5) is often less code than a Python sidecar.

The landscape (as of 2026 β€” verify before adopting; this space churns):

Framework Language Sweet spot When to skip
OpenAI Agents SDK Python (TS preview) You're OpenAI-first, want handoffs/guardrails baked in, and the Responses API model fits your shape. You need provider-agnostic routing or strict structured outputs from non-OpenAI models.
LangGraph (LangChain) Python, TS Stateful, graph-shaped agent flows with explicit nodes + checkpoints. Good for "agent that pauses for human approval, resumes later." Simple linear tool-loop agents β€” LangGraph is overkill and the LangChain abstractions leak.
CrewAI Python Multi-agent role-play scenarios ("researcher hands to writer hands to editor"). Easy to demo. Production single-agent workflows β€” its abstractions optimize for the demo, not the long tail.
Google ADK / Vertex AI Agent Builder Python (Java/Go SDKs) You're already on GCP, want managed deployment + Gemini-native, and need enterprise IAM/audit out of the box. You're not on GCP; lock-in is high.
Pydantic AI Python Type-first, FastAPI-style ergonomics, model-agnostic. Closest thing to "if I'd written it myself." TS/Go shops.
Mastra TypeScript First-class TS agent framework with workflows, evals, and memory baked in. Python-only shops; smaller ecosystem than LangChain/LangGraph.
Vercel AI SDK TypeScript Streaming-first UX primitives (useChat, streamText) for Next.js apps. Not really an "agent framework" β€” it's the rendering layer. Backend agent orchestration.
Custom on top of the LLM Gateway Any You have an opinion about tool shape, memory, budgeting, and want to own them. ~500–1500 LoC. Greenfield prototyping where time-to-first-demo matters more than the final architecture.

Template recommendation: start with a custom orchestrator on top of pkg/llm/ (Β§5) β€” the agent loop is ~200 lines of Go and gives you exact control over multi-tenancy, budgets, and audit. Reach for a framework only when you hit a specific pattern it solves better (LangGraph for graph-shaped pause/resume flows, OpenAI Agents SDK if you've fully committed to Responses API + handoffs).

Whatever you pick, the framework is an implementation detail of the worker β€” your API surface, DB schema (Β§4.1), audit log (Β§4.3), and observability (Β§14) stay framework-agnostic. Swapping LangGraph for OpenAI Agents SDK should be a worker-side rewrite, not a platform rewrite.


5. πŸ”Œ The LLM Gateway (Provider Abstraction)

5.1 The interface (Go)

package llm

type ChatRequest struct {
    Messages    []Message
    Model       string         // "claude-sonnet-4-6", "gpt-5", "gemini-2-pro", "auto"
    Tools       []Tool
    Stream      bool
    JSONSchema  json.RawMessage // for structured outputs
    MaxTokens   int
    Temperature float64
    
    // Tracking
    WorkspaceID string
    UserID      string
    Feature     string  // e.g. "summarize", "agent.codegen"
    IdemKey     string
}

type ChatResponse struct {
    ID       string
    Model    string
    Choices  []Choice
    Usage    TokenUsage
    Provider string
    Cached   bool
    DurationMs int64
}

type Gateway interface {
    Chat(ctx context.Context, req ChatRequest) (ChatResponse, error)
    ChatStream(ctx context.Context, req ChatRequest) (<-chan StreamEvent, error)
    Embed(ctx context.Context, model string, texts []string) ([][]float32, error)
}

5.2 What goes inside Chat() β€” the layered pipeline

1. Validate + normalize (model alias resolution)
2. Check budget        ─► reject if over cap
3. Check prompt cache  ─► return cached response if hit
4. Check semantic cache─► return semantic match if cosine > 0.97
5. Pick provider       ─► routing rules (model name β†’ provider)
6. Call provider with timeout + retry
7. On failure: fallback to secondary provider
8. Capture trace       ─► async write to trace store
9. Meter usage         ─► async increment in Redis + Stripe
10. Return response

5.3 Provider routing

# llm-routing.yaml
models:
  fast:
    primary: { provider: anthropic, model: claude-haiku-4-5 }
    fallback: { provider: openai, model: gpt-5-mini }
  smart:
    primary: { provider: anthropic, model: claude-sonnet-4-6 }
    fallback: { provider: openai, model: gpt-5 }
  reasoning:
    primary: { provider: anthropic, model: claude-opus-4-7 }
    fallback: { provider: openai, model: o3 }
  cheap:
    primary: { provider: google, model: gemini-2-flash }

Code calls llm.Chat({ Model: "smart", ... }). The gateway resolves to the actual model. Never hardcode a provider's exact model name in business logic β€” you'll regret it the day prices change or a model is deprecated.

5.4 Fallback rules

  • Fall back on timeout / 5xx / rate limit β€” not on bad output (that's an eval problem).
  • Cap retries at 1 fallback to avoid stacking latency.
  • Log every fallback as a metric (llm.fallback.count) so you can detect provider issues.

5.5 Idempotency for LLM calls

Two LLM calls with identical input shouldn't get charged twice. Hash (workspaceID, model, messages, tools, jsonSchema) β†’ cache key. TTL 24h. Saves real money during retries and frontend double-clicks.


6. πŸ“ Prompts as Code

Treat prompts like SQL queries: version-controlled, testable, parameterized β€” never inline strings.

6.1 Filesystem layout

prompts/
  summarize/
    v1.md
    v2.md
    eval.jsonl       # ground-truth examples
    schema.json      # input variables
  agent/codegen/
    system.v3.md
    eval.jsonl

6.2 Loader with variable substitution

// prompts/loader.go
type Prompt struct {
    Name    string
    Version string
    Body    string  // with {{vars}}
}

func (p Prompt) Render(vars map[string]any) (string, error) {
    tmpl := template.Must(template.New(p.Name).Parse(p.Body))
    var buf bytes.Buffer
    return buf.String(), tmpl.Execute(&buf, vars)
}

6.3 Versioning rule

  • Every prompt has a version (v1, v2, summarize.v3).
  • Old versions stay in the repo β€” you'll need them to reproduce historical outputs and run regression evals.
  • The active version is selected by config or feature flag, not by replacing the file.
# config.yaml
prompts:
  summarize: "summarize/v3"
  codegen:   "agent/codegen/system.v2"

6.4 What goes in a prompt vs in a tool

Belongs in prompt Belongs in a tool
Persona, format rules, examples Anything that needs current data
Stable how-to instructions Anything that mutates state
Output schema Anything that should be auditable

If the prompt embeds data that changes hourly, you have a stale-context bug waiting to happen. Push it to a tool call.

6.5 Don't ship prompts longer than they need to be

  • Every extra token costs money + adds latency.
  • Move stable instructions to system prompt; ship per-call deltas only.
  • Use prompt caching (Β§15) for the stable prefix.

7. πŸ› οΈ Tools, Function Calling & MCP

7.1 Tool registry pattern

type Tool struct {
    Name        string
    Description string
    Schema      json.RawMessage  // JSON Schema for input
    Handler     func(ctx context.Context, input json.RawMessage) (string, error)
    Permissions []string          // RBAC permissions required
    Audited     bool              // log every call to audit_log
}

var Registry = map[string]Tool{}

func Register(t Tool) { Registry[t.Name] = t }

7.2 The execution loop

agent calls tool β†’ gateway dispatches β†’ handler runs with the agent's permissions β†’
  result back to model β†’ next round

Critical: the tool runs as the agent's identity, not the user's. Use the agent token's claims for authz checks.

7.3 Tool authorization

Two layers:

  1. Allowlist on the agent: agent.tool_allowlist = ["search", "read_issue", "comment"]. Agent can only call tools on its list.
  2. Per-call permission check: Can(actorAgent, "issue.update", issue). Same Can() helper from your generic SaaS playbook (Β§6.3).

Don't skip layer 2 even if the agent passes layer 1 β€” multi-tenancy bugs hide here.

7.4 MCP servers

Model Context Protocol is the emerging standard for exposing tools to LLM clients (Claude Desktop, Cursor, IDEs). For an AI SaaS, expose two MCP surfaces:

Surface Audience Auth
Public MCP server External clients (Claude Desktop, Cursor, ChatGPT integrations) OAuth or API key
Internal MCP server Your own agent runtimes Workspace-scoped agent token

Implementing MCP is ~200 LoC of JSON-RPC over stdio or HTTP. SDKs exist for Python, TS, Go.

7.5 Dangerous tools need confirmation

For destructive tools (delete, send email, post to Slack, run code, charge a card):

agent: "I'd like to call delete_issue with id=123"
runtime: pause + emit confirmation_required event
user: clicks "approve"
runtime: resume + execute

Implement this with a pending_tool_call table and a WebSocket push. Default destructive tools to require confirmation. See Β§17 (Human-in-the-Loop).

7.6 Tool output budget

Don't dump 100 KB of search results into the model. Tools should:

  • Cap output at a sensible token budget (e.g., 4 KB).
  • Provide pagination + summarization.
  • Return IDs the model can re-query for detail.

Otherwise you'll burn context and money.

7.7 Code execution: never on your infra, always sandboxed

If your agent runs LLM-generated code (python_exec, run_sql, execute_shell), it executes in an ephemeral, network-isolated, secret-free sandbox. Don't roll your own β€” the failure mode is "agent root-shells your prod box."

Sandbox Type Sweet spot
E2B Managed (also self-hostable) Default. Per-request micro-VMs in ~150 ms cold-start, Python/Node/Bash/filesystem, file mount, language-native SDKs. Drop-in for "ChatGPT Code Interpreter–style" tools.
Modal / Daytona Managed Heavier, longer-lived sandboxes for jobs that need a real workspace (data analysis, repo modifications).
Cloudflare Workers / Sandboxed iframes Self-host Pure-JS evaluation when the workload is small and trusted.
Firecracker microVMs DIY You have an infra team and want full control. Most teams should not pick this.

E2B is the recommended template default β€” it maps cleanly to the tool registry pattern (Β§7.1): one tool, one sandbox per call, output capped via Β§7.6, all wrapped in the usual audit log.


8. 🧠 Memory & RAG (the practical version)

8.1 Three kinds of memory, three different solutions

Kind TTL Storage Example
Conversational This session In-memory + Postgres Chat history within a thread
Episodic Per workspace, long-lived Postgres "User said their team is on PG 16"
Semantic / RAG Knowledge base Vector DB Company docs, past tickets

Don't conflate them. They have different access patterns and different invalidation rules.

Memory frameworks (when DIY gets tedious):

Tool Type Sweet spot Watch out for
Mem0 OSS + managed (Apache 2.0) Drop-in user/agent memory layer with add() / search() / update(). Auto-extracts and dedupes facts. Best when you want "give the agent a memory" without building the schema yourself. Opinionated about extraction prompts; works best on chat-shaped data.
Letta (formerly MemGPT) OSS, self-host (Apache 2.0) Stateful agents with hierarchical memory (core memory, archival memory, recall) and OS-style page-in/page-out. Strong for long-lived persistent agents. Heavier abstraction β€” agents are the memory; harder to bolt onto an existing app.
OpenViking (Volcengine / ByteDance) OSS, Python-first Unifies memories + resources + skills under a filesystem paradigm (viking:// URIs) with three-tier context loading (L0/L1/L2) to cut tokens, plus directory-recursive retrieval that combines vector search with hierarchical navigation. Interesting fit when you have structured knowledge (multi-doc workspaces, skill libraries) where flat RAG loses information. License: AGPLv3 on the main project (CLI/examples are Apache 2.0) β€” a hard blocker for many closed-source SaaS legal teams. Verify with counsel before adopting. Younger project, smaller community than Letta/Mem0.
DIY on Postgres + pgvector β€” You already have the multi-tenancy/audit/RLS plumbing and your "memory" is mostly extracted facts (a memory table with kind, payload, embedding, workspace_id). Accept that you're building extraction + dedupe yourself. Most templates land here.

Recommendation: start DIY (one memory table next to chunk), add Mem0 if extraction/dedupe becomes the bottleneck, reach for Letta if you're building agent-as-product where the agent has its own persistent identity across months. Consider OpenViking when your context is hierarchically structured (e.g., per-project knowledge bases with skills + resources) and AGPLv3 is acceptable for your distribution model.

8.2 RAG, the boring version that works

Most AI SaaS RAG pipelines are over-engineered. Start here:

1. Chunk documents at semantic boundaries (paragraphs / sections; ~500 tokens)
2. Generate embeddings via cheap model (text-embedding-3-small, voyage-3-lite)
3. Store in Postgres + pgvector with metadata (workspace_id, doc_id, chunk_index)
4. Hybrid retrieval: BM25 (pg_trgm/FTS) + vector (cosine) β†’ reciprocal rank fusion
5. Re-rank top 50 with a cross-encoder (Cohere Rerank, Voyage rerank-2) β†’ top 8
6. Stuff into prompt with citation tokens

You don't need a dedicated vector DB until ~5M chunks. pgvector + HNSW handles that comfortably and saves you a service.

8.3 Chunking that doesn't suck

  • Don't split mid-sentence.
  • Keep section headings with the chunk.
  • For code: split by symbol (function/class), not by line count.
  • Add a chunk header: [Doc: X / Section: Y] so the model has context even out of order.

8.4 Embeddings worker

Embeddings are async. Never block a write on embedding generation.

1. User saves doc β†’ INSERT into doc + INSERT into outbox
2. Embeddings worker reads outbox β†’ calls embedding API in batches β†’ UPSERT into chunk
3. Mark outbox row done

Batch sizes of 100 are usually optimal across providers.

8.5 Multi-tenancy in vectors

Every chunk row has workspace_id. Every query filters by it. It's tempting to skip this for "shared knowledge" β€” don't. Mistakes here become headlines.

For pgvector:

CREATE INDEX ON chunk USING hnsw (embedding vector_cosine_ops);
-- queries always include WHERE workspace_id = $1

8.6 When to invalidate

  • Source doc changed β†’ re-chunk, re-embed (delete old chunks first).
  • Source doc deleted β†’ cascade delete chunks.
  • Embedding model changed β†’ full re-embed (don't mix model versions in one index).

8.7 RAG is a search problem first

The single biggest improvement in any RAG system is better retrieval β€” not bigger context windows, not cleverer prompts. Run search-quality evals (recall@k, MRR) before tuning prompts.

8.8 Ingestion: don't write your own scraper

For any RAG that pulls from the open web or customer-hosted docs, the ingestion step is where most engineering time disappears (rendering JS, dealing with PDFs, deduping, cleaning boilerplate).

Tool Type Sweet spot
Crawl4AI OSS, Python LLM-shaped output by default β€” Markdown + structured chunks, JS rendering via Playwright, sitemap + multi-page crawl, async. Default pick for "give me clean docs from a URL list."
Firecrawl Managed (OSS option) Same shape, hosted. Pay per page; saves you running headless browsers.
Unstructured.io OSS + managed Best for PDFs, Office docs, emails β€” strong layout-aware parsing. Pair with Crawl4AI for the web side.
LlamaParse Managed High-quality PDF/table extraction; expensive but accurate on hard documents.

Whatever ingestor you pick, it runs in a worker (Β§18) that emits to the same outbox + embeddings pipeline (Β§8.4) β€” your RAG indexing path stays one shape.


9. πŸ“ Structured Outputs

When you need machine-readable output (extracting fields, generating UI, calling code), use JSON mode + JSON Schema β€” not regex on free text.

9.1 The pattern

schema := `{
  "type": "object",
  "properties": {
    "title": { "type": "string", "maxLength": 120 },
    "priority": { "enum": ["low","med","high"] },
    "due_date": { "type": "string", "format": "date" }
  },
  "required": ["title", "priority"]
}`

resp, _ := gateway.Chat(ctx, ChatRequest{
    Model: "smart",
    JSONSchema: json.RawMessage(schema),
    Messages: []Message{ ... },
})

var issue IssueDraft
json.Unmarshal([]byte(resp.Choices[0].Content), &issue)

9.2 Validation belt-and-suspenders

Even with JSON mode, validate server-side. Models occasionally produce schema-shaped-but-invalid output (wrong enum, out-of-range number). Use the same Zod / pydantic schema you'd use for human-submitted data.

9.3 When JSON mode isn't enough

  • Cross-field constraints ("if A then B"): validate, reject, retry once with the validation error in the prompt.
  • Generated data that needs DB references (foreign keys): post-process to resolve names β†’ IDs, fail loudly if unresolved.

9.4 Higher-level structured-output libraries

If you find yourself writing the same "schema β†’ prompt β†’ parse β†’ validate β†’ retry" loop in multiple places, lift it.

Tool Language Sweet spot Watch out for
Instructor Python (also JS, Go, Elixir ports) Pydantic-first wrapper around OpenAI/Anthropic/etc. Define a BaseModel, get type-safe outputs with automatic retries on validation failure. The default for Python AI SaaS. Couples your code to the Instructor abstraction; bare SDK calls remain available so the lock-in is shallow.
BAML Cross-language (TS, Python, Ruby, Go via codegen) A small DSL for prompts + schemas that compiles to typed clients. Great for teams with many prompts and a strong typing culture; treats prompts like API definitions. New tool to learn, separate .baml files in your repo, codegen step in CI.
TypeChat (Microsoft) TypeScript Small, focused on TS-first apps; schema is a TS type, validator regenerates on parse failure. Less active than Instructor/BAML; fewer providers wrapped.
Outlines / LMQL Python Constrained decoding (model literally cannot emit invalid JSON/regex). Useful for local/self-hosted models without native JSON mode. Provider-side JSON mode is now table stakes; this matters mainly for OSS model deployments.

Template recommendation: Python services β†’ Instructor. Multi-language teams or strong "prompts-as-API" culture β†’ BAML. Otherwise: bare JSON Schema (Β§9.1) + the same Zod/pydantic schema you already use for HTTP validation (Β§22.5 in the main playbook) is enough.


10. πŸ’§ Streaming UX

Users tolerate 30-second LLM responses only if they see progress. Streaming is non-negotiable for any chat-like surface.

10.1 The transport

Direction Use
Server β†’ client SSE (text/event-stream) β€” simpler, plays nicer with HTTP/2 + edges
Bidirectional needed (cancel, mid-stream input) WebSocket

Default to SSE.

10.2 The event taxonomy (steal this)

event: token            data: { content: "Hello" }           // text delta
event: thinking         data: { content: "Considering..." } // reasoning delta
event: tool_use         data: { name: "search", input: {...} }
event: tool_result      data: { name: "search", output: "..." }
event: status           data: { stage: "retrieving" }
event: error            data: { code: "rate_limited", message: "..." }
event: done             data: { usage: { input: 100, output: 250 } }

Mirror the structure across providers. The frontend should render the same components regardless of backend.

10.3 Cancellation

Streaming MUST be cancellable. When user closes the tab, navigates away, or clicks "stop":

const ctrl = new AbortController()
fetch("/api/chat", { signal: ctrl.signal })
// later
ctrl.abort()

Server-side: detect ctx.Done() and abort the upstream LLM call. Don't keep paying for tokens the user no longer wants.

10.4 Token-by-token UI

  • Render incrementally with no animation delay.
  • Markdown rendering: parse-as-you-go (libraries: marked-react, streaming-markdown).
  • Code blocks: syntax-highlight progressively or buffer until ``` closes.
  • Show a "stop" button while streaming, "regenerate" button after.

11. πŸ’΅ Cost Control, Budgets & Model Routing

The single biggest operational mistake in AI SaaS: deploying without budget caps and waking up to a $40,000 bill.

11.1 Three layers of caps

[Tenant cap]  workspace.daily_token_budget          β†’ 401 if exceeded
[User cap]    user.daily_request_budget             β†’ 429 if exceeded
[Per-call cap] max_tokens on the request            β†’ enforced by provider

All three. Always.

11.2 Real-time budget check

Hot path can't query Stripe or sum a Postgres table. Use Redis:

key: budget:{workspace_id}:{YYYY-MM-DD}
op:  INCRBY <tokens>
ttl: 36h

After every call, increment by usage.input_tokens + usage.output_tokens. Before every call, check GET against the workspace's daily limit.

11.3 Soft-fail UX

Don't just 403. When near the cap:

  • Banner: "You're at 80% of your daily AI budget."
  • At 100%: inline upgrade prompt β€” "Upgrade to Pro for 10x credits."
  • Reset hourly/daily based on plan.

11.4 Model routing for cost

Cheapest model that meets the bar. Real heuristic:

func routeModel(taskKind string) string {
    switch taskKind {
    case "classify", "extract", "rewrite":
        return "fast"      // Haiku / mini
    case "summarize", "answer", "draft":
        return "smart"     // Sonnet / GPT-5
    case "agent", "code", "reasoning":
        return "reasoning" // Opus / o3
    default:
        return "smart"
    }
}

Then run evals (Β§13) per task kind to verify the cheap model holds quality. Most tasks classify on fast; 90% of cost lives in 10% of tasks.

11.5 Cost dashboard (build this)

Per-workspace daily spend, per-feature breakdown, per-model breakdown. Without this you can't price your product.

CREATE TABLE llm_call_log (
    id UUID PK,
    workspace_id UUID,
    user_id UUID,
    feature TEXT,
    model TEXT,
    provider TEXT,
    input_tokens INT,
    output_tokens INT,
    cached_tokens INT,
    cost_usd_micros BIGINT,  -- store in micros to avoid float
    cache_hit BOOL,
    duration_ms INT,
    created_at TIMESTAMPTZ
);
-- Partition by month if volume is high

Materialized views (refreshed hourly) for the dashboard.

11.6 BYO key

For power users, support "bring your own API key." Stored encrypted, used as a passthrough.

workspace.byok = { provider: "anthropic", key_encrypted: "..." }

Two benefits: no margin pressure on heavy users, lets enterprises use their existing AI vendor relationship.


12. 🧾 Outcome-Based & Metered Pricing β€” the implementation

The "per-outcome" pricing trend is real but often misunderstood. You still bill per unit of work β€” the unit is just bigger than a seat.

12.1 Three patterns that actually work

Pattern Example Best for
Credits "1,000 AI credits/mo, top-up $5 = 500 more" Mixed-feature products
Per-call "$0.05 per generation" Single high-value output
Per-task / per-outcome "$2 per resolved ticket" Agentic / replacement-of-labor

12.2 The credit ledger

Keep a single ledger. Every consuming feature debits; every plan/topup credits.

CREATE TABLE credit_ledger (
    id UUID PRIMARY KEY,
    workspace_id UUID NOT NULL,
    delta BIGINT NOT NULL,        -- +N for grant, -N for usage
    reason TEXT NOT NULL,         -- "plan_grant" | "topup" | "feature.summarize" | "agent.run"
    metadata JSONB,
    created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Materialized view for current balance
CREATE MATERIALIZED VIEW credit_balance AS
SELECT workspace_id, SUM(delta) AS balance
FROM credit_ledger GROUP BY workspace_id;

Refresh credit_balance after every write. Or use a running_total column with row-level locking on the latest entry.

12.3 Mapping tokens to credits

Don't expose tokens to users β€” they don't care and pricing changes break their mental model. Convert internally:

func tokensToCredits(model string, in, out int) int64 {
    cost := costUSDMicros(model, in, out)
    return cost / pricePerCreditMicros // e.g., 1 credit = $0.001
}

Show users credits. Track tokens internally for cost analysis.

12.4 Stripe metered billing

For usage-based, push usage to Stripe daily (not per call):

// nightly cron
for _, ws := range workspaces {
    usage := sumYesterdaysUsage(ws.ID)
    stripe.UsageRecords.New(&stripe.UsageRecordParams{
        SubscriptionItem: &ws.UsageItemID,
        Quantity:         &usage,
        Timestamp:        &yesterday,
        Action:           stripe.UsageRecordActionSet,
    })
}

12.5 Outcome-based billing (the hard one)

For "$2 per resolved ticket," you need:

  • A definition of "resolved" the customer agrees to.
  • An immutable record of each outcome (outcome table).
  • A dispute window (5–7 days).
  • A finalize-and-bill cron after the window.

Don't sell outcome-based until you have eval coverage on what counts as "outcome." Disputes will eat you alive otherwise.


13. βœ… Evals β€” how to actually test agents

This is where most AI SaaS quality dies. Implement evals before launch, not after.

13.1 The simplest useful eval

# evals/summarize.jsonl
{"input": "...long article...", "expected_must_contain": ["climate", "policy"]}
{"input": "...", "expected_must_contain": ["..."]}
# evals/run.py
def score(output, expected):
    return all(term.lower() in output.lower() for term in expected["expected_must_contain"])

# Run nightly + on every PR that touches prompts/

Start with 20 hand-written examples. Add 1 more every time a user reports a bad output. In 3 months you have 100 β€” enough to catch real regressions.

13.2 Eval categories

Type Method When
Exact match / contains String compare Extraction, classification
Schema validity JSON Schema validate Structured output
Reference comparison BLEU / ROUGE / embedding similarity Translation, summarization
LLM-as-judge Stronger model scores output Open-ended quality
Human review Manual labels on samples Subjective quality, safety
A/B in production Compare metrics across variants Final word

LLM-as-judge is fast and useful but biased. Cross-check with human labels on a sample. Don't ship a judge prompt without validating it.

13.3 Regression evals on every prompt change

# .github/workflows/evals.yml
on: [pull_request]
jobs:
  evals:
    if: contains(github.event.pull_request.changed_files, 'prompts/')
    steps:
      - run: python evals/run.py
      - run: python evals/compare.py --base main --head HEAD

Block merges if quality drops by N% on the eval set. This is the closest thing to unit testing for LLMs.

13.4 Capture production outputs as eval data

Sample 1% of production calls (with PII scrubbed) into your eval store. Periodically promote interesting ones to ground-truth labeled examples. The longer you run, the better your eval set gets.

13.5 Tools

Tool Type Sweet spot
Promptfoo OSS, YAML-driven, fast Great default. Run from CI, diff prompts side-by-side, web UI for inspection. The "Jest for prompts."
DeepEval OSS, Python (pytest-native) If your team writes pytest already. Bundles 14+ metrics (faithfulness, hallucination, answer-relevancy, G-Eval), runs as @pytest.mark.eval decorators.
Ragas OSS, Python The standard for RAG-specific evals β€” context precision/recall, faithfulness, answer correctness. Pair with Promptfoo/DeepEval for end-to-end coverage.
Braintrust Hosted Dashboards, team workflows, dataset versioning, prompt-iteration UX. Best when you have 3+ engineers iterating on prompts.
Langfuse OSS + hosted Evals + observability in one tool β€” re-run a production trace as an eval, score it, version the prompt. Pairs perfectly with Β§14.5.
LangSmith Hosted If you're using LangChain anyway.
OpenAI Evals OSS framework, Python Reference framework if you want to stay close to OpenAI's eval philosophy.
DIY 200 LoC + a JSONL file Often best for the first 6 months.

Recommendation: start with a JSONL file + a make eval script (Β§13.1). Add Promptfoo the day you have >20 cases. Add Ragas the day you ship RAG. Add Langfuse the day you want production traces and evals to live in the same database.


14. πŸ”­ Observability for Agents

Standard observability (logs/metrics/traces) plus LLM-specific signals.

14.1 Capture every LLM call

CREATE TABLE llm_trace (
    id UUID PK,
    request_id TEXT,         -- correlates to your APM trace
    workspace_id UUID,
    feature TEXT,
    model TEXT,
    messages_hash TEXT,
    messages JSONB,          -- full prompt for replay
    response JSONB,          -- full response
    tools JSONB,             -- tool calls + results
    usage JSONB,
    latency_ms INT,
    cost_usd_micros BIGINT,
    cache_hit BOOL,
    score FLOAT,             -- user thumbs up/down or eval score
    created_at TIMESTAMPTZ
);
-- Heavy table; partition by day, drop after 30–90 days

Make this searchable in your admin tool. "Show me the last 10 chat completions for workspace X" should be one click β€” that's how you debug "why did the AI say something weird?"

14.2 Signals to plot on Grafana

  • p50 / p95 / p99 latency per model
  • Token throughput per minute
  • Cost per minute (broken down by feature + workspace)
  • Cache hit rate (prompt cache + semantic cache)
  • Error rate per provider
  • Fallback rate
  • Eval score over time (if you score in production)

14.3 Trace IDs across the stack

Every LLM call gets a trace ID that flows: API β†’ gateway β†’ provider β†’ tool calls β†’ DB. When a customer says "this answer was wrong," you find that trace ID and see exactly what happened.

14.4 User feedback signal

Thumbs up/down on every AI-generated output. Persist in llm_trace.score. Aggregate weekly. The directional signal is gold even with 1% response rate.

14.5 Don't build the trace UI yourself β€” pick an LLM observability tool

The llm_trace schema in Β§14.1 is what you need; the UI to search/replay/diff/score it is what you don't want to build. Wire one of these as the destination for trace exports (most have OTel-compatible ingestion, so the LLM Gateway emits once and you swap dashboards by config).

Tool Type Sweet spot Watch out for
Langfuse OSS, self-host or cloud Default recommendation. Open-source, generous free cloud tier, drop-in for the llm_trace schema, evals + prompt management + datasets in one tool. SDKs for Python/TS/Go. Self-hosting Postgres + ClickHouse adds ops burden β€” use cloud until trace volume justifies it.
LangSmith Managed (LangChain) You're already deep in LangChain/LangGraph β€” tightest integration, best replay UX for graph agents. Lock-in to LangChain abstractions; pricing scales with trace volume.
Helicone OSS, self-host or cloud Lightest-touch β€” works as an HTTP proxy in front of OpenAI/Anthropic, so zero SDK changes. Great for getting to "I can see my LLM calls" in 10 minutes. Proxy model means it sits on the request path; budget for the latency hop.
Arize Phoenix OSS, self-host Strong eval + drift detection, OTel-native. Good for ML-heavy teams that already speak Arize. Less polished trace replay UX than Langfuse/LangSmith.
Braintrust Managed Eval-first workflow with great prompt-iteration UX (diff prompts, run on dataset, compare). Smaller community than Langfuse.
Logfire (Pydantic) Managed If you're already on Pydantic AI, it Just Works β€” OTel-native, great Python ergonomics. Python-shaped.

Template recommendation: start with Langfuse cloud β€” free tier covers prototype volume, matches the llm_trace schema almost 1-for-1, and self-hosting later is a config flip, not a migration. Add Helicone in front of providers if you want zero-code-change observability before you've wired the gateway.

The LLM Gateway (Β§5) is where this integration lives β€” one writer, many destinations. Your handler code stays unchanged.


15. ⚑ Caching (Prompt + Semantic)

Two distinct caches with different rules.

15.1 Prompt cache (provider-managed)

Anthropic, OpenAI, and Google all support prompt caching now. Use it always for stable prefixes.

# Anthropic example
client.messages.create(
    model="claude-sonnet-4-6",
    system=[
        {"type": "text", "text": large_system_prompt, "cache_control": {"type": "ephemeral"}},
    ],
    messages=[{"role": "user", "content": user_query}],
)

Rule of thumb: anything over 1024 tokens that you reuse should be cached. System prompts, tool schemas, few-shot examples, RAG context that doesn't change β€” all cacheable.

Cache hit ratio of 80%+ on a chat product is normal and a 10x cost reduction.

15.2 Semantic cache (your responsibility)

For high-volume, low-novelty queries (FAQ-style chatbots), cache by meaning, not exact match:

1. Embed query
2. Vector search recent cached responses for this workspace
3. If cosine > 0.97 AND same model AND same tools: return cached response
4. Else: call model, cache result with embedding
CREATE TABLE semantic_cache (
    id UUID PK,
    workspace_id UUID,
    feature TEXT,
    model TEXT,
    query_embedding vector(1536),
    response TEXT,
    hits INT DEFAULT 0,
    created_at TIMESTAMPTZ,
    expires_at TIMESTAMPTZ
);
CREATE INDEX ON semantic_cache USING hnsw (query_embedding vector_cosine_ops);

Caveats: semantic cache is dangerous for personalized output. Scope by (workspace_id, user_id) if responses include user-specific data.

15.3 What NOT to cache

  • Anything with current time / "today" semantics.
  • Anything with user-specific data unless scoped.
  • Tool-using calls where tool results vary.
  • Anything regulated (healthcare, legal, financial advice).

16. πŸ›‘οΈ Safety, Abuse & PII

16.1 Input filtering

Cheap, fast classifier on every user input:

  • Off-topic / spam
  • Prompt injection attempts ("ignore previous instructions...")
  • Disallowed content per your policy

OpenAI's moderation endpoint and Llama Guard are both cheap or free.

16.2 Prompt injection β€” the actual mitigations

Prompt injection isn't fully solved. Your best defenses:

  1. Treat tool outputs as untrusted. Never let a tool result execute another tool without re-validating against the user's intent.
  2. Strict tool allowlists per agent. A summarizer doesn't need a delete_data tool.
  3. Confirm destructive actions. Β§17.
  4. Don't reflect tool output verbatim into another LLM call as instructions. Use clear delimiters and instruct the model to treat tool output as data.
  5. Audit all tool calls. When an injection succeeds, you'll need the trace.
  6. Sandbox code execution. If your agent runs arbitrary code, it runs in an ephemeral container with no network egress and no secrets. Use E2B or equivalent (Β§7.7) β€” never your own infra.

16.2a Red-team your prompts before users do

You can't reason your way to "injection-proof." You have to attack it.

Tool Type Sweet spot
NVIDIA garak OSS, Python The "nmap for LLMs." Probes for prompt injection, jailbreaks, encoding attacks, training-data leakage, malware generation, hallucinated package names. Runs against any provider via a plugin model. Run on every model upgrade and every system-prompt change.
PyRIT (Microsoft) OSS, Python Microsoft's automated red-teaming framework β€” multi-turn attacks, chained prompts, scenario-based testing. Heavier than garak; better for structured engagements.
promptfoo redteam OSS Adversarial test generation built into your existing eval suite. Lower setup cost if you already use Promptfoo.
Lakera Guard / Prompt Armor Managed Runtime injection detection as a sidecar β€” pair with your input filter.

Bake garak into CI β€” run a curated probe set on every PR that touches prompts or agent tools. Treat findings the way you'd treat OWASP ZAP results: known accepted risks documented, regressions block the merge.

16.3 Output filtering

Before showing AI output to a user (especially in customer-facing AI), filter for:

  • PII leakage (the model regurgitating training data)
  • Toxicity
  • Hallucinated URLs (validate links resolve before rendering)
  • Hallucinated function calls / API names that don't exist

16.4 PII scrubbing for telemetry

You will store prompts in llm_trace. Some prompts contain PII. Either:

  • Don't store the raw prompt β€” store a hash + a redacted version.
  • Store but encrypt β€” the production team can't read it without a break-glass procedure.
  • Tiered retention β€” raw 7 days, hashed 30 days.

16.5 Abuse: rate limits + cost limits + content limits

Beyond per-call rate limits:

  • Cumulative cost cap per IP / per signup-day (catch credit-card-stuffing attacks).
  • Block / ratelimit based on signup recency (account age < 24h gets stricter limits).
  • Cloudflare Turnstile / hCaptcha on signup.

The most common attack pattern in 2025–2026: trial accounts mass-created to scrape free LLM credits. Defend at signup.


17. πŸ™‹ Human-in-the-Loop & Autonomy Levels

Define autonomy levels per tool/action and let workspace admins set policy.

17.1 Five levels

Level Behavior Example
L1 β€” Suggest Agent suggests; human executes "Draft this email for me"
L2 β€” Auto-with-undo Agent acts; user can undo "Apply formatting"
L3 β€” Confirm-each Agent proposes; human approves each step "Refactor across files"
L4 β€” Confirm-once Human approves a plan; agent executes "Process this batch of tickets"
L5 β€” Fully autonomous Agent runs; audit log only "Reply to FAQ tickets matching pattern X"

17.2 Implementation

CREATE TABLE pending_action (
    id UUID PK,
    workspace_id UUID,
    agent_id UUID,
    user_id UUID,            -- who must approve
    tool TEXT,
    input JSONB,
    rationale TEXT,
    status TEXT,             -- pending | approved | rejected | expired
    expires_at TIMESTAMPTZ,
    created_at TIMESTAMPTZ
);

Agent calls "execute_with_approval" β†’ row inserted β†’ WS push to user β†’ user clicks approve β†’ row updates β†’ agent resumes via wakeup.

17.3 Defaults that won't get you sued

  • All destructive tools default to L3.
  • All tools that send external messages (email, Slack, social) default to L3 for the first 100 uses per agent, then L4 (per-batch approval).
  • All tools that spend money default to L3 with a confirmation modal showing the amount.
  • Workspace admins can override defaults; users on the workspace cannot.

18. ⏳ Long-Running Agent Jobs

LLM-based jobs can run for minutes or hours. Don't try to do this in the request path.

18.1 The pattern

1. POST /api/agents/run β†’ 202 Accepted, returns run_id
2. Worker picks up the job, runs the agent loop
3. Worker streams progress events to a per-run channel
4. Client subscribes via WS or SSE: GET /api/agents/runs/{run_id}/events
5. On completion, worker writes result + emits completion event
6. Client can fetch full result via GET /api/agents/runs/{run_id}

18.2 Resumable runs

Agents can run for hours and survive worker restarts. Store enough state to resume:

CREATE TABLE agent_run (
    id UUID PK,
    workspace_id UUID,
    agent_id UUID,
    status TEXT,             -- queued | running | paused | completed | failed | cancelled
    current_step INT,
    state JSONB,             -- agent's working memory, last LLM session ID
    result JSONB,
    error TEXT,
    started_at TIMESTAMPTZ,
    completed_at TIMESTAMPTZ,
    last_heartbeat_at TIMESTAMPTZ
);

Worker writes last_heartbeat_at every 10 s. Janitor cron picks up rows with stale heartbeats and re-queues.

18.3 Cancellation

User clicks "cancel" β†’ row status becomes cancelling β†’ worker checks the status every iteration β†’ sees cancelling β†’ cleans up + sets cancelled. The Multica pattern (Β§6.3) is the canonical example.

18.4 Cost guardrails on long runs

Every long run has a hard cost cap. When exceeded, the worker stops the agent loop, marks the run failed-budget-exceeded, refunds nothing, and emails the user.


19. 🏒 AI-Specific Multi-Tenancy Concerns

Building on Β§5 of the main playbook. Things you must handle that don't apply to non-AI SaaS:

19.1 Tenant context contamination

If you cache prompts or embeddings, scope every cache key by workspace_id. A cross-tenant cache hit is a customer-data leak.

19.2 Provider-side isolation

OpenAI, Anthropic, etc. don't see your tenants. They see you. So:

  • Track per-tenant usage yourself (the provider's usage dashboard is for you, not a per-customer audit trail).
  • Pass an opaque user_id field per call (most providers support it) to help abuse triage.
  • Don't pass real customer emails to providers.

19.3 Per-tenant model overrides

Some tenants want a specific model (compliance, regional latency, BYO API key). Your abstraction must support this:

workspace:
  ai_settings:
    model_override: "claude-sonnet-4-6"   # null β†’ use platform default
    byok: { provider: "openai", key_id: "..." }
    region: "eu"

19.4 Data residency

Enterprise tenants will ask "is my data sent to the US?" Have answers ready:

  • List which model providers / regions are used.
  • Support EU-only deployments by routing to EU endpoints (Anthropic Bedrock EU, OpenAI Azure EU, etc.).
  • Note any retention by the provider (most are zero-retention now, but check per-provider).

19.5 No-train guarantees

Default to opt-out of provider training. Every major provider now has zero-retention API tiers β€” use them. Document this in your DPA.


20. πŸ—ΊοΈ The 10-Phase Build Plan

Layered on top of the 14-phase plan in the main playbook. Run these phases after you have core auth + tenancy + billing in place β€” don't try to build AI-native without those foundations.

🌱 Phase 1 β€” LLM Gateway (2 days)

  • pkg/llm/ (or equivalent) β€” interface, provider adapters for one provider.
  • Basic call/stream/embed methods.
  • Token + cost metering writes to llm_call_log.
  • Idempotency by request hash.

Done when: you can call gateway.Chat(...) and see the call logged with cost.

πŸ“ Phase 2 β€” Prompts as Code (1 day)

  • prompts/ directory with versioned templates.
  • Loader + variable substitution.
  • Config-driven version selection.
  • One eval file per prompt with 20 examples.

Done when: changing a prompt requires a new file, the old one stays, and CI runs evals.

πŸ› οΈ Phase 3 β€” Tool Registry + One Real Tool (1 day)

  • Tool struct + registry.
  • One tool wired end-to-end (e.g., "search workspace docs").
  • Permission check enforced.
  • Tool calls audited.

Done when: an LLM call can request the tool, your code dispatches, and the audit log captures it.

🧠 Phase 4 β€” RAG (2 days)

  • pgvector enabled.
  • Chunking + embeddings worker.
  • Hybrid retrieval (BM25 + cosine + RRF).
  • Citation rendering in UI.

Done when: uploading a doc and asking a question returns an answer with cited chunks.

πŸ’§ Phase 5 β€” Streaming UX (1 day)

  • SSE endpoint.
  • Frontend hook that renders tokens as they arrive.
  • Cancel button propagates to upstream LLM call.
  • Markdown rendered progressively.

Done when: a 30-second response feels fast because tokens are flowing.

πŸ’΅ Phase 6 β€” Cost Caps + Credits (2 days)

  • Credit ledger table + balance materialized view.
  • Per-workspace daily budget check (Redis).
  • Stripe metered billing wired (daily push).
  • Cost dashboard in admin panel.

Done when: a workspace at quota gets a paywall instead of a runaway bill.

βœ… Phase 7 β€” Evals in CI (1 day)

  • Promptfoo or DIY runner.
  • Block PR merges that drop scores by > 5%.
  • Sample 1% of production calls into eval candidates table.

Done when: changing a prompt requires passing evals.

πŸ”­ Phase 8 β€” LLM Trace + Admin Replay (1 day)

  • llm_trace table populated for every call.
  • Admin panel page: search by workspace + user + feature.
  • One-click "rerun this prompt" for debug.
  • Thumbs up/down captured.

Done when: support can resolve "the AI said something wrong" tickets in < 5 min.

πŸ›‘οΈ Phase 9 β€” Safety Layer (1 day)

  • Moderation pre-check on user input.
  • PII scrubbing on stored traces.
  • Tool-allowlist per agent.
  • Destructive tools default to confirmation.

Done when: the obvious abuse vectors (prompt injection demos, NSFW input, free-credit scraping) all fail.

⏳ Phase 10 β€” Long-Running Agent Runs (2 days)

  • agent_run table + worker pool.
  • Resume on worker restart.
  • Cancellation propagation.
  • Per-run cost cap.
  • WS streaming of progress to UI.

Done when: a 5-minute agent task survives a worker restart and shows live progress.

Total: ~14 days for a single experienced engineer to layer AI-native primitives onto a working SaaS template.


21. ⚠️ Pitfalls

Pitfall Guardrail
Hardcoded provider model name in business logic Always go through model: "smart" aliases via the gateway.
No daily token cap β†’ runaway bill Per-workspace Redis counter checked on every call.
Provider outage takes whole product down Fallback provider configured per model alias.
Prompt change ships without testing CI runs evals on prompts/ changes; block on regression.
Tool runs as user, not agent Agent token's claims drive permission checks.
Tool output piped back into next prompt as instructions Treat tool output as data; use clear delimiters.
RAG returns chunks from wrong tenant workspace_id filter on every vector query.
Embeddings model upgraded mid-fleet β†’ scoring chaos Re-embed everything; don't mix model versions in one index.
Streaming endpoint can't be cancelled Plumb client AbortController through to upstream LLM call.
LLM trace contains raw PII forever Tiered retention: raw 7 days, redacted 30 days.
Semantic cache returns cross-user response Scope cache key by (workspace_id, user_id).
Long-running agent dies on worker restart Heartbeat + resumable state; janitor re-queues.
Free trial accounts farm AI credits Cumulative cost cap per IP + Turnstile + low budget on new accounts.
Credits balance computed by SUM on every check Materialized view or running-total column.
Outcome billing without dispute window 5–7 day dispute window before finalizing invoice.
Destructive tool runs without confirmation All destructive tools default to L3 (confirm-each).
User retries β†’ double charge Idempotency key on the LLM call hashed by content.
Cache invalidates correctly except for one path Tag cached entries with version; bump version on writes.
Provider rate-limited β†’ cascading timeout Circuit breaker + fast fallback + user-visible "system busy" banner.
Eval score looks great but production quality bad Production sampling β†’ real user feedback β†’ keep the eval set honest.

22. πŸ“‹ Cheat Sheet

Architecture rules

  • Every LLM call goes through the Gateway. No direct provider SDK calls in business code.
  • Every call carries workspace_id, user_id, feature, and request_id.
  • Every call is hashed for idempotency.
  • Every call is captured in llm_trace.
  • Every call is metered into the credit ledger.
  • Every prompt is in a file, versioned, with at least one eval example.
  • Every tool has a JSON Schema + permission check + audit flag.
  • Every cache key includes workspace_id (and user_id for personalized output).
  • Every long-running agent has a heartbeat + resumable state + cost cap.

Defaults

Setting Default
Per-call timeout 60 s (chat), 30 s (extraction), 5 min (agent)
Max tokens per response 4096
Provider retry 1 attempt + 1 fallback
Daily token budget (free) 50,000 tokens
Daily token budget (pro) 2,000,000 tokens
Eval set minimum 20 examples to ship; 100 to deprecate
Trace retention 7 days raw, 30 days redacted
Semantic cache cosine threshold 0.97
Embedding model text-embedding-3-small or voyage-3-lite (cheap, fast)
Default chat model "smart" alias β†’ mid-tier (Sonnet / GPT-5)
Confirmation required All destructive tools, all spend > $1, all external sends

The model alias table (review every quarter)

fast:      claude-haiku-4-5      | gpt-5-mini       | gemini-2-flash
smart:     claude-sonnet-4-6     | gpt-5            | gemini-2-pro
reasoning: claude-opus-4-7       | o3               | gemini-2-pro-thinking
embed:     voyage-3-lite         | text-embedding-3-small
rerank:    voyage-rerank-2       | cohere-rerank-3

Update model IDs as new versions ship. The alias names stay stable; the mapping moves.

Schema additions on top of base SaaS template

agent
agent_run
llm_call_log     -- partitioned by month
llm_trace        -- partitioned by day
credit_ledger
credit_balance   -- materialized view
prompt_version   -- if you go DB-driven instead of file-driven
tool_call        -- audited tool invocations
pending_action   -- human-in-the-loop queue
chunk            -- RAG chunks with embeddings
semantic_cache
eval_example
eval_run

KPIs to track from day one

  • AI feature DAU / WAU
  • Cost per active workspace (per day, per month)
  • Cache hit rate (prompt cache + semantic cache)
  • p95 streaming time-to-first-token
  • p95 full response time
  • Eval score per prompt over time
  • Thumbs up / thumbs down ratio
  • Provider availability / fallback rate
  • Cost-to-revenue ratio per workspace (red flag if > 30%)

Hard rules (non-negotiable)

  • No LLM call without a budget check.
  • No prompt change without an eval run.
  • No tool call without a permission check.
  • No cached response across tenants.
  • No destructive action without a confirmation policy.
  • No long-running run without a heartbeat + cost cap.
  • No raw PII in long-term trace storage.
  • No hardcoded provider model names in business logic.
  • No streaming endpoint that can't be cancelled.
  • No AI feature without observability (llm_trace + cost dashboard).

πŸ’­ Closing Thought

The "SaaSpocalypse" framing misses the practical truth: AI doesn't kill SaaS β€” it adds a new, expensive, non-deterministic dependency to it. Everything in your generic SaaS template still applies. This file is just the additional discipline you need when one component of your stack has variable cost, variable quality, and variable failure modes.

If you internalize four things:

  1. The Gateway is the keystone β€” every call goes through it.
  2. Prompts are code β€” versioned, tested, reviewed.
  3. Cost caps before launch β€” never optional.
  4. Evals before prompt changes β€” your only defense against silent quality drift.

…you can build an AI SaaS that doesn't surprise you with bills, doesn't degrade silently, and doesn't leak across tenants. The rest is detail.

Now go ship.


If you found this helpful, let me know by leaving a πŸ‘ or a comment!, or if you think this post could help someone, feel free to share it! Thank you very much! πŸ˜ƒ


All Rights Reserved

Viblo
Let's register a Viblo Account to get more interesting posts.