The Missing Layer: An AI Gateway Build-vs-Buy Playbook for 2026
Read Time 17 mins | Written by: Vinayak Bhagat
The AI gateway is the single layer that separates enterprises that survive AI cost shocks from the ones that don't — and it's the layer most still don't have. It sits between your applications and every LLM provider, enforcing token budgets, routing routine inference to cheaper models, halting compromised keys in under a minute, and emitting structured spend telemetry your provider dashboard can't.
The build-vs-buy decision comes down to four variables: per-team isolation, model-routing sophistication, audit-trail rigor, and time-to-production. Most enterprises should stand up an open-source gateway in 4–8 weeks and consider commercial replacement at the 12-month mark — not before.
Continuing the playbook. Our previous post documented why hyperscaler controls fail for AI workloads and introduced the 5-layer FinOps defense. This post is the deep-dive on Layer 3 — the layer most enterprises are still missing.
You Can't Fix AI Cost Runaway From Inside the Provider
Provider budgets reconcile on 28-day cycles. Anomaly detection has documented blind spots. Tier auto-upgrade overrides customer-set caps. A misconfigured loop or a leaked key can empty a budget in minutes — and the first signal arrives on the invoice.
You fix it with a control plane that lives between your applications and the model providers — an AI gateway. Every inference call from every app, agent, or pipeline routes through it. Token-level events flow out. Budgets enforce at request time. Compromised keys get killed in seconds, not days.
What an AI Gateway Actually Does
Strip away the marketing and there are six things a gateway must do. If a vendor or in-house build skips any of these, it's a wrapper, not a gateway.
Authentication & Key Vaulting
Every upstream API key (OpenAI, Anthropic, Google, Bedrock, Azure OpenAI) lives in the gateway, never in application code. Apps authenticate to the gateway with rotatable internal credentials.
Request Routing
Per-request decisions about which model handles which workload — based on tags, team identity, prompt characteristics, or fallback chains. Routine inference goes to Haiku-class / Gemini Flash. Frontier models reserved for tagged requests only.
Token Budget Enforcement
Hard caps per team, app, and environment, sliced by hour / day / month. Burst limits. Soft alerts before hard cutoffs. Aggregate headroom across providers — because a $5K/day budget split across AWS, Anthropic direct, and OpenAI can't be tracked anywhere else.
Spend Telemetry
Structured events for every call: model, input tokens, output tokens, latency, cost in provider currency and normalized to USD, caller identity, request tags. Streamed to your warehouse in seconds, not waiting on 28-day provider reconciliation.
Kill Switch
A single API call (or single button) halts all traffic from a specific key, team, or globally. Sub-60-second propagation. Independent of provider response times.
Audit Trail
Every prompt and response, retained per your governance policy. Hash-and-index for sensitive payloads. Searchable for incident response, prompt-injection forensics, and compliance review.
4 Variables That Settle Build vs. Buy
Skip vendor demos. Decide on these four first. If 3 of 4 variables point one direction, that's your answer. If the split is 2/2, build first and replace later — replacement is straightforward when the application interface is standardized.
Variable 1: Per-Team Isolation
| Your need | Decision signal |
|---|---|
| Single team, single use case | Build |
| 3+ teams sharing a budget, no chargeback yet | Build with team tags |
| 5+ teams with formal chargeback / showback | Buy or build with strong tagging discipline |
| Regulated business-unit isolation | Buy enterprise-grade with multi-tenancy primitives |
Variable 2: Model-Routing Sophistication
| Your need | Decision signal |
|---|---|
| One model, occasional fallback | Build — a 50-line FastAPI proxy works |
| 2-tier routing (cheap default / premium tagged) | Build with LiteLLM or similar |
| Semantic routing (by prompt content) | Buy or invest 4–6 engineer-weeks |
| Adaptive routing (accuracy/cost feedback) | Buy — 12+ months to in-house parity |
Variable 3: Audit-Trail Rigor
| Your need | Decision signal |
|---|---|
| Engineering observability only | Build — write events to S3 or BigQuery |
| Internal audit / SOX-adjacent | Build with append-only storage + integrity hash |
| External regulator (FINRA, MAS, FCA) or SOC 2 II | Buy with retention, RBAC, certified data residency |
| EU AI Act high-risk classification | Buy and procure compliance attestations upfront |
Variable 4: Time-to-Production
| Your pressure | Decision signal |
|---|---|
| 6+ months runway, strong platform team | Build |
| 8 weeks to first-pass control | Open-source gateway (LiteLLM, Helicone OS) |
| Production-grade in 4 weeks, no platform team | Commercial gateway |
| Yesterday | Commercial — and renegotiate the AI cost ceiling with finance in parallel |
4 Weeks to a Working Open-Source Gateway
If you're building, the open-source ecosystem is mature enough that you're integrating, not inventing.
Foundation
- Stand up LiteLLM or Helicone (open-source) behind your load balancer.
- Move all upstream API keys into the gateway's secret store.
- Issue internal API keys per team / per environment.
- Update one pilot application to call the gateway instead of the provider directly.
Budgets & Routing
- Define team budgets (start with monthly, layer in daily later).
- Wire model-routing rules: default to a cheap model, escalate on premium tag.
- Add a kill-switch endpoint protected by ops-only IAM.
Telemetry
- Stream every gateway event to your warehouse (Snowflake, BigQuery, Redshift).
- Build dashboards: spend per team, per model, per environment, anomaly bands.
- Pipe alerts to Slack / PagerDuty when a team exceeds 80% of monthly budget.
Migration
- Cut over remaining applications. OpenAI-API-compatible interface keeps the swap one-line per app.
- Tabletop: simulate a compromised key, trigger the kill switch, measure end-to-end response time.
- By end of week 4 you should hit ±15% spend-forecast accuracy. If not, the gap is tagging discipline — not the gateway.
Procurement Criteria That Actually Matter
- Provider coverage. OpenAI, Anthropic (direct + via Bedrock), Google (direct + Vertex), Azure OpenAI, AWS Bedrock, and at least one open-source path. OpenAI-only vendors will force you into a second gateway in 12 months.
- Self-host option. Hosted-only is acceptable early; regulated industries need self-host. Confirm no feature gap between the two modes.
- Latency overhead. Under 50ms p99 added for non-streaming, under 200ms for streaming first-token. Anything higher breaks real-time apps.
- Pricing model. Per-request pricing scales nastily with AI volume. Prefer flat-rate or per-team licensing.
- Standards compliance. OpenAI API on the application side. OpenTelemetry on the telemetry side. Proprietary interfaces trade hyperscaler lock-in for startup lock-in.
- SOC 2 Type II + ISO 27001. Non-negotiable for enterprise.
- Lock-in cost. If swapping the gateway is "weeks of re-instrumentation," walk.
The Side-by-Side
| Dimension | Build (Open-Source) | Buy (Commercial) |
|---|---|---|
| Time to first production traffic | 4–8 weeks | 1–3 weeks |
| Year-1 cost (50M req/mo) | ~$80K engineering + $20K infra | $120K–$400K license + infra |
| Latency overhead (p99) | 10–30 ms | 20–80 ms |
| Provider coverage | All (write the adapter if missing) | Vendor-dependent — verify before signing |
| Compliance (SOC 2, EU AI Act) | You own the audit work | Vendor-provided (verify scope) |
| Routing sophistication ceiling | Whatever you can implement | Higher (mature commercial products) |
| Switching cost | Low — code is yours | Medium — re-instrumentation required |
| Risk of vendor disappearing | Zero | Real — vet the cap table |
4 Mistakes Enterprises Make at This Layer
Mistake #1: Letting each team run its own proxy
Per-team proxies kill the only value the gateway provides: cross-team observability and aggregate budget control. One gateway, one source of truth.
Mistake #2: Treating the gateway as a developer convenience
It's a financial control plane. Finance, security, and engineering co-own it. If finance can't pull the kill switch, you don't have a kill switch — you have a hopeful API endpoint.
Mistake #3: Skipping application interface standardization
If apps call the gateway with a proprietary contract, you've recreated lock-in one layer up. Force OpenAI-API-compatible on every application — it makes migration between gateways (or models) a one-line change.
Mistake #4: Believing the gateway is a security boundary
A gateway is a cost control plane. Prompt injection, jailbreaks, and data exfiltration still happen through it. Pair with a prompt-shielding layer (Lakera, Robust Intelligence, in-house) before treating it as security infrastructure.
Cloud, FinOps, and GenAI — Delivered as One Stack
Cloud Solutions
Gateway architecture, deployment (open-source or commercial), multi-cloud integration, and the IAM & key-rotation pipelines underneath.
FinOps & Financial Intelligence
Token-budget design, chargeback model, anomaly modeling on gateway telemetry. The spend warehouse your CFO actually trusts.
Generative AI Consulting
Model-routing strategy, prompt-shielding integration, governance framework, and the kill-switch tabletop with your finance, security, and engineering leads.
Data & Analytics
The Snowflake / BigQuery / Redshift pipelines that turn gateway events into the spend dashboard finance signs off on.
Is an AI Gateway Right for Your Stack?
Two-hour engagement. We walk your AI traffic patterns, run the 4-variable worksheet against your constraints, and deliver a build-vs-buy recommendation with an 8-week implementation plan within five business days. No obligation.
Book a 30-Minute Discovery Call →- Ontrac's prior post: Stopping Runaway AI Cloud Bills — the 5-layer FinOps defense this gateway sits inside.
- LiteLLM open-source proxy — the most common starting point for a build path.
- Helicone open-source observability proxy — pairs well with LiteLLM or replaces it.
- OpenTelemetry specification — the telemetry standard your gateway should emit.