A production AI feature that does what it promises and does not balloon your inference bill. Architecture chosen against the actual problem (retrieval-augmented generation for grounded answers, an agent loop for multi-step tools, a fine-tuned classifier for narrow tasks); evaluation harness running before any prompt ships, so quality regressions get caught at CI time rather than at customer-complaint time; cost and latency budgets agreed at scope and enforced at every model call.
Architecture decision record (RAG vs agent vs classifier vs hybrid)
Evaluation harness with golden-dataset + LLM-as-judge passes in CI
Vector store and embedding pipeline (when RAG is the architecture)
Tool layer with structured-output schema validation (when agent is the architecture)
Prompt layer with version control, A/B testing, and rollout flags
Model router: cheap default, expensive fallback, by-feature overrides
Cost dashboard tracking spend per feature, per user cohort, per query type
Latency budget enforced via streaming responses and parallelised tool calls
Observability: trace every call, log every prompt + completion, alert on regression
Foundations
Every ai feature integration engagement inherits the four UX Studio foundations.
Schema graph wired at every URL. Core Web Vitals budget agreed at scope. Crawler-access policy across 18 named AI crawlers. Schema-per-page rather than templated copies. The full foundations grid lives on the UX Studio overview.
Architecture decision, not preference. Anthropic Claude leads on long-context reasoning and tool use; OpenAI on multi-modal and agent ecosystems; Google Gemini on cost-per-token at scale and on tasks where Google Search grounding matters; open-weight models (Llama, Mistral) on data-residency or cost floors below what hosted providers offer. Most production builds end up routing across two or three providers depending on the request shape, with a model-router layer hiding the choice from product code.
Three signals tracked from day one. Quality: the eval harness runs against a frozen golden dataset on every deploy, plus a sampled LLM-as-judge pass on real production traffic. Cost: spend per feature, per cohort, per query type, with alerts on per-query cost regression. Latency: P50 and P95 response time, with streaming response budgets where the user expects sub-second feedback. Vibe-checking AI features in production is the failure mode this engagement is designed to avoid.
The model router is the first lever: cheap default (smaller, faster model) for the 80% of queries that don't need the flagship model, expensive fallback only when the cheap model declines or returns low-confidence. Caching is the second: deterministic retrieval results cache on the embedding hash, deterministic completions cache on the prompt hash, with TTLs tuned to the freshness budget. The third lever, fine-tuning a smaller model on your own data, is reserved for cases where the cost-per-query is structurally dominant in the feature's economics.
Work with valUX
Start where it hurts.
If your organic traffic is sliding, start with a Pulse audit. If you want a programme rather than a one-off, ask about a retainer. Either way, every enquiry is read by a senior architect, and you hear back within one working day.