Introduction
An agent that calls a few tools inside a Jupyter notebook can look convincing in under an hour of work. The gap between that demo and a system that runs unattended against live customer data is wide, and most of the difficulty lives in the parts that demos never exercise. Production agents face concurrent users, partial tool failures, ambiguous instructions, and adversarial inputs that try to push the model off its intended task. Devyst treats an agent as a distributed system with a probabilistic component at its center, not as a single prompt. That framing forces you to answer concrete questions early: how state is stored, how tools fail safely, and how an operator inspects a run after the fact. The sections below walk through the architecture decisions that have the largest effect on reliability and cost.
Choosing an Orchestration Pattern
The orchestration layer decides how control moves between the model and the rest of the system. The simplest pattern is a single reasoning loop where the model reads context, picks one tool, and repeats until it produces a final answer. This works well for tasks with a bounded number of steps and a clear stopping condition. More structured workflows benefit from a planner that decomposes a goal into named subtasks, each handled by a focused sub-agent with a narrow tool set. Devyst recommends starting with the simplest loop that satisfies the task and adding structure only when a measurable failure pushes you toward it, because every extra layer of indirection increases latency and the surface area for bugs. The key discipline is to keep control flow explicit in code rather than asking the model to manage long-horizon plans on its own, since models drift over long step counts. Reserve multi-agent topologies for cases where subtasks genuinely have different tool requirements or different trust boundaries.
Cap the agent loop with a hard maximum step count and a wall-clock timeout. An unbounded loop is the most common way a runaway agent burns budget in production.
Memory Architecture
Memory in an agent splits into three concerns that deserve separate treatment. Working memory is the active conversation and tool output that the model needs within a single run, and it competes directly for the context window, so it has to be summarized or truncated under pressure. Episodic memory records what happened across past runs for a given user or task, and it usually lives in a database keyed by tenant and session rather than in the prompt. Semantic memory holds durable facts and documents that the agent retrieves on demand through a vector store or a structured query. Conflating these three into one growing prompt is the most frequent cause of context overflow and rising token bills. Devyst stores episodic and semantic memory outside the model and injects only the slices relevant to the current step, which keeps each request cheap and predictable. A clear retention policy matters too, since stale memory can mislead an agent as badly as missing memory.
Tool Design and Validation
Tools are the only way an agent affects the outside world, so their contracts deserve the same rigor as any public API. Each tool needs a precise input schema, a documented side effect, and a validated output that the model can rely on. Never trust the shape of data returned from an external call, and never pass an unvalidated model argument straight into a database write or a shell command. Validating tool output with a schema turns a vague runtime failure into a clear, catchable error that the agent can recover from or surface to an operator. Devyst validates every tool boundary with Zod so that malformed responses are rejected before they reach the model, which prevents the agent from reasoning over garbage. The example below shows a tool output schema that parses an external response and guarantees the downstream code receives a typed, trusted object.
import { z } from 'zod'
const WeatherToolOutput = z.object({
location: z.string().min(1),
temperatureC: z.number().finite(),
condition: z.enum(['clear', 'cloudy', 'rain', 'snow']),
retrievedAt: z.string().datetime(),
})
type WeatherToolOutput = z.infer<typeof WeatherToolOutput>
export async function getWeather(city: string): Promise<WeatherToolOutput> {
const res = await fetch(`https://api.example.com/weather?city=${encodeURIComponent(city)}`)
const raw: unknown = await res.json()
const parsed = WeatherToolOutput.safeParse(raw)
if (!parsed.success) {
throw new Error(`Weather tool returned an invalid payload: ${parsed.error.message}`)
}
return parsed.data
}Treat tool arguments produced by the model as untrusted user input. A prompt injection in a retrieved document can steer those arguments toward destructive calls.
Observability
You cannot debug what you cannot see, and agent runs are nonlinear enough that ad hoc logging falls apart quickly. Every run should emit a structured trace that captures each model call, the prompt that produced it, the tool invocations with their arguments and results, token counts, and latency per step. These traces serve three audiences: engineers diagnosing a specific failed run, product teams measuring task success rates, and finance tracking cost per outcome. Devyst attaches a stable correlation id to every run and propagates it through tool calls so a single trace tells the complete story end to end. Sampling full prompts in production raises privacy questions, so redact sensitive fields at the trace boundary rather than after the fact. Good observability also makes evaluation possible, because a stored trace can be replayed against a new model version to detect regressions before they ship.
Failure Modes
Agentic systems fail in ways that traditional services do not, and planning for those modes is part of the design. The model can hallucinate a tool that does not exist, loop indefinitely on a task it cannot complete, or confidently produce a wrong answer with no error to catch. External tools add ordinary distributed-systems failures on top: timeouts, rate limits, and partial writes that leave state inconsistent. Devyst defends against these with strict step limits, idempotent tool implementations so a retried call does not duplicate an effect, and a fallback path that escalates to a human when confidence is low or a guardrail trips. Cost is also a failure mode, since a misconfigured loop can spend a month of budget in an afternoon, so per-run and per-tenant spend limits belong in the orchestration layer. Treating low confidence as a first-class signal, rather than forcing an answer, often produces the most trustworthy product behavior.
Conclusion
A production agent is mostly conventional engineering wrapped around a probabilistic core. The model gets attention because it is novel, but reliability comes from the parts around it: explicit orchestration, separated memory, validated tools, thorough observability, and deliberate handling of failure. Teams that invest in those foundations ship agents customers can depend on, while teams that skip them tend to relaunch the same demo repeatedly without ever reaching production. Start simple, measure relentlessly, and add structure only where data shows it is needed. The result is a system that behaves predictably under the conditions a demo never reveals.