Evaluations

Evaluations (evals) are the only way to know how your agent actually performs. We recommend starting the development process of voice agents by writing evals first (Evaluation-Driven Development). In our system, evals are simply “example conversations” (inputs) paired with expected outcomes that your agent might encounter.

The Golden Rule

A good rule of thumb is to have at least 2 evals per distinct instruction in your system prompt.

Have an “out-of-scope rule” in your system prompt? Write 2+ evals for it.
Have a specific tool calling logic? Write 2+ evals for it.

Common Question: “How do I write the best prompt?”

There is no single answer. Your agent’s performance is affected by many variables:

The underlying model (LLM) you are using
The definitions and structure of your tools
The descriptions of those tools
The structure of the system prompt
The specific wording and phrasing used

The only reliable way to optimize your prompt is to write evals first, and then iterate on the prompt wording until the pass rate of your evals increases.

Don’t take our word for it. Take it from OpenAI: “[We] do not know how to prompt, we write evals and iterate on the prompt until it passes the evals.” — Building Resilient Prompts Using an Evaluation Flywheel (OpenAI Cookbook)

Agent Engineering

​Evaluations

​The Golden Rule

​Common Question: “How do I write the best prompt?”

Evaluations

The Golden Rule

Common Question: “How do I write the best prompt?”