Skip to main content

Evaluations

Evaluations (evals) are the only way to know how your agent actually performs. We recommend starting the development process of voice agents by writing evals first (Evaluation-Driven Development). In our system, evals are simply “example conversations” (inputs) paired with expected outcomes that your agent might encounter.

The Golden Rule

A good rule of thumb is to have at least 2 evals per distinct instruction in your system prompt.
  • Have an “out-of-scope rule” in your system prompt? Write 2+ evals for it.
  • Have a specific tool calling logic? Write 2+ evals for it.

Common Question: “How do I write the best prompt?”

There is no single answer. Your agent’s performance is affected by many variables:
  • The underlying model (LLM) you are using
  • The definitions and structure of your tools
  • The descriptions of those tools
  • The structure of the system prompt
  • The specific wording and phrasing used
The only reliable way to optimize your prompt is to write evals first, and then iterate on the prompt wording until the pass rate of your evals increases.
Don’t take our word for it. Take it from OpenAI: “[We] do not know how to prompt, we write evals and iterate on the prompt until it passes the evals.” Building Resilient Prompts Using an Evaluation Flywheel (OpenAI Cookbook)