Evaluations
Evaluations (evals) are the only way to know how your agent actually performs. We recommend starting the development process of voice agents by writing evals first (Evaluation-Driven Development). In our system, evals are simply “example conversations” (inputs) paired with expected outcomes that your agent might encounter.The Golden Rule
A good rule of thumb is to have at least 2 evals per distinct instruction in your system prompt.- Have an “out-of-scope rule” in your system prompt? Write 2+ evals for it.
- Have a specific tool calling logic? Write 2+ evals for it.
Common Question: “How do I write the best prompt?”
There is no single answer. Your agent’s performance is affected by many variables:- The underlying model (LLM) you are using
- The definitions and structure of your tools
- The descriptions of those tools
- The structure of the system prompt
- The specific wording and phrasing used
Don’t take our word for it. Take it from OpenAI: “[We] do not know how to prompt, we write evals and iterate on the prompt until it passes the evals.” — Building Resilient Prompts Using an Evaluation Flywheel (OpenAI Cookbook)