Skip to main content
Evaluations provide an automated testing framework for your prompts. Define scoring criteria and let Traceport grade every prompt output against your quality standards.

What Are Evaluations?

Evaluations are rules that automatically score the output of your prompts. Instead of manually reviewing every response, define criteria once and Traceport evaluates every test run.

Creating Evaluation Rules

1

Open Evaluations

Click the Evaluations icon in the Prompt Studio sidebar.
2

Add Rule

Click Add Evaluation Rule and define a scoring criterion.
3

Configure

Set the evaluation type, scoring method, and pass/fail threshold.
4

Run

Execute your prompt — evaluation scores appear alongside the output.

Evaluation Types

Score the response’s relevance, coherence, and completeness relative to the user’s request. Catches off‑topic or low‑quality responses.
Check for harmful content, PII leakage, or policy violations. Essential for customer-facing applications.
Verify that the response follows a required format — JSON schema, specific structure, or required fields.
Define your own scoring logic using natural language descriptions. Traceport uses an evaluator model to grade responses against your criteria.

Evaluations + Datasets

The most powerful workflow combines Evaluations with Datasets:
  1. Create a Dataset with diverse test inputs
  2. Define Evaluation Rules for quality, safety, and format
  3. Run the batch — Traceport evaluates every response against every rule
  4. Review the scorecard — identify which inputs produce failing outputs
This workflow is ideal for prompt optimization cycles: make a change, run the dataset, and compare evaluation scores before and after.

Continuous Evaluation

As your prompts evolve through new versions, evaluations serve as a quality gate:
  • Run evaluations before publishing a new version
  • Compare scores across versions to detect regressions
  • Use evaluation pass rates as confidence signals for deployment
Evaluations use an additional model call to score responses. This adds a small cost per evaluation. Use them strategically on important prompts rather than on every prompt run.