Edit
Command Palette
Search for a command to run...
Edit AI Evaluator
Modify this tool gene.
Gene Details
evaluationscoringqualityvalidationbatch
Content (Markdown)
Preview
AI Evaluator Tool
Overview
AI-as-Judge evaluation engine that scores agent outputs on three dimensions (1-5 each): Quality, Accuracy, and Efficiency. Supports single evaluation, batch processing, output validation, human feedback, and aggregate reporting.
Available Operations
Single Evaluation
- Score - Evaluate a single agent output against the judge prompt
- Returns quality, accuracy, efficiency scores with reasoning
Batch Evaluation
- Batch - Score all unscored traces within a time window
- Configurable hours lookback (default: 24h)
- Skips already-scored traces automatically
Output Validation
- Validate - Check output against quality criteria (format, content, safety, completeness)
- Uses the gene-output-validator prompt
- Returns pass/fail with specific issues and passed checks
Human Feedback
- Feedback - Submit human feedback score (1-5) for a specific trace
- Persists to Langfuse trace metadata
Reporting
- Report - Aggregate scoring report by template, agent type, and cost tier
- Configurable period (default: 7 days)
Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
operation | string | Yes | One of: score, batch, validate, feedback, report |
traceId | string | No | Trace ID (required for validate, feedback) |
sessionId | string | No | Session ID (for single evaluation) |
output | string | No | Agent output to evaluate |
input | string | No | User input context |
score | number | No | Human feedback score 1-5 (for feedback) |
hours | number | No | Lookback window in hours (for batch, report) |
expectedFormat | string | No | Expected output format (for validate) |
Use Cases
- Automated quality assurance for agent outputs
- Batch evaluation of recent agent sessions
- Output format and safety validation
- Human-in-the-loop feedback collection
- Performance reporting and trend analysis