Edit

K

Edit AI Evaluator

Modify this tool gene.

Gene Details

Content (Markdown)

# AI Evaluator Tool

## Overview
AI-as-Judge evaluation engine that scores agent outputs on three dimensions (1-5 each): Quality, Accuracy, and Efficiency. Supports single evaluation, batch processing, output validation, human feedback, and aggregate reporting.

## Available Operations

### Single Evaluation
- **Score** - Evaluate a single agent output against the judge prompt
- Returns quality, accuracy, efficiency scores with reasoning

### Batch Evaluation
- **Batch** - Score all unscored traces within a time window
- Configurable hours lookback (default: 24h)
- Skips already-scored traces automatically

### Output Validation
- **Validate** - Check output against quality criteria (format, content, safety, completeness)
- Uses the gene-output-validator prompt
- Returns pass/fail with specific issues and passed checks

### Human Feedback
- **Feedback** - Submit human feedback score (1-5) for a specific trace
- Persists to Langfuse trace metadata

### Reporting
- **Report** - Aggregate scoring report by template, agent type, and cost tier
- Configurable period (default: 7 days)

## Parameters
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `operation` | string | Yes | One of: score, batch, validate, feedback, report |
| `traceId` | string | No | Trace ID (required for validate, feedback) |
| `sessionId` | string | No | Session ID (for single evaluation) |
| `output` | string | No | Agent output to evaluate |
| `input` | string | No | User input context |
| `score` | number | No | Human feedback score 1-5 (for feedback) |
| `hours` | number | No | Lookback window in hours (for batch, report) |
| `expectedFormat` | string | No | Expected output format (for validate) |

## Use Cases
- Automated quality assurance for agent outputs
- Batch evaluation of recent agent sessions
- Output format and safety validation
- Human-in-the-loop feedback collection
- Performance reporting and trend analysis

Preview

AI Evaluator Tool

Overview

AI-as-Judge evaluation engine that scores agent outputs on three dimensions (1-5 each): Quality, Accuracy, and Efficiency. Supports single evaluation, batch processing, output validation, human feedback, and aggregate reporting.

Available Operations

Single Evaluation

Score - Evaluate a single agent output against the judge prompt
Returns quality, accuracy, efficiency scores with reasoning

Batch Evaluation

Batch - Score all unscored traces within a time window
Configurable hours lookback (default: 24h)
Skips already-scored traces automatically

Output Validation

Validate - Check output against quality criteria (format, content, safety, completeness)
Uses the gene-output-validator prompt
Returns pass/fail with specific issues and passed checks

Human Feedback

Feedback - Submit human feedback score (1-5) for a specific trace
Persists to Langfuse trace metadata

Reporting

Report - Aggregate scoring report by template, agent type, and cost tier
Configurable period (default: 7 days)

Parameters

Parameter	Type	Required	Description
`operation`	string	Yes	One of: score, batch, validate, feedback, report
`traceId`	string	No	Trace ID (required for validate, feedback)
`sessionId`	string	No	Session ID (for single evaluation)
`output`	string	No	Agent output to evaluate
`input`	string	No	User input context
`score`	number	No	Human feedback score 1-5 (for feedback)
`hours`	number	No	Lookback window in hours (for batch, report)
`expectedFormat`	string	No	Expected output format (for validate)

Use Cases

Automated quality assurance for agent outputs
Batch evaluation of recent agent sessions
Output format and safety validation
Human-in-the-loop feedback collection
Performance reporting and trend analysis

Command Palette