Evaluation Guides

Learn how to evaluate agent quality, benchmark performance, and compare models using vibe-check’s evaluation features.

Available Guides

Using Judges

Leverage LLM-based evaluation to assess agent output quality with rubrics.

Writing Rubrics

Design effective rubrics for consistent and reliable quality assessment.

Benchmarking

Compare models and configurations with systematic performance analysis.

Overview

Systematic agent evaluation requires structured assessment criteria and reproducible benchmarking. These guides cover:

LLM judges - Automated quality assessment using AI
Rubric design - Creating effective evaluation criteria
Benchmarking - Model comparison and performance analysis
Quality gates - Enforcement of quality standards

Guide Details

Using Judges

Leverage LLM-based evaluation to assess agent output quality automatically.

You’ll learn:

Configuring judges with rubrics
Understanding judge scoring systems
Implementing quality gates
Debugging judge decisions

Use cases:

Automated code review
Output quality assessment
Regression detection
Acceptance criteria validation

Key concepts:

Rubric application
Judge model selection
Scoring and thresholds
Error handling

Writing Rubrics

Design effective rubrics for consistent and reliable agent evaluation.

You’ll learn:

Rubric structure and schema
Writing clear evaluation criteria
Designing effective scoring scales
Best practices for rubric design

Use cases:

Code quality assessment
Documentation completeness checks
Behavior validation
Multi-criteria evaluation

Key concepts:

Criterion design
Scoring scales (binary, numeric, categorical)
Weight assignment
Criterion independence

Benchmarking

Compare models, configurations, and prompts systematically.

You’ll learn:

Setting up benchmark suites
Comparing model performance
Analyzing cost vs quality tradeoffs
Detecting performance regressions

Use cases:

Model selection (Sonnet vs Opus vs Haiku)
Prompt optimization
Configuration tuning
Performance monitoring

Key concepts:

Matrix testing
Performance metrics
Cost analysis
Statistical significance

Tutorials

Your First Evaluation - Benchmarking basics
Your First Test - Quality gates with assertions

API Reference

judge - Judge API
Rubric - Rubric schema and types
RunResult - Result inspection interface

Testing Guides

Matrix Testing - Cartesian product test generation
Custom Matchers - toPassRubric matcher

Concepts

Dual API - vibeTest for evaluation use cases
Auto-Capture - What judges can access

Evaluation Guides

Available Guides

Overview

Guide Details

Using Judges

Writing Rubrics

Benchmarking

Related Documentation

Tutorials

API Reference

Testing Guides

Concepts