Your First Evaluation
In this tutorial, you’ll learn how to evaluate and compare different models, prompts, and configurations using matrix testing and LLM-based judges. This is essential for finding the best cost-quality tradeoff for your use case.
What You’ll Build
Section titled “What You’ll Build”An evaluation suite that:
- Compares models (Sonnet vs Haiku) on the same task
- Tests multiple configurations (different maxTurns settings)
- Evaluates quality with an LLM judge
- Analyzes cost vs quality tradeoffs
- Generates comparison reports
We’ll use defineTestSuite
for matrix testing (Cartesian product of configurations).
Why Evaluation Matters
Section titled “Why Evaluation Matters”Before deploying an agent to production, you need to answer:
- 💰 Which model gives the best cost-quality tradeoff?
- ⚡ How many turns should I allow?
- 📝 Which prompt works best?
- 🎯 What configurations meet my quality bar?
Evaluation tests help you make data-driven decisions.
Basic Model Comparison
Section titled “Basic Model Comparison”Let’s start by comparing two models on the same task:
-
Create the evaluation file
Create
tests/eval/model-comparison.test.ts
:import { vibeTest } from '@dao/vibe-check';// Test with SonnetvibeTest('sonnet - refactor auth', async ({ runAgent, judge, expect }) => {const result = await runAgent({model: 'claude-3-5-sonnet-latest',prompt: 'Refactor src/auth.ts --add-tests'});const judgment = await judge(result, {rubric: {name: 'Code Quality',criteria: [{ name: 'correct', description: 'Code works correctly', weight: 0.5 },{ name: 'tested', description: 'Has comprehensive tests', weight: 0.3 },{ name: 'clean', description: 'Code is clean', weight: 0.2 }],passThreshold: 0.7}});// Log metrics for comparisonconsole.log('Sonnet - Cost:', result.metrics.totalCostUsd);console.log('Sonnet - Quality:', judgment.overallScore);expect(judgment.passed).toBe(true);});// Test with HaikuvibeTest('haiku - refactor auth', async ({ runAgent, judge, expect }) => {const result = await runAgent({model: 'claude-3-5-haiku-latest',prompt: 'Refactor src/auth.ts --add-tests'});const judgment = await judge(result, {rubric: {name: 'Code Quality',criteria: [{ name: 'correct', description: 'Code works correctly', weight: 0.5 },{ name: 'tested', description: 'Has comprehensive tests', weight: 0.3 },{ name: 'clean', description: 'Code is clean', weight: 0.2 }],passThreshold: 0.7}});console.log('Haiku - Cost:', result.metrics.totalCostUsd);console.log('Haiku - Quality:', judgment.overallScore);expect(judgment.passed).toBe(true);}); -
Run the evaluation
Terminal window vitest tests/eval/model-comparison.test.ts -
Compare results
✓ sonnet - refactor auth (18.2s)Sonnet - Cost: $0.42Sonnet - Quality: 0.85✓ haiku - refactor auth (12.1s)Haiku - Cost: $0.08Haiku - Quality: 0.72Vibe Check Cost Summary─────────────────────────────Total Cost: $0.50
Analysis: Sonnet costs 5× more but delivers 18% higher quality. Choose based on your needs!
Matrix Testing with defineTestSuite
Section titled “Matrix Testing with defineTestSuite”Manually writing tests for every combination is tedious. Use defineTestSuite
to generate a Cartesian product of configurations:
Basic Matrix
Section titled “Basic Matrix”import { defineTestSuite, vibeTest } from '@dao/vibe-check';
defineTestSuite({ matrix: { model: ['claude-3-5-sonnet-latest', 'claude-3-5-haiku-latest'], maxTurns: [5, 10, 20] }, test: ({ model, maxTurns }) => { vibeTest(`${model} - ${maxTurns} turns`, async ({ runAgent, expect }) => { const result = await runAgent({ model, maxTurns, prompt: 'Refactor src/auth.ts' });
console.log(`[${model}/${maxTurns}] Cost: $${result.metrics.totalCostUsd}`); console.log(`[${model}/${maxTurns}] Duration: ${result.metrics.durationMs}ms`);
expect(result).toStayUnderCost(5.0); }); }});
This generates 6 tests:
claude-3-5-sonnet-latest - 5 turns
claude-3-5-sonnet-latest - 10 turns
claude-3-5-sonnet-latest - 20 turns
claude-3-5-haiku-latest - 5 turns
claude-3-5-haiku-latest - 10 turns
claude-3-5-haiku-latest - 20 turns
Complete Evaluation Suite
Section titled “Complete Evaluation Suite”Here’s a production-ready evaluation with quality assessment:
import { defineTestSuite, vibeTest } from '@dao/vibe-check';import { writeFileSync } from 'fs';
// Define shared rubricconst QUALITY_RUBRIC = { name: 'Refactor Quality', criteria: [ { name: 'correctness', description: 'Code works correctly and maintains functionality', weight: 0.4 }, { name: 'test_coverage', description: 'Has comprehensive unit tests', weight: 0.3 }, { name: 'code_quality', description: 'Code is clean, readable, and maintainable', weight: 0.2 }, { name: 'type_safety', description: 'Uses TypeScript types properly', weight: 0.1 } ], passThreshold: 0.7};
// Results trackingconst evalResults: any[] = [];
defineTestSuite({ name: 'Model Evaluation Suite', matrix: { model: [ 'claude-3-5-sonnet-latest', 'claude-3-5-haiku-latest' ], maxTurns: [5, 10, 15] }, test: ({ model, maxTurns }) => { vibeTest(`eval: ${model.split('-').slice(-2, -1)[0]} / ${maxTurns}t`, async ({ runAgent, judge, expect }) => { const startTime = Date.now();
// Run agent with configuration const result = await runAgent({ model, maxTurns, prompt: ` Refactor src/auth.ts: - Add TypeScript strict types - Add comprehensive unit tests - Remove TODO comments - Improve error handling ` });
const duration = Date.now() - startTime;
// Evaluate quality with judge const judgment = await judge(result, { rubric: QUALITY_RUBRIC });
// Collect metrics const metrics = { model: model.split('-').slice(-2, -1)[0], // 'sonnet' or 'haiku' maxTurns, cost: result.metrics.totalCostUsd ?? 0, tokens: result.metrics.totalTokens ?? 0, duration, filesChanged: result.files.stats().total, qualityScore: judgment.overallScore ?? 0, passed: judgment.passed, criteriaScores: judgment.criteria };
evalResults.push(metrics);
// Log results console.log(`\n📊 ${metrics.model} (${maxTurns} turns):`); console.log(` 💰 Cost: $${metrics.cost.toFixed(4)}`); console.log(` ⭐ Quality: ${(metrics.qualityScore * 100).toFixed(1)}%`); console.log(` ⏱️ Duration: ${(metrics.duration / 1000).toFixed(1)}s`); console.log(` 📁 Files: ${metrics.filesChanged}`);
// Quality gates expect(judgment.passed).toBe(true); expect(metrics.cost).toBeLessThan(3.0); } ); }});
// Generate report after all tests (in a separate test)vibeTest('generate evaluation report', async () => { // Calculate statistics const report = { timestamp: new Date().toISOString(), summary: { totalTests: evalResults.length, avgCost: evalResults.reduce((sum, r) => sum + r.cost, 0) / evalResults.length, avgQuality: evalResults.reduce((sum, r) => sum + r.qualityScore, 0) / evalResults.length }, byModel: {} as Record<string, any>, results: evalResults };
// Group by model for (const result of evalResults) { if (!report.byModel[result.model]) { report.byModel[result.model] = { avgCost: 0, avgQuality: 0, results: [] }; } report.byModel[result.model].results.push(result); }
// Calculate per-model averages for (const model of Object.keys(report.byModel)) { const results = report.byModel[model].results; report.byModel[model].avgCost = results.reduce((sum: number, r: any) => sum + r.cost, 0) / results.length; report.byModel[model].avgQuality = results.reduce((sum: number, r: any) => sum + r.qualityScore, 0) / results.length; }
// Save report writeFileSync('eval-report.json', JSON.stringify(report, null, 2));
console.log('\n📊 Evaluation Report:'); console.log('─────────────────────────────'); console.log(`Total Tests: ${report.summary.totalTests}`); console.log(`Avg Cost: $${report.summary.avgCost.toFixed(4)}`); console.log(`Avg Quality: ${(report.summary.avgQuality * 100).toFixed(1)}%`); console.log('\nBy Model:');
for (const [model, stats] of Object.entries(report.byModel)) { console.log(` ${model}:`); console.log(` Avg Cost: $${stats.avgCost.toFixed(4)}`); console.log(` Avg Quality: ${(stats.avgQuality * 100).toFixed(1)}%`); }
console.log(`\n✅ Report saved to eval-report.json`);}, { timeout: 600000 });
Analyzing Results
Section titled “Analyzing Results”After running evaluations, analyze the tradeoffs:
Cost vs Quality
Section titled “Cost vs Quality”Model | Avg Cost | Avg Quality | Cost per Quality Point-----------|----------|-------------|----------------------sonnet | $0.42 | 0.85 | $0.49haiku | $0.08 | 0.72 | $0.11
Winner: Haiku for budget-conscious use cases, Sonnet for quality-critical tasks.
Speed vs Quality
Section titled “Speed vs Quality”Model | Avg Duration | Avg Quality | Seconds per Quality Point-----------|--------------|-------------|-------------------------sonnet | 18.2s | 0.85 | 21.4shaiku | 12.1s | 0.72 | 16.8s
Winner: Haiku is faster for similar quality.
Turn Limit Impact
Section titled “Turn Limit Impact”maxTurns | Success Rate | Avg Cost | Avg Quality-----------|--------------|----------|------------5 | 60% | $0.15 | 0.6810 | 85% | $0.28 | 0.7815 | 95% | $0.42 | 0.82
Sweet spot: 10 turns balances success rate and cost.
Evaluating Different Prompts
Section titled “Evaluating Different Prompts”Compare prompt variations:
defineTestSuite({ matrix: { prompt: [ 'Refactor src/auth.ts', 'Refactor src/auth.ts --add-tests', 'Refactor src/auth.ts --add-tests --strict-types', 'Refactor src/auth.ts with comprehensive tests and strict types' ] }, test: ({ prompt }) => { vibeTest(`prompt: ${prompt.slice(0, 30)}...`, async ({ runAgent, judge, expect }) => { const result = await runAgent({ model: 'claude-3-5-sonnet-latest', prompt });
const judgment = await judge(result, { rubric: QUALITY_RUBRIC });
console.log(`Prompt: "${prompt}"`); console.log(`Quality: ${(judgment.overallScore * 100).toFixed(1)}%`); console.log(`Cost: $${result.metrics.totalCostUsd}\n`);
expect(judgment.passed).toBe(true); }); }});
Custom Evaluation Metrics
Section titled “Custom Evaluation Metrics”Define your own success criteria:
vibeTest('custom metrics', async ({ runAgent, judge, expect }) => { const result = await runAgent({ prompt: 'Refactor auth.ts' });
const judgment = await judge(result, { rubric: { name: 'Custom Evaluation', criteria: [ { name: 'test_quality', description: 'Tests cover edge cases and error paths', weight: 0.4 }, { name: 'performance', description: 'Code is optimized for performance', weight: 0.3 }, { name: 'security', description: 'Follows security best practices', weight: 0.3 } ] } });
// Custom quality gates const hasTests = result.files.changed().some(f => f.path.includes('.test.') || f.path.includes('.spec.') );
const testCoverage = hasTests ? 1.0 : 0.0; const overallScore = (judgment.overallScore ?? 0) * 0.7 + testCoverage * 0.3;
console.log('Judge Score:', judgment.overallScore); console.log('Test Coverage:', testCoverage); console.log('Overall Score:', overallScore);
expect(overallScore).toBeGreaterThan(0.8);});
Baseline Testing
Section titled “Baseline Testing”Track improvements over time:
import { readFileSync, writeFileSync, existsSync } from 'fs';
vibeTest('regression test against baseline', async ({ runAgent, judge, expect }) => { const result = await runAgent({ model: 'claude-3-5-sonnet-latest', prompt: 'Refactor src/auth.ts' });
const judgment = await judge(result, { rubric: QUALITY_RUBRIC });
const currentMetrics = { cost: result.metrics.totalCostUsd, quality: judgment.overallScore, timestamp: new Date().toISOString() };
// Load baseline const baselinePath = 'baseline-metrics.json'; let baseline = null;
if (existsSync(baselinePath)) { baseline = JSON.parse(readFileSync(baselinePath, 'utf-8'));
// Compare against baseline console.log('\n📊 Baseline Comparison:'); console.log(`Quality: ${(currentMetrics.quality * 100).toFixed(1)}% (baseline: ${(baseline.quality * 100).toFixed(1)}%)`); console.log(`Cost: $${currentMetrics.cost?.toFixed(4)} (baseline: $${baseline.cost?.toFixed(4)})`);
// Ensure we didn't regress expect(currentMetrics.quality).toBeGreaterThanOrEqual(baseline.quality * 0.95); // Allow 5% regression } else { // Save as new baseline writeFileSync(baselinePath, JSON.stringify(currentMetrics, null, 2)); console.log('✅ Baseline saved'); }});
Best Practices
Section titled “Best Practices”✅ DO: Use Consistent Rubrics
Section titled “✅ DO: Use Consistent Rubrics”// Define once, reuse everywhereconst STANDARD_RUBRIC = { name: 'Code Quality', criteria: [/* ... */]};
// Use in all evaluationsconst judgment = await judge(result, { rubric: STANDARD_RUBRIC });
✅ DO: Track All Metrics
Section titled “✅ DO: Track All Metrics”const metrics = { cost: result.metrics.totalCostUsd, quality: judgment.overallScore, duration: result.metrics.durationMs, tokens: result.metrics.totalTokens, filesChanged: result.files.stats().total, toolCalls: result.metrics.toolCalls};
✅ DO: Save Evaluation Data
Section titled “✅ DO: Save Evaluation Data”// Save for later analysiswriteFileSync('eval-results.json', JSON.stringify(results, null, 2));
❌ DON’T: Compare Different Tasks
Section titled “❌ DON’T: Compare Different Tasks”// ❌ Bad: Comparing apples to orangesvibeTest('sonnet - refactor', async ({ runAgent }) => { /* ... */ });vibeTest('haiku - new feature', async ({ runAgent }) => { /* ... */ });
// ✅ Good: Same task, different configsdefineTestSuite({ matrix: { model: ['sonnet', 'haiku'] }, test: ({ model }) => { vibeTest(model, async ({ runAgent }) => { // Same prompt for fair comparison }); }});
❌ DON’T: Forget Cost Budgets
Section titled “❌ DON’T: Forget Cost Budgets”// ✅ Always set cost limitsexpect(result).toStayUnderCost(5.0);
Next Steps
Section titled “Next Steps”You’ve learned how to evaluate agents! Now explore:
- Using Judge → - Deep dive into LLM evaluation
- Creating Rubrics → - Design effective evaluation criteria
- Benchmarking Guide → - Advanced comparison techniques
- Cost Optimization → - Reduce costs while maintaining quality
- Matrix Testing → - Advanced matrix patterns
Reference
Section titled “Reference”- defineTestSuite → - Matrix testing API
- judge() → - LLM evaluation API
- Rubric Interface → - Rubric structure
- Custom Matchers → - All available matchers