Benchmarking
This guide covers how to benchmark different models, prompts, and configurations. You’ll learn how to use matrix testing with LLM judges to systematically compare performance, quality, and cost.
What is Benchmarking?
Section titled “What is Benchmarking?”Benchmarking is the process of systematically comparing different configurations to find the best option. In vibe-check, you can benchmark:
- Models - Compare Claude Sonnet, Haiku, and Opus
- Prompts - Test different prompt variations
- Parameters - Find optimal maxTurns, temperature, etc.
- Agent Configurations - Compare different tool setups
Basic Model Benchmarking
Section titled “Basic Model Benchmarking”Compare models on the same task:
import { defineTestSuite, defineAgent, vibeTest } from '@dao/vibe-check';
// Define agents with different modelsconst sonnetAgent = defineAgent({ name: 'sonnet', model: 'claude-sonnet-4-5-20250929'});
const haikuAgent = defineAgent({ name: 'haiku', model: 'claude-3-5-haiku-20241022'});
const opusAgent = defineAgent({ name: 'opus', model: 'claude-opus-4-20250514'});
// Benchmark all modelsdefineTestSuite({ matrix: { agent: [sonnetAgent, haikuAgent, opusAgent] }, test: ({ agent }) => { vibeTest(`${agent.name} - refactoring task`, async ({ runAgent, judge, expect }) => { const result = await runAgent({ agent, prompt: '/refactor src/utils.ts --improve-readability' });
// Evaluate with a judge const judgment = await judge(result, { rubric: { name: 'Code Quality', criteria: [ { name: 'readability', description: 'Code is easy to read' }, { name: 'correctness', description: 'Refactoring preserves behavior' } ] } });
expect(judgment.passed).toBe(true);
// Log metrics for comparison console.log(`${agent.name} results:`, { passed: judgment.passed, cost: result.metrics.cost.total, tokens: result.metrics.tokens.total, duration: result.metrics.timing.total }); }); }});
Output:
✓ sonnet - refactoring task (12.3s) sonnet results: { passed: true, cost: 0.0234, tokens: 5420, duration: 12300 }
✓ haiku - refactoring task (8.1s) haiku results: { passed: true, cost: 0.0089, tokens: 4210, duration: 8100 }
✓ opus - refactoring task (18.7s) opus results: { passed: true, cost: 0.0512, tokens: 6890, duration: 18700 }
Multi-Dimensional Benchmarking
Section titled “Multi-Dimensional Benchmarking”Test multiple dimensions simultaneously:
defineTestSuite({ matrix: { agent: [sonnetAgent, haikuAgent], maxTurns: [8, 16, 32], prompt: [ '/refactor --style=functional', '/refactor --style=object-oriented' ] }, test: ({ agent, maxTurns, prompt }) => { vibeTest( `${agent.name} with ${maxTurns} turns: ${prompt}`, async ({ runAgent, judge, expect }) => { const result = await runAgent({ agent, prompt, maxTurns });
const judgment = await judge(result, { rubric: codeQualityRubric });
expect(judgment.passed).toBe(true); } ); }});
// Generates 12 tests (2 agents × 3 maxTurns × 2 prompts)
Quality Benchmarking
Section titled “Quality Benchmarking”Use judges to evaluate quality across models:
const qualityRubric = { name: 'Overall Quality', criteria: [ { name: 'correctness', description: 'Implementation is functionally correct', weight: 0.4 }, { name: 'code_quality', description: 'Code is readable and well-structured', weight: 0.3 }, { name: 'testing', description: 'Has adequate test coverage', weight: 0.3 } ], passThreshold: 0.75};
defineTestSuite({ matrix: { agent: [sonnetAgent, haikuAgent, opusAgent] }, test: ({ agent }) => { vibeTest(`${agent.name} quality benchmark`, async ({ runAgent, judge }) => { const result = await runAgent({ agent, prompt: '/implement user authentication with JWT' });
const judgment = await judge(result, { rubric: qualityRubric });
// Store results for comparison console.log(`\n=== ${agent.name} Quality Report ===`); console.log('Overall passed:', judgment.passed); console.log('Overall score:', judgment.score?.toFixed(2)); console.log('\nCriteria scores:');
for (const [name, result] of Object.entries(judgment.criteria)) { console.log(` ${name}: ${result.passed ? '✓' : '✗'} - ${result.reason}`); }
console.log('\nMetrics:'); console.log(` Cost: $${result.metrics.cost.total.toFixed(4)}`); console.log(` Tokens: ${result.metrics.tokens.total}`); console.log(` Duration: ${(result.metrics.timing.total / 1000).toFixed(1)}s`); }); }});
Output:
=== sonnet Quality Report ===Overall passed: trueOverall score: 0.85
Criteria scores: correctness: ✓ - Implementation works correctly for all test cases code_quality: ✓ - Well-structured with clear naming testing: ✓ - Good test coverage with edge cases
Metrics: Cost: $0.0456 Tokens: 8230 Duration: 15.2s
Cost-Performance Trade-offs
Section titled “Cost-Performance Trade-offs”Compare cost vs. quality across models:
interface BenchmarkResult { agent: string; passed: boolean; qualityScore: number; cost: number; tokens: number; duration: number; costPerPoint: number; // Cost per quality point}
const results: BenchmarkResult[] = [];
defineTestSuite({ matrix: { agent: [sonnetAgent, haikuAgent, opusAgent] }, test: ({ agent }) => { vibeTest(`${agent.name} cost benchmark`, async ({ runAgent, judge }) => { const result = await runAgent({ agent, prompt: '/implement feature X' });
const judgment = await judge(result, { rubric: qualityRubric });
const benchmarkResult: BenchmarkResult = { agent: agent.name, passed: judgment.passed, qualityScore: judgment.score || 0, cost: result.metrics.cost.total, tokens: result.metrics.tokens.total, duration: result.metrics.timing.total, costPerPoint: result.metrics.cost.total / (judgment.score || 1) };
results.push(benchmarkResult);
// Log individual result console.log(`\n${agent.name}:`, { quality: benchmarkResult.qualityScore.toFixed(2), cost: `$${benchmarkResult.cost.toFixed(4)}`, costPerPoint: `$${benchmarkResult.costPerPoint.toFixed(4)}` }); }); }});
// After all tests, compare resultstest.afterAll(() => { console.log('\n=== Benchmark Summary ===\n');
// Sort by cost efficiency results.sort((a, b) => a.costPerPoint - b.costPerPoint);
console.log('Cost Efficiency (lower is better):'); for (const result of results) { console.log(` ${result.agent}: $${result.costPerPoint.toFixed(4)} per quality point`); }
// Find best quality const bestQuality = results.reduce((best, current) => current.qualityScore > best.qualityScore ? current : best );
// Find cheapest const cheapest = results.reduce((best, current) => current.cost < best.cost ? current : best );
console.log('\nBest Quality:', bestQuality.agent, `(${bestQuality.qualityScore.toFixed(2)})`); console.log('Cheapest:', cheapest.agent, `($${cheapest.cost.toFixed(4)})`);});
Output:
=== Benchmark Summary ===
Cost Efficiency (lower is better): haiku: $0.0112 per quality point sonnet: $0.0538 per quality point opus: $0.0654 per quality point
Best Quality: opus (0.92)Cheapest: haiku ($0.0089)
Prompt Engineering Benchmarking
Section titled “Prompt Engineering Benchmarking”Test different prompt variations:
const prompts = [ { name: 'direct', text: '/implement shopping cart' }, { name: 'detailed', text: '/implement shopping cart with add/remove/update quantity features. Include validation and tests.' }, { name: 'step-by-step', text: '/implement shopping cart. First, create the data model. Then add CRUD operations. Finally, add validation and tests.' }];
defineTestSuite({ matrix: { agent: [sonnetAgent], promptConfig: prompts }, test: ({ agent, promptConfig }) => { vibeTest( `${promptConfig.name} prompt`, async ({ runAgent, judge, expect }) => { const result = await runAgent({ agent, prompt: promptConfig.text });
const judgment = await judge(result, { rubric: { name: 'Feature Completeness', criteria: [ { name: 'features', description: 'All features implemented' }, { name: 'validation', description: 'Input validation included' }, { name: 'testing', description: 'Tests included' } ] } });
console.log(`${promptConfig.name} prompt:`, { passed: judgment.passed, score: judgment.score?.toFixed(2), cost: `$${result.metrics.cost.total.toFixed(4)}` });
expect(judgment.passed).toBe(true); } ); }});
Specialized Benchmarks
Section titled “Specialized Benchmarks”Speed Benchmark
Section titled “Speed Benchmark”Focus on execution time:
defineTestSuite({ matrix: { agent: [sonnetAgent, haikuAgent, opusAgent] }, test: ({ agent }) => { vibeTest(`${agent.name} speed`, async ({ runAgent, expect }) => { const startTime = Date.now();
const result = await runAgent({ agent, prompt: '/format all files in src/' });
const duration = Date.now() - startTime;
expect(result.files.changed().length).toBeGreaterThan(0);
console.log(`${agent.name}: ${(duration / 1000).toFixed(1)}s`); }); }});
Cost Benchmark
Section titled “Cost Benchmark”Focus on minimizing cost:
defineTestSuite({ matrix: { agent: [sonnetAgent, haikuAgent], maxTurns: [4, 8, 16] }, test: ({ agent, maxTurns }) => { vibeTest( `${agent.name} with ${maxTurns} turns - cost`, async ({ runAgent, expect }) => { const result = await runAgent({ agent, prompt: '/simple refactoring task', maxTurns });
// Ensure quality threshold expect(result.files.changed().length).toBeGreaterThan(0);
// Log cost console.log( `${agent.name} (${maxTurns} turns): $${result.metrics.cost.total.toFixed(4)}` );
// Fail if too expensive expect(result.metrics.cost.total).toBeLessThan(0.10); } ); }});
Accuracy Benchmark
Section titled “Accuracy Benchmark”Measure how often the model gets it right:
const testCases = [ { input: 'calculate fibonacci', expectedOutput: 'fibonacci function' }, { input: 'validate email', expectedOutput: 'email validation regex' }, { input: 'parse JSON', expectedOutput: 'JSON.parse with error handling' }];
defineTestSuite({ matrix: { agent: [sonnetAgent, haikuAgent, opusAgent], testCase: testCases }, test: ({ agent, testCase }) => { vibeTest( `${agent.name} - ${testCase.input}`, async ({ runAgent, judge, expect }) => { const result = await runAgent({ agent, prompt: `/implement ${testCase.input}` });
const judgment = await judge(result, { rubric: { name: 'Correctness', criteria: [ { name: 'implementation', description: `Correctly implements ${testCase.expectedOutput}` } ] } });
expect(judgment.passed).toBe(true); } ); }});
Statistical Analysis
Section titled “Statistical Analysis”Track results across multiple runs:
import { describe } from 'vitest';
interface RunStats { agent: string; runs: number; passRate: number; avgCost: number; avgQuality: number;}
const stats = new Map<string, { total: number; passed: number; totalCost: number; totalQuality: number;}>();
defineTestSuite({ matrix: { agent: [sonnetAgent, haikuAgent, opusAgent], run: [1, 2, 3, 4, 5] // 5 runs per agent }, test: ({ agent, run }) => { vibeTest( `${agent.name} run ${run}`, async ({ runAgent, judge }) => { const result = await runAgent({ agent, prompt: '/implement feature' });
const judgment = await judge(result, { rubric: qualityRubric });
// Track stats const agentStats = stats.get(agent.name) || { total: 0, passed: 0, totalCost: 0, totalQuality: 0 };
agentStats.total++; if (judgment.passed) agentStats.passed++; agentStats.totalCost += result.metrics.cost.total; agentStats.totalQuality += judgment.score || 0;
stats.set(agent.name, agentStats); } ); }});
test.afterAll(() => { console.log('\n=== Statistical Summary ===\n');
const results: RunStats[] = [];
for (const [agent, data] of stats.entries()) { results.push({ agent, runs: data.total, passRate: (data.passed / data.total) * 100, avgCost: data.totalCost / data.total, avgQuality: data.totalQuality / data.total }); }
// Sort by pass rate results.sort((a, b) => b.passRate - a.passRate);
for (const result of results) { console.log(`${result.agent}:`); console.log(` Pass rate: ${result.passRate.toFixed(1)}%`); console.log(` Avg cost: $${result.avgCost.toFixed(4)}`); console.log(` Avg quality: ${result.avgQuality.toFixed(2)}`); console.log(''); }});
Best Practices
Section titled “Best Practices”1. Use Consistent Prompts
Section titled “1. Use Consistent Prompts”// ✅ Good: Same prompt for fair comparisonconst BENCHMARK_PROMPT = '/implement user authentication';
defineTestSuite({ matrix: { agent: [sonnetAgent, haikuAgent] }, test: ({ agent }) => { vibeTest(`${agent.name}`, async ({ runAgent }) => { await runAgent({ agent, prompt: BENCHMARK_PROMPT }); }); }});
// ❌ Bad: Different prompts make comparison unfairdefineTestSuite({ matrix: { agent: [sonnetAgent, haikuAgent] }, test: ({ agent }) => { const prompt = agent.name === 'sonnet' ? '/implement auth with details' : '/implement auth'; // Inconsistent! }});
2. Set Consistent Timeouts
Section titled “2. Set Consistent Timeouts”// ✅ Good: Same timeout for alldefineTestSuite({ matrix: { agent: [sonnetAgent, haikuAgent] }, test: ({ agent }) => { vibeTest(`${agent.name}`, async ({ runAgent }) => { await runAgent({ agent, prompt: '/task', timeout: 60_000 // Same for all }); }); }});
3. Run Multiple Iterations
Section titled “3. Run Multiple Iterations”// ✅ Good: Multiple runs for statistical significancedefineTestSuite({ matrix: { agent: [sonnetAgent, haikuAgent], iteration: [1, 2, 3, 4, 5] }, test: ({ agent, iteration }) => { vibeTest(`${agent.name} run ${iteration}`, async ({ runAgent }) => { // ... }); }});
4. Separate Quality and Speed Benchmarks
Section titled “4. Separate Quality and Speed Benchmarks”// ✅ Good: Separate benchmarks for different goalsdescribe('quality benchmarks', () => { // Focus on quality, allow longer time});
describe('speed benchmarks', () => { // Focus on speed, simpler tasks});
What’s Next?
Section titled “What’s Next?”Now that you understand benchmarking, explore:
- Matrix Testing → - Deep dive into matrix testing
- Using Judge → - Learn more about judges
- Cost Optimization → - Reduce benchmark costs
Or dive into the API reference:
- defineTestSuite() → - Matrix testing API
- judge() → - Judge API documentation
- RunResult → - Access metrics for benchmarking