Matrix Testing
Matrix testing allows you to run the same test across multiple configurations using Cartesian product expansion. This is essential for comparing models, prompts, or parameters systematically.
Why Matrix Testing?
Section titled “Why Matrix Testing?”Problem: You want to compare multiple configurations (models, prompts, parameters) without duplicating test code.
Without Matrix Testing (Repetitive)
Section titled “Without Matrix Testing (Repetitive)”vibeTest('sonnet refactors auth', async ({ runAgent, expect }) => { const result = await runAgent({ agent: sonnetAgent, prompt: '/refactor src/auth.ts', maxTurns: 8 }); expect(result).toCompleteAllTodos();});
vibeTest('haiku refactors auth', async ({ runAgent, expect }) => { const result = await runAgent({ agent: haikuAgent, prompt: '/refactor src/auth.ts', maxTurns: 8 }); expect(result).toCompleteAllTodos();});
vibeTest('sonnet with 16 turns refactors auth', async ({ runAgent, expect }) => { const result = await runAgent({ agent: sonnetAgent, prompt: '/refactor src/auth.ts', maxTurns: 16 }); expect(result).toCompleteAllTodos();});
// ... 2 more tests (haiku with 16 turns)// Total: 4 tests (2 agents × 2 maxTurns)
Issues:
- Duplicate test code (DRY violation)
- Manual Cartesian product calculation
- Hard to add new configurations
With Matrix Testing (Clean)
Section titled “With Matrix Testing (Clean)”import { defineTestSuite, defineAgent } from '@dao/vibe-check';
const sonnetAgent = defineAgent({ name: 'sonnet', model: 'claude-3-5-sonnet-latest' });const haikuAgent = defineAgent({ name: 'haiku', model: 'claude-3-5-haiku-latest' });
defineTestSuite({ matrix: { agent: [sonnetAgent, haikuAgent], maxTurns: [8, 16] }, test: ({ agent, maxTurns }) => { vibeTest(`${agent.name} with ${maxTurns} turns`, async ({ runAgent, expect }) => { const result = await runAgent({ agent, prompt: '/refactor src/auth.ts', maxTurns });
expect(result).toCompleteAllTodos(); }); }});
// Generates 4 tests automatically (2 agents × 2 maxTurns):// ✓ sonnet with 8 turns// ✓ haiku with 8 turns// ✓ sonnet with 16 turns// ✓ haiku with 16 turns
Benefits:
- DRY (single test definition)
- Automatic Cartesian product
- Easy to add configurations (just add to arrays)
How It Works
Section titled “How It Works”Cartesian Product Expansion
Section titled “Cartesian Product Expansion”Matrix testing generates N tests from matrix configurations:
agents = [A, B]maxTurns = [8, 16]prompts = [P1, P2]
Total tests = 2 × 2 × 2 = 8
Generated combinations:1. A, 8, P12. A, 8, P23. A, 16, P14. A, 16, P25. B, 8, P16. B, 8, P27. B, 16, P18. B, 16, P2
Each combination becomes a separate test with unique parameters.
Basic Usage
Section titled “Basic Usage”defineTestSuite
API
Section titled “defineTestSuite API”defineTestSuite({ /** Matrix parameters (Cartesian product) */ matrix: { param1: [value1, value2, ...], param2: [valueA, valueB, ...], // ... },
/** Optional suite name */ name?: string,
/** Test function receiving one combination */ test: (combo) => { vibeTest(`test ${combo.param1} ${combo.param2}`, async ({ runAgent }) => { // Test logic using combo.param1, combo.param2, etc. }); }});
Parameters:
matrix
- Object where each key maps to an array of valuesname
- Optional suite name for groupingtest
- Function receiving one combination of parameters
Example: Model Comparison
Section titled “Example: Model Comparison”import { defineTestSuite, defineAgent } from '@dao/vibe-check';
const sonnet = defineAgent({ name: 'sonnet', model: 'claude-3-5-sonnet-latest' });const haiku = defineAgent({ name: 'haiku', model: 'claude-3-5-haiku-latest' });
defineTestSuite({ name: 'Model Comparison', matrix: { agent: [sonnet, haiku] }, test: ({ agent }) => { vibeTest(`${agent.name} refactors code`, async ({ runAgent, expect }) => { const result = await runAgent({ agent, prompt: '/refactor src/auth.ts' });
expect(result).toCompleteAllTodos(); expect(result).toStayUnderCost(5.00); }); }});
// Generates 2 tests:// ✓ sonnet refactors code// ✓ haiku refactors code
Common Patterns
Section titled “Common Patterns”Pattern 1: Model Benchmarking
Section titled “Pattern 1: Model Benchmarking”Compare different models on the same task:
const models = [ defineAgent({ name: 'sonnet', model: 'claude-3-5-sonnet-latest' }), defineAgent({ name: 'haiku', model: 'claude-3-5-haiku-latest' }), defineAgent({ name: 'opus', model: 'claude-3-opus-latest' })];
defineTestSuite({ name: 'Model Benchmark: Code Refactoring', matrix: { agent: models }, test: ({ agent }) => { vibeTest(`${agent.name} refactors auth module`, async ({ runAgent, expect }) => { const result = await runAgent({ agent, prompt: '/refactor src/auth.ts with comprehensive tests' });
// Quality checks expect(result).toCompleteAllTodos(); expect(result).toHaveChangedFiles(['src/auth.ts', 'tests/auth.test.ts']);
// Performance tracking console.log(`${agent.name} metrics:`, { cost: result.metrics.totalCostUsd, tokens: result.metrics.totalTokens, duration: result.metrics.durationMs }); }); }});
Pattern 2: Prompt Engineering
Section titled “Pattern 2: Prompt Engineering”Test multiple prompt variations:
const prompts = [ 'Refactor src/auth.ts', 'Refactor src/auth.ts with comprehensive tests', 'Refactor src/auth.ts. Add tests. Improve types. Document public API.'];
defineTestSuite({ name: 'Prompt Variations', matrix: { prompt: prompts }, test: ({ prompt }) => { vibeTest(`prompt: "${prompt.slice(0, 30)}..."`, async ({ runAgent, expect }) => { const result = await runAgent({ prompt });
expect(result).toCompleteAllTodos();
console.log(`Prompt effectiveness:`, { prompt: prompt.slice(0, 50), filesChanged: result.files.changed().length, cost: result.metrics.totalCostUsd }); }); }});
Pattern 3: Multi-Dimensional Matrix
Section titled “Pattern 3: Multi-Dimensional Matrix”Test combinations of models, turns, and prompts:
defineTestSuite({ name: 'Comprehensive Comparison', matrix: { agent: [sonnet, haiku], maxTurns: [8, 16, 32], prompt: ['/refactor', '/refactor with tests'] }, test: ({ agent, maxTurns, prompt }) => { vibeTest( `${agent.name}, ${maxTurns} turns, ${prompt}`, async ({ runAgent, expect }) => { const result = await runAgent({ agent, maxTurns, prompt });
expect(result).toCompleteAllTodos();
// Track efficiency console.log({ config: { agent: agent.name, maxTurns, prompt }, turnsUsed: result.timeline.all().length, cost: result.metrics.totalCostUsd }); } ); }});
// Generates 12 tests (2 agents × 3 maxTurns × 2 prompts)
Pattern 4: Cost vs Quality Trade-offs
Section titled “Pattern 4: Cost vs Quality Trade-offs”Compare cost/quality across models:
defineTestSuite({ name: 'Cost vs Quality', matrix: { agent: [ defineAgent({ name: 'haiku', model: 'claude-3-5-haiku-latest' }), defineAgent({ name: 'sonnet', model: 'claude-3-5-sonnet-latest' }) ] }, test: ({ agent }) => { vibeTest(`${agent.name} quality check`, async ({ runAgent, expect }) => { const result = await runAgent({ agent, prompt: '/implement feature X with tests' });
// Quality evaluation const judgment = await judge(result, { rubric: { name: 'Implementation Quality', criteria: [ { name: 'tests', description: 'Has comprehensive tests' }, { name: 'types', description: 'Properly typed' }, { name: 'docs', description: 'Well documented' } ] } });
console.log(`${agent.name} results:`, { passed: judgment.passed, cost: result.metrics.totalCostUsd, qualityScore: judgment.criteria.filter(c => c.passed).length });
expect(judgment.passed).toBe(true); }); }});
Pattern 5: Parameter Tuning
Section titled “Pattern 5: Parameter Tuning”Find optimal parameters:
defineTestSuite({ name: 'Parameter Tuning', matrix: { maxTurns: [4, 8, 16, 32], temperature: [0.0, 0.5, 1.0] }, test: ({ maxTurns, temperature }) => { vibeTest( `maxTurns=${maxTurns}, temp=${temperature}`, async ({ runAgent, expect }) => { const result = await runAgent({ agent: defineAgent({ name: 'tuned', model: 'claude-3-5-sonnet-latest', // Note: temperature config depends on SDK integration }), prompt: '/refactor codebase', maxTurns });
expect(result).toCompleteAllTodos();
// Find optimal settings console.log({ maxTurns, temperature, completed: result.timeline.todos().every(t => t.status === 'completed'), cost: result.metrics.totalCostUsd }); } ); }});
// Generates 12 tests (4 maxTurns × 3 temperatures)
Advanced Usage
Section titled “Advanced Usage”Dynamic Test Names
Section titled “Dynamic Test Names”Generate descriptive test names from combinations:
defineTestSuite({ matrix: { agent: [sonnet, haiku], maxTurns: [8, 16] }, test: ({ agent, maxTurns }) => { const name = `${agent.name} (${maxTurns} turns) refactors auth`;
vibeTest(name, async ({ runAgent, expect }) => { // Test logic }); }});
Conditional Tests
Section titled “Conditional Tests”Skip certain combinations:
defineTestSuite({ matrix: { agent: [sonnet, haiku, opus], complexity: ['simple', 'complex'] }, test: ({ agent, complexity }) => { // Skip opus on simple tasks (too expensive) if (agent.name === 'opus' && complexity === 'simple') { return; }
vibeTest(`${agent.name} handles ${complexity}`, async ({ runAgent }) => { // Test logic }); }});
Comparison Reporting
Section titled “Comparison Reporting”Aggregate results for comparison:
const results: Array<{ agent: string; cost: number; quality: number }> = [];
defineTestSuite({ matrix: { agent: [sonnet, haiku] }, test: ({ agent }) => { vibeTest(`${agent.name} comparison`, async ({ runAgent }) => { const result = await runAgent({ agent, prompt: '/implement feature' });
const judgment = await judge(result, { rubric: { name: 'Quality', criteria: [{ name: 'test', description: 'Has tests' }] } });
// Collect results results.push({ agent: agent.name, cost: result.metrics.totalCostUsd || 0, quality: judgment.passed ? 1 : 0 });
expect(result).toCompleteAllTodos(); }); }});
// After all tests, compare resultstest('compare all results', () => { console.table(results);
// Find best value (lowest cost with quality) const bestValue = results .filter(r => r.quality === 1) .sort((a, b) => a.cost - b.cost)[0];
console.log('Best value:', bestValue.agent);});
Best Practices
Section titled “Best Practices”- Keep matrices small - Large matrices generate many tests (can be slow/expensive)
- Use meaningful names - Include parameter values in test names for clarity
- Track metrics - Log cost/tokens/duration to compare configurations
- Start small - Begin with 2-3 values per dimension, expand as needed
- Consider cost - Each combination = 1 full agent run (can be expensive)
- Skip selectively - Use conditional logic to skip unnecessary combinations
- Aggregate results - Collect data for comparison reports
When to Use Matrix Testing
Section titled “When to Use Matrix Testing”✅ Good Use Cases
Section titled “✅ Good Use Cases”- Model comparison - Benchmark different Claude models
- Prompt engineering - Test multiple prompt variations
- Parameter tuning - Find optimal maxTurns, temperature, etc.
- Configuration testing - Test different workspace setups
- Cost analysis - Compare cost vs quality trade-offs
❌ Avoid Matrix Testing For
Section titled “❌ Avoid Matrix Testing For”- Single configuration - Just use regular
vibeTest
- Large matrices - 50+ tests = slow and expensive
- Flaky tests - Matrix amplifies flakiness across combinations
- Debugging - Too many tests make debugging harder
Comparison: Matrix vs Manual
Section titled “Comparison: Matrix vs Manual”Aspect | Matrix Testing | Manual Tests |
---|---|---|
Code | DRY (single definition) | Repetitive |
Maintenance | Easy (add to arrays) | Hard (update each test) |
Scalability | Automatic combinations | Manual calculation |
Test Count | N × M × … (Cartesian) | Explicit count |
Use Case | Comparisons, benchmarks | One-off tests |
Troubleshooting
Section titled “Troubleshooting”Too Many Tests Generated
Section titled “Too Many Tests Generated”Problem: Matrix generates 100+ tests, runs are slow and expensive.
Cause: Too many dimensions or values per dimension.
Solution: Reduce matrix size, use conditional skipping, or split into multiple smaller matrices.
Test Names Not Descriptive
Section titled “Test Names Not Descriptive”Problem: All tests have generic names like “test 1”, “test 2”.
Cause: Not using parameters in test name.
Solution: Include parameter values in test name:
vibeTest(`${agent.name} with ${maxTurns} turns`, ...)
Hard to Compare Results
Section titled “Hard to Compare Results”Problem: Can’t easily compare outcomes across combinations.
Cause: No aggregation or reporting.
Solution: Collect results in array, print comparison table after tests.
See Also
Section titled “See Also”- API Reference: defineTestSuite - Full API documentation
- Your First Evaluation - Matrix testing in action
- Benchmarking Guide - Advanced comparison techniques
- Cost Optimization - Minimize matrix testing cost
Quick Reference
Section titled “Quick Reference”import { defineTestSuite, defineAgent } from '@dao/vibe-check';
// Define configurationsconst agents = [ defineAgent({ name: 'sonnet', model: 'claude-3-5-sonnet-latest' }), defineAgent({ name: 'haiku', model: 'claude-3-5-haiku-latest' })];
// Matrix testingdefineTestSuite({ name: 'Model Comparison', matrix: { agent: agents, maxTurns: [8, 16], prompt: ['/refactor', '/refactor with tests'] }, test: ({ agent, maxTurns, prompt }) => { vibeTest( `${agent.name}, ${maxTurns} turns, ${prompt}`, async ({ runAgent, expect }) => { const result = await runAgent({ agent, maxTurns, prompt }); expect(result).toCompleteAllTodos();
// Log for comparison console.log({ agent: agent.name, maxTurns, prompt, cost: result.metrics.totalCostUsd }); } ); }});
// Generates 8 tests (2 agents × 2 maxTurns × 2 prompts)