Skip to content

Matrix Testing

Matrix testing allows you to run the same test across multiple configurations using Cartesian product expansion. This is essential for comparing models, prompts, or parameters systematically.


Problem: You want to compare multiple configurations (models, prompts, parameters) without duplicating test code.

vibeTest('sonnet refactors auth', async ({ runAgent, expect }) => {
const result = await runAgent({
agent: sonnetAgent,
prompt: '/refactor src/auth.ts',
maxTurns: 8
});
expect(result).toCompleteAllTodos();
});
vibeTest('haiku refactors auth', async ({ runAgent, expect }) => {
const result = await runAgent({
agent: haikuAgent,
prompt: '/refactor src/auth.ts',
maxTurns: 8
});
expect(result).toCompleteAllTodos();
});
vibeTest('sonnet with 16 turns refactors auth', async ({ runAgent, expect }) => {
const result = await runAgent({
agent: sonnetAgent,
prompt: '/refactor src/auth.ts',
maxTurns: 16
});
expect(result).toCompleteAllTodos();
});
// ... 2 more tests (haiku with 16 turns)
// Total: 4 tests (2 agents × 2 maxTurns)

Issues:

  • Duplicate test code (DRY violation)
  • Manual Cartesian product calculation
  • Hard to add new configurations
import { defineTestSuite, defineAgent } from '@dao/vibe-check';
const sonnetAgent = defineAgent({ name: 'sonnet', model: 'claude-3-5-sonnet-latest' });
const haikuAgent = defineAgent({ name: 'haiku', model: 'claude-3-5-haiku-latest' });
defineTestSuite({
matrix: {
agent: [sonnetAgent, haikuAgent],
maxTurns: [8, 16]
},
test: ({ agent, maxTurns }) => {
vibeTest(`${agent.name} with ${maxTurns} turns`, async ({ runAgent, expect }) => {
const result = await runAgent({
agent,
prompt: '/refactor src/auth.ts',
maxTurns
});
expect(result).toCompleteAllTodos();
});
}
});
// Generates 4 tests automatically (2 agents × 2 maxTurns):
// ✓ sonnet with 8 turns
// ✓ haiku with 8 turns
// ✓ sonnet with 16 turns
// ✓ haiku with 16 turns

Benefits:

  • DRY (single test definition)
  • Automatic Cartesian product
  • Easy to add configurations (just add to arrays)

Matrix testing generates N tests from matrix configurations:

agents = [A, B]
maxTurns = [8, 16]
prompts = [P1, P2]
Total tests = 2 × 2 × 2 = 8
Generated combinations:
1. A, 8, P1
2. A, 8, P2
3. A, 16, P1
4. A, 16, P2
5. B, 8, P1
6. B, 8, P2
7. B, 16, P1
8. B, 16, P2

Each combination becomes a separate test with unique parameters.


defineTestSuite({
/** Matrix parameters (Cartesian product) */
matrix: {
param1: [value1, value2, ...],
param2: [valueA, valueB, ...],
// ...
},
/** Optional suite name */
name?: string,
/** Test function receiving one combination */
test: (combo) => {
vibeTest(`test ${combo.param1} ${combo.param2}`, async ({ runAgent }) => {
// Test logic using combo.param1, combo.param2, etc.
});
}
});

Parameters:

  • matrix - Object where each key maps to an array of values
  • name - Optional suite name for grouping
  • test - Function receiving one combination of parameters
import { defineTestSuite, defineAgent } from '@dao/vibe-check';
const sonnet = defineAgent({ name: 'sonnet', model: 'claude-3-5-sonnet-latest' });
const haiku = defineAgent({ name: 'haiku', model: 'claude-3-5-haiku-latest' });
defineTestSuite({
name: 'Model Comparison',
matrix: {
agent: [sonnet, haiku]
},
test: ({ agent }) => {
vibeTest(`${agent.name} refactors code`, async ({ runAgent, expect }) => {
const result = await runAgent({
agent,
prompt: '/refactor src/auth.ts'
});
expect(result).toCompleteAllTodos();
expect(result).toStayUnderCost(5.00);
});
}
});
// Generates 2 tests:
// ✓ sonnet refactors code
// ✓ haiku refactors code

Compare different models on the same task:

const models = [
defineAgent({ name: 'sonnet', model: 'claude-3-5-sonnet-latest' }),
defineAgent({ name: 'haiku', model: 'claude-3-5-haiku-latest' }),
defineAgent({ name: 'opus', model: 'claude-3-opus-latest' })
];
defineTestSuite({
name: 'Model Benchmark: Code Refactoring',
matrix: { agent: models },
test: ({ agent }) => {
vibeTest(`${agent.name} refactors auth module`, async ({ runAgent, expect }) => {
const result = await runAgent({
agent,
prompt: '/refactor src/auth.ts with comprehensive tests'
});
// Quality checks
expect(result).toCompleteAllTodos();
expect(result).toHaveChangedFiles(['src/auth.ts', 'tests/auth.test.ts']);
// Performance tracking
console.log(`${agent.name} metrics:`, {
cost: result.metrics.totalCostUsd,
tokens: result.metrics.totalTokens,
duration: result.metrics.durationMs
});
});
}
});

Test multiple prompt variations:

const prompts = [
'Refactor src/auth.ts',
'Refactor src/auth.ts with comprehensive tests',
'Refactor src/auth.ts. Add tests. Improve types. Document public API.'
];
defineTestSuite({
name: 'Prompt Variations',
matrix: { prompt: prompts },
test: ({ prompt }) => {
vibeTest(`prompt: "${prompt.slice(0, 30)}..."`, async ({ runAgent, expect }) => {
const result = await runAgent({ prompt });
expect(result).toCompleteAllTodos();
console.log(`Prompt effectiveness:`, {
prompt: prompt.slice(0, 50),
filesChanged: result.files.changed().length,
cost: result.metrics.totalCostUsd
});
});
}
});

Test combinations of models, turns, and prompts:

defineTestSuite({
name: 'Comprehensive Comparison',
matrix: {
agent: [sonnet, haiku],
maxTurns: [8, 16, 32],
prompt: ['/refactor', '/refactor with tests']
},
test: ({ agent, maxTurns, prompt }) => {
vibeTest(
`${agent.name}, ${maxTurns} turns, ${prompt}`,
async ({ runAgent, expect }) => {
const result = await runAgent({ agent, maxTurns, prompt });
expect(result).toCompleteAllTodos();
// Track efficiency
console.log({
config: { agent: agent.name, maxTurns, prompt },
turnsUsed: result.timeline.all().length,
cost: result.metrics.totalCostUsd
});
}
);
}
});
// Generates 12 tests (2 agents × 3 maxTurns × 2 prompts)

Compare cost/quality across models:

defineTestSuite({
name: 'Cost vs Quality',
matrix: {
agent: [
defineAgent({ name: 'haiku', model: 'claude-3-5-haiku-latest' }),
defineAgent({ name: 'sonnet', model: 'claude-3-5-sonnet-latest' })
]
},
test: ({ agent }) => {
vibeTest(`${agent.name} quality check`, async ({ runAgent, expect }) => {
const result = await runAgent({
agent,
prompt: '/implement feature X with tests'
});
// Quality evaluation
const judgment = await judge(result, {
rubric: {
name: 'Implementation Quality',
criteria: [
{ name: 'tests', description: 'Has comprehensive tests' },
{ name: 'types', description: 'Properly typed' },
{ name: 'docs', description: 'Well documented' }
]
}
});
console.log(`${agent.name} results:`, {
passed: judgment.passed,
cost: result.metrics.totalCostUsd,
qualityScore: judgment.criteria.filter(c => c.passed).length
});
expect(judgment.passed).toBe(true);
});
}
});

Find optimal parameters:

defineTestSuite({
name: 'Parameter Tuning',
matrix: {
maxTurns: [4, 8, 16, 32],
temperature: [0.0, 0.5, 1.0]
},
test: ({ maxTurns, temperature }) => {
vibeTest(
`maxTurns=${maxTurns}, temp=${temperature}`,
async ({ runAgent, expect }) => {
const result = await runAgent({
agent: defineAgent({
name: 'tuned',
model: 'claude-3-5-sonnet-latest',
// Note: temperature config depends on SDK integration
}),
prompt: '/refactor codebase',
maxTurns
});
expect(result).toCompleteAllTodos();
// Find optimal settings
console.log({
maxTurns,
temperature,
completed: result.timeline.todos().every(t => t.status === 'completed'),
cost: result.metrics.totalCostUsd
});
}
);
}
});
// Generates 12 tests (4 maxTurns × 3 temperatures)

Generate descriptive test names from combinations:

defineTestSuite({
matrix: {
agent: [sonnet, haiku],
maxTurns: [8, 16]
},
test: ({ agent, maxTurns }) => {
const name = `${agent.name} (${maxTurns} turns) refactors auth`;
vibeTest(name, async ({ runAgent, expect }) => {
// Test logic
});
}
});

Skip certain combinations:

defineTestSuite({
matrix: {
agent: [sonnet, haiku, opus],
complexity: ['simple', 'complex']
},
test: ({ agent, complexity }) => {
// Skip opus on simple tasks (too expensive)
if (agent.name === 'opus' && complexity === 'simple') {
return;
}
vibeTest(`${agent.name} handles ${complexity}`, async ({ runAgent }) => {
// Test logic
});
}
});

Aggregate results for comparison:

const results: Array<{ agent: string; cost: number; quality: number }> = [];
defineTestSuite({
matrix: { agent: [sonnet, haiku] },
test: ({ agent }) => {
vibeTest(`${agent.name} comparison`, async ({ runAgent }) => {
const result = await runAgent({
agent,
prompt: '/implement feature'
});
const judgment = await judge(result, {
rubric: {
name: 'Quality',
criteria: [{ name: 'test', description: 'Has tests' }]
}
});
// Collect results
results.push({
agent: agent.name,
cost: result.metrics.totalCostUsd || 0,
quality: judgment.passed ? 1 : 0
});
expect(result).toCompleteAllTodos();
});
}
});
// After all tests, compare results
test('compare all results', () => {
console.table(results);
// Find best value (lowest cost with quality)
const bestValue = results
.filter(r => r.quality === 1)
.sort((a, b) => a.cost - b.cost)[0];
console.log('Best value:', bestValue.agent);
});

  1. Keep matrices small - Large matrices generate many tests (can be slow/expensive)
  2. Use meaningful names - Include parameter values in test names for clarity
  3. Track metrics - Log cost/tokens/duration to compare configurations
  4. Start small - Begin with 2-3 values per dimension, expand as needed
  5. Consider cost - Each combination = 1 full agent run (can be expensive)
  6. Skip selectively - Use conditional logic to skip unnecessary combinations
  7. Aggregate results - Collect data for comparison reports

  • Model comparison - Benchmark different Claude models
  • Prompt engineering - Test multiple prompt variations
  • Parameter tuning - Find optimal maxTurns, temperature, etc.
  • Configuration testing - Test different workspace setups
  • Cost analysis - Compare cost vs quality trade-offs
  • Single configuration - Just use regular vibeTest
  • Large matrices - 50+ tests = slow and expensive
  • Flaky tests - Matrix amplifies flakiness across combinations
  • Debugging - Too many tests make debugging harder

AspectMatrix TestingManual Tests
CodeDRY (single definition)Repetitive
MaintenanceEasy (add to arrays)Hard (update each test)
ScalabilityAutomatic combinationsManual calculation
Test CountN × M × … (Cartesian)Explicit count
Use CaseComparisons, benchmarksOne-off tests

Problem: Matrix generates 100+ tests, runs are slow and expensive.

Cause: Too many dimensions or values per dimension.

Solution: Reduce matrix size, use conditional skipping, or split into multiple smaller matrices.

Problem: All tests have generic names like “test 1”, “test 2”.

Cause: Not using parameters in test name.

Solution: Include parameter values in test name:

vibeTest(`${agent.name} with ${maxTurns} turns`, ...)

Problem: Can’t easily compare outcomes across combinations.

Cause: No aggregation or reporting.

Solution: Collect results in array, print comparison table after tests.



import { defineTestSuite, defineAgent } from '@dao/vibe-check';
// Define configurations
const agents = [
defineAgent({ name: 'sonnet', model: 'claude-3-5-sonnet-latest' }),
defineAgent({ name: 'haiku', model: 'claude-3-5-haiku-latest' })
];
// Matrix testing
defineTestSuite({
name: 'Model Comparison',
matrix: {
agent: agents,
maxTurns: [8, 16],
prompt: ['/refactor', '/refactor with tests']
},
test: ({ agent, maxTurns, prompt }) => {
vibeTest(
`${agent.name}, ${maxTurns} turns, ${prompt}`,
async ({ runAgent, expect }) => {
const result = await runAgent({ agent, maxTurns, prompt });
expect(result).toCompleteAllTodos();
// Log for comparison
console.log({
agent: agent.name,
maxTurns,
prompt,
cost: result.metrics.totalCostUsd
});
}
);
}
});
// Generates 8 tests (2 agents × 2 maxTurns × 2 prompts)