Matrix Testing

Matrix testing allows you to run the same test across multiple configurations using Cartesian product expansion. This is essential for comparing models, prompts, or parameters systematically.

Why Matrix Testing?

Problem: You want to compare multiple configurations (models, prompts, parameters) without duplicating test code.

Without Matrix Testing (Repetitive)

vibeTest('sonnet refactors auth', async ({ runAgent, expect }) => {
  const result = await runAgent({
    agent: sonnetAgent,
    prompt: '/refactor src/auth.ts',
    maxTurns: 8
  });
  expect(result).toCompleteAllTodos();
});

vibeTest('haiku refactors auth', async ({ runAgent, expect }) => {
  const result = await runAgent({
    agent: haikuAgent,
    prompt: '/refactor src/auth.ts',
    maxTurns: 8
  });
  expect(result).toCompleteAllTodos();
});

vibeTest('sonnet with 16 turns refactors auth', async ({ runAgent, expect }) => {
  const result = await runAgent({
    agent: sonnetAgent,
    prompt: '/refactor src/auth.ts',
    maxTurns: 16
  });
  expect(result).toCompleteAllTodos();
});

// ... 2 more tests (haiku with 16 turns)
// Total: 4 tests (2 agents × 2 maxTurns)

Issues:

Duplicate test code (DRY violation)
Manual Cartesian product calculation
Hard to add new configurations

With Matrix Testing (Clean)

import { defineTestSuite, defineAgent } from '@dao/vibe-check';

const sonnetAgent = defineAgent({ name: 'sonnet', model: 'claude-3-5-sonnet-latest' });
const haikuAgent = defineAgent({ name: 'haiku', model: 'claude-3-5-haiku-latest' });

defineTestSuite({
  matrix: {
    agent: [sonnetAgent, haikuAgent],
    maxTurns: [8, 16]
  },
  test: ({ agent, maxTurns }) => {
    vibeTest(`${agent.name} with ${maxTurns} turns`, async ({ runAgent, expect }) => {
      const result = await runAgent({
        agent,
        prompt: '/refactor src/auth.ts',
        maxTurns
      });

      expect(result).toCompleteAllTodos();
    });
  }
});

// Generates 4 tests automatically (2 agents × 2 maxTurns):
// ✓ sonnet with 8 turns
// ✓ haiku with 8 turns
// ✓ sonnet with 16 turns
// ✓ haiku with 16 turns

Benefits:

DRY (single test definition)
Automatic Cartesian product
Easy to add configurations (just add to arrays)

How It Works

Cartesian Product Expansion

Matrix testing generates N tests from matrix configurations:

agents = [A, B]
maxTurns = [8, 16]
prompts = [P1, P2]

Total tests = 2 × 2 × 2 = 8

Generated combinations:
1. A, 8, P1
2. A, 8, P2
3. A, 16, P1
4. A, 16, P2
5. B, 8, P1
6. B, 8, P2
7. B, 16, P1
8. B, 16, P2

Each combination becomes a separate test with unique parameters.

Basic Usage

`defineTestSuite` API

defineTestSuite({
  /** Matrix parameters (Cartesian product) */
  matrix: {
    param1: [value1, value2, ...],
    param2: [valueA, valueB, ...],
    // ...
  },

  /** Optional suite name */
  name?: string,

  /** Test function receiving one combination */
  test: (combo) => {
    vibeTest(`test ${combo.param1} ${combo.param2}`, async ({ runAgent }) => {
      // Test logic using combo.param1, combo.param2, etc.
    });
  }
});

Parameters:

matrix - Object where each key maps to an array of values
name - Optional suite name for grouping
test - Function receiving one combination of parameters

Example: Model Comparison

import { defineTestSuite, defineAgent } from '@dao/vibe-check';

const sonnet = defineAgent({ name: 'sonnet', model: 'claude-3-5-sonnet-latest' });
const haiku = defineAgent({ name: 'haiku', model: 'claude-3-5-haiku-latest' });

defineTestSuite({
  name: 'Model Comparison',
  matrix: {
    agent: [sonnet, haiku]
  },
  test: ({ agent }) => {
    vibeTest(`${agent.name} refactors code`, async ({ runAgent, expect }) => {
      const result = await runAgent({
        agent,
        prompt: '/refactor src/auth.ts'
      });

      expect(result).toCompleteAllTodos();
      expect(result).toStayUnderCost(5.00);
    });
  }
});

// Generates 2 tests:
// ✓ sonnet refactors code
// ✓ haiku refactors code

Common Patterns

Pattern 1: Model Benchmarking

Compare different models on the same task:

const models = [
  defineAgent({ name: 'sonnet', model: 'claude-3-5-sonnet-latest' }),
  defineAgent({ name: 'haiku', model: 'claude-3-5-haiku-latest' }),
  defineAgent({ name: 'opus', model: 'claude-3-opus-latest' })
];

defineTestSuite({
  name: 'Model Benchmark: Code Refactoring',
  matrix: { agent: models },
  test: ({ agent }) => {
    vibeTest(`${agent.name} refactors auth module`, async ({ runAgent, expect }) => {
      const result = await runAgent({
        agent,
        prompt: '/refactor src/auth.ts with comprehensive tests'
      });

      // Quality checks
      expect(result).toCompleteAllTodos();
      expect(result).toHaveChangedFiles(['src/auth.ts', 'tests/auth.test.ts']);

      // Performance tracking
      console.log(`${agent.name} metrics:`, {
        cost: result.metrics.totalCostUsd,
        tokens: result.metrics.totalTokens,
        duration: result.metrics.durationMs
      });
    });
  }
});

Pattern 2: Prompt Engineering

Test multiple prompt variations:

const prompts = [
  'Refactor src/auth.ts',
  'Refactor src/auth.ts with comprehensive tests',
  'Refactor src/auth.ts. Add tests. Improve types. Document public API.'
];

defineTestSuite({
  name: 'Prompt Variations',
  matrix: { prompt: prompts },
  test: ({ prompt }) => {
    vibeTest(`prompt: "${prompt.slice(0, 30)}..."`, async ({ runAgent, expect }) => {
      const result = await runAgent({ prompt });

      expect(result).toCompleteAllTodos();

      console.log(`Prompt effectiveness:`, {
        prompt: prompt.slice(0, 50),
        filesChanged: result.files.changed().length,
        cost: result.metrics.totalCostUsd
      });
    });
  }
});

Pattern 3: Multi-Dimensional Matrix

Test combinations of models, turns, and prompts:

defineTestSuite({
  name: 'Comprehensive Comparison',
  matrix: {
    agent: [sonnet, haiku],
    maxTurns: [8, 16, 32],
    prompt: ['/refactor', '/refactor with tests']
  },
  test: ({ agent, maxTurns, prompt }) => {
    vibeTest(
      `${agent.name}, ${maxTurns} turns, ${prompt}`,
      async ({ runAgent, expect }) => {
        const result = await runAgent({ agent, maxTurns, prompt });

        expect(result).toCompleteAllTodos();

        // Track efficiency
        console.log({
          config: { agent: agent.name, maxTurns, prompt },
          turnsUsed: result.timeline.all().length,
          cost: result.metrics.totalCostUsd
        });
      }
    );
  }
});

// Generates 12 tests (2 agents × 3 maxTurns × 2 prompts)

Pattern 4: Cost vs Quality Trade-offs

Compare cost/quality across models:

defineTestSuite({
  name: 'Cost vs Quality',
  matrix: {
    agent: [
      defineAgent({ name: 'haiku', model: 'claude-3-5-haiku-latest' }),
      defineAgent({ name: 'sonnet', model: 'claude-3-5-sonnet-latest' })
    ]
  },
  test: ({ agent }) => {
    vibeTest(`${agent.name} quality check`, async ({ runAgent, expect }) => {
      const result = await runAgent({
        agent,
        prompt: '/implement feature X with tests'
      });

      // Quality evaluation
      const judgment = await judge(result, {
        rubric: {
          name: 'Implementation Quality',
          criteria: [
            { name: 'tests', description: 'Has comprehensive tests' },
            { name: 'types', description: 'Properly typed' },
            { name: 'docs', description: 'Well documented' }
          ]
        }
      });

      console.log(`${agent.name} results:`, {
        passed: judgment.passed,
        cost: result.metrics.totalCostUsd,
        qualityScore: judgment.criteria.filter(c => c.passed).length
      });

      expect(judgment.passed).toBe(true);
    });
  }
});

Pattern 5: Parameter Tuning

Find optimal parameters:

defineTestSuite({
  name: 'Parameter Tuning',
  matrix: {
    maxTurns: [4, 8, 16, 32],
    temperature: [0.0, 0.5, 1.0]
  },
  test: ({ maxTurns, temperature }) => {
    vibeTest(
      `maxTurns=${maxTurns}, temp=${temperature}`,
      async ({ runAgent, expect }) => {
        const result = await runAgent({
          agent: defineAgent({
            name: 'tuned',
            model: 'claude-3-5-sonnet-latest',
            // Note: temperature config depends on SDK integration
          }),
          prompt: '/refactor codebase',
          maxTurns
        });

        expect(result).toCompleteAllTodos();

        // Find optimal settings
        console.log({
          maxTurns,
          temperature,
          completed: result.timeline.todos().every(t => t.status === 'completed'),
          cost: result.metrics.totalCostUsd
        });
      }
    );
  }
});

// Generates 12 tests (4 maxTurns × 3 temperatures)

Advanced Usage

Dynamic Test Names

Generate descriptive test names from combinations:

defineTestSuite({
  matrix: {
    agent: [sonnet, haiku],
    maxTurns: [8, 16]
  },
  test: ({ agent, maxTurns }) => {
    const name = `${agent.name} (${maxTurns} turns) refactors auth`;

    vibeTest(name, async ({ runAgent, expect }) => {
      // Test logic
    });
  }
});

Conditional Tests

Skip certain combinations:

defineTestSuite({
  matrix: {
    agent: [sonnet, haiku, opus],
    complexity: ['simple', 'complex']
  },
  test: ({ agent, complexity }) => {
    // Skip opus on simple tasks (too expensive)
    if (agent.name === 'opus' && complexity === 'simple') {
      return;
    }

    vibeTest(`${agent.name} handles ${complexity}`, async ({ runAgent }) => {
      // Test logic
    });
  }
});

Comparison Reporting

Aggregate results for comparison:

const results: Array<{ agent: string; cost: number; quality: number }> = [];

defineTestSuite({
  matrix: { agent: [sonnet, haiku] },
  test: ({ agent }) => {
    vibeTest(`${agent.name} comparison`, async ({ runAgent }) => {
      const result = await runAgent({
        agent,
        prompt: '/implement feature'
      });

      const judgment = await judge(result, {
        rubric: {
          name: 'Quality',
          criteria: [{ name: 'test', description: 'Has tests' }]
        }
      });

      // Collect results
      results.push({
        agent: agent.name,
        cost: result.metrics.totalCostUsd || 0,
        quality: judgment.passed ? 1 : 0
      });

      expect(result).toCompleteAllTodos();
    });
  }
});

// After all tests, compare results
test('compare all results', () => {
  console.table(results);

  // Find best value (lowest cost with quality)
  const bestValue = results
    .filter(r => r.quality === 1)
    .sort((a, b) => a.cost - b.cost)[0];

  console.log('Best value:', bestValue.agent);
});

Best Practices

Keep matrices small - Large matrices generate many tests (can be slow/expensive)
Use meaningful names - Include parameter values in test names for clarity
Track metrics - Log cost/tokens/duration to compare configurations
Start small - Begin with 2-3 values per dimension, expand as needed
Consider cost - Each combination = 1 full agent run (can be expensive)
Skip selectively - Use conditional logic to skip unnecessary combinations
Aggregate results - Collect data for comparison reports

When to Use Matrix Testing

✅ Good Use Cases

Model comparison - Benchmark different Claude models
Prompt engineering - Test multiple prompt variations
Parameter tuning - Find optimal maxTurns, temperature, etc.
Configuration testing - Test different workspace setups
Cost analysis - Compare cost vs quality trade-offs

❌ Avoid Matrix Testing For

Single configuration - Just use regular vibeTest
Large matrices - 50+ tests = slow and expensive
Flaky tests - Matrix amplifies flakiness across combinations
Debugging - Too many tests make debugging harder

Comparison: Matrix vs Manual

Aspect	Matrix Testing	Manual Tests
Code	DRY (single definition)	Repetitive
Maintenance	Easy (add to arrays)	Hard (update each test)
Scalability	Automatic combinations	Manual calculation
Test Count	N × M × … (Cartesian)	Explicit count
Use Case	Comparisons, benchmarks	One-off tests

Troubleshooting

Too Many Tests Generated

Problem: Matrix generates 100+ tests, runs are slow and expensive.

Cause: Too many dimensions or values per dimension.

Solution: Reduce matrix size, use conditional skipping, or split into multiple smaller matrices.

Test Names Not Descriptive

Problem: All tests have generic names like “test 1”, “test 2”.

Cause: Not using parameters in test name.

Solution: Include parameter values in test name:

vibeTest(`${agent.name} with ${maxTurns} turns`, ...)

Hard to Compare Results

Problem: Can’t easily compare outcomes across combinations.

Cause: No aggregation or reporting.

Solution: Collect results in array, print comparison table after tests.

Quick Reference

import { defineTestSuite, defineAgent } from '@dao/vibe-check';

// Define configurations
const agents = [
  defineAgent({ name: 'sonnet', model: 'claude-3-5-sonnet-latest' }),
  defineAgent({ name: 'haiku', model: 'claude-3-5-haiku-latest' })
];

// Matrix testing
defineTestSuite({
  name: 'Model Comparison',
  matrix: {
    agent: agents,
    maxTurns: [8, 16],
    prompt: ['/refactor', '/refactor with tests']
  },
  test: ({ agent, maxTurns, prompt }) => {
    vibeTest(
      `${agent.name}, ${maxTurns} turns, ${prompt}`,
      async ({ runAgent, expect }) => {
        const result = await runAgent({ agent, maxTurns, prompt });
        expect(result).toCompleteAllTodos();

        // Log for comparison
        console.log({
          agent: agent.name,
          maxTurns,
          prompt,
          cost: result.metrics.totalCostUsd
        });
      }
    );
  }
});

// Generates 8 tests (2 agents × 2 maxTurns × 2 prompts)