Your First Test
In this tutorial, you’ll write a real-world test that validates agent behavior using custom matchers, reactive watchers for fail-fast testing, and LLM-based evaluation with the judge.
What You’ll Build
Section titled “What You’ll Build”A test that validates an agent can:
- Refactor code with proper test coverage
- Complete all TODOs without leaving the code in a broken state
- Stay within a cost budget
- Meet quality standards (validated by an LLM judge)
We’ll use:
- Custom matchers -
toCompleteAllTodos()
,toHaveChangedFiles()
,toStayUnderCost()
- Reactive watchers - Fail fast if agent violates constraints during execution
- Judge - LLM-based quality evaluation against a rubric
Scenario: Code Refactoring Test
Section titled “Scenario: Code Refactoring Test”Let’s test an agent that refactors authentication code and adds tests.
Step 1: Write the Test
Section titled “Step 1: Write the Test”Create tests/refactor.vibe.test.ts
:
import { vibeTest } from '@dao/vibe-check';
vibeTest('refactor adds comprehensive tests', async ({ runAgent, expect, judge }) => { const result = await runAgent({ prompt: ` Refactor src/auth.ts to: - Add TypeScript strict types - Add comprehensive unit tests - Remove any TODO comments
Budget: Stay under $2.00 ` });
// 1. Check basic completion expect(result).toCompleteAllTodos();
// 2. Verify file changes expect(result).toHaveChangedFiles([ 'src/auth.ts', // Original file 'src/auth.test.ts' // New test file ]);
// 3. No files should be deleted expect(result).toHaveNoDeletedFiles();
// 4. Verify cost budget expect(result).toStayUnderCost(2.00);
// 5. Check specific file content const authTest = result.files.get('src/auth.test.ts'); expect(authTest).toBeDefined();
const testContent = await authTest?.after?.text(); expect(testContent).toContain('describe'); expect(testContent).toContain('expect');});
Step 2: Run the Test
Section titled “Step 2: Run the Test”bun run vitest tests/refactor.vibe.test.ts
You should see:
✓ tests/refactor.vibe.test.ts (1) ✓ refactor adds comprehensive tests (18.3s)
Test Files 1 passed (1) Tests 1 passed (1)
Vibe Check Cost Summary─────────────────────────────Total Cost: $0.42Total Tokens: 12,847
Step 3: Add Reactive Watchers (Fail Fast)
Section titled “Step 3: Add Reactive Watchers (Fail Fast)”Waiting for the agent to finish only to discover it violated a constraint is inefficient. Use reactive watchers to fail fast during execution.
Why Watchers?
Section titled “Why Watchers?”Problem: Agent deletes critical database files 2 minutes into a 5-minute run. Test fails after 5 minutes.
Solution: Watcher catches the violation immediately and aborts the execution.
Add Watchers to Your Test
Section titled “Add Watchers to Your Test”import { vibeTest } from '@dao/vibe-check';
vibeTest('refactor with fail-fast watchers', async ({ runAgent, expect }) => { // Start execution (returns AgentExecution, not RunResult) const execution = runAgent({ prompt: 'Refactor src/auth.ts --add-tests' });
// Register watchers for real-time assertions execution.watch(({ files, metrics }) => { // Fail fast if agent touches database directory const changedPaths = files.changed().map(f => f.path); expect(changedPaths).not.toContain('database/');
// Fail fast if cost exceeds budget if (metrics.totalCostUsd && metrics.totalCostUsd > 2.0) { throw new Error(`Cost exceeded: $${metrics.totalCostUsd}`); } });
// Wait for completion const result = await execution;
// Final assertions (only run if watchers never failed) expect(result).toCompleteAllTodos(); expect(result).toHaveChangedFiles(['src/auth.ts', 'src/auth.test.ts']);});
Multiple Watchers
Section titled “Multiple Watchers”You can register multiple watchers. They execute sequentially in registration order:
const execution = runAgent({ prompt: '/refactor' });
// Watcher 1: File constraintsexecution.watch(({ files }) => { const changed = files.changed().map(f => f.path); expect(changed).not.toContain('database/'); expect(changed).not.toContain('.env');});
// Watcher 2: Cost constraintsexecution.watch(({ metrics }) => { expect(metrics.totalCostUsd).toBeLessThan(5.0);});
// Watcher 3: Tool failuresexecution.watch(({ tools }) => { const failed = tools.failed(); expect(failed.length).toBeLessThan(3); // Max 2 failures allowed});
const result = await execution;
Step 4: Add LLM-Based Evaluation with Judge
Section titled “Step 4: Add LLM-Based Evaluation with Judge”Custom matchers validate structure and metrics, but what about code quality? Use the judge to evaluate subjective criteria with an LLM.
What is Judge?
Section titled “What is Judge?”The judge is a specialized agent that:
- Takes your rubric (evaluation criteria)
- Reads the RunResult (code changes, tool calls, etc.)
- Returns a structured judgment (passed/failed + scores)
Add Judge to Your Test
Section titled “Add Judge to Your Test”import { vibeTest } from '@dao/vibe-check';
vibeTest('refactor with quality evaluation', async ({ runAgent, judge, expect }) => { const result = await runAgent({ prompt: 'Refactor src/auth.ts --add-tests' });
// Use judge to evaluate code quality const judgment = await judge(result, { rubric: { name: 'Code Quality Rubric', criteria: [ { name: 'has_tests', description: 'Added comprehensive unit tests with good coverage', weight: 0.4 }, { name: 'type_safety', description: 'Uses TypeScript strict types appropriately', weight: 0.3 }, { name: 'no_todos', description: 'No TODO comments left in code', weight: 0.2 }, { name: 'readable', description: 'Code is clean and well-documented', weight: 0.1 } ], passThreshold: 0.7 // 70% score to pass } });
// Check if judgment passed expect(judgment.passed).toBe(true);
// Inspect individual criteria scores console.log('Quality Scores:'); for (const criterion of judgment.criteria) { console.log(` ${criterion.name}: ${criterion.score}/1.0`); }
console.log(`Overall: ${judgment.overallScore}/1.0`);});
Understanding the Rubric
Section titled “Understanding the Rubric”{ name: 'Code Quality Rubric', criteria: [ { name: 'has_tests', // Unique identifier description: 'Added tests', // What to evaluate weight: 0.4 // 40% of overall score }, // ... ], passThreshold: 0.7 // Minimum score to pass (0.0 - 1.0)}
How it works:
- Judge reads file changes from
result.files
- Evaluates each criterion (0.0 - 1.0 score)
- Computes weighted overall score
- Returns
passed: true
if score ≥ passThreshold
Complete Example: Putting It All Together
Section titled “Complete Example: Putting It All Together”Here’s a complete test using all features:
import { vibeTest } from '@dao/vibe-check';
vibeTest('comprehensive refactoring validation', async ({ runAgent, judge, expect }) => { // 1. Start execution with watchers for fail-fast const execution = runAgent({ prompt: ` Refactor src/auth.ts: - Add TypeScript strict types - Add comprehensive unit tests - Remove TODO comments - Don't modify database/ directory ` });
// 2. Register reactive watchers execution .watch(({ files }) => { // Fail fast if protected files touched const changed = files.changed().map(f => f.path); expect(changed).not.toContain('database/'); expect(changed).not.toContain('.env'); }) .watch(({ tools }) => { // Fail fast if too many tool failures expect(tools.failed().length).toBeLessThan(3); }) .watch(({ metrics }) => { // Fail fast if cost exceeded expect(metrics.totalCostUsd).toBeLessThan(3.0); });
// 3. Wait for completion const result = await execution;
// 4. Structural assertions (fast) expect(result).toCompleteAllTodos(); expect(result).toHaveChangedFiles(['src/auth.ts', 'src/auth.test.ts']); expect(result).toHaveNoDeletedFiles(); expect(result).toStayUnderCost(2.50);
// 5. Content validation const authFile = result.files.get('src/auth.ts'); const authTestFile = result.files.get('src/auth.test.ts');
expect(authFile).toBeDefined(); expect(authTestFile).toBeDefined();
const authContent = await authFile?.after?.text(); const testContent = await authTestFile?.after?.text();
expect(authContent).not.toContain('TODO'); expect(testContent).toContain('describe('); expect(testContent).toContain('expect(');
// 6. Tool usage validation const writeToolUses = result.tools.used('Write'); expect(writeToolUses).toBeGreaterThan(0);
// 7. Quality evaluation with judge (slower, but thorough) const judgment = await judge(result, { rubric: { name: 'Refactor Quality', criteria: [ { name: 'has_tests', description: 'Added comprehensive unit tests with edge cases', weight: 0.4 }, { name: 'type_safety', description: 'Uses TypeScript strict mode properly', weight: 0.3 }, { name: 'clean_code', description: 'Code is readable and well-structured', weight: 0.3 } ], passThreshold: 0.75 } });
expect(judgment.passed).toBe(true);
// Log quality report console.log('\n📊 Quality Report:'); for (const criterion of judgment.criteria) { const icon = criterion.score >= 0.7 ? '✅' : '⚠️'; console.log(`${icon} ${criterion.name}: ${criterion.score.toFixed(2)}`); } console.log(`\n🎯 Overall Score: ${judgment.overallScore.toFixed(2)}`);});
All Available Custom Matchers
Section titled “All Available Custom Matchers”Here’s the complete list of matchers you can use:
File Matchers
Section titled “File Matchers”// Assert specific files changed (supports globs)expect(result).toHaveChangedFiles(['src/**/*.ts', 'tests/**/*.test.ts']);
// Assert no files deletedexpect(result).toHaveNoDeletedFiles();
Tool Matchers
Section titled “Tool Matchers”// Assert specific tool was usedexpect(result).toHaveUsedTool('Edit');
// Assert tool was used minimum number of timesexpect(result).toHaveUsedTool('Bash', { min: 2 });
// Assert only allowed tools were usedexpect(result).toUseOnlyTools(['Edit', 'Read', 'Write']);
Quality Matchers
Section titled “Quality Matchers”// Assert all TODOs completedexpect(result).toCompleteAllTodos();
// Assert no errors in logsexpect(result).toHaveNoErrorsInLogs();
// Assert rubric passes (uses judge internally)await expect(result).toPassRubric({ name: 'Quality', criteria: [ { name: 'correct', description: 'Works correctly' } ]});
Cost Matchers
Section titled “Cost Matchers”// Assert cost stayed under budgetexpect(result).toStayUnderCost(5.00); // Max $5.00
Hook Data Matchers
Section titled “Hook Data Matchers”// Assert all hooks were captured successfully// (Useful for debugging hook capture issues)expect(result).toHaveCompleteHookData();
Best Practices
Section titled “Best Practices”✅ DO: Use Matchers for Structure
Section titled “✅ DO: Use Matchers for Structure”expect(result).toHaveChangedFiles(['src/index.ts']);expect(result).toCompleteAllTodos();expect(result).toStayUnderCost(1.0);
✅ DO: Use Watchers for Invariants
Section titled “✅ DO: Use Watchers for Invariants”execution.watch(({ files }) => { // Never touch database expect(files.changed()).not.toContain('database/');});
✅ DO: Use Judge for Quality
Section titled “✅ DO: Use Judge for Quality”const judgment = await judge(result, { rubric: { name: 'Code Quality', criteria: [ { name: 'readable', description: 'Code is clean' } ] }});expect(judgment.passed).toBe(true);
❌ DON’T: Use Judge for Simple Checks
Section titled “❌ DON’T: Use Judge for Simple Checks”// ❌ Bad: Using judge for simple checksawait judge(result, { rubric: { criteria: [ { name: 'file_exists', description: 'README.md exists' } ] }});
// ✅ Good: Use matchers insteadexpect(result).toHaveChangedFiles(['README.md']);
❌ DON’T: Register Watchers After await
Section titled “❌ DON’T: Register Watchers After await”// ❌ Bad: Watchers must be registered before awaitconst result = await runAgent({ prompt: '...' });result.watch(() => {}); // Error: can't watch completed execution
// ✅ Good: Register before awaitconst execution = runAgent({ prompt: '...' });execution.watch(() => {});const result = await execution;
Debugging Failed Tests
Section titled “Debugging Failed Tests”Check the HTML Report
Section titled “Check the HTML Report”Open .vibe-artifacts/reports/index.html
to see:
- Full conversation transcript
- Tool call timeline
- File diffs
- Error messages
Inspect Captured Data
Section titled “Inspect Captured Data”vibeTest('debug failed test', async ({ runAgent }) => { const result = await runAgent({ prompt: '...' });
// Print metrics console.log('Metrics:', result.metrics);
// Print file changes console.log('Files changed:', result.files.changed().map(f => f.path));
// Print tool calls console.log('Tools used:', result.tools.all().map(t => t.name));
// Print TODOs console.log('TODOs:', result.todos);
// Check hook capture console.log('Hook capture:', result.hookCaptureStatus);});
Check Hook Capture Status
Section titled “Check Hook Capture Status”If tests fail unexpectedly, verify hooks were captured:
expect(result.hookCaptureStatus.complete).toBe(true);
if (!result.hookCaptureStatus.complete) { console.log('Missing events:', result.hookCaptureStatus.missingEvents); console.log('Warnings:', result.hookCaptureStatus.warnings);}
Next Steps
Section titled “Next Steps”You’ve learned how to write comprehensive tests! Now explore:
- Build Your First Workflow → - Multi-stage automation pipelines
- First Evaluation → - Matrix testing and benchmarking
- Custom Matchers Guide → - Deep dive into all matchers
- Reactive Watchers Guide → - Advanced watcher patterns
- Using Judge → - Complete judge documentation
Reference
Section titled “Reference”- vibeTest API → - Complete API reference
- RunResult Interface → - All captured data
- AgentExecution → - Watcher API
- Custom Matchers → - Matcher reference