Skip to content

Testing Guides

Learn how to write effective tests for your Claude Code agents using vibe-check’s powerful testing features.


Testing Claude Code agents requires specialized tools and patterns. These guides cover:

  • Real-time monitoring - Watch execution progress and fail fast on errors
  • Multi-run tracking - Aggregate state across multiple agent invocations
  • Assertion patterns - Use custom matchers for common validation scenarios
  • Benchmarking - Compare performance across models and configurations

Monitor agent execution in real-time and implement fail-fast assertions.

You’ll learn:

  • Using AgentExecution.watch() for real-time monitoring
  • Accessing partial results during execution
  • Implementing early termination conditions
  • Detecting errors and anomalies as they happen

Use cases:

  • Failing fast on file deletions
  • Monitoring tool usage patterns
  • Budget enforcement during execution
  • Real-time quality gates

Track and aggregate state across multiple agent runs for comprehensive testing.

You’ll learn:

  • Building cumulative context objects
  • Aggregating results from multiple runs
  • Cross-run state analysis
  • Persistent state management

Use cases:

  • Multi-step test workflows
  • Progressive refinement testing
  • Long-running automation validation
  • State evolution tracking

Master all available matchers for comprehensive test assertions.

You’ll learn:

  • File-based matchers (toHaveChangedFiles, toHaveNoDeletedFiles)
  • Tool-based matchers (toHaveUsedTool, toUseOnlyTools)
  • Quality matchers (toCompleteAllTodos, toPassRubric)
  • Cost matchers (toStayUnderCost)
  • Chaining matchers for complex assertions

Use cases:

  • Validating file operations
  • Enforcing tool usage policies
  • Quality gates with LLM judges
  • Budget enforcement

Generate Cartesian product tests for systematic model and configuration benchmarking.

You’ll learn:

  • Using defineTestSuite for test generation
  • Creating configuration matrices
  • Comparing model performance
  • Analyzing benchmark results

Use cases:

  • Model comparison (Sonnet vs Opus vs Haiku)
  • Configuration optimization
  • Regression testing across versions
  • Performance profiling