Rubric & JudgeResult
Rubric
defines evaluation criteria for LLM-based judgments, and JudgeResult
contains the structured evaluation results. These types power the judge()
function for quality assessment.
Rubric Interface
Section titled “Rubric Interface”interface Rubric { name: string; criteria: RubricCriterion[]; passingThreshold?: number;}
Properties
Section titled “Properties”name: string
Human-readable name for this rubric.
Example:
const rubric: Rubric = { name: 'Code Quality Assessment', criteria: [...]};
criteria
Section titled “criteria”criteria: RubricCriterion[]
Array of evaluation criteria.
RubricCriterion Interface:
interface RubricCriterion { name: string; description: string; weight?: number; threshold?: number;}
Example:
const rubric: Rubric = { name: 'Authentication Quality', criteria: [ { name: 'security', description: 'Uses secure authentication patterns (bcrypt, JWT, etc.)', weight: 3 }, { name: 'error-handling', description: 'Handles auth errors gracefully with user-friendly messages', weight: 2 }, { name: 'test-coverage', description: 'Includes unit tests for auth functions', weight: 1 } ]};
passingThreshold
Section titled “passingThreshold”passingThreshold?: number
Minimum score required to pass (0.0 to 1.0). Default: 0.7
Example:
const rubric: Rubric = { name: 'Strict Quality Gate', criteria: [...], passingThreshold: 0.9 // Require 90% to pass};
RubricCriterion Interface
Section titled “RubricCriterion Interface”interface RubricCriterion { name: string; description: string; weight?: number; threshold?: number;}
Properties
Section titled “Properties”name: string
Short identifier for the criterion (used in results).
Naming Conventions:
- Use kebab-case:
error-handling
,code-quality
- Be specific:
jwt-security
rather thansecurity
- Avoid spaces:
test-coverage
nottest coverage
description
Section titled “description”description: string
Detailed description of what the judge should evaluate.
Best Practices:
- Be specific and measurable
- Include examples of good/bad patterns
- Mention tools or techniques to look for
- Keep under 200 characters
Example:
{ name: 'error-handling', description: 'Uses try-catch blocks with specific error types. Logs errors appropriately. Returns user-friendly error messages.'}
weight
Section titled “weight”weight?: number
Relative importance of this criterion (default: 1).
How Weights Work:
- Higher weight = more impact on overall score
- Weights are normalized (e.g., weights [1, 2, 3] → [16.7%, 33.3%, 50%])
- Default weight is 1 if not specified
Example:
const rubric: Rubric = { name: 'Security Review', criteria: [ { name: 'auth', description: 'Authentication is secure', weight: 5 // Most important }, { name: 'input-validation', description: 'Input is validated', weight: 3 // Important }, { name: 'logging', description: 'Security events are logged', weight: 1 // Nice to have } ]};
threshold
Section titled “threshold”threshold?: number
Minimum score for this criterion (0.0 to 1.0). If set, failure on this criterion can fail the entire rubric.
Example:
{ name: 'security', description: 'Code has no security vulnerabilities', threshold: 1.0 // Must score 100% on security}
JudgeResult Interface
Section titled “JudgeResult Interface”interface JudgeResult { passed: boolean; score: number; criteria: CriterionResult[]; reasoning: string; suggestions?: string[];}
Properties
Section titled “Properties”passed
Section titled “passed”passed: boolean
Whether the evaluation passed based on passingThreshold
.
Example:
const judgment = await judge(result, { rubric });
if (judgment.passed) { console.log('✓ Quality gate passed');} else { console.log('✗ Quality gate failed');}
score: number
Overall weighted score (0.0 to 1.0).
Calculation:
score = Σ(criterion.score × criterion.weight) / Σ(criterion.weight)
Example:
const judgment = await judge(result, { rubric });
console.log(`Score: ${(judgment.score * 100).toFixed(1)}%`);
if (judgment.score >= 0.9) { console.log('Excellent quality');} else if (judgment.score >= 0.7) { console.log('Good quality');} else { console.log('Needs improvement');}
criteria
Section titled “criteria”criteria: CriterionResult[]
Per-criterion evaluation results.
CriterionResult Interface:
interface CriterionResult { name: string; score: number; reasoning: string; passed: boolean;}
Example:
const judgment = await judge(result, { rubric });
console.log('Criterion breakdown:');judgment.criteria.forEach(c => { const status = c.passed ? '✓' : '✗'; const percent = (c.score * 100).toFixed(0); console.log(`${status} ${c.name}: ${percent}%`); console.log(` ${c.reasoning}`);});
reasoning
Section titled “reasoning”reasoning: string
Judge’s overall reasoning for the score.
Example:
const judgment = await judge(result, { rubric });
console.log('Judge reasoning:');console.log(judgment.reasoning);
suggestions
Section titled “suggestions”suggestions?: string[]
Optional improvement suggestions from the judge.
Example:
const judgment = await judge(result, { rubric });
if (judgment.suggestions && judgment.suggestions.length > 0) { console.log('Improvement suggestions:'); judgment.suggestions.forEach((s, i) => { console.log(`${i + 1}. ${s}`); });}
Common Rubric Patterns
Section titled “Common Rubric Patterns”Code Quality
Section titled “Code Quality”const codeQualityRubric: Rubric = { name: 'Code Quality', criteria: [ { name: 'readability', description: 'Code is well-formatted, uses clear variable names, and includes helpful comments', weight: 2 }, { name: 'maintainability', description: 'Functions are small and focused. Code follows DRY principle. Logic is easy to follow', weight: 3 }, { name: 'error-handling', description: 'Uses try-catch blocks appropriately. Handles edge cases. Provides meaningful error messages', weight: 2 }, { name: 'type-safety', description: 'Uses TypeScript types correctly. No any types. Proper null checks', weight: 1 } ], passingThreshold: 0.75};
Security Review
Section titled “Security Review”const securityRubric: Rubric = { name: 'Security Review', criteria: [ { name: 'authentication', description: 'Uses secure auth patterns (bcrypt, JWT). No plaintext passwords. Proper session management', weight: 5, threshold: 0.9 // Critical }, { name: 'input-validation', description: 'Validates and sanitizes all user input. Prevents SQL injection, XSS, and other attacks', weight: 4, threshold: 0.9 // Critical }, { name: 'sensitive-data', description: 'No API keys, passwords, or secrets in code. Uses environment variables', weight: 3, threshold: 1.0 // Must be perfect }, { name: 'security-headers', description: 'Sets appropriate security headers (CSP, HSTS, etc.)', weight: 2 } ], passingThreshold: 0.85};
Test Coverage
Section titled “Test Coverage”const testCoverageRubric: Rubric = { name: 'Test Coverage', criteria: [ { name: 'unit-tests', description: 'Core functions have unit tests with good coverage (>80%)', weight: 3 }, { name: 'edge-cases', description: 'Tests cover edge cases, error conditions, and boundary values', weight: 2 }, { name: 'test-quality', description: 'Tests are clear, isolated, and maintainable. Good use of describe/it structure', weight: 1 } ], passingThreshold: 0.7};
Accessibility
Section titled “Accessibility”const a11yRubric: Rubric = { name: 'Accessibility', criteria: [ { name: 'semantic-html', description: 'Uses semantic HTML elements (header, nav, main, button, etc.)', weight: 2 }, { name: 'aria-labels', description: 'Proper ARIA labels and roles for screen readers', weight: 3 }, { name: 'keyboard-nav', description: 'All interactive elements are keyboard accessible. Logical tab order', weight: 2 }, { name: 'color-contrast', description: 'Sufficient color contrast for text (WCAG AA compliance)', weight: 1 } ], passingThreshold: 0.8};
Documentation
Section titled “Documentation”const docsRubric: Rubric = { name: 'Documentation Quality', criteria: [ { name: 'api-docs', description: 'All public functions have JSDoc comments with @param and @returns', weight: 3 }, { name: 'readme', description: 'README includes setup instructions, usage examples, and API reference', weight: 2 }, { name: 'inline-comments', description: 'Complex logic has explanatory comments. Comments explain "why" not "what"', weight: 1 } ], passingThreshold: 0.75};
Usage Examples
Section titled “Usage Examples”Basic Evaluation
Section titled “Basic Evaluation”import { vibeTest } from '@dao/vibe-check';
vibeTest('code quality evaluation', async ({ runAgent, judge, expect }) => { const result = await runAgent({ prompt: '/implement user login' });
const rubric: Rubric = { name: 'Login Implementation', criteria: [ { name: 'security', description: 'Uses bcrypt for passwords, JWT for tokens', weight: 3 }, { name: 'error-handling', description: 'Handles invalid credentials gracefully', weight: 2 } ] };
const judgment = await judge(result, { rubric });
expect(judgment.passed).toBe(true); expect(judgment.score).toBeGreaterThan(0.7);});
Detailed Analysis
Section titled “Detailed Analysis”vibeTest('detailed quality analysis', async ({ runAgent, judge }) => { const result = await runAgent({ prompt: '/implement' });
const rubric: Rubric = { name: 'Implementation Quality', criteria: [ { name: 'correctness', description: 'Implementation is correct', weight: 5 }, { name: 'performance', description: 'Code is efficient', weight: 3 }, { name: 'style', description: 'Follows style guide', weight: 1 } ], passingThreshold: 0.8 };
const judgment = await judge(result, { rubric });
console.log('=== Evaluation Results ==='); console.log(`Overall: ${judgment.passed ? 'PASS' : 'FAIL'}`); console.log(`Score: ${(judgment.score * 100).toFixed(1)}%`); console.log(`\nReasoning: ${judgment.reasoning}`);
console.log('\nCriterion breakdown:'); judgment.criteria.forEach(c => { const status = c.passed ? '✓' : '✗'; console.log(`${status} ${c.name}: ${(c.score * 100).toFixed(0)}%`); console.log(` ${c.reasoning}`); });
if (judgment.suggestions) { console.log('\nSuggestions:'); judgment.suggestions.forEach((s, i) => { console.log(`${i + 1}. ${s}`); }); }});
Threshold Enforcement
Section titled “Threshold Enforcement”vibeTest('critical security check', async ({ runAgent, judge, expect }) => { const result = await runAgent({ prompt: '/implement payment processing' });
const rubric: Rubric = { name: 'Payment Security', criteria: [ { name: 'pci-compliance', description: 'Follows PCI DSS requirements. No card data in logs', weight: 5, threshold: 1.0 // Must be perfect }, { name: 'encryption', description: 'Uses TLS for transmission. Encrypts sensitive data at rest', weight: 4, threshold: 0.9 }, { name: 'error-handling', description: 'Never exposes sensitive details in errors', weight: 3, threshold: 0.9 } ], passingThreshold: 0.95 };
const judgment = await judge(result, { rubric, throwOnFail: true });
// Only reaches here if all thresholds passed expect(judgment.passed).toBe(true);});
Custom Result Format
Section titled “Custom Result Format”import { z } from 'zod';
const CustomResultSchema = z.object({ passed: z.boolean(), score: z.number(), securityScore: z.number(), performanceScore: z.number(), issues: z.array(z.string())});
type CustomResult = z.infer<typeof CustomResultSchema>;
vibeTest('custom judgment format', async ({ runAgent, judge }) => { const result = await runAgent({ prompt: '/implement' });
const rubric: Rubric = { name: 'Multi-Aspect Review', criteria: [ { name: 'security', description: '...', weight: 3 }, { name: 'performance', description: '...', weight: 2 } ] };
const judgment = await judge<CustomResult>(result, { rubric, resultFormat: CustomResultSchema });
console.log('Security score:', judgment.securityScore); console.log('Performance score:', judgment.performanceScore); console.log('Issues:', judgment.issues);});
Best Practices
Section titled “Best Practices”1. Write Clear Criteria
Section titled “1. Write Clear Criteria”// ✅ Good: Specific and measurable{ name: 'error-handling', description: 'Uses try-catch for async operations. Logs errors with context. Returns user-friendly messages (not stack traces)'}
// ❌ Bad: Vague and subjective{ name: 'quality', description: 'Code is good'}
2. Set Appropriate Weights
Section titled “2. Set Appropriate Weights”const rubric: Rubric = { name: 'API Implementation', criteria: [ { name: 'correctness', description: 'API returns correct responses', weight: 5 // Most critical }, { name: 'performance', description: 'Responses are fast (<100ms)', weight: 3 // Important }, { name: 'documentation', description: 'Endpoints are documented', weight: 1 // Nice to have } ]};
3. Use Thresholds for Critical Criteria
Section titled “3. Use Thresholds for Critical Criteria”const rubric: Rubric = { name: 'Production Deployment', criteria: [ { name: 'security', description: 'No security vulnerabilities', threshold: 1.0 // Must be perfect }, { name: 'tests', description: 'All tests passing', threshold: 1.0 // Must be perfect }, { name: 'performance', description: 'Meets performance targets', weight: 2 // Important but not blocking } ], passingThreshold: 0.9};
4. Keep Criteria Focused
Section titled “4. Keep Criteria Focused”// ✅ Good: Focused criteriaconst rubric: Rubric = { name: 'Authentication', criteria: [ { name: 'password-hashing', description: 'Uses bcrypt with salt rounds >= 10' }, { name: 'jwt-security', description: 'JWT tokens have expiration and proper signing' }, { name: 'session-management', description: 'Sessions timeout and can be invalidated' } ]};
// ❌ Bad: Vague umbrella criteriaconst rubric: Rubric = { name: 'Authentication', criteria: [ { name: 'security', description: 'Authentication is secure' } // Too broad ]};
5. Set Realistic Thresholds
Section titled “5. Set Realistic Thresholds”const rubric: Rubric = { name: 'Code Review', criteria: [...], passingThreshold: 0.7 // ✅ Good: Achievable (70%) // passingThreshold: 0.95 // ❌ Too strict for general use // passingThreshold: 0.5 // ❌ Too lenient};
See Also
Section titled “See Also”- judge() → - Function that uses rubrics
- Using Judge → - Judge usage guide
- Rubrics Guide → - Rubric design patterns
- Custom Matchers → -
toPassRubric()
matcher