AI Agent Evaluation Framework

Design systems at scale fail quietly. Rules get written, distributed across files, and then slowly ignored — not out of negligence, but because agents and engineers alike lack reliable ways to discover them. This project builds an evaluation harness to measure exactly that gap.

Inspired by Vercel’s research on passive vs. active context retrieval, the framework tests whether a coding agent surfaces the right rules at the right moment — given only a task prompt and access to a project’s file tree.

The interesting question isn’t whether an agent can follow a rule it’s been given. It’s whether it can find the rule in the first place.

Approach

The harness operates in two modes. In passive mode, all relevant context is pre-loaded into the agent’s context window — this establishes a ceiling score. In active mode, the agent must retrieve context itself, simulating real usage. The delta between modes is the metric that matters.

Each test case is a minimal reproduction: a task, a set of distributed rule files, and a scoring rubric. Results are logged and compared across agent configurations, surfacing which retrieval strategies hold up under realistic conditions.

Stack

The framework is built as a lightweight CLI tool, designed to run alongside any Claude Code project. Test cases are plain TypeScript files; scoring is deterministic where possible, Claude-graded where not.