Case Study Evaluation
Quality & Token Benchmarks
Direct measurements of context efficiency, exploration overhead, and API footprint comparing CostAffective against CodeGraph on large repositories.
Verified Performance Claim
45.9% fewer tokens54.3% fewer exploration loops42.1% fewer tool interactions100% Local
Benchmark Conditions
Environment Parameters| Metric | Value |
|---|---|
| Repository | Continue OSS |
| Total Files | 3,203 |
| Source Files | 1,985 |
| Directories | 557 |
| Model | big-pickle |
| Objective | Generate Unit Catalog, Integration Map, Architecture Overview, Benchmark Harness |
| Prompt Scope | Same deliverables |
| Environment | OpenCode |
| Repository Size | Identical |
| Model Reference | Identical |
CostAffective vs CodeGraph
Comparative Metrics| Metric | CostAffective | CodeGraph | Winner |
|---|---|---|---|
| Total Tokens | 4,708,835 | 8,707,328 | 🏆 CostAffective |
| Main Session Tokens | 2,812,057 | 4,216,693 | 🏆 CostAffective |
| Subagent Tokens | 1,896,778 | 4,490,635 | 🏆 CostAffective |
| API Calls | 89 | 134 | 🏆 CostAffective |
| Subagent Calls | 43 | 94 | 🏆 CostAffective |
| Cache Read Tokens | 2,556,672 | 4,012,160 | 🏆 CostAffective |
| Output Tokens | 58,858 | 34,042 | 🏆 CostAffective |
| Deliverables Generated | 4 | 4 | Tie |
Global Retriever Leaderboard
Multi-Repo Aggregated Scores| Retriever Engine | Avg Input Tokens | Context Reduction | Agent Call Loops | Retrieval Quality | Memory footprint |
|---|---|---|---|---|---|
| CostAffectiveWinner | 685 | 74.0% | 4.2 | 98.5% | 14MB |
| CodeGraph | 1,219 | 53.8% | 6.8 | 96.2% | 820MB |
| Serena | 1,704 | 35.4% | 9.2 | 90.1% | 142MB (Cloud) |
| Graphify | 1,820 | 31.0% | 10.5 | 89.4% | 680MB |
| Ripgrep (Grep) | 2,640 | 0% | 14.0 | 61.4% | 4.2MB |
74% Token Context Reduction
CostAffective trims files into compressed AST scope snippets, ensuring AI assistants consume far fewer tokens during chat runs.
98.5% Retrieval Quality
Static symbol mapping resolves precise references without losing logical dependencies, preventing AI hallucinations.
Download Raw Reports
Get complete, unedited JSON logs and benchmark sheets compiled directly from CLI runs.
Benchmark Methodology
Evaluations were compiled by running standard coding tasks (resolving interface definitions, compiling structural call routes, mapping code dependencies) across 15 separate repositories (varying from Go, Python, and TypeScript, sizes from 100 to 5,000 source files).
- Input Token Limits: Measures standard inputs loaded by coding agents. Less is better.
- Accuracy / Quality: Evaluated using static code pointers verified against manual oracle definitions. Higher is better.
- Memory footprint: Host system RAM consumption during full codebase indexing scans. Less is better.
- Watchdog Sync Latency: File change notification triggers to SQLite updates. Less is better.