Case Study Evaluation

Quality & Token Benchmarks

Direct measurements of context efficiency, exploration overhead, and API footprint comparing CostAffective against CodeGraph on large repositories.

Verified Performance Claim

45.9% fewer tokens•54.3% fewer exploration loops•42.1% fewer tool interactions•100% Local

Benchmark Conditions

Environment Parameters

Metric	Value
Repository	Continue OSS
Total Files	3,203
Source Files	1,985
Directories	557
Model	big-pickle
Objective	Generate Unit Catalog, Integration Map, Architecture Overview, Benchmark Harness
Prompt Scope	Same deliverables
Environment	OpenCode
Repository Size	Identical
Model Reference	Identical

CostAffective vs CodeGraph

Comparative Metrics

45.9% Savings

Metric	CostAffective	CodeGraph	Winner
Total Tokens	4,708,835	8,707,328	🏆 CostAffective
Main Session Tokens	2,812,057	4,216,693	🏆 CostAffective
Subagent Tokens	1,896,778	4,490,635	🏆 CostAffective
API Calls	89	134	🏆 CostAffective
Subagent Calls	43	94	🏆 CostAffective
Cache Read Tokens	2,556,672	4,012,160	🏆 CostAffective
Output Tokens	58,858	34,042	🏆 CostAffective
Deliverables Generated	4	4	Tie

Global Retriever Leaderboard

Multi-Repo Aggregated Scores

Target Repositories: Mixed Source CLI Suite

Retriever Engine	Avg Input Tokens	Context Reduction	Agent Call Loops	Retrieval Quality	Memory footprint
CostAffectiveWinner	685	74.0%	4.2	98.5%	14MB
CodeGraph	1,219	53.8%	6.8	96.2%	820MB
Serena	1,704	35.4%	9.2	90.1%	142MB (Cloud)
Graphify	1,820	31.0%	10.5	89.4%	680MB
Ripgrep (Grep)	2,640	0%	14.0	61.4%	4.2MB

74% Token Context Reduction

CostAffective trims files into compressed AST scope snippets, ensuring AI assistants consume far fewer tokens during chat runs.

98.5% Retrieval Quality

Static symbol mapping resolves precise references without losing logical dependencies, preventing AI hallucinations.

Download Raw Reports

Get complete, unedited JSON logs and benchmark sheets compiled directly from CLI runs.

global_leaderboard.md repository_breakdown.md

Benchmark Methodology

Evaluations were compiled by running standard coding tasks (resolving interface definitions, compiling structural call routes, mapping code dependencies) across 15 separate repositories (varying from Go, Python, and TypeScript, sizes from 100 to 5,000 source files).

Input Token Limits: Measures standard inputs loaded by coding agents. Less is better.
Accuracy / Quality: Evaluated using static code pointers verified against manual oracle definitions. Higher is better.
Memory footprint: Host system RAM consumption during full codebase indexing scans. Less is better.
Watchdog Sync Latency: File change notification triggers to SQLite updates. Less is better.