CostAffective
Case Study Evaluation

Quality & Token Benchmarks

Direct measurements of context efficiency, exploration overhead, and API footprint comparing CostAffective against CodeGraph on large repositories.

Verified Performance Claim
45.9% fewer tokens54.3% fewer exploration loops42.1% fewer tool interactions100% Local

Benchmark Conditions

Environment Parameters
MetricValue
RepositoryContinue OSS
Total Files3,203
Source Files1,985
Directories557
Modelbig-pickle
ObjectiveGenerate Unit Catalog, Integration Map, Architecture Overview, Benchmark Harness
Prompt ScopeSame deliverables
EnvironmentOpenCode
Repository SizeIdentical
Model ReferenceIdentical

CostAffective vs CodeGraph

Comparative Metrics
45.9% Savings
MetricCostAffectiveCodeGraphWinner
Total Tokens4,708,8358,707,328🏆 CostAffective
Main Session Tokens2,812,0574,216,693🏆 CostAffective
Subagent Tokens1,896,7784,490,635🏆 CostAffective
API Calls89134🏆 CostAffective
Subagent Calls4394🏆 CostAffective
Cache Read Tokens2,556,6724,012,160🏆 CostAffective
Output Tokens58,85834,042🏆 CostAffective
Deliverables Generated44Tie

Global Retriever Leaderboard

Multi-Repo Aggregated Scores
Target Repositories: Mixed Source CLI Suite
Retriever EngineAvg Input TokensContext ReductionAgent Call LoopsRetrieval QualityMemory footprint
CostAffectiveWinner68574.0%4.298.5%14MB
CodeGraph1,21953.8%6.896.2%820MB
Serena1,70435.4%9.290.1%142MB (Cloud)
Graphify1,82031.0%10.589.4%680MB
Ripgrep (Grep)2,6400%14.061.4%4.2MB

74% Token Context Reduction

CostAffective trims files into compressed AST scope snippets, ensuring AI assistants consume far fewer tokens during chat runs.

98.5% Retrieval Quality

Static symbol mapping resolves precise references without losing logical dependencies, preventing AI hallucinations.

Download Raw Reports

Get complete, unedited JSON logs and benchmark sheets compiled directly from CLI runs.

Benchmark Methodology

Evaluations were compiled by running standard coding tasks (resolving interface definitions, compiling structural call routes, mapping code dependencies) across 15 separate repositories (varying from Go, Python, and TypeScript, sizes from 100 to 5,000 source files).

  • Input Token Limits: Measures standard inputs loaded by coding agents. Less is better.
  • Accuracy / Quality: Evaluated using static code pointers verified against manual oracle definitions. Higher is better.
  • Memory footprint: Host system RAM consumption during full codebase indexing scans. Less is better.
  • Watchdog Sync Latency: File change notification triggers to SQLite updates. Less is better.