The problem
An AI coding assistant working in a real repository spends most of its budget on two things, and neither of them is thinking.
Rediscovery. The model reads the same files over and over to answer questions it has effectively already answered: where is this defined, who calls this, what does this module do. Each read pushes thousands of tokens into the context window.
The prompt cache. Providers cache the conversation so repeated context is cheaper to resend. But the cache is not free. Every turn pays to read the entire resident context. And any change to earlier context, or a short idle gap, invalidates the cache and forces a full rewrite of everything resident.
A single measured call was billed at $2.95, of which $2.84 was the cache write of roughly 455,000 tokens of resident context. The model's output that turn was under 4,000 tokens. The expensive part was the size of the context being carried, not the answer.
The insight
A tool that connects to an editor over MCP cannot control how or when the client caches. Cache breakpoints and time-to-live are decided by the client, not the server. There is exactly one lever a server does control:
How many tokens ever enter the resident context window in the first place.
Shrink that, and both costs fall at once: a smaller window is cheaper to read every turn and cheaper to rewrite when the cache is invalidated. Every design decision in CostAffective serves this one goal: keep tokens out of the window without losing information.
The approach
CostAffective is a local MCP server that does three things in service of that goal.
1. Answer from a local index, not from raw files
It parses your repository once with Tree-sitter and stores symbols, references, and call edges in a local SQLite index. Navigation questions are answered from that index in a few tokens instead of by dumping files. A token-budgeted repository summary gives the high-level map without ever emitting a giant tree at session start.
2. Give the model tools to keep large content out of the window
The context-control tools are the loop below. They let the model move large output and durable facts out of the resident window, losslessly, because the content is relocated to disk, not deleted.
stash_context
Park a large blob out of context and get back a short handle. The full content is written to disk.
recall
Pull back only the slice a query needs, within a token budget, never the whole blob.
remember
Persist the durable conclusion as a per-repository fact so it is not re-derived next turn.
3. Make the model use all of this, automatically
None of it helps if the model does not reach for it. The costaffective-session skill is a small piece of session-awareness guidance (about 275 tokens) that teaches the model the lean workflow once per session. It is delivered through the MCP protocol's instructions field, so every connected editor loads it on connect, plus a native Claude Code skill on top. That fixed, tiny cost pays for itself the first time it prevents a single large blob from entering the window.
Why not just summarize or delete old context?
Because that loses information. Summarizing collapses detail you may need later; deleting drops it outright. Stashing relocates tokens rather than discarding them. The full content stays on disk and is always recoverable with recall. That was a hard design constraint from the start: reduce the window without ever dropping context.
The result
The same philosophy runs end to end: answer from the index, budget the summaries, stash the large stuff, recall only the slice, remember the conclusions, and make the editor do it by default. On the Continue OSS repository this adds up to 45.9% fewer tokens, 54.3% fewer exploration loops, and 42.1% fewer tool interactions, entirely local.