Why is the prompt cache the main cost in long AI coding sessions?

Providers cache the conversation so repeated context is cheaper to resend, but every turn still pays to read the entire resident context, and any change to earlier context or a short idle gap forces a full rewrite of everything resident. In long sessions this dominates the bill. In one measured call, $2.84 of a $2.95 charge was the cache write of about 455,000 tokens, while the model output was under 4,000 tokens.

What can an MCP server actually control about caching?

Nothing about how or when the client caches; breakpoints and time-to-live are decided by the client. The only lever a server controls is how many tokens ever enter the resident context window. Shrinking that makes every turn cheaper to read and cheaper to rewrite.

How does CostAffective reduce the context window without losing information?

It answers from a local index instead of dumping files, budgets repository summaries, and provides stash_context and recall to move large output out of the window behind a small handle. Stashing relocates tokens to disk rather than deleting them, so the full content is always recoverable.

The Vision: Why CostAffective Cuts Prompt-Cache Cost

The problem

An AI coding assistant working in a real repository spends most of its budget on two things, and neither of them is thinking.

Rediscovery. The model reads the same files over and over to answer questions it has effectively already answered: where is this defined, who calls this, what does this module do. Each read pushes thousands of tokens into the context window.

The prompt cache. Providers cache the conversation so repeated context is cheaper to resend. But the cache is not free. Every turn pays to read the entire resident context. And any change to earlier context, or a short idle gap, invalidates the cache and forces a full rewrite of everything resident.

A single measured call was billed at $2.95, of which $2.84 was the cache write of roughly 455,000 tokens of resident context. The model's output that turn was under 4,000 tokens. The expensive part was the size of the context being carried, not the answer.

The insight

A tool that connects to an editor over MCP cannot control how or when the client caches. Cache breakpoints and time-to-live are decided by the client, not the server. There is exactly one lever a server does control:

How many tokens ever enter the resident context window in the first place.

Shrink that, and both costs fall at once: a smaller window is cheaper to read every turn and cheaper to rewrite when the cache is invalidated. Every design decision in CostAffective serves this one goal: keep tokens out of the window without losing information.

The approach

CostAffective is a local MCP server that does three things in service of that goal.

1. Answer from a local index, not from raw files

It parses your repository once with Tree-sitter and stores symbols, references, and call edges in a local SQLite index. Navigation questions are answered from that index in a few tokens instead of by dumping files. A token-budgeted repository summary gives the high-level map without ever emitting a giant tree at session start.

2. Give the model tools to keep large content out of the window

The context-control tools are the loop below. They let the model move large output and durable facts out of the resident window, losslessly, because the content is relocated to disk, not deleted.

stash_context

Park a large blob out of context and get back a short handle. The full content is written to disk.

recall

Pull back only the slice a query needs, within a token budget, never the whole blob.

remember

Persist the durable conclusion as a per-repository fact so it is not re-derived next turn.

3. Make the model use all of this, automatically

None of it helps if the model does not reach for it. The costaffective-session skill is a small piece of session-awareness guidance (about 275 tokens) that teaches the model the lean workflow once per session. It is delivered through the MCP protocol's instructions field, so every connected editor loads it on connect, plus a native Claude Code skill on top. That fixed, tiny cost pays for itself the first time it prevents a single large blob from entering the window.

Why not just summarize or delete old context?

Because that loses information. Summarizing collapses detail you may need later; deleting drops it outright. Stashing relocates tokens rather than discarding them. The full content stays on disk and is always recoverable with recall. That was a hard design constraint from the start: reduce the window without ever dropping context.

The result

The same philosophy runs end to end: answer from the index, budget the summaries, stash the large stuff, recall only the slice, remember the conclusions, and make the editor do it by default. On the Continue OSS repository this adds up to 45.9% fewer tokens, 54.3% fewer exploration loops, and 42.1% fewer tool interactions, entirely local.

Install CostAffective See the tools View benchmarks

The expensive part isn't the answer.It's the context.