RAG is dead. Long live context engineering.

Why 2M-token windows changed the retrieval game — and what to build instead.

by Vincent Tat·Apr 14, 2026·9 min·47 reading now

For two years, every AI product had the same spine: chunk documents, embed, stuff top-k into the prompt, pray. Retrieval-augmented generation became synonymous with "enterprise AI." Vendors sold it. Consultants deployed it. Conference talks celebrated it.

Then the context windows grew. Not a little — a thousand-fold. And the question quietly shifted from "what should we retrieve?" to "what should we leave out?"

The retrieval tax

Every RAG pipeline pays three hidden costs: chunking destroys structure, embeddings flatten meaning, and top-k ranking discards the long tail. For a customer support bot answering FAQ questions, fine. For reasoning across a codebase, a legal corpus, or a research archive — catastrophic.

# The old way
chunks = split(doc, size=512)
vectors = embed(chunks)
results = top_k(query, vectors, k=5)
answer = llm(query + results)

# The new way
answer = llm(query + doc)  # the whole thing

What replaces it

Context engineering. The discipline of deciding what a model sees, in what order, at what fidelity — without pre-emptively throwing information away. It looks less like database design and more like film editing.

The tools are different. Caching layers that remember per-session. Summarizers that compress low-salience passages. Routers that decide when to fall back to retrieval and when to stream the full document. This is the new stack.

RAG is dead. Long live context engineering.

The retrieval tax

What replaces it

Comments

More to read

Postgres is the answer to 90% of your database questions.

The quiet comeback of Web Components.

The LLM is the UI.