AI

RAG is dead. Long live context engineering.

Why 2M-token windows changed the retrieval game — and what to build instead.

by Vincent Tat·Apr 14, 2026·9 min·47 reading now

For two years, every AI product had the same spine: chunk documents, embed, stuff top-k into the prompt, pray. Retrieval-augmented generation became synonymous with "enterprise AI." Vendors sold it. Consultants deployed it. Conference talks celebrated it.

Then the context windows grew. Not a little — a thousand-fold. And the question quietly shifted from "what should we retrieve?" to "what should we leave out?"

The retrieval tax

Every RAG pipeline pays three hidden costs: chunking destroys structure, embeddings flatten meaning, and top-k ranking discards the long tail. For a customer support bot answering FAQ questions, fine. For reasoning across a codebase, a legal corpus, or a research archive — catastrophic.

# The old way
chunks = split(doc, size=512)
vectors = embed(chunks)
results = top_k(query, vectors, k=5)
answer = llm(query + results)

# The new way
answer = llm(query + doc)  # the whole thing

What replaces it

Context engineering. The discipline of deciding what a model sees, in what order, at what fidelity — without pre-emptively throwing information away. It looks less like database design and more like film editing.

The tools are different. Caching layers that remember per-session. Summarizers that compress low-salience passages. Routers that decide when to fall back to retrieval and when to stream the full document. This is the new stack.

Comments

Add a comment…

Sign in
No comments yet. Be the first to share your thoughts!

More to read

Systems

Postgres is the answer to 90% of your database questions.

Web

The quiet comeback of Web Components.

Design

The LLM is the UI.