Context Has Gravity: The Hidden Infrastructure Bottleneck Behind Agentic AI

Context Has Gravity: The Hidden Infrastructure Bottleneck Behind Agentic AI
Agentic AI: Context has gravity

For the last two years, AI scaling has been framed first as a race for GPUs, then as a race for power, then as a race for inference capacity. Now another important constraint is now becoming visible: context.

At first, context sounds like a software problem, it sounds like prompt design, retrieval, memory, or application logic. But once AI systems become agents, context stops being just a product feature, it becomes an infrastructure problem.

A chatbot can be relatively stateless, it receives a prompt, generates a response, and the interaction ends. An agent is different, it may plan, call tools, search documents, query systems, write code, check outputs, ask for approval, repeat steps and continue working over time. Which means the agent does not just need compute, it needs memory, it needs state, it needs access to the right history, the right data, the right tool outputs and the right operating context at the right moment. That changes the AI infrastructure question yet again.

The next phase of AI will not only be about where models are trained, or where inference runs, it will be about where context lives, how it moves, how it is reused, how it is governed, and how much it costs to keep intelligence useful over time.

Agents change what context means

In the early phase of generative AI, most enterprise use cases were built around short interactions. A user asked a question, the model answered, and the value of the system depended mainly on model quality, latency and cost per token. Agentic AI is much more complex.

A production agent may need to maintain continuity across a long task. It may need to remember earlier decisions, retrieve relevant documents, understand permissions, interact with systems, track intermediate outputs, and know when a previous instruction has been superseded. It may also need to operate across multiple sessions, multiple users, multiple tools and multiple data environments.

That means context becomes dynamic. It is not simply a larger prompt, it is a constantly changing operating state.

Anthropic has described this shift as the move from prompt engineering to context engineering. It defines context as the full set of tokens and information available to a model during inference, including system instructions, tools, message history, external data and other state. It also argues that context is a critical but finite resource for agents, not something that can be expanded without consequence.

That point matters because many people still assume larger context windows solve the problem. They help, but they do not remove the underlying constraint. A larger window can hold more information, but the model still has to attend to it, reason over it, and find the signal inside the noise.

In practical terms, long context is not free. It creates latency. It consumes memory. It increases cost. It complicates orchestration. It can degrade quality if irrelevant information crowds out the useful information. Cloudflare made the same point when it launched Agent Memory in April 2026, noting that even as context windows grow beyond one million tokens, context rot remains an unresolved problem.

That is the real issue. The market does not just need bigger context windows, it needs better context systems. Ultimately it needs engineers who understand the need for tighter resource control, a skill I personally believe has been in steady decline over the last three decades.

Context becomes physical

The reason this matters for infrastructure is that context has to exist somewhere.

It may appear to the user as memory, history or personalisation, but underneath it becomes data movement, storage, retrieval, cache management and memory hierarchy. The more agentic the workload becomes, the more important those layers become.

One of the most important examples is the KV cache.

In simple terms, when a transformer model processes text, it creates internal representations that allow it to generate the next token without recalculating the entire history every time. That cached state is essential for efficient inference. For short prompts, this may be manageable. For long running agents, multi turn workflows and very large context windows, it becomes a major memory and storage challenge.

NVIDIA is now explicit about this. It says that as context windows increase, KV cache requirements grow proportionally, while recomputing that history becomes much more expensive. It also says agentic systems are putting pressure on existing memory and storage tiers because traditional storage was not designed for ephemeral, latency sensitive inference context.

That is the point at which context stops being abstract.

It becomes a question of GPU memory, system RAM, local storage, shared storage, network bandwidth, latency, energy efficiency and utilisation. If context sits too far from the accelerator, the system pays for it in delay, power and cost per useful token. If it sits only in scarce high bandwidth memory, capacity becomes the constraint. If it is pushed into generic storage, performance suffers.

This is why context has gravity.

Just as data gravity shaped cloud architecture, context gravity will start to shape AI infrastructure. The more valuable, persistent and operationally specific context becomes, the more it influences where inference should run, where memory should sit, and how the surrounding system should be designed.

The memory wall is moving into the market

This is no longer just a research issue. The infrastructure stack is already responding.

Google’s latest TPU 8i is a useful signal. Google describes it as a reasoning system for inference and reinforcement learning, built for the low latency needs of agentic workflows and mixture of experts models. It triples on chip SRAM to 384 MB and increases high bandwidth memory to 288 GB, specifically to help host larger KV caches on silicon and reduce idle time during long context decoding.

That tells us something important. The next generation of AI hardware is not only being designed around raw compute, it is being designed around memory placement, cache behaviour, latency and agent concurrency.

NVIDIA is moving in the same direction from another layer of the stack. Its CMX context memory storage platform is designed to create a new context memory tier for long context and agentic inference. NVIDIA says this can improve token throughput and power efficiency compared with traditional storage approaches.

Clearly agentic AI is forcing the infrastructure industry to redesign around context, not just compute. That means memory bandwidth, cache reuse, context compression, retrieval quality, storage hierarchy and orchestration will all become more commercially important.

Now the bottleneck is not just, can we run the model? It is, can we run the model with the right context, at the right latency, at the right cost, repeatedly, across real workflows?

Long context will not remove the need for discipline

There is a temptation in the market to treat large context windows as the answer to every enterprise AI problem. Put more into the prompt. Give the model more documents. Keep a longer history. Let the agent see everything. And that approach will fail more often than people expect.

The best agentic systems will not be the ones that pass the most information to the model. They will be the ones that pass the most useful information to the model, at the right point in the workflow.

That distinction really matters!

Enterprise context is messy. It includes documents, policies, permissions, historical decisions, live telemetry, user preferences, audit trails, tool outputs, operational constraints and exceptions. Much of it is irrelevant at any given moment. Some of it is sensitive. Some of it is out of date. Some of it conflicts with other information. Some of it should never leave a specific environment.

So, the hard problem is not simply storage, it is controlled relevance.

Agents need to remember, but they also need to forget. They need to retrieve, but they also need to filter. They need continuity, but they also need governance. They need access to enterprise data, but not unlimited access to everything.

That is why context engineering will become a core operating discipline for enterprise AI. It sits between application design, data architecture, infrastructure, security, FinOps and compliance. The organisations that treat it as a minor prompt optimisation exercise will struggle to scale. The ones that treat it as part of the infrastructure layer will have a better chance of making agents reliable and economical.

Context gravity changes workload placement

The previous infrastructure question was, where should inference run? That is still the right question, but context makes the answer more specific.

A generic assistant may work well from a centralised inference pool. A batch summarisation job may not need to sit close to the user. A frontier model serving many customers may benefit from hyperscale economics.

But an agent embedded in a factory, an energy network, a hospital, a regulated enterprise campus or a field operations environment may behave differently. In those cases, the value of the agent may depend on local data, low latency, resilience, security, sovereignty and operational context.

The issue is not only where the user sits, it is where the relevant context sits.

If the agent depends on live telemetry, local documents, sensitive records, control systems or site specific operating history, then pushing every step back to a distant central region may create avoidable cost, latency, bandwidth demand and governance complexity. If the useful context is local, some of the inference may need to move closer to that context.

That does not mean everything moves to the edge. It means the infrastructure model becomes more stratified.

One layer will remain centralised around frontier training and large shared inference pools. A second layer will become regional, shaped by power availability, customer demand, data sovereignty and enterprise proximity. A third layer will sit closer to physical systems, where the cost of distance becomes operationally real.

Context gravity strengthens that layered model. It means workload placement will not be decided by model size alone. It will be decided by the relationship between compute, power, data, context, latency, resilience and governance.

Cost per token is not enough

The market often measures inference cost in tokens. Whilst useful, it is too narrow for agentic AI.

An agent may use many model calls to complete one task. It may retrieve documents, call tools, validate outputs, update memory and repeat steps. It may use cheap models for some stages and larger models for others. It may consume context that was created earlier, stored elsewhere and recalled later.

So the real metric is not simply cost per token, it is cost per useful, context aware, action.

A low cost model call is not cheap if the agent forgets the relevant history, repeats work, retrieves the wrong information, loses state, or requires human intervention to repair the workflow. A more expensive system may be cheaper in practice if it uses context efficiently, avoids unnecessary calls, reduces latency and completes the task reliably.

This is why AI cost management is moving quickly into the mainstream. The FinOps Foundation’s 2026 State of FinOps report says 98 per cent of respondents now manage AI spend, up from 31 per cent two years earlier, with AI cost management identified as the number one skillset teams need to develop.

That trend will only become more important as agents move from experiments to production systems.

McKinsey has also warned that as AI workloads expand, IT infrastructure costs could increase two to three times by 2030 while budgets remain broadly flat. It also found that 62 per cent of organisations are experimenting with or piloting AI agents, but no more than 10 per cent are scaling agents in any given business function.

That gap is the opportunity. The market is interested in agents, but scaling them is still hard. One reason is that the underlying infrastructure, data and operating models were not built for persistent, context rich, autonomous workflows.

Energy still sets the boundary

Context does not replace the power problem, it adds another layer to it.

AI infrastructure is already energy constrained. Context heavy agents make that problem more specific.

Every unnecessary retrieval, every avoidable model call, every poorly managed cache, every repeated workflow step and every misplaced inference request consumes real infrastructure capacity. At scale, poor context management becomes poor energy management.

The industry has spent a lot of time asking how to secure more megawatts. That remains critical. But the next question is how intelligently those megawatts are converted into useful work.

If context can be stored, compressed, recalled and reused efficiently, the same infrastructure can deliver more value. If context is handled badly, expensive compute sits idle, memory pressure rises, storage traffic increases and energy is wasted on repeated work.

In that sense, context engineering is not only an AI quality problem, it is an infrastructure efficiency problem.

What serious builders do differently

For infrastructure builders, operators and investors, the implication is clear. AI infrastructure can no longer be planned only around racks, GPUs, power and cooling. Those remain essential, but they are not enough.

The real question is, what context does this workload need to perform useful work?

Recognising the importance of that question changes the development model. It means understanding the decision loop before choosing the infrastructure layer. It means knowing whether the workload needs persistent memory, live data access, low latency retrieval, strict sovereignty, auditability, or continuity during connectivity loss. It means designing storage, networking, orchestration and security around the context requirements of the workload, not treating them as afterthoughts. It also means being disciplined about what should not move outward.

Not every workload needs local inference. Not every agent needs persistent memory. Not every context store needs to be close to the user. Not every enterprise process justifies a dedicated infrastructure layer. The mistake is not in believing context matters. The mistake is assuming that all context has the same value, the same sensitivity, the same latency requirement and the same economic profile.

The winners will be more precise. They will understand which context is hot, which is warm, which is cold, which is regulated, which is reusable and which should be discarded. They will optimise the whole loop, from model to memory, from retrieval to storage, from network path to energy profile, from governance policy to useful outcome.

The real opportunity

The first AI infrastructure race was about securing GPUs. The second was about securing power. The third is about placing inference intelligently. The next will be about context.

That does not mean context replaces compute, power or data, it means those layers start to interact in a more complex way. Once agents become persistent, memory becomes part of the operating model. Once workflows become multi step, context becomes part of the cost base. Once AI enters regulated and physical environments, context becomes part of the governance model. Once inference is distributed across central, regional and edge layers, context becomes part of the placement decision.

That is why context has gravity. It pulls infrastructure towards the places where useful state, operational data, latency requirements and governance boundaries actually exist. It changes which sites matter, which architectures make sense and which systems can deliver value repeatedly rather than occasionally.

The companies that win the next phase of AI infrastructure will not simply own the biggest clusters or the cheapest tokens. They will operate the best context systems.

They will know how to accumulate context without creating noise. They will know how to retrieve context without wasting tokens. They will know how to store context without overloading expensive memory. They will know how to govern context without slowing every workflow. And they will know where intelligence should run when the context that makes it useful is no longer centralised.

In agentic AI, intelligence is not just generated, it is accumulated, compressed, retrieved, governed and reused.

That makes context one of the most important infrastructure layers in the AI economy.

Read more