Opinion: Latency may be invisible to users, but it will define who wins in AI

Rade Kovacevic is the co-founder and CEO of PolarGrid, an edge compute company building a global, real-time AI inference delivery network.

For the last decade, we’ve trained ourselves to expect the internet to feel instant. Pages load in milliseconds. Video streams without buffering. Real-time collaboration just works.

That expectation didn’t happen by accident. Starting with the founding of Akamai, which pioneered distributed caching in 1997, content delivery networks (CDNs) were built out globally, with infrastructure distributed geographically so static data—or data that stays constant and isn’t often updated—didn’t need to travel far to reach users.

That evolution didn’t stop with the first generation of the web. Each major shift in how we use the internet has renewed focus on content delivery. While Akamai helped make early web content load instantly at global scale, as the internet became mobile and security-sensitive, Cloudflare expanded the CDN into a programmable, globally distributed edge, bringing compute, storage, and security closer to users. As streaming, APIs, and real-time applications took off, platforms like Fastly further optimized delivery networks for dynamic, low-latency delivery.

AI, however, represents the next shift—and CDNs, as they exist today, weren’t built for it.

The hidden cost of AI inference

Most AI inference today—the act of running a trained model to generate a response—happens in large, centralized cloud data centres. When a user interacts with an AI application, their request often travels hundreds or thousands of kilometers to reach a graphics processing unit (GPU), and then back again with a result.

That round trip routinely introduces network latencies of 100 milliseconds or more. For some applications, a delay of half a second is tolerable. For many emerging ones, it isn’t.

As AI moves beyond chat interfaces into real-time use cases—voice and video agents, gaming, robotics, industrial automation, augmented reality, fraud detection, and autonomous systems—latency stops being an optimization problem and starts being a product constraint.

Take voice agents as an example. Conversational pauses that drag on for a second feel unnatural and quickly erode trust. For voice assistants and customer support agents, every additional network hop compounds delay; speech must be captured, transmitted, processed, and responded to in near real-time. When inference is routed to distant, centralized data centres, even state-of-the-art models struggle to feel conversational. The result isn’t just slower responses, it’s AI that feels artificial, brittle, and frustrating to use.

In these environments, tens of milliseconds of network latency matter. Hundreds are unacceptable.

Why AI didn’t follow the content delivery network playbook

The internet already solved latency once. CDNs cache and distribute static content across hundreds of edge locations worldwide, ensuring data is always close to the user.

AI inference breaks that model.

Unlike static content, as AI goes through the process of applying its learnings to create outputs, it requires GPU computation on every single request. Every prompt produces a different output, which means responses can’t be precomputed or reliably cached. Those AI workloads are inherently compute-bound and must be generated at request time, rather than served up from memory.

AI workloads have defaulted to centralized hyperscale data centers—not because they’re ideal for latency, but because that’s where GPUs, orchestration tools, and developer workflows already exist. As demand for inference capacity has continued to outstrip supply, the scale of that centralization has multiplied time and again.

The consequence is a growing mismatch between how fast users expect AI to feel and how far AI requests actually travel.

Latency is becoming the bottleneck

We’re already seeing this tension emerge.

Developers can train increasingly capable models, but deploying them in a way that feels responsive to users is becoming harder. Multi-zone cloud configurations are complex, expensive, and time-consuming to implement, often requiring weeks of infrastructure work and specialized expertise.

Meanwhile, centralized inference creates a structural disadvantage for companies building products that are sensitive to delays. Even with faster GPUs and language processing unitss, physics still applies. Distance adds delay.

As AI adoption accelerates, this problem won’t remain niche. It will affect user experience, enterprise reliability, and safety-critical systems.

Advances in chips and model optimization are reducing inference time inside the data centre. Faster GPUs, specialized accelerators, and more efficient architectures have meaningfully improved how quickly models can generate outputs.

But as compute latency shrinks, network latency becomes a larger share of the end-to-end experience. When inference itself takes tens or hundreds of milliseconds, the physical distance between users and GPUs increasingly determines whether an application feels responsive or not.

The limits of centralization

The solution isn’t just faster chips or more aggressive optimization. It’s architectural.

Just as CDNs distribute content, AI inference needs to be physically distributed—running on GPUs at the network edge, closer to where requests originate.

Processing requests closer to end users dramatically reduces the distance data must travel, cutting network latency by more than 70 percent in many real-world scenarios and enabling sub-30 millisecond round-trip times.

As more applications approach sub-second, end-to-end latency, centralized architectures hit a ceiling for real-time use cases—and meeting final latency targets requires addressing the last mile of the network.

Crucially, this approach leverages new models, optimization techniques and hardware into a tailwind to deliver better user experiences. But capturing that tailwind at scale requires a software layer that can:

Seamlessly deploy inference workloads across geographically distributed GPU nodes;
Intelligently route requests to the optimal location based on a variety of complex inputs;
Enable multi-zone deployments from a developer-centric console, without the operational complexity.

In other words, AI needs its own delivery network. That realization is what led us to build PolarGrid.

PolarGrid is designed as a content delivery network for AI inference—a distributed, edge-GPU platform that allows developers to deploy their models close to end users in minutes, not weeks. By routing inference requests intelligently, we’re able to deliver real-time AI performance without requiring teams to become infrastructure experts.

The models are the same. The hardware is the same. What’s different is where inference runs, how the optimal server is identified by the network, and how easily it can be deployed.

We believe distributed inference will become foundational infrastructure, just as CDNs did for the modern web. As AI applications become more interactive, more embedded, and more real-time, latency will quietly determine what’s possible.

The internet became invisible when it became fast. If AI is going to feel truly integrated into everyday life, the same thing needs to happen again.

Latency may be invisible to users—but it’s about to define who wins in AI.

The opinions and analysis expressed in the above article are those of its author, and do not necessarily reflect the position of BetaKit or its editorial staff. It has been edited for clarity, length, and style.

Feature image courtesy Unsplash. Photo by Zulfugar Karimov.