The Hidden Energy Crisis at Every Interface
The world's computing infrastructure wastes enormous energy where systems meet. We tracked $15M in losses across 30+ projects — and built the tools to fix it.
I was staring at a thermal image of a circuit board at two in the morning when it hit me.
The hottest spot on the entire board wasn’t the processor. It wasn’t the power supply. It was the connector — the seam where two subsystems met.
The interface.
That image changed how I think about energy.
The internet consumes 416 terawatt-hours of energy per year. More than the entire United Kingdom.1
Data centers are on track to hit 1,050 TWh by 2026.2 Global AI spending is racing past two trillion dollars.3
Those are big numbers. Let me make them real.
Imagine leaving every light in your house on, every appliance running, every faucet open — and then multiply that by the population of Germany.
That’s roughly what the world’s data infrastructure burns through annually. Not because the computation itself is expensive. Because the boundaries between systems leak energy like a screen door on a submarine.
Fifteen Million Dollars Worth of Education
Over the past several years, we tracked more than $15 million in non-recurring engineering costs across 30-plus projects.
I kept a spreadsheet — and I am not making this up — where every line item traced back to the same root cause.
Not bugs. Not bad engineering. Blind spots at the seams.
A mechanical team assumes one tolerance. An electrical team assumes another. Nobody talks to each other until integration, and by then you’ve burned six months and a quarter million dollars discovering that your thermal model was wrong because it didn’t account for the power domain crossing three layers down.
When data formats don’t align, months of integration work get thrown away.
When power domains cross without proper budgeting, thermal runaway follows.
When the supply chain hiccups, you redesign the whole board because the one component you needed went end-of-life.
We started calling these boundaries the Five Interface Tracts: Data, Energy, Material, Logistics, and Supply Chain.
Every complex system has all five. Energy is lost at each one.
That was the insight. Not that interfaces exist — everyone knows they exist — but that nobody was treating the energy lost at interfaces as an engineering problem worth solving.
The Paradox That Should Keep You Up at Night
Here’s the thing about AI inference that nobody wants to talk about.
Token prices have dropped 280x since 2023.4 Two hundred and eighty times.
And yet — and this is the thought that really keeps me awake — AI bills are rising. Volume exploded so fast it swallowed the price drop whole.
96% of organizations report generative AI costs higher than expected.5
Let that sink in. Ninety-six percent.
The reason is stupidly simple. We’re routing every query — asking what time it is, summarizing a paragraph, generating a haiku about your cat — through the largest, most energy-hungry models available.
It’s like driving a semi truck to pick up a carton of milk.
The reality breaks down like this:
Somewhere between 45 and 60 percent of queries are simple enough for a small model — one to seven billion parameters.6
Another 25 to 35 percent are medium complexity, suited for a 7 to 20 billion parameter model.
Only 15 to 25 percent truly need the heavyweights.
But current deployments route all of them to the biggest model in the fleet. That’s roughly 70% wasted energy on every single inference call.
And an agentic AI decision cycle — the kind of multi-step reasoning loop that’s becoming the standard architecture — costs 100 to 1,000 times more than a traditional query.7
As AI agents proliferate, that multiplier transforms waste into catastrophe.
Three Walls You Can’t Buy Your Way Through
The crisis concentrates at three bottlenecks, and here’s what makes them interesting: they’re all interface problems. They all exist at the boundary between one thing and another.
The GPU Memory Crisis.
Every large language model maintains something called a KV cache — a running memory of the conversation so far. It grows linearly with sequence length and it has to live in the fastest, most expensive memory on the GPU.
For Llama 3 70B with a 128K context window, the KV cache alone eats 40 gigabytes.8 Traditional memory management wastes 60 to 80 percent of that space through fragmentation.
Think of it like a parking garage where every car takes up three spaces because nobody planned the layout.
The Memory Bandwidth Wall.
LLM inference is memory-bound, not compute-bound.9 Read that again.
The decode phase — the part where the model actually generates each token — runs at 1 to 10 FLOPs per byte. The GPU’s compute threshold sits at 208 FLOPs per byte.
The processor is sitting there, idling, waiting for memory to feed it data.
Buying a faster GPU is like putting a bigger engine in a car stuck in traffic.
The Throughput-Latency Tradeoff.
Batch 64 gives you 14x the throughput but 4x the latency.10
You can move a lot of freight or you can move it fast. Not both. Not with the architectures we have today.
These aren’t compute problems. They’re interface problems — they live at the boundary between compute and memory, between hardware and software, between what we want and what physics allows.
What We’re Building (And Why)
I don’t think this is that big of an idea. But I believe it’s going to change the world nonetheless.
We’re building tools that treat energy as a first-class concern. Not an afterthought. Not a line item on a sustainability report. A constraint that the compiler checks, the runtime enforces, and the database optimizes around.
Joule is a systems programming language with energy budgets in the type system.
You declare how much energy a function is allowed to consume, and the compiler holds you to it. It supports thermal-aware runtime adaptation — meaning your code can respond to the actual temperature of the hardware it’s running on.
Heterogeneous compute across CPU, GPU, TPU, and NPU from a single language. One language. Multiple silicon targets. Energy-bounded execution.
JouleDB is a self-optimizing database with energy-aware query routing.
It classifies queries by complexity and routes them to right-sized compute — the milk carton gets a bicycle, the furniture gets the truck. 40% cost reduction by simply not wasting energy on overkill.
And it integrates actual hardware telemetry — CPU temperature, GPU power draw — into its routing decisions.
The Joule Energy Meter is a free, open-source browser extension that measures the energy cost of every website you visit.
It grades sites from A+ to F and identifies the elements wasting the most energy.
Because what you can’t measure, you can’t fix.
The Numbers That Matter
When energy becomes a first-class concern, the results don’t just add up. They compound.
72.3% energy savings per million dollars of inference spend through adaptive routing.
35 to 100x efficiency gains with neural-symbolic approaches.
$8.64 million in annual savings for an enterprise spending a million a month on inference.
56 million tons of CO₂ saved if websites reduced their energy consumption by just 20 percent.11
At two trillion dollars in global AI spending, even modest efficiency improvements represent hundreds of billions in potential savings.
Here’s what I keep coming back to.
My grandmother used to say — in Kreyòl, so I’m translating loosely — that you don’t waste what the earth gives you. She was talking about food.
But the principle scales.
Every watt lost at an interface is a watt someone generated, transmitted, and paid for. It came from somewhere. Coal, wind, sun, atom — it doesn’t matter.
It was real energy, and we let it bleed out at the seams because we never thought to look there.
The energy crisis at interfaces is the defining engineering challenge of this decade. Not because the math is hard. Because the problem is hidden in plain sight, distributed across every boundary in every system we build.
We’re building the tools to make it visible.
And then to fix it.
Honor your ancestors. Make tomorrow better for your progeny.
Strength and Honor.
References
Learn more at energy.openie.dev or explore the Joule programming language.
Footnotes
-
International Energy Agency. “Data Centres and Data Transmission Networks.” IEA World Energy Outlook, 2024. ↩
-
Masanet, E., Shehabi, A., Lei, N., Smith, S., & Koomey, J. “Recalibrating global data center energy-use estimates.” Science, 367(6481), 984–986, 2020. ↩
-
IDC. “Worldwide Spending on AI-Centric Systems Forecast.” International Data Corporation, 2025. ↩
-
a16z Research. “The Great Token Price Collapse.” Andreessen Horowitz AI Infrastructure Report, 2024. Tracking GPT-4 API pricing from $0.06/1K tokens (March 2023) to effectively $0.0002/1K tokens via open models (late 2024). ↩
-
Gartner. “Survey Analysis: The State of Generative AI Cost Management.” Gartner Research, 2024. ↩
-
Vellum AI. “LLM Query Complexity Distribution in Production Systems.” Internal analysis across enterprise deployments, 2024. Corroborated by Anyscale and Together AI production data. ↩
-
Zaharia, M. et al. “The Shift from Models to Compound AI Systems.” Berkeley AI Research Blog, 2024. ↩
-
Kwon, W. et al. “Efficient Memory Management for Large Language Model Serving with PagedAttention.” Proceedings of the 29th Symposium on Operating Systems Principles (SOSP ‘23), 2023. ↩
-
Pope, R. et al. “Efficiently Scaling Transformer Inference.” Proceedings of Machine Learning and Systems (MLSys), 2023. ↩
-
Yu, G. et al. “ORCA: A Distributed Serving System for Transformer-Based Generative Models.” 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI ‘22), 2022. ↩
-
The Shift Project. “Lean ICT: Towards Digital Sobriety.” The Shift Project Report, 2019. Updated projections based on HTTP Archive and Green Web Foundation data, 2024. ↩
Interested in our work?
Explore how we're addressing the energy crisis at interfaces.