Intelligence Processing Unit (IPU): Rethinking AI Hardware and the Operating System
The Intelligence Processing Unit (IPU) rethinks AI hardware with unified memory, graph-native scheduling, sparse compute, and an AI-native operating system.
Modern AI workloads are being forced through software and hardware designed for spreadsheets and file systems. We propose a ground-up rethink: the Intelligence Processing Unit, an architecture where inference isn't an application running on top of an OS it's what the OS is.
IPU
IPU Research Team
Systems Architecture · AI Infrastructure
7×
Software layers between your model and the hardware today
8×
Speedup achieved by 4-bit quantization over 32-bit on H100
100B+
Parameter models targetable on edge hardware via streaming
The problem hiding in plain sight
When you run a language model today, your inference request passes through Python, PyTorch, a CUDA runtime, a driver, the Linux kernel, a PCIe bus, and finally reaches the GPU. Each of those layers was designed for a different era and a different workload. None of them know what a tensor is. None of them understand you're computing attention. The operating system is completely blind to what's actually happening.
This isn't a minor inefficiency. It's a fundamental architecture mismatch and it gets worse as models get larger. Meanwhile, the industry's answer has been to add more GPUs, more VRAM, and more cluster interconnects. More of the wrong thing, at greater expense and power consumption.
"We are running 21st-century intelligence on 20th-century operating system theory. The entire AI software stack is a collection of workarounds for an OS that doesn't know AI exists."
The stack problem, visualised
Current path from model to silicon
Application
Your code or API call
Framework
PyTorch / TensorFlow / JAX
Accelerator runtime
CUDA / ROCm / Metal
Driver
Vendor-specific, opaque
OS kernel
Linux / Windows — process-centric
Bus layer
PCIe — high latency, narrow
Hardware
GPU designed for rendering, not reasoning
Proposed IPU path
Application / model definition
Computation graph, latency targets
IPU OS
Graph compiler + semantic scheduler + memory manager
Hardware
Tensor fabric + unified memory + sparse engine
Four architectural pillars
Tensor-native ISA
RISC-V base extended with 4/8/16-bit native types. Single instructions execute entire attention heads — not emulated, not converted.
Unified semantic memory
One physical pool. No VRAM/RAM split. Hardware-tagged by role — weights, activations, KV cache — with differentiated bandwidth policy per region.
Sparse event engine
Borrowed from neuromorphic research. Only computes on non-zero activations. For sparse models, 60–80% compute savings automatically — in silicon, not software.
Graph-native scheduler
Accepts a full computation graph, not individual instructions. Schedules based on dependency readiness, thermal headroom, memory locality, and token latency targets.
A memory model that understands what it holds
Current memory is just bytes. The IPU introduces semantic memory tiers — each physically distinct, each managed by the OS according to the role of the data it holds.
Tier 1 — Active
On-chip SRAM. Current activations, attention scores.
~500 MB
nanosecond
Tier 2 — Warm
On-package HBM4. Running weights, active KV cache.
~128 GB
microsecond
Tier 3 — Cold
Storage-class memory. All installed models, archived contexts.
~4 TB
sub-millisecond
The OS moves data between tiers automatically based on access patterns. A developer never manages memory placement, never sees VRAM limits, and never manually pins buffers. The system pre-stages attention layer 14's weights from Tier 3 to Tier 2 before you ask — because the computation graph tells it what's coming.
How 100B models become edge-deployable
A 102B parameter model at 4-bit quantization requires roughly 51GB of weights plus 10–20GB for KV cache. On current hardware this is impossible on mobile and expensive on server — because data bounces constantly between separate memory pools across slow buses. On the IPU, the unified memory hierarchy and execution-order-aware model storage format make this tractable through managed streaming rather than full residency.
Model weights are stored on-device in the order they'll be accessed during inference — not the order they were saved during training. The OS pre-fetches upcoming layers while current layers execute. The result: a 15W mobile IPU running a 102B model at 5–10 tokens per second. Not fast by server standards. Transformative for on-device private intelligence.
Primary contributions
Tensor-native ISA over RISC-VNative 4/8/16-bit register types, single-instruction attention primitives, open and extensible — no vendor lock-in.
Semantic memory model with hardware role partitioningUnified pool across heterogeneous physical tiers, differentiated policy enforced in the memory controller without software intervention.
Graph-native OS schedulerDeadline, thermal, and locality-aware execution directly from computation graph submission — eliminating the conventional driver layer.
Execution-order-aware model storage formatStreaming inference of 100B+ parameter models within constrained memory budgets through predictive tier movement.
The thesis, plainly stated: Intelligence should not be an application running on top of a general-purpose operating system. It should be what the operating system is.
Prior systems address individual pieces Groq's LPU for deterministic execution, IBM NorthPole for on-chip memory, Apple Silicon for memory unification. No existing platform co-designs inference scheduling, memory semantics, and OS primitives as an integrated whole. That is the gap this work addresses.
The shift from von Neumann, process-oriented machine design toward intelligence-native computing is not a distant research horizon. The algorithmic foundations exist. The silicon techniques exist. What remains is the will to abandon the comfortable familiarity of 1970s abstractions and build something honest about what computing is becoming.
Related Posts

Cold Storage Management Software: Cut Costs by 60% in 2026
Discover why cold storage owners are ditching registers for software. Save ₹60,000/month, eliminate billing errors & grow your business. Real success stories from India.

IoT Cold Chain Monitoring ROI Calculator: Indian Pharmaceutical Distribution
Complete ROI breakdown for IoT cold chain monitoring in Indian pharmaceutical distribution. Investment: ₹23.8L. Annual savings: ₹79.3L. Payback: 3.7 months. Get your custom calculation.

Custom Reporting Dashboard Development in India: The 2026 Strategic Guide
Stop fighting generic BI tools. Learn why Indian SMEs are switching to custom reporting dashboards for real-time insights, zero monthly fees, and 100% ownership