Modern AI workloads are being forced through software and hardware designed for spreadsheets and file systems. We propose a ground-up rethink: the Intelligence Processing Unit, an architecture where inference isn't an application running on top of an OS it's what the OS is.

IPU

IPU Research Team

Systems Architecture · AI Infrastructure

7×

Software layers between your model and the hardware today

8×

Speedup achieved by 4-bit quantization over 32-bit on H100

100B+

Parameter models targetable on edge hardware via streaming

The problem hiding in plain sight

When you run a language model today, your inference request passes through Python, PyTorch, a CUDA runtime, a driver, the Linux kernel, a PCIe bus, and finally reaches the GPU. Each of those layers was designed for a different era and a different workload. None of them know what a tensor is. None of them understand you're computing attention. The operating system is completely blind to what's actually happening.

This isn't a minor inefficiency. It's a fundamental architecture mismatch and it gets worse as models get larger. Meanwhile, the industry's answer has been to add more GPUs, more VRAM, and more cluster interconnects. More of the wrong thing, at greater expense and power consumption.

"We are running 21st-century intelligence on 20th-century operating system theory. The entire AI software stack is a collection of workarounds for an OS that doesn't know AI exists."

The stack problem, visualised

Current path from model to silicon

Application

Your code or API call

Framework

PyTorch / TensorFlow / JAX

Accelerator runtime

CUDA / ROCm / Metal

Driver

Vendor-specific, opaque

OS kernel

Linux / Windows — process-centric

Bus layer

PCIe — high latency, narrow

Hardware

GPU designed for rendering, not reasoning

Proposed IPU path

Application / model definition

Computation graph, latency targets

IPU OS

Graph compiler + semantic scheduler + memory manager

Hardware

Tensor fabric + unified memory + sparse engine

Four architectural pillars

Tensor-native ISA

RISC-V base extended with 4/8/16-bit native types. Single instructions execute entire attention heads — not emulated, not converted.

Unified semantic memory

One physical pool. No VRAM/RAM split. Hardware-tagged by role — weights, activations, KV cache — with differentiated bandwidth policy per region.

Sparse event engine

Borrowed from neuromorphic research. Only computes on non-zero activations. For sparse models, 60–80% compute savings automatically — in silicon, not software.

Graph-native scheduler

Accepts a full computation graph, not individual instructions. Schedules based on dependency readiness, thermal headroom, memory locality, and token latency targets.

A memory model that understands what it holds

Current memory is just bytes. The IPU introduces semantic memory tiers — each physically distinct, each managed by the OS according to the role of the data it holds.

Tier 1 — Active

On-chip SRAM. Current activations, attention scores.

~500 MB
nanosecond

Tier 2 — Warm

On-package HBM4. Running weights, active KV cache.

~128 GB
microsecond

Tier 3 — Cold

Storage-class memory. All installed models, archived contexts.

~4 TB
sub-millisecond

The OS moves data between tiers automatically based on access patterns. A developer never manages memory placement, never sees VRAM limits, and never manually pins buffers. The system pre-stages attention layer 14's weights from Tier 3 to Tier 2 before you ask — because the computation graph tells it what's coming.

How 100B models become edge-deployable

A 102B parameter model at 4-bit quantization requires roughly 51GB of weights plus 10–20GB for KV cache. On current hardware this is impossible on mobile and expensive on server — because data bounces constantly between separate memory pools across slow buses. On the IPU, the unified memory hierarchy and execution-order-aware model storage format make this tractable through managed streaming rather than full residency.

Model weights are stored on-device in the order they'll be accessed during inference — not the order they were saved during training. The OS pre-fetches upcoming layers while current layers execute. The result: a 15W mobile IPU running a 102B model at 5–10 tokens per second. Not fast by server standards. Transformative for on-device private intelligence.

Primary contributions

Tensor-native ISA over RISC-VNative 4/8/16-bit register types, single-instruction attention primitives, open and extensible — no vendor lock-in.

Semantic memory model with hardware role partitioningUnified pool across heterogeneous physical tiers, differentiated policy enforced in the memory controller without software intervention.

Graph-native OS schedulerDeadline, thermal, and locality-aware execution directly from computation graph submission — eliminating the conventional driver layer.

Execution-order-aware model storage formatStreaming inference of 100B+ parameter models within constrained memory budgets through predictive tier movement.

The thesis, plainly stated: Intelligence should not be an application running on top of a general-purpose operating system. It should be what the operating system is.

Prior systems address individual pieces Groq's LPU for deterministic execution, IBM NorthPole for on-chip memory, Apple Silicon for memory unification. No existing platform co-designs inference scheduling, memory semantics, and OS primitives as an integrated whole. That is the gap this work addresses.

The shift from von Neumann, process-oriented machine design toward intelligence-native computing is not a distant research horizon. The algorithmic foundations exist. The silicon techniques exist. What remains is the will to abandon the comfortable familiarity of 1970s abstractions and build something honest about what computing is becoming.

Intelligence Processing Unit (IPU): Rethinking AI Hardware and the Operating System

The problem hiding in plain sight

The stack problem, visualised

Four architectural pillars

A memory model that understands what it holds

How 100B models become edge-deployable

Primary contributions

Related Posts

Cold Storage Management Software: Cut Costs by 60% in 2026

IoT Cold Chain Monitoring ROI Calculator: Indian Pharmaceutical Distribution

Custom Reporting Dashboard Development in India: The 2026 Strategic Guide