Google Research has released TPU v4, their latest supercomputer that is 10 times faster than its predecessor, TPU v3, and 1.2-1.7 times faster than Nvidia A100 GPUs, while using 1.3-1.9 times less power.
Google is confident that TPU v4 will emerge as the preferred computing choice for the demands of modern large language models (LMMs), due to its superior performance, scalability, and availability.
The TPU v4 has undergone a significant improvement that involves expanding the number of chips to 4096 and incorporating Optical Circuit Switches (OCSes), which use optical data links to interconnect the chips. This enhancement has resulted in better scalability, security, power efficiency, and overall performance of the system.

TPUs vs CPUs and GPUs
The TPUs, CPUs, and GPUs are all hardware components, each of them having unique features that make them suitable for particular types of computing workloads.
- CPUs are general-purpose processors that can handle a wide range of tasks
- GPUs are optimized for graphics-intensive workloads
- TPUs are specialized processors designed specifically for machine learning tasks
A short history of TPU
TPU is an application-specific integrated circuit (ASIC) developed by Google for the acceleration of AI tasks, particularly neural network machine learning. It was initially employed by Google in 2015 and was designed for Google’s TensorFlow software.
TPU v1 (released in 2016) was designed to accelerate inference workloads. It used an 8-bit matrix multiplication engine with a clock speed of 700 MHz and 28 MiB of on-chip memory.
TPU v2 (released in 2018) was designed to accelerate both training and inference workloads. It included improvements in performance, memory capacity, and network connectivity. It had increased memory bandwidth to 600 GB/s and performance to 45 teraFLOPS.
TPU v3 (announced in 2018) was twice as powerful as v2, resulting in an 8-fold increase in performance.
TPU v4 (announced in 2021) reveals improved performance by more than 2x over TPU v3 chips, with each pod containing 4,096 v4 chips interconnected by OCS.
The major upgrades in TPU v4
The major upgrades in TPU v4 involve the adoption of a 3D torus topology, and the interconnection of 4096 chips via a dynamically reconfigurable OCS.
TPU v4 is the first supercomputer to implement a reconfigurable OCS, to dynamically scale up the system and address the growing bandwidth requirements, leading to a notable enhancement in the scalability of ML systems by nearly 10 times over TPU v3.
The team wanted to scale up the network by four times compared to TPU v3 and to use optical links instead of electrical links. As the network expands, the traffic between its two halves also increases significantly, leading to a bottleneck. To address this concern, the team replaced the 2D torus network topology of TPU v3 with a 3D grid torus topology.
Furthermore, to increase the reliability of the system, they proposed the OCS to act like a plugboard to bypass any failed units.
TPU v4 architecture
In the figure below, the optical links from the 6 faces of a 4x4x4 (3D) block are displayed, which includes 16 links per face and a total of 16×6 = 96 optical links per block. The links on opposite sides need to connect to the same OCS.
Hence, each block connects to 96/2 = 48 OCSes. The adaptability of OCSes allows for easy and rapid adjustment to the application, number of nodes, and system running a job, resulting in notable enhancements in training time.

The TPU v4 package, similar to v3, consists of two Tensor Cores (TC), each with four 128×128 Matrix Multiply Units (MXUs), a Vector Processing Unit (VPU), and a Vector Memory (VMEM).

Furthermore, each TPU v4 is equipped with SparseCores, which are dataflow processors that significantly improve the performance of models by 5x to 7x while consuming only a small percentage (5 percent) of the TPU v4’s die area and power.

OCSes have the capability to adjust their interconnect topology in real-time with the aim of enhancing scale, availability, utilization, modularity, deployment, security, power, and performance.

Performance
On a log-log scale, the chart below shows the scalability of TPU v4 production workloads across various model types.

The following chart illustrates the performance of an internal recommendation model across CPUs, TPU v3, TPU v4, and TPU v4 with embeddings in CPU memory (not utilizing SparseCore).
It is worth noting that the TPU v4 SparseCore delivers a 3X boost in speed compared to TPU v3 when it comes to recommendation models and surpasses CPU-based systems by 5-30X in terms of performance.

The TPU v4 demonstrates its superiority through its performance ratio (see chart below).

MLPerf 2.0 benchmark shows that A100s uses on average 1.3x–1.9x more power than TPU v4.
ML Perf Benchmark | A100 | TPU v4 | Ratio |
---|---|---|---|
BERT | 380 W | 197 W | 1.93 |
ResNet | 273 W | 206 W | 1.33 |
Conclusion
Google’s Tensor Processing Unit (TPU) is a competitor to Nvidia, that recently unveiled H100 as the successor to A100.
Google’s TPU v4 is a custom-built AI accelerator designed specifically for ML workloads, while Nvidia produces a range of general-purpose graphics processing units (GPUs) that are commonly used for ML and other applications.
The performance and capabilities of these hardware components depend on the specific use case and workload. In certain scenarios, TPU v4 may outperform Nvidia GPUs in terms of speed and efficiency for ML tasks, while in others, Nvidia GPUs may be a better choice.