AlphaDev: how deep reinforcement learning can find better sorting algorithms than humans

Google DeepMind introduces AlphaDev, a new deep reinforcement learning (DRL) agent that can discover and optimize fundamental algorithms such as sorting or hashing. AlphaDev can produce algorithms that are accurate and efficient by working at CPU instruction level and minimizing their latency. The AI-driven agent was able to create small sorting algorithms from scratch that outperform the best human-designed ones.

The algorithms discovered by the AlphaDev agent have been integrated into the standard sort function of the LLVM C++ library. The model can also apply to other domains, such as matrix multiplication and hashing and find efficient routines for them.

The research proves that AI can surpass the current human benchmark and is able to discover new and better algorithms for basic problems.

The picture below shows the AlphaDev agent architecture.

The AlphaDev agent is made of a Transformer Encoder and a CPU State Encoder. The Transformer Encoder receives as input the program instructions written in assembly. The CPU State Encoder network receives as input the current state of memory and registers.

Many sorting algorithms have reached a stage where human experts cannot optimize them further, leading to a computational bottleneck.

To address this challenge, the team introduces a game called AssemblyGame with a single player, AlphaDev. The player is a DRL agent trained to find a better sorting routine using a series of low-level CPU instructions (assembly instructions).

The connection between C++ and assembly programs. a. A C++ function that sorts any input sequence with a maximum of 2 elements. b. The equivalent assembly representation that the C++ compiler produces for the function in a.

To effectively play the game, the learning algorithm can incorporate both DRL and stochastic search optimization algorithms. The player must navigate through the assembly instructions to find a sorting algorithm that is both provably correct and fast.

This game is very difficult for two main reasons:  

  1. The large search space of assembly instructions, comparable to highly complex games like chess and Go.
  2. The nature of the reward function, where a single incorrect instruction can invalidate the entire algorithm. A single incorrect assembly instruction has the potential to render the entire algorithm incorrect.  

The goal is to train the learning agent (AlphaDev) to generate efficient and correct assembly algorithms by iteratively exploring different actions and receiving rewards based on their performance. The sorting algorithm should be correct and have low-latency (usually length and latency are highly correlated). Upon taking an action, the player receives a reward.

The reward consists of two components: algorithm correctness and latency. Algorithm correctness is evaluated by comparing the generated outputs with the expected outputs.

The game is played for a limited number of steps, after which it is terminated. 

  • Winning the game corresponds to generating a correct and low-latency algorithm using assembly instructions. 
  • Losing the game occurs when the generated algorithm is incorrect or correct but inefficient.
The AssemblyGame and algorithm correctness computation. In this example, the output D′ does not match the expected output B′ and the algorithm is therefore incorrect.

Main Components

AlphaDev consists of two core components: a learning algorithm and a representation function.

AlphaDev’s learning algorithm is an extension of DeepMind’s AlphaZero, a well-known DRL and a MonteCarlo tree search algorithm used to master the games of chess, shogi and go.

AlphaZero uses a neural network to guide a search process in order to solve complex games. In the case of AlphaDev, the learning algorithm based on AlphaZero is adapted to address the challenges of AssemblyGame and find optimal assembly programs for sorting.

The representation function in AlphaDev helps the agent to interpret and understand the assembly instructions. The primary representation function utilized in AlphaDev is based on Transformers

AlphaDev vs AlphaZero

  1. AlphaDev uses a single-player game formulation instead of a two-player game formulation. 
  2. AlphaDev uses a different reward function that takes into account the accuracy and latency of the algorithm at the CPU instruction level 
  3. AlphaDev uses a different action space. It can copy an element from one position to another, or swap multiple elements at once (swap and copy moves). This helps it to find novel shortcuts and create better sorting algorithms.

The approach

In their approach the authors focused on two kinds of small sort algorithm: (1) the fixed sort and (2) the variable sort. Fixed sort algorithms can only sort sequences of a fixed length (for example, sort 4 can only sort sequences of length 4), while variable sort algorithms can sort sequences of varying size.

AlphaDev successfully discovered new fixed and variable sort algorithms that outperform the state-of-the-art human benchmarks. These algorithms were created from scratch by the agent, without any prior human-designed algorithms.

The picture below shows 2 sorting algorithms for sequences of length 2, 3, or 4: a human benchmark algorithm (a) and the AlphaDev’s sorting algorithm (b). 

AlphaDev’s sorting algorithm (b) is more efficient than the human benchmark algorithm for sequences of length 4, because it uses a simplified sorting routine that avoids unnecessary comparisons. It firstly calls sort 3, and then applied a simplified sort 4 routine to sort the fourth number.

AlphaDev discovered a different sorting algorithm: human benchmark (a) and AlphaDev (b) 


The new sorting methods made sorting faster by up to 70% for sequences with five elements and by around 1.7% for sequences with over 250,000 elements.

The fixed sort solutions found by AlphaDev for sort 3, sort 4, and sort 5 have been integrated into the standard sort function in the LLVM C++ library. This library is widely used by millions of users, including universities and various international companies.

There are two algorithmic discoveries made by AlphaDev that contribute to more efficient sorting algorithms by reducing the number of assembly instructions: the “AlphaDev swap move” and the “AlphaDev copy move”. It identified cases where certain inputs can be copied directly to the corresponding outputs without any additional operations, resulting in a reduction of instructions (“copy move”).

The team compared the performance of AlphaDev to stochastic search optimization approaches, highlighting the advantages of the AlphaDev approach. 

Furthermore, they apply AlphaDev to other domains beyond sorting, demonstrating the versatility and generality of the approach in discovering efficient algorithms in various problem domains.


The use of AI, particularly DRL, has allowed for the discovery of new, high-performing algorithms that improve the fundamental sorting routines.

The AlphaDev agent, trained through DRL demonstrates the capability to discover sorting algorithms that outperform existing human benchmarks in terms of both correctness and latency. This AI agent can be used for further advancements in other domains as well.

Learn more:

Research paper: “Faster sorting algorithms discovered using deep reinforcement learning” (on Nature)

Other popular posts