YOLOv10 is a recent advancement in real-time object detection YOLO models that achieves cutting-edge performance with significantly lower computational demands than previous versions.
Developed by researchers at Tsinghua University from Beijing, the new version optimizes the design of its predecessors and eliminates the non-maximum suppression (NMS) procedure, thereby enhancing the efficiency and accuracy of the object detection algorithm.
For the PyTorch implementation of YOLOv10, visit the official GitHub repository where all the necessary resources are provided. The code is released under the AGPL-3.0 license. Its pretrained weights are available for download here, allowing users to leverage the model’s capabilities without having to train it from scratch. These weights come in a range of sizes to suit various requirements, such as Nano (N), Small (S), Medium (M), Balanced (B), Large (L), and Extra-large (X).
The YOLO approach
YOLO (You Only Look Once) is a family of object detection models known for their ability to process images quickly and efficiently, making them ideal for real-time applications.
Traditional object detection methods usually have two steps. First, they use Region Proposal Networks to guess where objects might be in the image. Then, they look at those places more closely to figure out what the objects are.
YOLO takes a different approach. It doesn’t use a trial-and-error method. Instead, it divides the image into a grid and directly predicts the objects and their locations for each part of the grid. This technique allows YOLO to detect objects in a single step, making it very fast.
Introducing YOLOv10
YOLOv10, which builds on previous YOLO versions, introduces several key improvements that enhance its accuracy and speed, such as:
- Efficiency-driven architecture design. Various components of the model were optimized for both efficiency and accuracy. This involved areas like the network structure, feature extraction layers, and loss functions.
- NMS-free training. YOLOv10 has introduced a training strategy that removes the need for NMS during inference. NMS is traditionally used in object detection models like YOLO to eliminate duplicate bounding boxes, ensuring that only the most relevant one is retained for each detected object. By bypassing NMS, YOLOv10 reduces latency, making the model faster and more efficient.
- Quicker yet equally proficient object identification system. Typically, YOLO models improve their object recognition skills by learning from numerous instances of each object. This extensive learning provides valuable insights but also necessitates a cleanup step to remove the redundant detections, which can slow them down. YOLOv10 enhances the learning process by minimizing unnecessary detections from the outset, resulting in a faster and more accurate object recognition system.
Model architecture
The YOLOv10 architecture incorporates a dual-label assignment system which improves the model’s capability to detect and classify objects in real-time. The dual-label assignment system is a novel approach that combines two strategies during training: one-to-many (multiple predictions per object) and one-to-one (a single best prediction per object).
The picture below shows how YOLOv10 works (a) and how often it can correctly identify each actual object with its best guess (one-to-one assignments) compared to the many guesses (one-to-many predictions) for the YOLOv8-S model (b). The bar groups in (b) represent:
- Top-1 is the model’s best guess
- Top-5 includes the best and 4 other good guesses
- Top-10 includes the best and 9 other guesses
The model’s main components are:
- Backbone network that processes the input image and extracts features.
- Neck that takes the features from the backbone and combines them in a way that preserves important information from different scales. It’s like putting together pieces of a puzzle to get a better picture.
- PAN (Path Aggregation Network) layers that are specific layers within the neck. They make sure that the detailed features from the lower levels and the more abstract features from the higher levels work together. This helps the model make more accurate guesses about what’s in the image.
- One-to-many Head that generates several predictions for each object during training. This gives a wide range of feedback, helping the model learn better.
- One-to-one Head generates a single best prediction for each object during inference. This eliminates the need for NMS, making the process faster and more efficient.
Experiments
YOLOv10 uses YOLOv8 as its baseline model. YOLOv8 is known for its balance between latency and accuracy and comes in various model sizes suitable for different needs. YOLOv10 was trained and evaluated on the COCO dataset, which is a large-scale object detection, segmentation, and captioning dataset known for its diversity and complexity, with over 80 object categories.
The evaluation results show that YOLOv10 outperforms the earlier YOLO versions and other leading models (see the next tables), achieving superior accuracy and efficiency.
Model | Params (M) | FLOPs (G) | APval (%) | Latency (ms) | Latency (Forward) (ms) |
---|---|---|---|---|---|
YOLOv6-3.0-N | 4.7 | 11.4 | 37.0 | 2.69 | 1.76 |
Gold-YOLO-N | 5.6 | 12.1 | 39.6 | 2.92 | 1.82 |
YOLOv8-N | 3.2 | 8.7 | 37.3 | 6.16 | 1.77 |
YOLOv10-N | 2.3 | 6.7 | 39.5 | 1.84 | 1.79 |
YOLOv6-3.0-S | 18.5 | 45.3 | 44.3 | 3.42 | 2.35 |
Gold-YOLO-S | 21.5 | 46.0 | 45.4 | 3.82 | 2.73 |
YOLOv8-S | 11.2 | 28.6 | 44.9 | 7.07 | 2.33 |
YOLOv10-S | 7.2 | 21.6 | 46.8 | 2.49 | 2.39 |
RT-DETR-R18 | 20.0 | 60.0 | 46.5 | 4.58 | 4.49 |
YOLOv6-3.0-M | 34.9 | 85.8 | 49.1 | 5.63 | 4.56 |
Gold-YOLO-M | 41.3 | 87.5 | 49.8 | 6.38 | 5.45 |
YOLOv8-M | 25.9 | 78.9 | 50.6 | 9.50 | 5.09 |
YOLOv10-M | 15.4 | 59.1 | 51.3 | 4.74 | 4.63 |
YOLOv6-3.0-L | 59.6 | 150.7 | 51.8 | 9.02 | 7.90 |
Gold-YOLO-L | 75.1 | 151.7 | 51.8 | 10.65 | 9.78 |
YOLOv8-L | 43.7 | 165.2 | 52.9 | 12.39 | 8.06 |
RT-DETR-R50 | 42.0 | 136.0 | 53.1 | 9.20 | 9.07 |
YOLOv10-L | 24.4 | 120.3 | 53.4 | 7.28 | 7.21 |
The figure below illustrates the model’s performance in terms of latency-accuracy (left) and size-accuracy (right) as compared to previous versions and other contemporary detectors, such as RT-DETR.
- The latency-accuracy graph shows how fast and accurate the YOLOv10 model can process images. The lower the latency and the higher the accuracy, the better the model is for real-time applications.
- The size-accuracy graph shows the relationship between the size of the model (how much memory it takes up) and its accuracy. Smaller models are generally faster and require less computational power, which is great for devices with limited resources.
The ideal object detection models for real-time use need to be fast, accurate, and as small as possible. The experiments show that YOLOv10 achieves these goals better than other similar models.
Conclusion
YOLOv10 sets a new benchmark in the field of real-time object detection. It introduces a novel training strategy that eliminates the need for NMS during inference. This significantly reduces the model’s latency, making it even faster for real-time applications such as autonomous vehicles, surveillance and security applications and augmented reality.
Read more:
- Paper on arXiv: “YOLOv10: Real-Time End-to-End Object Detection”
- GitHub repository