Computer vision

MinerU2.5, a vision-language model for efficient document parsing

MinerU2.5 (paper, code) is a parsing vision-language model that converts complex documents, such…

November 8, 2025
Create long AI videos locally with FramePack from Stanford

FramePack is a next-frame prediction neural network for high-quality and efficient video generation,…

April 27, 2025
InfiniteYou, photo customization with identity preservation

ByteDance introduced InfiniteYou (InfU), a powerful model that allows flexible photo modifications based…

April 6, 2025
Meta’s VGGT reconstructs 3D scenes in seconds [CVPR 2025]

VGGT (Visual Geometry Grounded Transformer) is an advanced AI model that is able…

April 1, 2025
Create dynamic multi-angle videos with CAT4D diffusion model

CAT4D is a new AI model for creating 4D scenes from single-camera videos.…

December 20, 2024
LivePortrait, a fast and free AI tool to animate portraits

LivePortrait is an AI-powered tool that creates lifelike animations from portraits. Simply provide…

August 7, 2024
Magic Insert, the new style-aware drag-and-drop technology from Google

Magic Insert is a new method proposed by Google that lets you drag-and-drop…

July 22, 2024
Depth Anything V2, a highly capable depth estimation model

Depth Anything V2 is a new powerful monocular depth estimation model, delivering significantly…

July 8, 2024
YOLOv10, a faster and more accurate object detection model

YOLOv10 is a recent advancement in real-time object detection YOLO models that achieves…

June 25, 2024
Grounding DINO 1.5, a powerful open-set object detection model

Grounding DINO 1.5 is a series of powerful open-set object detection models capable…

May 27, 2024