Computer vision
-

MinerU2.5, a vision-language model for efficient document parsing
MinerU2.5 (paper, code) is a parsing vision-language model that converts complex documents, such…
-

Create long AI videos locally with FramePack from Stanford
FramePack is a next-frame prediction neural network for high-quality and efficient video generation,…
-

InfiniteYou, photo customization with identity preservation
ByteDance introduced InfiniteYou (InfU), a powerful model that allows flexible photo modifications based…
-
![Meta’s VGGT reconstructs 3D scenes in seconds [CVPR 2025]](https://www.mlwires.com/wp-content/uploads/2025/04/VGGT_featured_image-fs8.png)
Meta’s VGGT reconstructs 3D scenes in seconds [CVPR 2025]
VGGT (Visual Geometry Grounded Transformer) is an advanced AI model that is able…
-

Create dynamic multi-angle videos with CAT4D diffusion model
CAT4D is a new AI model for creating 4D scenes from single-camera videos.…
-

LivePortrait, a fast and free AI tool to animate portraits
LivePortrait is an AI-powered tool that creates lifelike animations from portraits. Simply provide…
-

Magic Insert, the new style-aware drag-and-drop technology from Google
Magic Insert is a new method proposed by Google that lets you drag-and-drop…
-

Depth Anything V2, a highly capable depth estimation model
Depth Anything V2 is a new powerful monocular depth estimation model, delivering significantly…
-

YOLOv10, a faster and more accurate object detection model
YOLOv10 is a recent advancement in real-time object detection YOLO models that achieves…
-

Grounding DINO 1.5, a powerful open-set object detection model
Grounding DINO 1.5 is a series of powerful open-set object detection models capable…