PoseDiffusion: a novel diffusion framework for camera pose estimation with epipolar geometry constraints

PoseDiffusion is a new method for camera pose estimation, which means finding the location and orientation of a camera relative to a scene. It combines the power of the diffusion model and the traditional epipolar geometry constraints. Determining the camera pose is a fundamental challenge in computer vision and it has many applications including 3D reconstruction, augmented reality, and robot navigation.

PoseDiffusion was developed by a research team from the Visual Geometry Group, University of Oxford and Meta AI. The new model surpasses the traditional methods and can generalize to different datasets without further training.

The problem

Camera pose estimation is a traditional problem in computer vision. To find the camera pose, we need to estimate the camera’s parameters (e.g. location, direction, focal length) based on a collection of images of the scene taken from different viewpoints.

The problem with the conventional methods is that they are not robust to noise, outliers, occlusions, and dynamic scenes:

  • handcrafted keypoint matching may produce wrong matches due to ambiguity, illumination changes, or perspective distortion.
  • RANSAC may fail to find the optimal model.
  • bundle adjustment may be slow and sensitive to the initial estimate and the choice of optimization method.

PoseDiffusion offers a new perspective to tackle these difficulties. It uses diffusion and geometric principles based on the relationship between two images. It outperforms the traditional methods in challenging scenarios such as few pictures with wide baselines between them (taken from very different viewpoints).

What are diffusion models?

Diffusion models are models that learn to create new data that resemble the data they are trained on. For instance, if they are trained on images of cats, they learn to generate new images of cats.

They use a neural network that is trained to denoise images blurred with Gaussian noise by learning to reverse the diffusion process. They do this by following two processes: the forward process and the reverse process.

  • In the forward process, the model gradually transforms the data samples into pure noise by adding noise in small steps.
  • In the reverse process, the model learns to reverse this transformation by removing noise in small steps, until it produces realistic data.

What are epipolar geometric constraints?

Epipolar constraints are the geometric relations that exist between two images of the same 3D scene taken from different viewpoints.

They are based on the idea that a 3D point (P) and its projections on the two images (p and p’) lie on a plane called the epipolar plane (the grey surface). This plane also contains the centers of the two cameras (O1 and O2) that are observing the same 3D point (P). The line joining them is called the baseline (the orange line).

The general setup of epipolar geometry: the epipolar plane (the gray region), the baseline (the orange line) and the epipolar lines (the blue lines)

PoseDiffusion uses epipolar geometry constraint to improve the sampling efficiency and accuracy of the diffusion model.

How does PoseDiffusion work?

The picture below shows how PoseDiffusion predicts a camera’s parameters (denoted by x) for a given set of images (denoted by I).

Camera pose estimation with PoseDiffusion

The model starts with a set of randomly initialized poses and iteratively refines them to estimate the camera pose (extrinsic and intrinsic parameters).

  • The extrinsic parameters are the location and direction of the camera.
  • The intrinsic parameters are the focal length, the pixel size, the image center, the skew, and the lens distortion.

Framework

The model consists of two stages: training and inference (see the picture below). In the first stage, the diffusion model was trained on sets of supervised images and their corresponding camera poses. The model learned to identify the most probable camera pose for each set of pictures. In the second stage, the trained model was used for inference, to recover the camera pose by reversing the diffusion process.

PoseDiffusion framework

For example, you have a camera and you take some pictures of a scene from different angles. You want to know the camera pose, where the camera was when you took each picture, and how it was pointing. The camera pose can be described by the coordinates of the camera and the direction it was facing (in our picture there numbers are called x).

p(x | I) tells us how likely each possible pose x of the camera is for the given pictures I. If p(x | I) is high for a certain pose x, it means that pose x is very likely to match the pictures I.

  • During the training the model learns to maximize p(x | I) using a set of (x, I).
  • During the inference the model takes a set of images (I) and outputs the camera poses (x).

The training dataset is made of multiple images of a scene taken from different viewpoints, and the corresponding camera poses (position and orientation) for each image.

Visualization of sampling iterations

The video above shows some sampling iterations for pose estimation. Given a set of input frames, the model samples p(x|I) step by step.

Qualitative comparison

The figure below shows the results of the PoseDiffusion method compared to RelPose, COLMAP+SPSG, and the ground truth (red color). The input images I are taken from the CO3Dv2 dataset. A missing camera indicates that the method did not succeed.

Pose estimation on CO3Dv2 dataset

Create a 3D model of a scene from a set of 2D photos 

With the new method you can create a 3D model of a scene from a set of 2D photos (photogrammetry). Imagine you have a camera and you take pictures of a building from different angles.

You can use those pictures to figure out the shape and size of the building. The model uses the probabilistic diffusion to calculate the best positions and orientations of the camera for each picture, and then use them to estimate the 3D points of that building.

Conclusion

PoseDiffusion is a new method for camera pose estimation that outperforms other methods by combining the strengths of traditional and learned methods.

The new approach addresses one of the main challenges of learned methods: the generalization across datasets, even when trained on a dataset with different pose distributions. For example, if it learned from pictures of people standing, it can still work on pictures of people sitting or lying down.

The model can be used for 3D reconstruction, augmented reality, and robot navigation.

Learn more:

Other popular posts