GST: Precise 3D Human Body from a Single Image with Gaussian Splatting Transformers

Visual Geometry Group, University of Oxford

Input Image Output Precise 3D Human Body Novel views Renderings


Given a single input image, Gaussian Splatting Transformers (GST) predicts precise 3D human pose and shape, and a color model that enables novel view rendering including clothes.
GST is fast (50 FPS) and relies solely on multi-view supervision (no direct 3D supervision).

Abstract

We base our work on 3D Gaussian Splatting (3DGS), a scene representation composed of a mixture of Gaussians. Predicting such mixtures for a human from a single input image is challenging, as it is a non-uniform density (with a many-to-one relationship with input pixels) with strict physical constraints. At the same time, it needs to be flexible to accommodate a variety of clothes and poses. Our key observation is that the vertices of standardized human meshes (such as SMPL) can provide an adequate density and approximate initial position for Gaussians. We can then train a transformer model to jointly predict comparatively small adjustments to these positions, as well as the other Gaussians' attributes and the SMPL parameters. We show empirically that this combination (using only multi-view supervision) can achieve fast inference of 3D human models from a single image without test-time optimization, expensive diffusion models, or 3D points supervision. We also show that it can improve 3D pose estimation by better fitting human models that account for clothes and other variations.

Pipeline

Given a single input image, GST uses a Vision Transformer (ViT) to predict both a 3D human pose (in the form of SMPL parameters) and a refined full-color 3D model (in the form of 3D Gaussian Splats). Additional input tokens facilitate the output of individual Gaussians’ color c, opacity α, scale, rotation, and position offset δ. Each Gaussian’s position μ is relative to one vertex of the SMPL model v by the offset δ, and so this model can be considered a refinement or residual over the interpretable SMPL mesh, facilitating multi-view rendering with higher visual fidelity.

Predicting 3D SMPL Parameters and 3D Gaussian Splats from a Single Image



Input Out SMPL Out 3DGS Input Out SMPL Out 3DGS

Video

Comparisons

Speed:

3D Pose Methods

Human Novel Views Methods:

BibTeX

@misc{prospero2024gstprecise3dhuman,
        title={GST: Precise 3D Human Body from a Single Image with Gaussian Splatting Transformers}, 
        author={Lorenza Prospero and Abdullah Hamdi and Joao F. Henriques and Christian Rupprecht},
        year={2024},
        eprint={2409.04196},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2409.04196}, 
  }