TransHuman: A Transformer-based Human Representation for Generalizable Neural Human Rendering

Task Overview

The task of generalizable neural human rendering trains conditional Neural Radiance Fields (NeRF) from multi-view videos of different characters, and can generalize to new performers in a single forward fashion during the inference stage.

Why TransHuman?

Previous methods have primarily used a SparseConvNet (SPC)-based human representation to process the painted SMPL. However, such SPC-based representation i) optimizes under the volatile observation space which leads to the pose-misalignment between training and inference stages, and ii) lacks the global relationships among human parts that is critical for handling the incomplete painted SMPL.

Tackling these issues, we present a brand-new framework named TransHuman, which learns the painted SMPL under the canonical space and captures the global relationships between human parts with transformers.

Method Overview

TransHuman is mainly composed of Transformer-based Human Encoding (TransHE), Deformable Partial Radiance Fields (DPaRF), and Fine-grained Detail Integration (FDI).

TransHE first builds a pipeline for capturing the global relationships between human parts via transformers under the canonical space.

Then, DPaRF deforms the coordinate system from the canonical back to the observation space and encodes a query point as an aggregation of coordinates and condition features.

Finally, FDI further gathers the fine-grained information of the observation space from the pixel-aligned appearance feature under the guidance of human representation.

Please refer to our paper for more details.

Novel View Synthesis on ZJU-MoCap

360° rendering results on ZJU-MoCap given 3 reference views from unseen poses/identities. With no test-time optimization/finetuning, our TransHuman achieves significantly better clarity with detailed textures and brightness consistency compared with NHP [18] which employs a SPC-based human representation.

Mesh Reconstruction on ZJU-MoCap

Though 3D surface reconstruction is not our main goal, our method achieves more complete geometry with details like wrinkles.

BibTeX

@InProceedings{Pan_2023_ICCV,
        author    = {Pan, Xiao and Yang, Zongxin and Ma, Jianxin and Zhou, Chang and Yang, Yi},
        title     = {TransHuman: A Transformer-based Human Representation for Generalizable Neural Human Rendering},
        booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
        month     = {October},
        year      = {2023},
        pages     = {3544-3555}
    }