DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation

Abstract

We propose DiffSHEG, a Diffusion-based approach for Speech-driven Holistic 3D Expression and Gesture generation with arbitrary length. While previous works focused on co-speech gesture or expression generation individually, the joint generation of synchronized expressions and gestures remains barely explored. To address this, our diffusion-based co-speech motion generation transformer enables uni-directional information flow from expression to gesture, facilitating improved matching of joint expression-gesture distributions. Furthermore, we introduce an outpainting-based sampling strategy for arbitrary long sequence generation in diffusion models, offering flexibility and computational efficiency. Our method provides a practical solution that produces high-quality synchronized expression and gesture generation driven by speech. Evaluated on two public datasets, our approach achieves state-of-the-art performance both quantitatively and qualitatively. Additionally, a user study confirms the superiority of DiffSHEG over prior approaches. By enabling the real-time generation of expressive and synchronized motions, DiffSHEG showcases its potential for various applications in the development of digital humans and embodied agents.

Demo Video

Framework

DiffSHEG framework overview. Left: Audio Encoders and UniEG-Transformer Denoiser. Given an audio clip, we encode the audio into a low-level feature Mel-Spectrogram and a high-level HuBERT feature. The audio features are concatenated with other optional conditions, such as text, and then fed into the UniEG Transformer Denoiser. The denoising block fuses the conditions with noisy motion at diffusion step t and feeds it into style-aware transformers to get the predicted noises. The uni-directional condition flow is enforced from expression to gesture. Right: The detailed architecture of style-aware Transformer encoder and motion-condition fusion residual block.

Arbitrary-long Sampling

We propose Fast Outpainting-based Partial Autoregressive Sampling (FOPPAS). Instead of conditioning the model on previous frames during training, FOPPAS realizes the arbitrary-long sampling via outpainting at the test time without training, which has more flexibility and less computation waste. The following algorithm describes the single clip pass of our FOPPAS method.

User Study

The figure shows user preference percentage in terms of four metrics: realism, gesture-speech synchronism, expression-speech synchronism, and motion diversity. In both datasets and all metrics, our method is dominantly preferred. DSG and DG are the abbreviations of DiffuseStyleGesture and DiffuseGesture.

Quantitative Comparison

On the BEAT dataset, we compare our DiffSHEG with CaMN, DiffGesture, DiffuseStyleGesture (DSG) and LDA with audio and person ID as input. Note that the baseline methods are originally for gesture generation solely, and we apply the same procedure independently for expression generation. On the SHOW dataset, we compare with LS3DCG and TalkSHOW. The ablation studies are conducted on both datasets to demonstrate the effectiveness of our UniEG-Transformer design. Note that we use SRGR on the BEAT dataset and PCM on SHOW dataset. *: indicates that the results are computed using the pre-trained checkpoints provided by authors of TalkSHOW.

BibTeX

@inproceedings{chen2024diffsheg,
      title     = {DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation},
      author    = {Chen, Junming and Liu, Yunfei and Wang, Jianan and Zeng, Ailing and Li, Yu and Chen, Qifeng},
      booktitle = {CVPR},
      year      = {2024}
    }