Diffusion-based Expressive Talking Head
Generation Framework.

When Expressive Talking Head Generation Meets Diffusion Probabilistic Models

Yifeng Ma1, Shiwei Zhang2, Jiayu Wang2, Xiang Wang3, Yingya Zhang2, Zhidong Deng1

1Tsinghua University, 2Alibaba Group, 3Huazhong University of Science and Technology

Diffusion models have shown remarkable success in a variety of downstream generative tasks, yet remain under-explored in the important and challenging expressive talking head generation. In this work, we propose a DreamTalk framework to fulfill this gap, which employs meticulous design to unlock the potential of diffusion models in generating expressive talking heads. Specifically, DreamTalk consists of three crucial components: a denoising network, a style-aware lip expert, and a style predictor. The diffusion-based denoising network is able to consistently synthesize high-quality audio-driven face motions across diverse expressions. To enhance the expressiveness and accuracy of lip motions, we introduce a style-aware lip expert that can guide lip-sync while being mindful of the speaking styles. To eliminate the need for expression reference video or text, an extra diffusion-based style predictor is utilized to predict the target expression directly from the audio. By this means, DreamTalk can harness powerful diffusion models to generate expressive faces effectively and reduce the reliance on expensive style references. Experimental results demonstrate that DreamTalk is capable of generating photo-realistic talking faces with diverse speaking styles and achieving accurate lip motions, surpassing existing state-of-the-art counterparts.

The code and checkpoints are released.


Generalization Capabilities: Songs
送别 Farewell (Chinese), Love Story (English)
More Songs
上海滩 The Bund (Cantonese), Lemon (Japanese), All For Love (English)
Generalization Capabilities: Out-of-domain Portraits

Generalization Capabilities: Speech in Multiple Languages
Speech in Chinese, French, German, Italian, Japanese, Korean, and Spanish
Generalization Capabilities: Noisy Audio

Speaking Style Manipulation
Adjusting the Scale of Classifier-free Guidance; Style Code Interpolation
Speaking Style Prediction

If you are seeking an exhilarating challenge and the chance to collaborate with AIGC and large-scale pretraining, then you have come to the right place. We are searching for talented, motivated, and imaginative researchers to join our team. If you are interested, please don't hesitate to send us your resume via email yingya.zyy@alibaba-inc.com


title={DreamTalk: When Expressive Talking Head Generation Meets Diffusion Probabilistic Models},
author={Ma, Yifeng and Zhang, Shiwei and Wang, Jiayu and Wang, Xiang and Zhang, Yingya and Deng, Zhidong},
journal={arXiv preprint arXiv:2312.09767},