DreamTalk

When Expressive Talking Head Generation Meets Diffusion Probabilistic Models

Yifeng Ma¹, Shiwei Zhang², Jiayu Wang², Xiang Wang³, Yingya Zhang², Zhidong Deng¹

¹Tsinghua University, ²Alibaba Group, ³Huazhong University of Science and Technology

Diffusion models have shown remarkable success in a variety of downstream generative tasks, yet remain under-explored in the important and challenging expressive talking head generation. In this work, we propose a DreamTalk framework to fulfill this gap, which employs meticulous design to unlock the potential of diffusion models in generating expressive talking heads. Specifically, DreamTalk consists of three crucial components: a denoising network, a style-aware lip expert, and a style predictor. The diffusion-based denoising network is able to consistently synthesize high-quality audio-driven face motions across diverse expressions. To enhance the expressiveness and accuracy of lip motions, we introduce a style-aware lip expert that can guide lip-sync while being mindful of the speaking styles. To eliminate the need for expression reference video or text, an extra diffusion-based style predictor is utilized to predict the target expression directly from the audio. By this means, DreamTalk can harness powerful diffusion models to generate expressive faces effectively and reduce the reliance on expensive style references. Experimental results demonstrate that DreamTalk is capable of generating photo-realistic talking faces with diverse speaking styles and achieving accurate lip motions, surpassing existing state-of-the-art counterparts.

The code and checkpoints are released.

Overview

Generalization Capabilities: Songs

送别 Farewell (Chinese), Love Story (English)

More Songs

上海滩 The Bund (Cantonese), Lemon (Japanese), All For Love (English)

Generalization Capabilities: Out-of-domain Portraits

Generalization Capabilities: Speech in Multiple Languages

Speech in Chinese, French, German, Italian, Japanese, Korean, and Spanish

Generalization Capabilities: Noisy Audio

Speaking Style Manipulation

Adjusting the Scale of Classifier-free Guidance; Style Code Interpolation

Speaking Style Prediction

More>>

If you are seeking an exhilarating challenge and the chance to collaborate with AIGC and large-scale pretraining, then you have come to the right place. We are searching for talented, motivated, and imaginative researchers to join our team. If you are interested, please don't hesitate to send us your resume via email yingya.zyy@alibaba-inc.com

References

@article{ma2023dreamtalk,
title={DreamTalk: When Expressive Talking Head Generation Meets Diffusion Probabilistic Models},
author={Ma, Yifeng and Zhang, Shiwei and Wang, Jiayu and Wang, Xiang and Zhang, Yingya and Deng, Zhidong},
journal={arXiv preprint arXiv:2312.09767},
year={2023}
}

DreamTalk

Diffusion-based Expressive Talking Head Generation Framework.

More>>

References

Diffusion-based Expressive Talking Head
Generation Framework.