Motivation

In this work, we tackle the challenge of enhancing the realism and expressiveness in talking head video generation by focusing on the dynamic and nuanced relationship between audio cues and facial movements.

之前的方法难以掌控人脸的丰富表达和个性化人脸风格

Problem

audio和facial expression 本就有的映射的模糊性,会造成模型的不稳定(+ speed control + face region control)

Method

Untitled


用Reference Feature Map保持主体一致性

用Face Region来给人脸位置弱信息

用Pretrain audio encoder给出音频信息

用Speed Encoder 给出速度弱信息

用SD 1.5的强大图片生成能力的同时加入Temporal Layer增加时间感知能力

Limitaion