Untitled

任务

文本和参考图像驱动的、可风格化的人物视频生成

# Input
Images : 人物的参考图像
Text ： 人物事件描述

# Output
Video : 风格化的、符合文本内容、人物主体（背景？）一致的、动作逻辑合理的视频

Video多长，帧率多高？

期望贡献

【效果】更好的人物动作流畅性、更好的人物特征一致性、更好的文本&人物逻辑匹配
【架构方案】创新的conditional diffusion-transformer架构，之前的condition video generation基本基于Unet架构，DiT架构的条件控制有待发掘，特别是在human face 的领域，有很强的条件控制要求和应用前景。

| VDT ‣ | + Stable Diffusion VAE

Patchify(t/s embedding)
Transformer(Temporal&Spatial attn in one block)
Diffusion Schedule | | --- | --- | | LATTE ‣ | + Stable Diffusion VAE
Patchify(t/s embedding)
Transformer(Temporal/Spatial attn in one block)
Diffusion Schedule | | OpenSora PKU-YUAN-Lab (袁粒课题组-北大信工) (github.com) | 待调研 |

《VDT: GENERAL-PURPOSE VIDEO DIFFUSION TRANS- FORMERS VIA MASK MODELING》

Untitled

《Latte: Latent Diffusion Transformer for Video Generation》

Untitled

VDT 和 LATTE 架构大致相似，在position embedding和transformer block的设计上有不同。

VDT 特别点出了condition token concat方案的重要性，这在Latte中没有体现

Animateanyone	Consistent and Controllable Image-to-Video Synthesis for Character Animation
Champ	Controllable and Consistent Human Image Animation with 3D Parametric Guidance
Emo	Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions
AnimateDiff	Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
待补充