Untitled

Task

用Diffusion-Transformer架构做一个统一“视频生成任务”，“视频预测任务”，“视频插帧补帧任务”的视频处理器

Motivation

The domain of video generation encompasses a variety of tasks, such as unconditional generation, video prediction, interpolation, and text-to-image generation. Prior research has typically focused on individual tasks, often incorporating specialized modules for downstream fine-tuning.

视频生成领域包含很多任务（无条件生成，视频预测，视频插帧），包含很多的模态条件，transformer可以将多种模态的信息、以多种任务的形式无缝地整合到一起

Transformers, unlike U-Net which is designed mainly for images are inherently capable of capturing long-range or irregular temporal dependencies, thanks to their powerful tokenization and attention mechanisms.

Unet架构主要是为Image设计的，而Transformer有长时间建模的能力（toenization & attention）

Only when a model has learned (or memorized) worldly knowledge (e.g., spatiotemporal relationships and physical laws) can it generate videos corresponding to the real world. Model capacity is thus a crucial component for video diffusion.

模型需要很多的数据和很大的参数量来学习世界知识（时空关系、物理知识），Transformer is more scalable

Contribution

第一个成功将diffusion-transformer应用于video的工作
时空掩码机制，使得各种任务上的能力都达到sota级别，例如3D物体的动态预测
综合研究了VDT如何捕捉时间关系，处理条件信息和高效训练等细节

Task

Motivation

Contribution

Main

Condition Control

Masking

Conclusion