Untitled

Task

用Diffusion-Transformer架构做一个统一“视频生成任务”,“视频预测任务”,“视频插帧补帧任务”的视频处理器

Motivation

The domain of video generation encompasses a variety of tasks, such as unconditional generation, video prediction, interpolation, and text-to-image generation. Prior research has typically focused on individual tasks, often incorporating specialized modules for downstream fine-tuning.

视频生成领域包含很多任务(无条件生成,视频预测,视频插帧),包含很多的模态条件,transformer可以将多种模态的信息、以多种任务的形式无缝地整合到一起

Transformers, unlike U-Net which is designed mainly for images are inherently capable of capturing long-range or irregular temporal dependencies, thanks to their powerful tokenization and attention mechanisms.

Unet架构主要是为Image设计的,而Transformer有长时间建模的能力(toenization & attention)

Only when a model has learned (or memorized) worldly knowledge (e.g., spatiotemporal relationships and physical laws) can it generate videos corresponding to the real world. Model capacity is thus a crucial component for video diffusion.

模型需要很多的数据和很大的参数量来学习世界知识(时空关系、物理知识),Transformer is more scalable


Contribution

  1. 第一个成功将diffusion-transformer应用于video的工作
  2. 时空掩码机制,使得各种任务上的能力都达到sota级别,例如3D物体的动态预测
  3. 综合研究了VDT如何捕捉时间关系,处理条件信息和高效训练等细节

Main

Condition Control

Masking

Conclusion