
用Diffusion-Transformer架构做一个统一“视频生成任务”,“视频预测任务”,“视频插帧补帧任务”的视频处理器
The domain of video generation encompasses a variety of tasks, such as unconditional generation, video prediction, interpolation, and text-to-image generation. Prior research has typically focused on individual tasks, often incorporating specialized modules for downstream fine-tuning.
视频生成领域包含很多任务(无条件生成,视频预测,视频插帧),包含很多的模态条件,transformer可以将多种模态的信息、以多种任务的形式无缝地整合到一起
Transformers, unlike U-Net which is designed mainly for images are inherently capable of capturing long-range or irregular temporal dependencies, thanks to their powerful tokenization and attention mechanisms.
Unet架构主要是为Image设计的,而Transformer有长时间建模的能力(toenization & attention)
Only when a model has learned (or memorized) worldly knowledge (e.g., spatiotemporal relationships and physical laws) can it generate videos corresponding to the real world. Model capacity is thus a crucial component for video diffusion.
模型需要很多的数据和很大的参数量来学习世界知识(时空关系、物理知识),Transformer is more scalable