MotionPrior

The diffusion-based text-to-image generation has achieved remarkable progress and realistic content generation performance, greatly promoting the development of text-to-video generation. Although equipped with powerful image diffusion models, video generation modeling still requires massive labeled data and a high training resource cost. Recently, some work has focused on cost-effective video generation in a one-shot or few-shot manner based on the image diffusion model with the least demand for video data and computing resources. However, these video generation models only support the generation of one motion pattern. It raises the question: How can we improve generation freedom with a light training burden? In this paper, we explore the cost-effective video generation for adaptive motion concepts by learning motion priors from a small set of video data. Specifically, we construct a learnable bank for motion concepts and propose the Dual-Semantic-guided Motion Attention module to locate the corresponding concept from the bank with the guidance of textual semantic and visual semantic. The extracted motion priors are inserted into video latents via the lightweight motion injection layer, which integrates motion semantic effectively with much fewer parameters compared to the conventional temporal attention layer. In addition, we introduce a temporal-aware noise prior and an inter-frame consistency constraint to strengthen the learning of temporal dependency and improve video smoothness. Extensive experiments validate that the proposed method can learn motion priors adaptively from a small set of training videos to generate smooth videos that involve single or multiple motion concepts. The results demonstrate that our method achieves superior performance compared to existing few-shot video generation methods and even some large-scale video generation models.