MotionPrior: Exploring Efficient Learning of Motion Concepts for Few-shot Video Generation


The Hong Kong Polytechnic University
IEEE TIP, 2026

Abstract

The diffusion-based text-to-image generation has achieved remarkable progress and realistic content generation performance, greatly promoting the development of text-to-video generation. Although equipped with powerful image diffusion models, video generation modeling still requires massive labeled data and a high training resource cost. Recently, some work has focused on cost-effective video generation in a one-shot or few-shot manner based on the image diffusion model with the least demand for video data and computing resources. However, these video generation models only support the generation of one motion pattern. It raises the question: How can we improve generation freedom with a light training burden? In this paper, we explore the cost-effective video generation for adaptive motion concepts by learning motion priors from a small set of video data. Specifically, we construct a learnable bank for motion concepts and propose the Dual-Semantic-guided Motion Attention module to locate the corresponding concept from the bank with the guidance of textual semantic and visual semantic. The extracted motion priors are inserted into video latents via the lightweight motion injection layer, which integrates motion semantic effectively with much fewer parameters compared to the conventional temporal attention layer. In addition, we introduce a temporal-aware noise prior and an inter-frame consistency constraint to strengthen the learning of temporal dependency and improve video smoothness. Extensive experiments validate that the proposed method can learn motion priors adaptively from a small set of training videos to generate smooth videos that involve single or multiple motion concepts. The results demonstrate that our method achieves superior performance compared to existing few-shot video generation methods and even some large-scale video generation models.




Learning Multiple Motion Concepts ⇨ Comparison with state-of-the-art methods

"A horse runs on the grass."

Exposed video data: > 10 Million
(Pretraining or Finetuning)
Gen-2
MotionDirector
AnimatedDiff
I2VGen-XL
TI2V-Zero




Exposed video data: < 100
Tune-A-Video
LAMP
MotionPrior (ours)
Given the first frame


"Fireworks over the mountains."

Exposed video data: > 10 Million
(Pretraining or Finetuning)
Gen-2
MotionDirector
AnimatedDiff
I2VGen-XL
TI2V-Zero




Exposed video data: < 100
Tune-A-Video
LAMP
MotionPrior (ours)
Given the first frame


"Birds fly over the city."

Exposed video data: > 10 Million
(Pretraining or Finetuning)
Gen-2
MotionDirector
AnimatedDiff
I2VGen-XL
TI2V-Zero




Exposed video data: < 100
Tune-A-Video
LAMP
MotionPrior (ours)
Given the first frame



Learning Multiple Motion Concepts ⇨ Generating Composite Motions

"A beautiful girl plays the guitar
under fireworks in the sky."
"A helicopter and many birds
fly over the sea."
"A horse is running
in the rain."
"A Pikachu plays the guitar in the
garden besides a waterfall."

Learning Single Motion Concept ⇨ Few-Shot

"A branch in rain."
"A woman plays the guitar"
"Apache flies in sky."
"Waterfall and a Ferrari."

Learning Single Motion Pattern ⇨ One-Shot


Input Video
"A jeep car is moving
on the snow."

Input Video
"A girl runs beside a river,
Van Gogh style."

Training with 16 Frames ⇨ Inference for Longer Videos

16 frames
32 frames
60 frames
100 frames