The diffusion-based text-to-image generation has achieved remarkable progress and realistic content generation performance, greatly promoting the development of text-to-video generation. Although equipped with powerful image diffusion models, video generation modeling still requires massive labeled data and a high training resource cost. Recently, some work has focused on cost-effective video generation in a one-shot or few-shot manner based on the image diffusion model with the least demand for video data and computing resources. However, these video generation models only support the generation of one motion pattern. It raises the question: How can we improve generation freedom with a light training burden? In this paper, we explore the cost-effective video generation for adaptive motion concepts by learning motion priors from a small set of video data. Specifically, we construct a learnable bank for motion concepts and propose the Dual-Semantic-guided Motion Attention module to locate the corresponding concept from the bank with the guidance of textual semantic and visual semantic. The extracted motion priors are inserted into video latents via the lightweight motion injection layer, which integrates motion semantic effectively with much fewer parameters compared to the conventional temporal attention layer. In addition, we introduce a temporal-aware noise prior and an inter-frame consistency constraint to strengthen the learning of temporal dependency and improve video smoothness. Extensive experiments validate that the proposed method can learn motion priors adaptively from a small set of training videos to generate smooth videos that involve single or multiple motion concepts. The results demonstrate that our method achieves superior performance compared to existing few-shot video generation methods and even some large-scale video generation models.
|
Exposed video data: > 10 Million (Pretraining or Finetuning) |
Gen-2
|
MotionDirector
|
AnimatedDiff
|
I2VGen-XL
|
TI2V-Zero
|
|---|---|---|---|---|---|
|
|
Exposed video data: < 100 |
Tune-A-Video
|
LAMP
|
MotionPrior (ours)
|
Given the first frame
|
|
Exposed video data: > 10 Million (Pretraining or Finetuning) |
Gen-2
|
MotionDirector
|
AnimatedDiff
|
I2VGen-XL
|
TI2V-Zero
|
|---|---|---|---|---|---|
|
|
Exposed video data: < 100 |
Tune-A-Video
|
LAMP
|
MotionPrior (ours)
|
Given the first frame
|
|
Exposed video data: > 10 Million (Pretraining or Finetuning) |
Gen-2
|
MotionDirector
|
AnimatedDiff
|
I2VGen-XL
|
TI2V-Zero
|
|---|---|---|---|---|---|
|
|
Exposed video data: < 100 |
Tune-A-Video
|
LAMP
|
MotionPrior (ours)
|
Given the first frame
|
|
"A beautiful girl plays the guitar
under fireworks in the sky."
|
"A helicopter and many birds
fly over the sea."
|
"A horse is running
in the rain."
|
"A Pikachu plays the guitar in the
garden besides a waterfall."
|
|---|
|
"A branch in rain."
|
"A woman plays the guitar"
|
"Apache flies in sky."
|
"Waterfall and a Ferrari."
|
|---|
|
Input Video
|
"A jeep car is moving
on the snow."
|
Input Video
|
"A girl runs beside a river,
Van Gogh style."
|
|---|
|
16 frames
|
32 frames
|
60 frames
|
100 frames
|
|---|