ControlVideo: Adding Conditional Control for One Shot Text-to-Video Editing

Min Zhao^1,3, Rongzhen Wang², Fan Bao^1,3, Chongxuan li^2*, Jun Zhu^1,3,4*,

¹Dept. of Comp. Sci. & Tech., BNRist Center, THU-Bosch ML Center, THBI Lab, Tsinghua University ²Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China
Beijing Key Laboratory of Big Data Management and Analysis Methods , Beijing, China ³ShengShu, Beijing, China ⁴Pazhou Laboratory (Huangpu), Guangzhou, China

Paper arXiv Code

a car, Vincent van Gogh style

a car, autumn

a red car

a girl, red hair

a girl, at night, foggy, soft cinematic lighting

a girl, Krenz Cushart style

a wooden building, at night

a black swan is swimming in a river, Vincent van Gogh style

Michael Jackson is dancing

Sherlock Holmes is dancing, on the street of london, raining

a person wearing blue jeans is dancing

a person is dancing, Makoto Shinkai style

a brown bear is skateboarding

a person is dancing, Makoto Shinkai style

a girl with red hair

a girl, Krenz Cushart style

a girl with golden dress

a girl, at night, foggy, soft cinematic lighting

a red car

a black cat

a Swarovski crystal swan is swimming in a river

a jeep car is moving on the road, watercolor painting

a jeep car is moving on the road, beautiful autumn

a man on a snow snow mountain, realistic

the back view of a woman admiring beautiful sunrising, early morning

the back view of a woman, Toei Animation style

a cake with romantic pure red candlestick, beautifully backlit, matte painting concept art

a girl with rich makeup

Abstract

In this paper, we present ControlVideo, a novel method for text-driven video editing. Leveraging the capabilities of text-to-image diffusion models and ControlNet, ControlVideo aims to enhance the fidelity and temporal consistency of videos that align with a given text while preserving the structure of the source video. This is achieved by incorporating additional conditions such as edge maps, fine-tuning the key-frame and temporal attention on the source video-text pair with carefully designed strategies. An in-depth exploration of ControlVideo's design is conducted to inform future research on one-shot tuning video diffusion models. Quantitatively, ControlVideo outperforms a range of competitive baselines in terms of faithfulness and consistency while still aligning with the textual prompt. Additionally, it delivers videos with high visual realism and fidelity w.r.t. the source content, demonstrating flexibility in utilizing controls containing varying degrees of source video information, and the potential for multiple control combinations.

Method

ControlVideo incorporates visual conditions for all frames to amplify the source video's guidance, key-frame attention that aligns all frames with a selected one and temporal attention modules succeeded by a zero convolutional layer for temporal consistency and faithfulness. The three key components and corresponding fine-tuned parameters are designed by a systematic empirical study. Built upon the trained ControlVideo, during inference, we employ DDIM inversion and then generate the edited video using the target prompt via DDIM sampling.

Results

HED Boundary Control

+ with red hair

+ Krenz Cushart style

+ with golden dress

+ at night, foggy, soft cinematic lighting

a car → a red car

a cat → a black cat

a swan → a Swarovski crystal swan

+ watercolor painting

Canny Edge Map Control

a car → a red car

+ autumn

+ with red hair

+ at night, foggy, soft cinematic lighting

a building → a wooden building, at night

+ Vincent van Gogh style

Depth Map Control

+ admiring beautiful sunrising, early morning

+ Toei Animation style

+ with romantic pure red candlestick, beautifully backlit, matte painting concept art

ink diffuses in water → gentle green ink diffuses in water, beautiful light

Pose Control

a person is dancing → Michael Jackson is dancing

a person is dancing → Sherlock Holmes is dancing, on the street of london, raining

a person is dancing → a person wearing blue jeans is dancing

a person is dancing → a brown bear is skateboarding

More Results with 48 Frames

+ with exquisite and rich makeup

+ with rich makeup

Comparsions

Source Video

Ours

Stable Diffusion

Tune-A-Video

vid2vid-zero

Video-P2P

FateZero

+ with red hair

Source Video

Ours

Stable Diffusion

Tune-A-Video

vid2vid-zero

Video-P2P

FateZero

+ Krenz Cushart style

Source Video

Ours

Stable Diffusion

Tune-A-Video

vid2vid-zero

Video-P2P

FateZero

+ with rich makeup

Source Video

Ours

Stable Diffusion

Tune-A-Video

vid2vid-zero

Video-P2P

FateZero

+ with rich makeup

BibTeX

@article{zhao2023controlvideo,
  title={ControlVideo: Adding Conditional Control for One Shot Text-to-Video Editing},
  author={Zhao, Min and Wang, Rongzhen and Bao, Fan and Li, Chongxuan and Zhu, Jun},
  journal={arXiv preprint arXiv:2305.17098},
  year={2023}
}

ControlVideo: Adding Conditional Control for One Shot Text-to-Video Editing

Abstract

Method

Results

HED Boundary Control

Canny Edge Map Control

Depth Map Control

Pose Control

More Results with 48 Frames

Comparsions

Related Links

BibTeX