ControlVideo: Adding Conditional Control for One Shot Text-to-Video Editing

1Dept. of Comp. Sci. & Tech., BNRist Center, THU-Bosch ML Center, THBI Lab, Tsinghua University 2Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China
Beijing Key Laboratory of Big Data Management and Analysis Methods , Beijing, China
3ShengShu, Beijing, China 4Pazhou Laboratory (Huangpu), Guangzhou, China

Abstract

In this paper, we present ControlVideo, a novel method for text-driven video editing. Leveraging the capabilities of text-to-image diffusion models and ControlNet, ControlVideo aims to enhance the fidelity and temporal consistency of videos that align with a given text while preserving the structure of the source video. This is achieved by incorporating additional conditions such as edge maps, fine-tuning the key-frame and temporal attention on the source video-text pair with carefully designed strategies. An in-depth exploration of ControlVideo's design is conducted to inform future research on one-shot tuning video diffusion models. Quantitatively, ControlVideo outperforms a range of competitive baselines in terms of faithfulness and consistency while still aligning with the textual prompt. Additionally, it delivers videos with high visual realism and fidelity w.r.t. the source content, demonstrating flexibility in utilizing controls containing varying degrees of source video information, and the potential for multiple control combinations.

Method

Our method.

ControlVideo incorporates visual conditions for all frames to amplify the source video's guidance, key-frame attention that aligns all frames with a selected one and temporal attention modules succeeded by a zero convolutional layer for temporal consistency and faithfulness. The three key components and corresponding fine-tuned parameters are designed by a systematic empirical study. Built upon the trained ControlVideo, during inference, we employ DDIM inversion and then generate the edited video using the target prompt via DDIM sampling.

Results

HED Boundary Control

+ with red hair

+ Krenz Cushart style

+ with golden dress

+ at night, foggy, soft cinematic lighting

a car a red car

a cat a black cat

a swan a Swarovski crystal swan

+ watercolor painting

Canny Edge Map Control

a car a red car

+ autumn

+ with red hair

+ at night, foggy, soft cinematic lighting

a building a wooden building, at night

+ Vincent van Gogh style

Depth Map Control

+ admiring beautiful sunrising, early morning

+ Toei Animation style

+ with romantic pure red candlestick, beautifully backlit, matte painting concept art

ink diffuses in water gentle green ink diffuses in water, beautiful light

Pose Control

a person is dancing Michael Jackson is dancing

a person is dancing Sherlock Holmes is dancing, on the street of london, raining

a person is dancing a person wearing blue jeans is dancing

a person is dancing a brown bear is skateboarding

More Results with 48 Frames

+ with exquisite and rich makeup

+ with rich makeup

Comparsions

Source Video
Ours
Stable Diffusion
Tune-A-Video
vid2vid-zero
Video-P2P
FateZero

+ with red hair

Source Video
Ours
Stable Diffusion
Tune-A-Video
vid2vid-zero
Video-P2P
FateZero

+ Krenz Cushart style

Source Video
Ours
Stable Diffusion
Tune-A-Video
vid2vid-zero
Video-P2P
FateZero

+ with rich makeup

Source Video
Ours
Stable Diffusion
Tune-A-Video
vid2vid-zero
Video-P2P
FateZero

+ with rich makeup

BibTeX

@article{zhao2023controlvideo,
  title={ControlVideo: Adding Conditional Control for One Shot Text-to-Video Editing},
  author={Zhao, Min and Wang, Rongzhen and Bao, Fan and Li, Chongxuan and Zhu, Jun},
  journal={arXiv preprint arXiv:2305.17098},
  year={2023}
}