+ with red hair
a car, Vincent van Gogh style
a car, autumn
a red car
a girl, red hair
a girl, at night, foggy, soft cinematic lighting
a girl, Krenz Cushart style
a wooden building, at night
a black swan is swimming in a river, Vincent van Gogh style
Michael Jackson is dancing
Sherlock Holmes is dancing, on the street of london, raining
a person wearing blue jeans is dancing
a person is dancing, Makoto Shinkai style
a brown bear is skateboarding
a person is dancing, Makoto Shinkai style
a girl with red hair
a girl, Krenz Cushart style
a girl with golden dress
a girl, at night, foggy, soft cinematic lighting
a red car
a black cat
a Swarovski crystal swan is swimming in a river
a jeep car is moving on the road, watercolor painting
a jeep car is moving on the road, beautiful autumn
a man on a snow snow mountain, realistic
the back view of a woman admiring beautiful sunrising, early morning
the back view of a woman, Toei Animation style
a cake with romantic pure red candlestick, beautifully backlit, matte painting concept art
a girl with rich makeup
a girl with rich makeup
In this paper, we present ControlVideo, a novel method for text-driven video editing. Leveraging the capabilities of text-to-image diffusion models and ControlNet, ControlVideo aims to enhance the fidelity and temporal consistency of videos that align with a given text while preserving the structure of the source video. This is achieved by incorporating additional conditions such as edge maps, fine-tuning the key-frame and temporal attention on the source video-text pair with carefully designed strategies. An in-depth exploration of ControlVideo's design is conducted to inform future research on one-shot tuning video diffusion models. Quantitatively, ControlVideo outperforms a range of competitive baselines in terms of faithfulness and consistency while still aligning with the textual prompt. Additionally, it delivers videos with high visual realism and fidelity w.r.t. the source content, demonstrating flexibility in utilizing controls containing varying degrees of source video information, and the potential for multiple control combinations.
ControlVideo incorporates visual conditions for all frames to amplify the source video's guidance, key-frame attention that aligns all frames with a selected one and temporal attention modules succeeded by a zero convolutional layer for temporal consistency and faithfulness. The three key components and corresponding fine-tuned parameters are designed by a systematic empirical study. Built upon the trained ControlVideo, during inference, we employ DDIM inversion and then generate the edited video using the target prompt via DDIM sampling.
+ with red hair
+ Krenz Cushart style
+ with golden dress
+ at night, foggy, soft cinematic lighting
a car → a red car
a cat → a black cat
a swan → a Swarovski crystal swan
+ watercolor painting
a car → a red car
+ autumn
+ with red hair
+ at night, foggy, soft cinematic lighting
a building → a wooden building, at night
+ Vincent van Gogh style
+ admiring beautiful sunrising, early morning
+ Toei Animation style
+ with romantic pure red candlestick, beautifully backlit, matte painting concept art
ink diffuses in water → gentle green ink diffuses in water, beautiful light
a person is dancing → Michael Jackson is dancing
a person is dancing → Sherlock Holmes is dancing, on the street of london, raining
a person is dancing → a person wearing blue jeans is dancing
a person is dancing → a brown bear is skateboarding
+ with exquisite and rich makeup
+ with rich makeup
+ with red hair
+ Krenz Cushart style
+ with rich makeup
+ with rich makeup
There are a lot of excellent works that are related to ControlVideo.
High-resolution image synthesis with latent diffusion models
Adding Conditional Control to Text-to-Image Diffusion Models
Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
@article{zhao2023controlvideo,
title={ControlVideo: Adding Conditional Control for One Shot Text-to-Video Editing},
author={Zhao, Min and Wang, Rongzhen and Bao, Fan and Li, Chongxuan and Zhu, Jun},
journal={arXiv preprint arXiv:2305.17098},
year={2023}
}