Jiahui (Gabriel) Huang,
Yuhe Jin,
Kwang Moo Yi,
Leonid Sigal
The University of British Columnbia
[PDF]
[code]
This paper has been accepted to ECCV 2022 and selected for an Oral presentation.
We introduce layered controllable video generation, where we, without any supervision, decompose the initial frame of a video into foreground and background layers, with which the user can control the video generation process by simply manipulating the foreground mask. This video is a quick abstract of our method. Please see our paper for more details.
We show a screen recording of a real-time demo of our model reacting to a user's inputs. In this demo, we show that the user can specify not only the directions of the movements, but also the speed. It proves that our model is able to take in continous control parameteres rather than a set of discreat ones.
Here we compare the videos generated by our methods against videos generated by other baselines (SAVP[35], CADDY[37], and MoCoGAN[47]) on both BAIR and Tennis dataset. We show results of all three variants of our model (see paper for more details).
Most noticeably, on both datasets, all three variants of our model successfully generated motions that consistantly correspond to the ground truth sequences, where other baselines failed to do so to some degree. In addition, in terms of image quality, our method is superior to competitors.
Here we show how our model reacts to user control signals. For each sample, we keep applying the same moving control signal of the direction that the arrows are pointing. Notice how the generated motions precisely follow the directions of the arrows.
Action Mimicking: Our method can be used to extract the motion from a driving sequence and then apply onto different appearances (staring frames).
Frame Animation: We can animate a single input image with a variety of different motions.
Multi-Object Animation: our method is capable of generating and controlling videos with multiple moving objects by simply overlaying two individually controlled mask sequences together.
[35]  Alex X. Lee, Richard Zhang, Frederik Ebert, P.Abbeel, Chelsea Finn, and Sergey Levine. "Stochastic adersarial video prediction."   ArXiv preprint, 2018.
[37]  Willi Menapace, Stephane Lathuiliere, S. Tulyakov, Aliaksandr Siarohin, and Elisa Ricci. "Playable Video Generation." CVPR, 2021
[47]  Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. "MoCoGAN: Decomposing motion and content for video generation."   CVPR, 2018