[ECCV 2024] VideoStudio: Generating Consistent-Content and Multi-Scene Videos

Fuchen Long, Zhaofan Qiu, Ting Yao and Tao Mei

HiDream.ai Inc.

Code & Model

PDF

Abstract

The recent innovations and breakthroughs in diffusion models have significantly expanded the possibilities of generating high-quality videos for the given prompts. Most existing works tackle the single-scene scenario with only one video event occurring in a single background. Extending to generate multi-scene videos nevertheless is not trivial and necessitates to nicely manage the logic in between while preserving the consistent visual appearance of key content across video scenes. In this paper, we propose a novel framework, namely VideoStudio, for consistent-content and multi-scene video generation. Technically, VideoStudio leverages Large Language Models (LLM) to convert the input prompt into comprehensive multi-scene script that benefits from the logical knowledge learnt by LLM. The script for each scene includes a prompt describing the event, the foreground/background entities, as well as camera movement. VideoStudio identifies the common entities throughout the script and asks LLM to detail each entity. The resultant entity description is then fed into a text-to-image model to generate a reference image for each entity. Finally, VideoStudio outputs a multi-scene video by generating each scene video via a diffusion process that takes the reference images, the descriptive prompt of the event and camera movement into account. The diffusion model incorporates the reference images as the condition and alignment to strengthen the content consistency of multi-scene videos. Extensive experiments demonstrate that VideoStudio outperforms the SOTA video generation models in terms of visual quality, content consistency, and user preference.

Framework

An overview of our VideoStudio framework for consistent-content and multi-scene video generation. VideoStudio consists of three main stages: (1) multi-scene video script generation, (2) entity reference image generation, and (3) video scene generation. In the first stage, LLM is utilized to convert the input prompt into a comprehensive multi-scene script. The script for each scene includes the descriptive prompt of the event in the scene, a list of foreground objects or persons, the background, and camera movement. We then request LLM to detail the common foreground/background entities across scenes. These entity descriptions are fed into a text-to-image (T2I) model to produce reference images in the second stage. Finally, in the third stage, VideoStudio-Img exploits the descriptive prompt of the event and the reference images of entities in each scene as the condition to generate a scene-reference image. VideoStudio-Vid takes the scene-reference image plus temporal dynamics of the action depicted in the descriptive prompt of the event and camera movement in the script as the inputs and produces a video clip for each scene.

Results

ActivityNet Captions dataset: giving multi-scene prompt

input prompt:

  • A woman is resting next to crashing water.
  • She is smoking a pipe.
  • She blows out a plume of smoke.
  • input prompt:

  • A woman stands in a room.
  • She is wearing a blue polka dot dress.
  • She is playing a violin.
  • She smiles at the end.
  • input prompt:

  • This woman is shown putting on her makeup while looking in the mirror.
  • First she makes her eyes up and winks at the camera.
  • Then she says something and turns back around to do what she was doing.
  • input prompt:

  • A man is seen sitting in front of a bucket holding clothes in his hands.
  • Another person is seen sitting next to him sticking his hand in a bucket.
  • The men continue washing clothes next to one another while dipping them into a bucket.
  • Coref-SV dataset: giving multi-scene prompt

    input prompt:

  • There is a house and many trees
  • Cat puts cherry on a pie. Cat is done with the pie.
  • Cat puts pie on the table. Cat looks very happy. There are bread, a book, apples, and a pie on the table.
  • Cat tastes pie and Cat thinks it is delicious. Cat turns over the page.
  • Cat marvels at the picture on the book. Cat eats a piece of pie.
  • input prompt:

  • Mouse is looking for something in Mouse's library.
  • Mouse is standing on the ladder and Mouse is finding something on the bookshelf.
  • Mouse found the book. Mouse climbs downs a ladder.
  • Mouse looks at the book and questions himself.
  • Mouse came up with an idea and Mouse decides to make something.
  • input prompt:

  • Teddy-Bear is reading a book, turning the page by his right paw.
  • In the story, Teddy-Bear wears an armor, holding a sword and riding on a white horse.
  • There are trees outside the window. Teddy-Bear cheers by raising his both paws.
  • The clock is running fast. Teddy-Bear is reading a book.
  • input prompt:

  • There is a mountain covered with snow and trees on it. Mouse is reading a book.
  • Mouse is holding a book and makes a happy face.
  • Mouse looks happy and talks.
  • Mouse is holding a flower by her right paw.
  • Mouse is smiling and talking while holding a flower on her right paw.
  • Mouse is ripping a petal from the flower.
  • Mouse is pulling petals off the flower.
  • MSR-VTT dataset: giving single-sentence prompt

    input prompt:

  • A camera captures the train in beautiful scenic landscape
  • input prompt:

  • A young man with blue hair is making cake.
  • input prompt:

  • Spanish language music video
  • input prompt:

  • A person with red clothes is preparing dessert in the kitchen.
  • input prompt:

  • A baby girl is sitting in the cradle with mother
  • input prompt:

  • A man and a woman drive a car from hills to city
  • Giving real images as reference images

    input prompt:

  • The cat lies in the room
  • The cat lies in the driving car
  • The cat plays in the flowers
  • Real Reference Images:

    Generated Video:

    input prompt:

  • The parrot stands in the bedroom
  • The parrot stands in the forest
  • The parrot stands in front of the river
  • Real Reference Images:

    Generated Video:

    input prompt:

  • The motorcyclist stays in the town
  • The motorcyclist is riding on the road under the sunset
  • The motorcyclist is ridding on the moon
  • Real Reference Images:

    Generated Video: