Skip to content

Microsoft Researchers Propose NUWA-XL: A Novel Diffusion Over Diffusion Architecture For Extremely Long Video Generation Dhanshree Shripad Shenwai Artificial Intelligence Category – MarkTechPost

  • by

The field of generative models has recently seen a surge of interest in visual synthesis. High-quality image generation is possible in previous work. However, the duration of videos presents greater difficulties in practical applications than photos. The average running time of a feature film is over 90 minutes. The average length of a cartoon is 30 minutes. The ideal size for a video on TikTok or another similar app is between 21 and 34 seconds.

Microsoft’s research team has developed an innovative architecture for making long videos. Most existing work generates long movies segment by segment sequentially, which usually leads to the gap between training on short films and inferring large videos. The sequential generation could be more efficient. This novel method instead uses a coarse-to-fine process, where the video is generated simultaneously at the same granularity; after applying a global diffusion model to produce the range-wide keyframes, local diffusion models are used to fill in the material between adjacent frames iteratively. The training-inference gap can be narrowed through direct training on long movies, and all parts can be generated simultaneously using this straightforward yet successful approach.

The most important contributions are as follows:

NUWA-XL, a “Diffusion over Diffusion” architecture, has been proposed by the research team because they see the creation of long videos as a revolutionary “coarse-to-fine” process.

NUWA-XL is the first model directly trained on long films (3376 frames), bridging the training-inference gap for generating such videos.

Parallel inference is made possible by NUWA-XL, which drastically shortens the time required to generate lengthy videos. When producing 1024 frames, NUWA-XL accelerates inference by 94.26 percent.

To ensure the model’s efficacy and supply a standard for extended video creation, the research team at FlintstonesHD created a new dataset called FlintstonesHD.

Methods

Temporal KLVAE (T-KLVAE)

KLVAE transforms an input image into a low-dimensional latent representation before applying the diffusion process to avoid the computational burden of training and sampling diffusion models directly on pixels. Researchers propose Temporal KLVAE(T-KLVAE) by augmenting the original spatial modules with external temporal convolution and attention layers to transfer superficial knowledge from the pre-trained image KLVAE to videos. 

Masked Diffusion in Time (MTD)

As a foundational diffusion model for the proposed Diffusion over Diffusion architecture, researchers present Mask Temporal Diffusion (MTD). While the “coarse” storyline of the movie is formed only from L prompts for use in global diffusion, the opening and last frames are also used as inputs for local distribution. The suggested MTD is compatible with global and local diffusion and can take input conditions with or without beginning and last frames. In the following, they lay out the MTD pipeline in its entirety before using an UpBlock to illustrate the fusion of various input circumstances.

There are still some restrictions, even though the proposed NUWA-XL boosts the quality of extended video creation and quickens the inference speed: First, researchers only validate the efficacy of NUWA-XL on publicly available cartoon Flintstones because open-domain long videos (such as movies and TV episodes) are not now known. With preliminary successes in creating an open-domain long video dataset, they hope to extend NUWA-XL to the open domain eventually. Second, the training-inference gap can be narrowed through direct training on long movies, but this presents a formidable obstacle for data. Finally, although NUWA-XL can speed up inference, this improvement requires a powerful graphics processing unit (GPU) to facilitate parallel inference.

Researchers suggest NUWA-XL, a “Diffusion over Diffusion” architecture, by framing the creation of long videos as an unconventional “coarse-to-fine” procedure. NUWA-XL is the first model directly trained on lengthy films (3376 frames), bridging the training-inference gap in long video production. The parallel inference is supported by NUWA-XL, which speeds up the creation of long videos by 94.26 percent while producing 1024 frames. To further verify the model’s efficacy and offer a benchmark for extended video creation, they construct FlintstonesHD, a new dataset.

Check out the Paper and Project. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 26k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

The post Microsoft Researchers Propose NUWA-XL: A Novel Diffusion Over Diffusion Architecture For Extremely Long Video Generation appeared first on MarkTechPost.

 The field of generative models has recently seen a surge of interest in visual synthesis. High-quality image generation is possible in previous work. However, the duration of videos presents greater difficulties in practical applications than photos. The average running time of a feature film is over 90 minutes. The average length of a cartoon is
The post Microsoft Researchers Propose NUWA-XL: A Novel Diffusion Over Diffusion Architecture For Extremely Long Video Generation appeared first on MarkTechPost.  Read More AI Shorts, Applications, Artificial Intelligence, Computer Vision, Editors Pick, Machine Learning, Staff, Tech News, Technology, Uncategorized 

Leave a Reply

Your email address will not be published. Required fields are marked *