Automated Audio Source Analysis and Sound Effect Generation for Silent Videos using Vision Language Models

Auto-Foley is a proof-of-concept for a workflow that automates sound design for silent videos, using a vision language model for scene analysis and a text-to-sfx model for sound generation.

Click on a video to mute or unmute it.

Overview

The workflow instructs a vision language model to analyze the contents of the scenes and to generate text prompts for possible audio sources that could exist in them. These prompts are then sent to a text-to-sfx model. The resulting audio files are then composited back onto the original video at their appropriate timings.

Proof-of-concept

This project explores the idea of using vision language models in sound design workflows.

Rather than focusing on a specific implementation of that idea, the workflow follows a modular design where both the vision language model and the text-to-sfx services are replacable.

For the sake of simplicity and rapid development, the API services of OpenAI and ElevenLabs were used. While functional, the quality of the output is influenced by both the complexity and duration of the input video.

Comparison

Auto-foley takes a different approach compared to traditional video-to-audio architectures, like Google DeepMind's V2A technology.

Converting scene analysis to text before generating audio from it causes synchronization details to get lost in translation. While models like V2A generate synchronized audio directly from visual input.

V2A

Auto-foley

The Auto-Foley output has an incorrect rhythm pattern compared to the visual performance.

The usage of a vision language model does offer a feature in its ability to describe ambient audio sources that should be heard, but are not seen in the visual input.

The limitations of the short and non-complex input, combined with the strength of ambient background noise being able to blend out small flaws in synchronization, makes Auto-Foley still suitable as the final step for AI video generation workflows. A comparison with Tencent's Hunyuan Video shows how Auto-Foley performs in that scenario.

Auto-foley

Auto-Foley shows how simply linking two readily available API services together can create results comparable to current solutions in certain environments.

Editor

Another benefit of the Auto-Foley workflow is that it enables human intervention between its two main components. Instead of producing a single audio track, Auto-Foley generates a separate file for each audio source. This allows for manual synchronization adjustments, removal of unwanted sounds, or generation of new audio sources.

Auto-Foley Editor demonstrates this editing capability in a Gradio interface.

Basic input

Advanced control