Click on a video to mute or unmute it.
The workflow instructs a vision language model to analyze the contents of the scenes and to generate text prompts for possible audio sources that could exist in them. These prompts are then sent to a text-to-sfx model. The resulting audio files are then composited back onto the original video at their appropriate timings.
This project explores the idea of using vision language models in sound design workflows.
Rather than focusing on a specific implementation of that idea, the workflow follows a modular design where both the vision language model and the text-to-sfx services are replacable.
For the sake of simplicity and rapid development, the API services of OpenAI and ElevenLabs were used. While functional, the quality of the output is influenced by both the complexity and duration of the input video.
Auto-foley takes a different approach compared to traditional video-to-audio architectures, like Google DeepMind's V2A technology.
Converting scene analysis to text before generating audio from it causes synchronization details to get lost in translation. While models like V2A generate synchronized audio directly from visual input.
Auto-foley
The Auto-Foley output has an incorrect rhythm pattern compared to the visual performance.
The usage of a vision language model does offer a feature in its ability to describe ambient audio sources that should be heard, but are not seen in the visual input.
The limitations of the short and non-complex input, combined with the strength of ambient background noise being able to blend out small flaws in synchronization, makes Auto-Foley still suitable as the final step for AI video generation workflows. A comparison with Tencent's Hunyuan Video shows how Auto-Foley performs in that scenario.
Auto-Foley shows how simply linking two readily available API services together can create results comparable to current solutions in certain environments.
Another benefit of the Auto-Foley workflow is that it enables human intervention between its two main components. Instead of producing a single audio track, Auto-Foley generates a separate file for each audio source. This allows for manual synchronization adjustments, removal of unwanted sounds, or generation of new audio sources.
Auto-Foley Editor demonstrates this editing capability in a Gradio interface.
Basic input
Advanced control