[Arxiv] [Code]

We introduce WavCraft, an AI-empowered assistant that leverages large language models (LLMs) to edit audio content following human’s instructions. Specifically, WavCraft prompts LLMs to decompose users’ demands into several tasks and tackle each task collaboratively with the corresponding model. By embracing in-context learning together with a set of expert models, WavCraft greatly improves audio content with more details and rationales, facilitating users controlling the quality of audio. Moreover, WavCraft is able to cooperate with human via dialogue interaction and even create the audio content without specific user guidance. Experiments demonstrate that WavCraft yields a better performance than existing methods, especially when editing local area of audio clips is preferred. Moreover, WavCraft can follow complex instructions to edit and even create audio content on the top of input recordings, which further meets the demands of audio producers in the practice.

<aside> 📖 Basic features

Advanced features

Basic features

We present case study on audio editing tasks by comparing WavCraft with SOTA end-to-end audio editing and generation models: For audio editing tasks, we evaluated: (a) SEDit, (b) AUDIT, and (c) WavCraft models; For text-to-audio generation, we evaluated: (a) AudioLDM, (b) Tangle, (c) WavJourney, (d) and WavCraft models.


Instruction: add a bell in the beginning


Machine gun, while bell in the beginning_input.wav


Machine gun, while bell in the beginning_sdedit.wav


Machine gun, while bell in the beginning_audit.wav


Machine gun, while bell in the beginning_wavcraft.wav


Instruction: drop a short firework explosion in the end


Vehicle horn, car horn, honking, while fireworks in the end.wav


Vehicle horn, car horn, honking, while fireworks in the end_audit.wav


Vehicle horn, car horn, honking, while fireworks in the end_sedit.wav


Vehicle horn, car horn, honking, while fireworks in the end_output.wav


Instruction: replace wind instrument with drum kit