Google today detailed SoundStream, an end-to-end â€œneuralâ€ audio codec that can provide higher-quality audio while encoding different sound types, including clean speech, noisy and reverberant speech, music, and environmental sounds. The company claims this is the first AI-powered codec to work on speech and music while being able to run in real time on a smartphone processor at the same time.
Audio codecs compress audio to reduce the need for high storage and bandwidth requirements. Ideally, the decoded audio should be perceptually indistinguishable from the original and introduce little latency. While most codecs leverage domain expertise and carefully engineered signal processing pipelines, thereâ€™s been interest in replacing handcrafted specs with AI that can learn to encode on the fly.
Earlier this year, Google released Lyra, a neural audio codec trained to compress low-bitrate speech. SoundStream extends this work with a system consisting of an encoder, decoder, and quantizer. The encoder converts audio into a coded signal thatâ€™s compressed using the quantizer and converted back to audio using the decoder. Once trained, the encoder and decoder can be run on separate clients to transmit audio over the internet, and the decoder can operate at any bitrate.
In traditional audio processing pipelines, compression and enhancement â€” i.e., the removal of background noise â€” are typically performed by different modules. But SoundStream is designed to carry out compression and enhancement at the same time. At 3kbps, SoundStream outperforms the popular Opus codec at 12kbps and approaches the quality of EVS at 9.6kbps while using 3.2-4 times fewer bits, Google claims. Moreover, SoundStream performs better than the current version of Lyra when compared at the same bitrate.
Hereâ€™s reference audio before processing with SoundStream:
And hereâ€™s the audio after processing:
Google cautions that SoundStream is still in the experimental stages. However, the company plans to release an updated version of Lyra that incorporates its components to deliver both higher audio quality and â€œreduced complexity.â€
â€œEfficient compression is necessary whenever one needs to transmit audio, whether when streaming a video or during a conference call. SoundStream is an important step toward improving machine learning-driven audio codecs. It outperforms state-of-the-art codecs, such as Opus and EVS, can enhance audio on demand, and requires deployment of only a single scalable model, rather than many,â€ Google research scientist Neil Zeghidour and staff research Marco Tagliasacchi wrote in a blog post. â€œBy integrating SoundStream with Lyra, developers can leverage the existing Lyra APIs and tools for their work, providing both flexibility and better sound quality.â€