Neural Audio Compression | What is Residual Vector Quantization?

5:20

Neural Audio Compression | What is Residual Vector Quantization?

AssemblyAI 11.12.2024 2 172 просмотров 64 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

AI based methods for learnable codecs are revolutionizing how we store and transmit audio and video data, and lie at the heart of cutting-edge AI models like Google's SoundStream and Meta's EnCodec. Learn how RVQ and neural compression work in this video explainer. References: - Blog post on latency at AssemblyAI https://www.assemblyai.com/blog/lower-latency-new-pricing/ - Tutorial on Text-to-Video apps in Python https://youtu.be/Tlxe3l_m3PA - Google’s Soundstream https://research.google/blog/soundstream-an-end-to-end-neural-audio-codec/ - Meta’s Encodec paper https://arxiv.org/abs/2210.13438 - Paper on perceptual metrics https://arxiv.org/pdf/1801.03924 Video sections: 00:00 Why learnable codecs? 01:50 Neural compression 03:17 RVQ 04:27 Final thoughts ▬▬▬▬▬▬▬▬▬▬▬▬ CONNECT ▬▬▬▬▬▬▬▬▬▬▬▬ 🖥️ Website: https://www.assemblyai.com 🐦 Twitter: https://twitter.com/AssemblyAI 🦾 Discord: https://discord.gg/Cd8MyVJAXd ▶️ Subscribe: https://www.youtube.com/c/AssemblyAI?sub_confirmation=1 🔥 We're hiring! Check our open roles: https://www.assemblyai.com/careers 🔑 Get your AssemblyAI API key here: https://www.assemblyai.com/?utm_source=youtube&utm_medium=referral&utm_campaign=yt_marco_5 ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ #MachineLearning #DeepLearning

Оглавление (4 сегментов)

Why learnable codecs?

have you ever wondered how Spotify can stream high quality music using less data than ever before or how multimodel AI models can compress different types of sounds while preserving the qualities of the original audio the secret lies in a technique called residual vector quantization or rvq this powerful method is revolutionizing how we store and transmit audio and it's at the heart of Cutting Edge AI models like Google's Soundstream and metas and codc in this video we'll break down how neural compression and rvq work and how AI inspired approaches based on learnable codecs might eventually replace traditional codex like MP3 the factor reshaping the future of video and audio compression before we dive in let's understand why audio compression is so important did you know that over 80% of internet traffic today comes from streaming audio and video that's a massive amount of data being transferred every single second traditional compression methods like MP3 work by removing parts of the audio that humans supposedly can't hear while this works pretty well it's a one siiz fits-all approach that can sometimes lead to noticeable quality loss also a fixed algorithmic approach might work well for specific kinds of sounds like MP3 which is optimized for music but can struggle with other types of sound neural network based methods however offer the flexibility to adapt dynamically to diverse audio types preserving quality across different contexts and this is where neural compression takes a fundamentally different approach instead of following fixed rules about what humans can and can hear it learns patterns directly from the data and can compress audio in a way that preserves what matters most to human perception and what's really exciting is that when these models are combined with iterative noising or diffusion models for upscaling resolution or other super resolution techniques inspired by the computer vision domain they can recreate data with extremely high faithfulness to the original in practice this might lead us to new and flexible near lossless compression techniques let's start with

Neural compression

the basics of how neural audio compression Works imagine you're trying to compress a few seconds of audio data here's what happens first an encoder neuron Network converts the audio waveform into a sequence of vectors think of this as equivalent to DNA sequences that capture the essential characteristics or features of the sound but there's a problem here the audio encoder is producing vectors in a continuous process so these vectors can take any combination of real numbers in order to efficiently transmit these vectors to the decoder we need to replace them with Clos vectors from a fixed finite set a process called Vector quantization this finite set of reference vectors is called a code book essentially a lookup table the idea is that in instead of transmitting the original High dimensional vectors we can just send a single integer number the index of the closest matching Vector in the code book sounds quite simple right but here's the catch bit rate refers to the amount of data used to encode the audio per second now this strategy of replacing encoded vectors with their neighboring codebot vectors works well in theory but in practice we have a problem as we want to increase the beat rate for example at a very low bit rate like 1 kilobits per second this approach can work fine but as the bit rate goes up say to 3 kilobit per second and the encoder is producing say 100 vectors per second we would need a giant codebook with more than a billion unique vectors to presentent all the possible values and that's just invisible in practice so this is where traditional Vector quantization hits a wall and where residual Vector quantization comes in

RVQ

rvq solves this problem with a simple Insight instead of compressing everything at once why not do it in stages this might remind you of how deep neuron networks process images a discovery that revolutionized computer vision around the decade researchers found that these networks naturally organize visual information across their layers early layers detect basic features like edges and colors middle layers combine these into more complex patterns and deeper layers recognize high level features like faces or objects but back to rvq the idea is to replace traditional Vector quantization with a multi-layered approach so we let the feature vectors be processed over several quantization layers the first layer quantizes the vectors with moderate resolution and each subsequent layer processes the residual error from the previous one which is done by taking Vector Difference by splitting the quantization process across multiple layers the required code book size can be reduced drastically in our previous example just having five quantization layers in our rvq component reduces the codebook size from over 1 billion down to just 320 the beauty of this structure is that each layer only needs to handle a portion of the signal making the whole process much more efficient than trying to capture everything at once with a single code book neural compression

Final thoughts

methods based on residual Vector qu quantization are revolutionizing audio and video CICS quantization techniques like rvq have become essential to text to speech and text to music generators for example and if you're interested in how we deal with model latency when managing large audio files for our API check out Ryan's blog post Linked In the description to learn how we can process and transcribe a 3-hour podcast in roughly 100 seconds with our models also if you're looking to get Hands-On with generative AI check out smitha's tutorial where she shows you how to create textto video applications in Python it's a great starting if you want to build something cool with these Technologies and I'll see you in the next videos and this model essentially converts text to video using a diffusion model I'm going to explain how this model works but before that let's actually jump into Google cab and download this model to start running it

Другие видео автора — AssemblyAI

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник