# Evolution of Autoencoders - Explained!

## Метаданные

- **Канал:** CodeEmporium
- **YouTube:** https://www.youtube.com/watch?v=XyWNmHZi1oA
- **Дата:** 06.04.2026
- **Длительность:** 27:49
- **Просмотры:** 1,146
- **Источник:** https://ekstraktznaniy.ru/video/49643

## Описание

In this video, we take a look at a core component of DALL-E text-to-image generation: discrete autoencoders. What is it? Why do we have it? How does it look? We specifically looks at vanilla Autoencoders, Variational Auto-encoders and VQ-VAEs. 

ABOUT ME
⭕ Subscribe: https://www.youtube.com/c/CodeEmporium?sub_confirmation=1
📚 Medium Blog: https://medium.com/@dataemporium
💻 Github: https://github.com/ajhalthor
👔 LinkedIn: https://www.linkedin.com/in/ajay-halthor-477974bb/

RESOURCES
[1 📚] Slides: https://link.excalidraw.com/p/readonly/GElfVd51jvXhYuQ6yjns
[2 📚] Paper that suggests Autoencoders address “back propagation without a teacher”: https://proceedings.mlr.press/v27/baldi12a/baldi12a.pdf
[3 📚] Early paper on auto encoders that compared performance vs PCA (2006): https://www.cs.toronto.edu/~hinton/absps/science.pdf 
[4 📚] 2013 VAE paper: https://arxiv.org/abs/1312.6114?utm_source=chatgpt.com 
[5 📚] 2017 VQ-VAE paper: https://papers.nips.cc/paper_files/paper/2017/file/7a98af17e63a0a

## Транскрипт

### What is DALL-E? []

Greetings fellow learners. In this video, we are going to talk about the core of DALL-E, autoencoders. And um this is going to be like the first in a two-part series on DALL-E, where DALL-E is going to take some text and generate an image related to that text. So, one core component of this is a discrete variational autoencoder. And so, before explaining DALL-E as a system, I thought it would be nice to understand the progression of ideas that made DALL-E possible. And specifically, it's the evolution of autoencoders, from autoencoders for compression to variational autoencoders that introduced the generation component, and then vector quantized variational autoencoders that added discretization. And it is a type of discrete variational autoencoder, this VQ-VAE, that is actually going to be used in DALL-E, which we'll look at in the next video. So, let's just take a look at these three concepts, the what, the why, and the how.

### Autoencoders [1:08]

So, let's start with autoencoders. For like a brief one-line explanation of what they are, it is essentially a neural network that is trained to compress and reproduce an input signal. So, you can imagine here that it has an encoder-decoder architecture, and each of these could be, for example, like feed-forward layers. They could be convolution networks, or in modern day, they can also be transformer layers. And so, what we'll have is we'll take an input, we'll pass it to the encoder to compress it into this vector here, which is a latent code. And this is then passed into a decoder that'll reconstruct the original image. So, why do we have autoencoders? Well, it is to learn the compressed representation of this input image via backpropagation without a supervisor. And when I say supervisor, I mean that there is no explicit label that we have to this data. The label is the original image itself to help reconstruction. And it learns via backpropagation of errors. So, how are autoencoders trained and inferred? Well, they're trained via backpropagation to minimize a reconstruction loss. And this reconstruction loss function could be like a mean squared error for colored images, where we compare pixel by pixel values, or they can be like a binary cross-entropy loss for binary images or black and white images. So, that's kind of the core essence of how autoencoders are trained and how they work.

### What are Variational Autoencoders? [2:59]

So, let's now move on to the second concept, that is VAEs. So, variational autoencoders, these are neural networks that are trained to compress images into a latent space, just like autoencoders, but that latent space is probabilistic and structured. So, let's see why and how that works.

### How VAE is better suited for generation than AE. [3:22]

So, why do we have variational autoencoders? So, let's say that we train this autoencoder on dog images. Now, when we want to do some generation here, we just need to use the decoder to generate the image. But, we need to sample like this vector from somewhere. So, what should this actually be? And if you just like sample random values for this vector, like from some like random distribution, these values, you'll just get gibberish. But, what if there was a specific distribution that we can sample it to get these latent values to always generate some image? So, by that, I mean let's say that we have this entire embedding space where any possible vector that we can input to the decoder exists. But, let's say that there is this one small region, let's just call it this yellow-looking puddle over here, where we can just continuously, like we can sample a vector, pass it to the decoder, and it will always generate at least some relevant image. So, if I take it at a different point in this, we'll generate another vector to get another image. Sample another point, we can get a vector again sampled, and it will generate another image. And variational autoencoders kind of do this at a high level, and so, we have them. So, succinctly stated, randomly sampling a vector to input to the decoder can give us a garbage output. Variational autoencoders solve this by allowing us to sample the vector from a continuous fixed distribution. And that distribution tends to be a Gaussian distribution, but it can actually kind of be anything.

### VAE structure and forward pass [5:18]

So, now that we kind of understood the what and the why of variational autoencoders, let's actually look at how the forward and backward passes are done. So, variational autoencoders have three main components. They have an encoder, there's like a sampling layer over here, and then there's like a decoder phase over here. So, we first take an input image. We will pass it to an encoder, which again, let's just say it's like a convolution layered network. And the idea of this encoder is that it is going to output the parameters of a distribution, in which like the images or are going to be like encoded, or where the latent vector is going to come from. Now, in order to generate a probability distribution, you typically just need like the parameters of it. For example, for a Gaussian distribution, you need to know the mean and the standard deviation. So, if you know the mean and standard deviation, those two parameter values, you can generate or know what the Gaussian looks like. So, that's why this encoder, because we assume a Gaussian distribution, we are going to say that, okay, well, we'll generate a mean and a standard deviation. And so, we can generate now this entire Gaussian distribution. Now, it is from this Gaussian distribution that we can sample a latent vector. This vector is going to be passed into a decoder, and the decoder is going to then reconstruct the original image. So, this is at least at a high level of how this would work. Now, an issue though is that, let's say, you know, over here we're going to be

### Reparameterization trick [7:09]

computing some loss, which we'll talk about very shortly. Now, this loss is going to have to be backpropagated through the network, as these are neural networks that learn via the backpropagation learning procedure. And so, we need to be able to compute gradients of every component of this network. So, this is fine. Through the decoder, we can always compute like how each of these parameter values change the loss. So, for example, we know how the del L by del Z, we can compute that quite easily through backpropagation. But, then we get to the sampling layer, which becomes very tricky. This is a stochastic procedure, and hence it is very difficult to understand how changing the mean or changing the variance is actually going to affect the loss of the network. So, how do we really deal with this so that the encoder can effectively learn? Well, we can use something called the reparameterization trick. And the idea here is that instead of sampling this latent vector from this normal distribution, what we will do instead is that we will write this latent vector as mu plus sigma times epsilon. And this is uh Hadamard product, so it's an element-wise multiplication of two vectors. And this actually works because it approximates a sample from the distribution still, right? So, you can intuitively kind of see this distribution is going to be like um you know, centered at the mean, and we are going to add uh some sigma, some variation to it. And this epsilon is going to be sampled from a standard Gaussian distribution. So, basically, you can imagine that Z is going to be a value that's mu plus sigma or mu minus sigma, maybe somewhere in between, and maybe slightly out of that balance. So, that kind of reflects this distribution itself. So, it is quite literally a reparameterization of the original normal distribution that we saw over here. So, what this now allows us to do though, apart from just being an approximation for this distribution, we can now easily compute the gradients that we need. So, for backpropagation, we want to know how the loss, how changing this mu will affect the loss. Well, through chain rule, it's how this Z affects the loss times how this mu affects Z. And so, if you write that out, you'll see that, you know, the del Z by del mu, it's actually going to be one, just based on this equation over here, if you just take the derivative with respect to mu. All right? And so, del L by del mu is the same thing as the gradient L by gradient of Z. So, it just passes through here. So now, we now have a way for the gradients to pass through. Similarly, for sigma squared, we can say del L by del sigma, that is just the standard deviation, is how, you know, how this changing Z changes the loss times how changing this sigma changes Z. And so, if you take this, you know, how changing if we how like the changing sigma changes Z, you can actually just take the derivative here with respect to sigma, and you'll get this constant term epsilon. It's at least constant for a specific forward and backward pass. And so, you'll end up with sigma times del L by del Z. And effectively, now what has happened is we now have gradients that can pass through Z to mu, and then effectively to sigma or sigma squared here as well. And when we have gradients through here, we can easily then now backpropagate through the encoder, and so the encoder can effectively learn. — [clears throat] — So, that's kind of why we're doing this reparameterization trick, and also how it's super helpful here. One thing to note is that I write sigma squared over here, and I write sigma here. That is intentional, and I didn't just want to clutter this graph too much, but in many cases, you'll actually see that the encoder, for numerical stability purposes, will output log sigma squared, because that's a value that can kind of range from negative infinity to positive infinity without being constrained to the positive space as it is over here. But these logarithmic operations or taking square operations, they're all in at the end of the day differentiable. So, we can still use the chain rule to inevitably get like some function of how changing, you know, Z affects the loss to how this downstream log sigma squared also affects the loss. It's all differentiable at the end of the day. But I just wanted to make that a clear note.

### VAE loss function [12:09]

Now, let's look at the actual loss components, of which there are two. One is the reconstruction loss, and this is the same as we saw in the case of the autoencoder, where we're just trying to make sure that the reconstruction is similar to the original image. And then we have this KL divergence loss function, and this is to ensure that the encoder latent distribution, that is this distribution over here, is going to be as close to the prior standard Gaussian distribution itself. So, we made the assumption that our distribution, at least when we want to do the generation phase, is going to be sampled from a standard Gaussian distribution. That is our prior standard Gaussian distribution. And so, we want to make sure that this mu and sigma squared, the parameters of our posterior distribution, are is going to be pushed towards the standard Gaussian. So, mu should be pushed towards zero, and sigma squared should be pushed effectively towards one. But they won't be exactly there, because it's going to be tugged by this reconstruction loss. So, we have to effectively reconstruct the image correctly, while making sure that the image or the latent space is around this standard Gaussian distribution. And that's kind of how, you know, the intuition of this works. So, we'll get a loss, which can then be effectively learned via backpropagation. We can learn the parameters of the encoder and decoder.

### VAE inference [13:49]

So, when you want to do some inference now, let's say it's trained. We have We just take the trained decoder. I put it in green here, cuz it's trained. We now just sample from a standard Gaussian distribution. We'll get a vector Z, and this is then passed to a decoder in order to generate an image. And the cool thing is now, this continuous distribution, we can sample from any part of it, and still get some relevant image sample. And sampling and varying, you know, the latent vector slightly can also help us control what kind of images are generated. So, you can imagine here for this grid of faces, each face was performed with, you know, just varying Z, that vector ever so slightly, so that we can create like from frowning faces to neutral faces to even smiling faces, and we can control this effectively.

### What is VQ-VAE, forward pass, loss [14:44]

So, onto vector quantized variational autoencoders. So, these are a VAE-style model with a discrete latent representation. So, let's talk about this. So, we saw that was like the one-line definition of the what. Let's see how it works. So, we have an original image. We'll now pass this to an encoder, and let's say this is like a convolution network. So, it has a bunch of convolution, activation, pooling layers, which will eventually transform this into a tensor. The tensor, let's just say it is of shape 32 cross D. D could be like some like 512. And the way that I like to interpret this for now is just that it's a grid of latent continuous vectors. So, along the depth, so you can imagine this is like 32 * 32 is 1,024. It's like 1,024 vectors that are of D dimensions. And each of these D-dimensional vectors is effectively going to be like a continuous vector that represents some region of the image. So, we'll take each of these vectors. We are now going to have something called a learnable codebook embeddings. So, you can imagine these are just D-dimensional vectors. This is like a 512 dimension vector, 512 dimensions, all the way over here, too. And let's say there's like 8,000 of these or something like that. So, these are all learnable vectors here. And what we want to do is we're going to try to find for this vector here, we want to find the nearest neighbor. And let's say it's this vector in the codebook called E2. And so, this entry number is going to be documented here. So, you can imagine this is just 32 cross 32, and we're just putting the entry in the of the codebook over here. So, this is entry number two. Whereas this last one is entry number three, which would have been the nearest neighbor for this, which is this. And then what we want to do is we're going to use that actual vector as input to the decoder. So, you have 32 cross like D, and each of these D's, instead of being the continuous vectors, they are now the codebook vectors themselves. And we pass these discrete vectors into the decoder. And the decoder is going to reconstruct the image. And as a part of this VQ-VAE, there's going to be three major losses that contribute to the loss function. One is the reconstruction loss, which is what we have talked about in the autoencoder and the variational autoencoder. The second one is the codebook loss function. So, what this is going to do is it's going to push this selected code, right over here, this E2, towards the corresponding output of the encoder. And then we have the commitment loss. And what that's going to do is it's also going to push the corresponding output of the encoder to the selected codebook here, assuming that this codebook is fixed now. And so, what we can now do is So, you can imagine that these two loss functions are kind of symbiotic. They'll keep each other in check. And so, we add these losses together or do like a weighted addition, and that will contribute to the final loss.

### Straight through estimator [18:06]

loss. And this entire network will be learned via backpropagation. But one thing to note here. So, you can see that when you do backpropagation of this loss, it can go into the decoder just fine, and we can get up till here. The gradients can propagate quite well here, because this decoder is just a convolution network or something. But from these discrete codebook vectors to these continuous codebook these continuous vectors over here, how do we actually get the gradients to propagate that section? After all, for example, if you vary, you know, this vector ever so slightly, just by a little bit, what will happen? Well, if you vary this only a little bit, in most positions, it's the same codebook entry that's going to be selected. So, it's not going to change the codebook entry, and hence it won't change the downstream loss computed. So, the gradient, for the most parts, is zero, and the encoder will not learn. But there's going to be like some point in between where, you know, you vary this just by a little bit, but the nearest codebook vector will now change. And all of a sudden, the gradient is going to like spike up. And this spiking and then flat gradient is really not very useful to learn the parameters of this encoder. So, in order to combat this issue, what we do instead is that we do something called just a straight-through estimation, where whatever gradients we compute for, you know, these vectors over here is the same gradient that we're simply going to propagate as if, you know, it's the same over here. This is more heuristic and practical, as it kind of does make sense, right? We do, at the end of the day, want these vectors to be similar to each other. So, it makes sense to have their gradients the same as well. So, I hope it makes sense of like how this entire network can now effectively learn.

### Posterior Collapse [20:08]

All right. So, now that we talked about the what and the how of VQ-VAEs, let's talk about the why do we have this? So, a primary advantage of VQ-VAEs over the original vanilla um variational autoencoder is this idea of posterior collapse. So, I just want to clarify some terminology over here. This is the vanilla variational autoencoder. The encoder will output parameters of a distribution given some image. So, we have this idea of posterior distribution and prior distribution. A prior distribution is essentially a distribution without seeing any data. A posterior distribution, which is what the output of this encoder is, is a distribution when you see some input data. We have some input data and we have generated we have like this distribution over here. So, this is a posterior distribution that's effectively constructed from these parameters. And this latent vector is effectively going to be sampled from that posterior distribution. So, now that's clarified. So, during training now the decoder should use Z, this vector, to reconstruct the image. But, if this decoder is very expressive and very powerful, it might actually learn to reconstruct this image without relying too much on Z. And because of this, the reconstruction loss can actually be minimized and can go down with very little dependence on what the Z actually is. So, now that this, you know, this there's like a weak dependence on Z that it affects this reconstruction loss, and this reconstruction loss typically pulls on this KL divergence, what the encoder can now do is, well, because you know, Z doesn't affect the output too much, it doesn't need to worry and it is a little bit more flexible with the outputs of, you know, mu and sigma squared. It is more flexible with this posterior distribution. So, it can focus on just minimizing KL divergence altogether. And the best way to minimize KL divergence, because it just is trying to make it close to a standard Gaussian, is to literally push the mu towards zero and push sigma squared directly towards one. And so, the output posterior distribution is now literally pushed very much so towards the prior Gaussian distribution and hence there is a posterior collapse. So, we might, you know, because of this during like the generation phase, what this could entail is that even if you vary Z uh kind of meaningfully, it doesn't create a meaningful change in the generated images itself. Now, this is just an example that shows images that were somewhat recovered um from posterior collapse, but I hope it kind of paints the point of like, oh, we might get images that are generated and they might look kind of like average faces, but we can't meaningfully control that by changing Z and hence this becomes a problem. Now, vector quantized variational autoencoders, well, they don't actually even have this concept of posterior collapse. And this is because the encoder no longer really outputs the parameters of a distribution. And hence there's no concept of that. There are other issues like codebook collapse, but that's separate from this original posterior collapse.

### Discrete representations [23:52]

Now, a second reason of why we would use, for example, vector quantized variational autoencoders is because of their discrete representation. So, discrete representation, you know, with this codebook, right? This is a learnable codebook. There's like, let's say 8,000 vectors of 512 dimensions. And this information has to be reused across all training images. So, no matter how many training images there are, there's just this you know, all of that representation has to be captured within this fixed set of codebooks. And so, this enables this like forcing function here enables like a better generalization. Um this is as opposed to like the variational autoencoders, which, you know, can output continuous vectors which are very flexible, but may get into the tendency of memorizing specific like high pixel variation signals, for example. And of course, you know, codebooks discrete vectors are much more um they're more computationally efficient and storable.

### Compatibility with sequence models (and DALL-E) [25:00]

And the third here is, well, this is just like a nice to have right now and kind of segues well into the DALL-E discussion, but is the compatibility with sequence models. So, we have like, let's say we trained encoder from vector quantized VAEs. We pass an image, we'll now get uh a matrix or a tensor of continuous vectors, which we can snap to discrete vectors. And these are going to be now like 1,024 image tokens that is going to be used in DALL-E, which we'll take a look at more in detail in the next video. And this works out well because a lot of like models today kind of work with sequences of tokens. Um typically it was originally created for text, but now you can see we can do the same for images and hence leverage these very powerful transformer architectures when you have a significant amount of data.

### Quiz Time [25:56]

Quiz time. Have you been paying attention? Let's quiz you to find out. What advantages does vector quantized VAEs have over the vanilla VAEs? A, to avoid posterior collapse. B, it uses a discrete latent space. C, it always trains faster than VAEs. Or D, you don't need an encoder network. I'll give you a few seconds to answer this question and just note that multiple options may be correct. The correct options are A and B. Did you get them right? Please comment your reasoning down in the comments below and let's have a discussion. And at this point, if you think I deserve it, please do consider giving this video a like because it will help me out a lot. And that's going to do it for quiz time, but before we go, let's generate a summary.

### Summary [26:52]

So, in this video we actually took a look at the what, why, and how of autoencoders, variational autoencoders, and the vector quantized variational autoencoders. And it's actually a discretized version of the VAEs that is like a version similar to this vector quantized variational autoencoder that's going to effectively be used in DALL-E as a core component for image tokenization. So, I hope all of this makes sense and if you do want some more reading material, I'm going to put it down in the description below all of the reference papers that I do highly encourage you go through. And I hope you can use this as a supplement to whatever you're learning when you're tackling these concepts. They're very math heavy, they're kind of difficult, but if you stare at it long enough, I'm sure you'll get it. So, thank you so much and I'll see you in the next one.
