Is it better than DALL-E 2? | How does Imagen Actually Work?

9:13

Is it better than DALL-E 2? | How does Imagen Actually Work?

AssemblyAI 13.07.2022 5 517 просмотров 154 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

After GLIDE and DALL-E 2, we have a new image generation model: Imagen! Like its predecessors, Imagen also uses diffusion models to achieve great results. In this video, let's learn why Imagen is special, what its architecture looks like and how it creates the photorealistic images it does. Here is the article by Ryan O'Connor on Imagen: https://www.assemblyai.com/blog/how-imagen-actually-works/ Our video on Diffusion Models: https://youtu.be/yTAMrHVG1ew Article on Diffusion Models: https://www.assemblyai.com/blog/diffusion-models-for-machine-learning-introduction/ 00:00 Introduction 00:43 Why is Imagen special? 01:17 The Architecture 01:47 Text Encoder 03:17 Image Generator 05:27 Classifier-free Guidance 07:04 Super Resolution Models 07:37 Model Evaluation 08:34 Wrap-up Is Imagen better than DALL-E 2? It is hard to answer since both Imagen and DALL-E 2 are not publicly available but from the published results, it looks like both of these models perform at a very similar level. They each have their own pros and cons, of course. How does Imagen work? Imagen is mainly based on a language model for caption understanding and a diffusion model for image generation. Is Imagen open source? Not yet. Google has decided not to release Imagen for public use before there are more safeguards in place. ▬▬▬▬▬▬▬▬▬▬▬▬ CONNECT ▬▬▬▬▬▬▬▬▬▬▬▬ 🖥️ Website: https://www.assemblyai.com 🐦 Twitter: https://twitter.com/AssemblyAI 🦾 Discord: https://discord.gg/Cd8MyVJAXd ▶️ Subscribe: https://www.youtube.com/c/AssemblyAI?sub_confirmation=1 🔥 We're hiring! Check our open roles: https://www.assemblyai.com/careers ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ #MachineLearning #DeepLearning

Оглавление (9 сегментов)

Introduction

big companies are coming out with their own image generative models one after another so in this video let's take a closer look at google's latest model imogen is a caption conditioned image generation model meaning given a caption it generates highly relevant high resolution images let's look at some examples here is a photo of a corgi riding a bike in times square while wearing sunglasses and a beach hat here is another example of a transparent sculpture of a duck made out of glass and the sculpture is in front of a painting of a landscape and this last one is the cutest one which is a cute corgi living in a house made out of sushi all right so we've seen dolly 2

Why is Imagen special?

and glide before so what makes imogen so special well two things photorealism and language understanding creators of imogen are claiming that it has a very deep language understanding being able to understand the physical attributes of objects and how they relate to each other and additionally according to the creators of imogen it creates unprecedented photorealistic images so let's take a deeper look at how imogen works and understand how this level of photorealism and this level of language understanding has been possible imogen

The Architecture

has a couple of components working together to generate images first the text encoder that accepts a caption as input and turns it into a text encoding the image generator gets a fully noise image plus the text encoding and generates a small image this small image is passed through two super resolution models to be upscaled to 1024 by 1024 pixels now let's take a closer look at each of these components the text

Text Encoder

encoder is a pre-trained transformer called t5 is basically a language model made by google back in 2019 it is a text-to-text model that is trained to achieve many problems at once for example translation sentence acceptance determination sentence similarity estimation and summarization inside imogen t5 is frozen so while imogen is being trained t5 is not being updated if you watched our video on delhi 2 you might remember that the text encoding used in dolly 2 was clip and clip was a model trained on image and text pairs so that was a model that has something to do with images whereas t5 has only been trained on text information you might question of course at this point why did the creators of imogen opt for t5 a model that has only been trained on text the main reason is size as we mentioned t5 is a model that is trained only on text data and text data is more abundant in the world than image and text or image and caption pairs thanks to this it was possible to train t5 with much more data than you could train a clip model the creators of imogen wanted to see if using an encoder that is a big language model trained on a big and diverse text data set is going to outperform a model like clip that was trained on text and image pairs and it seems that using language models has paid off for them the main functionality

Image Generator

of generating images and imaging is done by the fusion models we have talked about how diffusion models work before on this channel so i will leave a link to our video somewhere here or in the description below alternatively if you want to go more in depth and understand the math behind the fusion models you can go and check out ryan o'connor's blog post i will leave a link for that in the description too but to quickly recap diffusion models are generative models that can generate images from noise they are trained by feeding a image to a markov chain and through the time steps of markov chain adding a little bit of gaussian noise in each time step and once the image is completely consisting of noise trying to reverse this process and arrive at the original image and after they are trained diffusion models can generate images only from noise and this reversing the noise edition process is done by convolutional neural networks the specific one that is being used in diffusion models is called a unit and that's exactly what we use in imagen2 the main feature of units is that the input dimension and the output dimension is the same the network consists of residual blocks multi-head attention down and up sampling and skip connections between the layers and the residual blocks consist of batch normalization a value activation level and a 3x3 convolution in sequence in imaging and generally in diffusion models as a whole the same denoising unit is used at every time step but different amounts of noise need to be removed at each time step for this the time step information need to be passed to these networks too to deal with this the authors opted for using positional encoders that was introduced in the original transformers paper attention is all you need and lastly of course we need to inject the information that we generated from the text encoder inside this diffusion model for this at each time step the text encoding is pulled and added to the time step embedding and both are later added to the image in order to make the generated images be as

Classifier-free Guidance

relevant as possible to the original image imogen uses something called classifier-free guidance increases the fidelity of the images at the cost of its diversity here's how it works the diffusion model generates an image using text guidance and then one without using any text guidance their difference is used to find out the direction the original image lies in and the extrapolate towards that direction this helps the model actually capture the essence of the caption that was given to it but there is a problem with it large guidance weights generate good quality images but they sometimes cause the images to look saturated or unnatural to avoid this the authors came out with two techniques to make sure that the pixel values that are being extrapolated are still in the allowed range and these are called static thresholding and dynamic thresholding static thresholding pushes any distribution weight outside the minus one and one threshold to minus one for negative values and one for positive values with dynamic thresholding a percentile is chosen and at each time step if the percentile exceeds 1 the pixel values are thresholded to this percentile's value and then divided by the value of it effectively being scaled between -1 and 1. dynamic thresholding was found to lead to much better photorealism and alignments especially for large guidance weights and that's why this approach was a chosen approach by the authors the last step of imaging

Super Resolution Models

is to upscale the images that are created there are two steps first small to medium and then medium to large small to medium takes a 64 by 64 image and turns it into a 256 by 256 pixels image and medium to large takes this image and turns it into a high resolution 1024 by image both the small to medium and medium to large models are diffusion models that are again conditioned on the caption information we are quickly

Model Evaluation

approaching a point in time where evaluating these image generation models are going to be subjective but we're not there yet so here is what the imaging creators did to see how well their model performed based on the author's tests imogen was preferred 39. 2 percent of the time in comparison to the original reference image in terms of photorealism and it was on par with the original reference images in terms of accurately depicting what is being said in the caption but the authors did not think that this evaluation was sufficient so they came up with their own set of challenging prompts to evaluate image generation models on and it is called drawbench the generated images using these prompts were shown to human raters based on the reported results image and outperforms dolly 2 glide and a bunch of other image generation models in terms of both image fidelity and caption alignment imaging seems to be performing

Wrap-up

quite well but we cannot yet see it for ourselves because the model hasn't been made public yet though if the current trend is any indication it is likely that in the near future we're going to see more bigger and better models on image generation if you would like to learn about another image generation model namely delhi 2 or learn more about how diffusion models work you can go and check out our other videos on this channel but for now thanks for watching and i will see you in the next video

Другие видео автора — AssemblyAI

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник