# How AI connects text and images

## Метаданные

- **Канал:** 3Blue1Brown
- **YouTube:** https://www.youtube.com/watch?v=j5h9dBtWZXM
- **Дата:** 21.08.2025
- **Длительность:** 1:42
- **Просмотры:** 103,478

## Описание

From this guest video by @WelchLabs on how diffusion models work: https://youtu.be/iv-5mZ_9CPY

## Содержание

### [0:00](https://www.youtube.com/watch?v=j5h9dBtWZXM) Segment 1 (00:00 - 01:00)

Over the last few years, AI systems have become astonishingly good at turning text props into videos. At the core of how these models operate is a deep connection to physics. This generation of image and video models works using a process known as diffusion, which is remarkably equivalent to the Brownian motion we see as particles diffuse, but with time run backwards and in highdimensional space. But what exactly is the connection to Brownian motion here? And how is our model able to use text input? so expressively. In February 2021, a team at OpenAI released a new model architecture called Clip. Clip is composed of two models, one that processes text and one that processes images. The output of each of these models is a vector of length 512. And the central idea is that the vectors for a given image and its caption should be similar. If I take two pictures of myself, one not wearing a hat and one wearing a hat, and pass both of these into our clip image model, we get two vectors in our embedding space. Now, if I take the vector corresponding to me wearing a hat and subtract the vector of me not wearing a hat, we get a new vector in our embedding space. Now, what text might this new vector correspond to? Mathematically, we took the difference of me wearing a hat and me not wearing a hat. We can search for corresponding text by passing a bunch of different words into our text encoder. Testing a set of a few hundred common words. The top ranked math is the word hat followed by cap and helmet. This is a remarkable result. The learned geometry of clips embedding space allows us to operate mathematically on the pure ideas or concepts in our images and text.

---
*Источник: https://ekstraktznaniy.ru/video/11499*