# Gemini Embedding 2 - Audio, Text, Images, Docs, Videos

## Метаданные

- **Канал:** Sam Witteveen
- **YouTube:** https://www.youtube.com/watch?v=zUkKvWBJ_0I

## Содержание

### [0:00](https://www.youtube.com/watch?v=zUkKvWBJ_0I) Segment 1 (00:00 - 05:00)

Okay, if someone asked you today to build a search system that could handle text, images, audio recordings, video clips, and even things like documents, and PDFs all in the same search, what would your pipeline kind of look like? So, up until recently, when I worked on things like this, you would end up using multiple vector stores. having multiple embedding models, and the system would get pretty complicated pretty fast. Now jump forward to a model I covered almost two months ago which came from the Quen team that basically allowed us to do embeddings with both text and images at the same time. Now I do think the Quen thing is really cool but what I couldn't talk about at that time was that I was already testing another model which just got released yesterday which takes this whole multimodal embedding system even further than just text and images. And this is the Gemini embedding 2 model. So this is the first natively multimodal embedding model from the Gemini team that can not only cover text and images, but also it can take videos of up to 2 minutes without having to convert them to any other format. It can take audio files without having to transcribe them or anything like that. And it can even take files like PDFs and embed them natively in their format without needing to convert them to plain text or anything like that. So whereas in the past you would need to spin up perhaps a text embedding model perhaps something like a clip or sig embedding model then use something like whisper to transcribe your audio. This model basically replaces all of those sort of five problems and perhaps five different models that you needed and five indexes and five different headaches and they've collapsed all of that into a single API call. So the idea with this model is that it can take text, images, video, audio and PDFs and put them into this same shared vector space. one model, one index, and one query to basically access an embedding that you can then use for a variety of different tasks. Okay, so real quick, for anyone who's new to this and not really sure how embeddings work, I covered that quite a bit in the video that I made about the Quen models and how multimodal embeddings work in that. But the simplest way to think about this is that what we're going to do here is you can take any piece of content that could be a sentence or some text, an image, a chunk of audio, a video, or even a PDF file. And the model's going to convert that into a list of numbers, specifically a vector that lives in a highdimensional space. And the key property that those numbers have is that they've extracted out semantic information from that particular piece of content. And you can think about that vector as basically like an address in n-dimensional space that allows you to sort of see okay things that are in similar space tend to be semantically similar overall. Meaning, if I've got some text about a cat and I've got an image of some speech talking about a cat, they're all going to be in roughly the same location. Now, when we think of space, we tend to think about it in three dimensions. But for your model like this, your full representations are over 3,000 dimensions. So that's what allows it to encode things so well to be able to find when you do similarity lookups, find content that relates to your specific query, etc. Now, historically to do this, every modality needed its own model to do this kind of conversion. So you would need things like a text embedding model, an image embedding model, something like a clip or a sig model. And then usually for things like audio speech, people wouldn't even just encode it. what they would actually do is transcribe it and then encode it as text. Now, what this did was it made sort of multimodality rag and multimodality search both challenging to do because not only would you require multiple models, but you would often have multiple indexes that you would be searching and you would need the whole sort of reranking layer or sort of fusion layer to work out what to actually bring back to the user. It was usually very messy to do, very expensive to maintain, and often very slow to run as well. And this is where the Gemini embedding 2 model really changes that. We've now got this one unified space that we can have both a written description of a product and a photo of that product that end up being close together in that same vector space. This means that your users can write a text query and retrieve most semantically similar results, whether they be text, images, video, audio

### [5:00](https://www.youtube.com/watch?v=zUkKvWBJ_0I&t=300s) Segment 2 (05:00 - 10:00)

PDFs, or they can throw in an image or you can even just take a raw recording of the person saying what they want and then get an embedding that actually then finds the content that they want. This whole sort of unified element is really a gamecher here. And one of the other key parts of this is that if you want to, you can not only just embed each of the different modalities separately. You can pass in multiple modalities. For example, we can pass in an image and text in a single request to actually get back an embedding that represents the combination of those two. So, this allows you to build a whole bunch of different things. For example, like if I've got a picture of a watch band I like and I describe the sort of watch, but I don't actually have a picture of that part, I can actually create an embedding that represents the two of those things combined and then use that to do lookups against pictures of watches or videos of watches, etc. And in fact, if we come into the demo that they've got here, you can see that we can do searches by images. So here I'm pressing the picture of the cats. And sure enough, we're getting pictures of cats that look like these black and white cats coming back here. We're also getting videos that are coming back that have been embedded into a similar space. So you can notice here that the model has not only worked out cats semantically, it's worked out this sort of black and white and specifically on the face of a cat and represented that. So, it's giving us back both images and videos that match that. You can see we're doing the same with the soccer team here of where not only are we getting a soccer back, but it's picked up on the sort of yellow uniforms that they're actually wearing here. Now, just as we could do this via sort of image search, we can also do it with speech search. So, if we listen to this audio, a tiger, you can see that okay, that's very simple little sort of piece of speech. But if we now do a search on the embedding made from that, we're getting back images of tigers and we're even getting back videos that where the tiger is sort of in the video. So you can see in this demo, this has got over a million images in it. It's got over half a million videos in it. It's able then to basically just take these audio things quickly do a lookup and find the relevant images, the relevant videos in there. All right. So if we look at some of the details and limitations of this, when you're passing the different modalities in, obviously you're limited to what you can pass in. So for text, that's up to 8,000 tokens. Now, most of the time, you're probably going to be wanting to do some kind of chunking. I'm not sure I'd really want to be putting sort of 6 to 8,000 words in to get a representation of the whole thing. You're probably more likely to go for smaller chunks in that. But if you want to, you can go up to 8,000 tokens. can also go up to six images at a time. You can also go up to videos that are two minutes long. Now, again, with the video example, I might want to actually chunk it much smaller than 2 minutes. If I've got a video that's 6 hours long, and I want to be able to do search of that, I might even chunk that down to sort of 15-second or 30-second chunks, embed all of those, and then allow myself to do sort of text search over that video. Perhaps if I want to type in something like when does a woman in a red dress appear in the video. Obviously, the smaller the chunks are, the more specific I will be able to be at returning the exact time that happened in the video. But this is definitely a way now that you can do search over really long videos. I can certainly see some really nice use cases of where you could take, for example, some of the university courses that are maybe 25 hours, 30 hours, 50 hours long, encode both the video, the audio, and any PDFs of slides that you had for each of those lessons, and then be able to ask it, hey, which lessons did they talk about specifically this? And have a diagram about it. And up until now, that really hasn't been something that you've been able to easily do for this kind of task. And I do think this is where all this gets really interesting is in the ability to come up with new ideas for new products and new ways of using this technology that just wasn't possible in the past. All right, so Google's published a bunch of benchmarks for this. I'm not really going to go through these. Interestingly, this is already doing just texttoext similarity better than the original Gemini embedding 001 model and it's outperforming the other sort of multimodal models that are out there for image to text and text to image. But really where this shines is

### [10:00](https://www.youtube.com/watch?v=zUkKvWBJ_0I&t=600s) Segment 3 (10:00 - 15:00)

the fact that it can do all of the modalities together. On top of this, the way this model is actually built is it incorporates the Matrioska representation learning, which means that if you don't want to get the full size embedding back, which is 3,072, you can get embeddings back that are either half that size, a quarter of that size. And that can be useful where perhaps you don't need the fine grain semantics of knowing exactly what color the cats were. You want to just know that, okay, that was a cat, there was not a cat there, but you want the performance boost of not having to store such large embeddings. And then also just the speed of being able to look up the embeddings faster because they're actually shorter. So on top of releasing this in the Gemini API for AI studio and Vertex AI, they've also teamed up with many of the agentic frameworks like Langchain, Llama Index, and the Vector Store companies like Chroma DB and Cudrant to actually get support for this on day zero as it's released. So I think the best thing is let's jump into a collab and have a play with this and then you can get a sense of what this can actually do. And I would love to hear from you in the comments of what are some of the ideas that you can see yourself using this now that you can index across all these modalities. So let's jump into the notebook. All right. So if we come into the notebook here, I basically made a little notebook just to show you some of the key features of this. So the model itself is still in preview. It's Gemini embedding to preview. And I basically put together just how you would use it without any external agent frameworks or anything like that. So we need the Google Gen AI SDK to do this. You need your Gemini key to do this. All right. So we bring down some example content. If we run that through and just see what that is, we can see we've got the Jetpack backpack, which is an image that's been used since Gemini 1 for a lot of the demos. We've got a scon image in here. We've got a cat image hint here. And we've got a audio file in here. — It is so peaceful walking through the trees with the leaves crunching underf foot. — All right, so next up we've got some just helper functions I've put together here. So basically calling the actual model, you're just going to use client. mmodels. Content and here. So I made one for embedding text, embedding images, embedding audio, and then later on we'll look at some simple straightforward ones as well. All right. So you can see that with these it's pretty easy to embed something and we'll get back these 372 dimensional vectors back here. And that's going to be the same size vector that we're going to get back whether we're doing images, text, audio. In fact, all the modalities will produce this length back. If we want to do something like text to image similarity here, I've basically got a bunch of text descriptions. And you can see that some of those will fit the image that we've images that we've got there. We can then just embed those. We can embed the actual images that we had as well, these three images. And then we can compute the similarity in here. Now, this obviously normally would be done by a framework. If you're using something like lang chain and stuff like that, it can handle all this for you. But just to show you that sure enough, when we look at jetpack picture, which is a sketch of a jetpack backpack, what is it? What comes back to highest? This is coming back to highest. Now remember, this is not a percentage kind of thing. It's just think of it as a score. That's a similarity score. And you can see here that for that one, we've got a person flying with a jetpack. Well, the jetpack is there, but not the person really and not really flying through the sky and stuff like that. But that still registers higher than obviously the cat image, for example. And we can see that rocket launching into space, which kind of makes sense, registers higher than the cat image. If we look at the cat image, by the way, we can see the cat image registers much higher for this acute ginger cat sitting and looking at the camera than any of the other things in here. Now, if we do the same for audio, all we need to do now is basically encode our audio file that we've got there. And we've already encoded our text. So, now we can just compare that audio file to the text. And sure enough, that file that was talking about trees and walking in nature and stuff like that is getting the highest score by far here, right? that this one is coming back that okay this is the closest text to that particular audio file. If we want to do sort of a reverse search where we pass in an image and get text back we can also do that here. So this is just putting in the various

### [15:00](https://www.youtube.com/watch?v=zUkKvWBJ_0I&t=900s) Segment 4 (15:00 - 20:00)

text. This image the closest text match is a sketch of a jetpack backpack which makes sense in here. Now, you could go through and try adding a pen sketch on lined paper, the jetpack backpack, and that would probably score higher than this, too. So, you can play around with sort of the detail of these. You can see that this one, sure enough, the friction shoe bake scores with jam and cream on a plate. And our last one, cute ginger cat, is scoring the highest there. So, this is definitely working, right? This, if we're doing the similarity matching and stuff like that, we've got something going well here. The other thing too is we can just run a full cross modality similarity check and basically see okay what is similar to what right so obviously the text a sketch of a backpack is going to come back with a score of one for matching the text of a sketch backpack right that's what this one line is doing but we can see that the second highest thing in there is actually this text matches this other text higher than the actual image in there in this case. So they're pretty close, right? We can see that there. And so we can see that we could actually check the audio to the text, compare images. Obviously, they're not anywhere near sort of as high as this audio to this text of trees and nature sounds in a peaceful forest. Okay. So in this one, we're just looking at how you would embed a video. So we just download a video file and you can see here that this is still just using the client models embed content the model we're using the Gemini embedding 2 preview model here and then the important part is we're going to do it from bytes and we need to pass in the actual mime type so that it knows that it's a video file in here once it's got that it can encode it fine the same sort of thing for embedded PDF FS, right? So, you've got the option of where you could use the files API if you wanted to upload them and provide them that way or you can do it from bytes here. And again, this mime type now is going to be application PDF in the and it's going to give both of these are going to give us back the embedding which is 3,72 long in there. Another thing that you want to think about as you're using this is if you're passing in multiple pieces of content, do you want to have separate embeddings for all of them or do you want to basically aggregate the embeddings? So, for example, let's say we've got a Twitter post or some kind of social media post where we've got a text component and an image component. Now if we want to make one particular embedding just for the whole post then what we would do is like this where we pass in the actual call in the typesc content in parts we basically just pass in a list of parts right where all of them are going to be joined back and we're going to get this averaged out embedding. So you'll see that this even though we've got two pieces of content sort of going in there or two parts of one content going in there, we've got the text, we've got an image, we're getting one embedding back here. If we do it like this where we actually pass in a list of pieces of content, right? So you can see here we've actually just the contents is just one piece of content with two parts to it. Here we've actually got the second one. We've actually got two pieces of content going in there. When you pass it in like this, you will get multiple embeddings back. So if we wanted to pass in six images, we would do it like this. Where if we wanted to pass in six images and get six embeddings back for them, [clears throat] we would do it like this. But if we wanted to aggregate an embedding over those images, then we would basically just pass those in parts, right? as each part being a separate image in there. So, they've got some notes in here about how to use that. I really think you want to sort of experiment with that yourself if you're going to be sort of working out like, okay, what is going to be the best way to represent things in a rag system? Is that going to be if you've got sort of like social media posts where all of them you want to have sort of one embedding, you might go for the aggregated way of doing it. You can always do that with this too where you basically take these out and you just average them out, right? So you just work out the average of the two embeddings or the 3, four, five embeddings that you've got there. Overall though, while this may not be as sexy as something like a Gemini 3. 1 model coming out, if you're actually building AI apps, embedding stuff is one

### [20:00](https://www.youtube.com/watch?v=zUkKvWBJ_0I&t=1200s) Segment 5 (20:00 - 20:00)

of the sort of core tools that you're using all the time. So, it's definitely worth checking this out. And while generally I like to use open embeddings just for having the control, unfortunately, there's nothing out there like this that has the quality of these embeddings over all of the modalities with one model in here. So, anyway, let me know in the comments what you think. If you've got any really good ideas of how you plan to use this, I would definitely be interested to sort of see what sort of use cases people are most interested in using this for. As always, if you found the video useful, please click like and subscribe and I will talk to you in the next video. Bye for now.

---
*Источник: https://ekstraktznaniy.ru/video/22373*