[Paper Analysis] On the Theoretical Limitations of Embedding-Based Retrieval (Warning: Rant)
48:57

[Paper Analysis] On the Theoretical Limitations of Embedding-Based Retrieval (Warning: Rant)

Yannic Kilcher 11.10.2025 35 030 просмотров 1 189 лайков

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI
Описание видео
Paper: https://arxiv.org/abs/2508.21038 Abstract: Vector embeddings have been tasked with an ever-increasing set of retrieval tasks over the years, with a nascent rise in using them for reasoning, instruction-following, coding, and more. These new benchmarks push embeddings to work for any query and any notion of relevance that could be given. While prior works have pointed out theoretical limitations of vector embeddings, there is a common assumption that these difficulties are exclusively due to unrealistic queries, and those that are not can be overcome with better training data and larger models. In this work, we demonstrate that we may encounter these theoretical limitations in realistic settings with extremely simple queries. We connect known results in learning theory, showing that the number of top-k subsets of documents capable of being returned as the result of some query is limited by the dimension of the embedding. We empirically show that this holds true even if we restrict to k=2, and directly optimize on the test set with free parameterized embeddings. We then create a realistic dataset called LIMIT that stress tests models based on these theoretical results, and observe that even state-of-the-art models fail on this dataset despite the simple nature of the task. Our work shows the limits of embedding models under the existing single vector paradigm and calls for future research to develop methods that can resolve this fundamental limitation. Authors: Orion Weller, Michael Boratko, Iftekhar Naim, Jinhyuk Lee Links: Homepage: https://ykilcher.com Merch: https://ykilcher.com/merch YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher Discord: https://ykilcher.com/discord LinkedIn: https://www.linkedin.com/in/ykilcher If you want to support me, the best thing to do is to share out the content :) If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this): SubscribeStar: https://www.subscribestar.com/yannickilcher Patreon: https://www.patreon.com/yannickilcher Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2 Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Оглавление (10 сегментов)

Segment 1 (00:00 - 05:00)

Hello there. Today we're going to look at on the theoretical limitations of embedding based retrieval. And this paper is really interesting uh because it proves something about embedding based retrieval. So think vector databases in rag and things like this. And the paper proves effectively that there are limitations to what can be retrieved. And this has a huge implications on basically all of AI nowadays, especially AI that works with any sort of data. Or so this paper would have you believe. The title is obviously very um you know, it's a title that evokes certain feelings. demands attention like whoa. We're all doing embedding based retrieval nowadays and there are limitations on it. I will make one change to the title that makes it more accurate and maybe much less impressive. The paper is on the theoretical limitations of embedding based retrieval of random data. Now, we'll go through the paper and I'll explain to you what I mean. But effectively, the TLDDR, and please watch the rest of the video, that it's really good for my stats, but the TLDDR is um if you have random combinations of data, then a certain embedding dimension is not going to be able to represent all of it. Wow. that that was completely not obvious to anyone. This is so unexpected. In any case, I don't want to downplay the paper too much here. Like it's one thing that you know something is kind of obvious um and it's another thing to really rigorously go through and prove it and show it in experimental results and so on. But I just wanted to take away this expectation that this is going to mean anything for any practical application. It is not. Um it simply says random combinations uh have uh there like of a set of n elements there are n choose two ways of making two sets and uh you need the amount of dimensions to represent that otherwise you can't retrieve a random two set. So what are we talking about for the people who are thoroughly confused now what this all means? uh we're generally in the domain of information retrieval and information retrieval is the following where you have a whole b like a data set uh let's represent it with this poorly drawn bucket right here and in here are lots of let's say documents right um in the internet you can think websites on your Dropbox you may have you know uploaded docs it could also be pictures and whatnot but we'll just take care of documents for now So let's say these are documents whatnot word documents PDFs text messages and or websites in that case and um what a search engine does is it builds what's called an index right and an index is some kind of data structure that allows the search engine so if the user over here like yay they have some search uh request for example uh I don't know I want to know about the Paris Metro. Paris Metro then we will go and consult the index and the index allows us to very quickly retrieve data from uh this uh bucket right here. Why don't we go directly to the bucket and that's because if there are a billion documents right here then it takes really long time to go through all of them and check whether Paris metro is in there. So we'll build an index. Classically these index have been um sort of reverse lookup tables. So I would start and I would have my entire vocabulary here. So I would analyze these documents and as I do I build up this vocabulary. So the word dog appears in documents number one 3 5 6 and so on. The word Paris appears in documents you know four 8 uh 13 and so on. The words Berlin appears and so on. You get the idea. So like in the back of a book I have this index and I can look up where the stuff appears. So if I search for Paris metro I will simply look up where Paris is, where metro is and I will take the intersection of those. So maybe metro is in 6 8 and 19. Uh you can see that the document 8 is in both. So that's the

Segment 2 (05:00 - 10:00)

document I retrieve. And now you can feel like if this is really a billion then you'll get a lot of documents that are have Paris and metro in. So you need a way um of kind of ranking that and that's where typical like very classic metrics like the best match 25 um come in. This is a very common ranking technique but this is not the topic of this paper. So ranking isn't it's simply retrieval like can or can you not uh from this query retrieve the relevant things from the database uh that match this query and exclude things that don't match the query. Okay. Now this uh is a sort of good old searching and there is this new modern searching which is embedding based searching. So that's where you no longer have the documents and analyze the words in them but you do have these documents and from each of these documents you create what's called an embedding vector. So you have uh some kind of uh neural network here and you ship the document in and out you get some sort of a vector and here you ship this now when your query comes like Paris metro you also ship it into the same neural network and you also get out a vector and what you do now is you consider the inner product between those vectors. So these ones look pretty similar. So they go into the similar direction and that's why this document would be more relevant than this document right here. What this paper is concerned with is again the same thing. So they have documents, they have queries and they want to um know whether it is possible um for an embedding model to be so good at assigning vectors to documents and queries that for any given query the relevant documents will be above the non-relevant documents. Cool. That's effectively it. If you go a bit further than this then you'll know okay these are vectors right so they are uh let's say I don't know what's a common embedding dimension so let's say these are 1,024 numbers right and they're always 1 and you can make the inner product with these 1,024 numbers right here and we call these dense embedding models because all of these numbers almost will be non zero um because these vectors will they'll point in some direction and the idea is that the vectors that point in a similar direction they uh are similar somehow they're related and however this here and that's might be not super obvious is also a vector-based embedding um and you can also uh think of it like this. So the way to think of this is um that your vector dimensionality is actually not uh like a 1,024 but your embedding dimensionality is the size of the vocabulary. So your vector is as has as many dimensions as there are number of words uh in either your language or your corpus depending on how you build your vocabulary. Um it is very uncommon for these kinds of indexes to use uh sentence piece or word piece tokenizers or bite pair encodings or anything like this like we use in kind of large language models or frankly even these embedding models. So very uncommon um much more common to simply build a stemmed vocabulary of the whole language. So you'll have the v the vocabulary size as the embedding dimension and then you simply so for example this let's say this is the embedding uh vector for paris you'll put a one oh this is very here and a one here um and this will be positions this will be position eight uh four 8 and 13 I cannot write numbers. Uh and this goes on right like this goes on for for a long time because there are a lot of words. Um no wait no I am this is wrong this is wrong. Obviously the vector is for a document. Yeah. Yeah. Okay. So this here will be I don't know document number eight right so document number eight and sorry about that yeah you embed the

Segment 3 (10:00 - 15:00)

documents um so the documents need to be assigned a vector and so the document will have a one at the words that the document contains. So this here will be you know you assign so Paris here is the second word in the vocabulary. So at position two you have a one and then somewhere down metro will be the I don't know 94th word. So at position 94 there's also a one and there will obviously be other words in this document too. And these have ones and the rest is uh zero right. So, we call these sparse embeddings because they're zero almost everywhere except at the position that correspond to the words that are contained in the document. You can go a bit fancier. If a work contains a document twice, you can put a two right here and not only a one. Um, but that's the general idea. And then you can do the same thing. inner product uh between two of these vectors and that will tell you. So, if the query is Paris Metro, right? So the query is over here. We do the exact same thing with the query. We put a one at the corresponding word. So Paris is here, metro is here. And you can see that the inner product between those two will be non zero because they do in fact have overlapping um words. And the more overlapping words they have, the more nonzero it is and the higher it goes. And you can even you could even define other inner products rather than just that the pointwise multiplication and summation um that would correspond to various retrieval functions such that all words need to be present or something like this. So what I'm trying to say is this here is actually also a vector-based retrieval. It's an embedding of type. um even it's a very trivial embedding but it is an embedding and is a vector-based retrieval except that yeah it's typically sparse whereas this is typically dense. Sorry this is a long preamble to the paper but I feel it's important to kind of put everything here into perspective of that you know typically when we talk about vector-based in retrieval we think of neural networks and whatnot. No, it's kind of all just a matter of framing, right? And what's important here is that you can see that in the dense um vectors. Why is it different? And throughout this paper, it's going to be obvious. It's different because the number of dimensions here is so much bigger that it's much more flexible, right? You can represent any combinations of words being in a document. Whereas this number of dimension is way smaller. Um and what that means is that um the uh you don't have as much flexibility to represent arbitrary combinations and that's a feature of neuralbased retrievers. But this paper here makes it into a bug. So let's go through the paper. It's a quick one to go through, I promise, because we'll kind of uh not follow along the math too broadly. But yeah, um so they're saying information retrieval has moved from models dominated by sparse techniques such as BM25, uh to those that use neural language models. That's what we just described. Um, and what we're doing is we're predominantly using a single vector models, meaning that we take a document and then we produce a single embedding um for the document. Now, I know like if long documents you can chunk them and blah blah, but that's implementation details if you will. Um, for now consider one document is going to be one uh embedding it. Even if we consider chunked uh documents, it does not change anything about the results right here. Now there the point of the paper here is that people have been um people have expected these neural models to solve increasingly complicated retrieval problems to the point where people are effectively expecting miracles of these embedding vectors. um they're saying even with the rise of instruction following retrieval benchmarks where the models are asked to represent any relevance definition for any query. So the point of the paper here is effectively that um now people are expecting way too much of these embedding models um to the point where theyffect they affect arbitrary instruction following arbitrary retrieval. And the paper has a point

Segment 4 (15:00 - 20:00)

like people do really expect miracles out of embedding vectors alone. And that's why you always see if people are new to the information retrieval um landscape especially nowadays they will often first they will only exclusively use embedding vectors which is already like that's how you recognize a beginner. um they will be like okay we're going to vectorize all the data and then we're going to do nearest neighbor retrieval and then that that's going to solve all our problems that's going to be magic and that is definitely not the case so the paper has a point that the expectation nowadays are too high and with the instruction following retrieval even higher however the paper is going to rest their entire case here on this their paper the paper is saying well since people are now expecting more of these models let's effectively prove that you cannot expect arbitrary things from these models and to me there's still kind of a disconnect between those two like high expectations I agree they are too high but the paper now says well we're going to prove that these models can't do arbitrary uh things arbitrary combinations, right? Like it's almost a good argument. There are combinations of things that these models can't retrieve no matter what. But like good incompleteness theorem, this is completely pointless for any practical application. This in my mind this does not address um proportionally the over expectations that people have of the models which are still very much in line with what could be learned um but is just a bit outside of the sphere of what's possible currently. All right. Uh so they're saying okay the quest data set uses logical operators to define different concepts and so on. Um, sure that's really hard for embedding models. Uh, if you have logical operators, um, so if you're not aware, if you search for something like this with embedding vectors, you're not going to you're may you may get actually stuff that has this, but you will also get stuff that has combinations of things. talks about um you know two of these and things like this. Uh and they're also saying yeah uh definitions of relevance um by defining relevance in a way that requires reasoning. So yeah, if you if as soon as logic and logical reasoning and abstraction enters the problem, embedding vectors really aren't your friend anymore. Um, and yeah, they push the boundaries of what dense retrievers are capable of. They're saying we seek to understand at a more fundamental level what the limitations are. Our work aims to to connect known theoretical results in geometric algebra with modern advances in neural information retrieval. They're saying they get a lower bound on the embedding dimension needed to represent a given combination of relevant documents and queries. and they show that for a given embedding dimension there exists top k combinations of documents that cannot be returned. Okay, let's pause here and um frame it like this. I have a set of elements, right? A set of elements. What you get? You got two numbers to describe each of these, right? Um you can assign this one 14 259 and so on. The mechanism to retrieve one is the inner product. Right? You specify two numbers here one two right for example 8 9 I will compute the inner product and give you the element that matches can you retrieve any arbitrary element the answer is yes right because I can just always if I like okay retrieve this one my query is going to be 99 that element's going have the highest inner product. If you want to retrieve this one, a cur is going to be 1/4. That element is going to have the higher inest product. This is triv like I'm not trying to trick you here. This is trivial. Now the crux comes. Can you retrieve

Segment 5 (20:00 - 25:00)

combinations of two elements? Well, with what we see right here, let me see. I want you to retrieve these two. Okay, maybe my cur is going to be uh two four, right? The inner product, this is not 24, this is two, four, right? Okay, the inner product here, it's okay with one. Two is better, right? Uh four is going to match this and this quite well. Um oh, sorry, I made a mistake. Well, not even not even you already see the problem. Actually, the nine here, they're just going to blow this inner product out of the water. Their multiplication are way high. So, maybe we need some sort of standardized inner product or actually cosine similarity between these. But, um, yeah, that's my mistake. I'm sorry. So, actually, what I said before isn't true if we're simply concerning the inner product, right? Um yeah, we want the cosine similarity effectively just how similar the numbers are. What I'm trying to say is that as soon as it comes to combination of two sets of these things then what you can do is um let's say I'm the adversary and I get to add elements to the set here to make the to make to thwart the query makaker's plans. Um what I can do is let's say you want you hit exactly in the middle between 25 and 14 here, right? to uh put a query in that's exactly in the middle. So these two elements are closest to it. I can always put another element here that is closer to the query, right? Um and kind of makes it impossible uh for to craft a single query that will hit both of these elements but not this element here in the middle. So um since this is we're two numbers, this is a two-dimensional landscape. You can imagine this being spread out in the landscape and it's just a matter of fact if you do nearest neighbor retrieval in two dimensions and you um you cannot uh craft your embedding so that you can retrieve an arbitrary two set of elements. One set that works but a two set just combinatorically you cannot retrieve arbitrary two sets. Do we need a giant amount of math to recognize that? No. Is this very obvious? Yes. Okay, you've got the whole paper. The rest here is just math to find out exactly how many dimensions there is, which is interesting. But I hope this isn't this doesn't come as a surprise. Now, why is this not a problem for embedding based retrieval? Because in embedding based retrieval the whole assumption is that there is a structure to the data and to the queries right so I want to retrieve Paris metro because that makes sense it makes sense to retrieve these two things together because uh in the data as a structure cities have transportation systems and the metro is a transportation system and so on right like it's just the nature of language and the world means that there is a structure and the whole point of embedding neural of machine learning, the whole point of machine learning is that data and the world has structure and we can learn this use this structure to create a compressed representation and by compressing we give up certain freedoms. We give up the freedom to represent arbitrary combinations of elements in favor of having a more condensed and a generalizable representation. That's the entire point of machine learning. And this paper here is doing a lot of math and and wording around the fact that well, if you're doing machine learning, you're giving up. you're assuming a structure that you learn and thereby giving up the com the the ability to make arbitrary combinations of data to represent arbitrary combinations of data. Yes, that's the whole point. If I want to represent arbitrary combinations of data, I would just store the data which is exactly what BM25 does. Okay, that and so sorry again the the reason why this paper has absolutely zero practical consequence is that we use embedding based retrieval in situations where the data has structure and where that structure is what we're

Segment 6 (25:00 - 30:00)

interested in and we're not interested in making arbitrary combinations. And yes, there are embeddings are now being pushed to things that they you know they're not supposed to be used for. But I don't think anyone seriously uses embedding based retrieval to represent arbitrary combinations. And this paper is just hinging on that point like oh look you cannot make arbitrary combinations happen. I still like that it still does nothing practical. Okay. So two things happening in this paper. One they're going to construct the data set. The data set is as follows. They are looking for name like they have names John O Leslie and so on. And they have things that one can like right caucus apples bananas rabbits candy and so on. They're going to construct two sets. So John likes caucus and apples. Oid likes caucus and candy. Leslie likes apples and candy and so on. Okay, arbitrary combinations of two sets. And then they're going to show that an embedding B like if your embedding dimension is too small, you cannot represent uh arbitrary two sets here. You cannot retrieve arbitrary two sets with inner product embedding based retrieval. Yeah, very again very obvious but they're going to they construct the data set to show that you know where exactly that point is where you failed to retrieve on the other hand they're going to theoretically show uh where that point is and the two match up and that that's pretty cool like if anything like I know I'm dunking on this paper but that's pretty cool okay so I will not go through the entire math right here um but effectively the idea behind the math is uh to build um so-called uh curl matrices or matric matrices where it is specified that um uh so the matrix is generally zero except it's one if a document is relevant to a query and then you can uh build another matrix which is called the score matrix and that's actually what your search engine gives you and then what you want to do is you want to see that the two matrices are effectively equal but not equal but just that the ordering is the same within. So for every document what you want is that the ordering of what you get out of the search engine is exactly representing the relevant documents over the nonrelevant documents and then they move to uh what's called the sign rank and the sign rank is um in the definition like the smallest integer D such that there exists a rank D matrix whose entries have the same sign as another um matrix. And if you think about it a little, it's like these are these are not easy things to prove by any means. I could not prove them. So, props to the math here. Uh but in principle um if you can construct a matrix that has the same signs as another matrix, then uh there's an equivalence of that um to uh the having a matrix whose ranks are um as the relevance ranks are right. So the sign rank defines the rank of that matrix. And so you might know that um for example if you simply have one vector and another vector you build the outer product of that you get a matrix. Uh but that matrix has rank one because effectively every row and column are just linear combinations of each other. Um so what that would mean is that your data has very uh a lot of structure right effectively a rank one matrix like let's say you have two dimensional d uh two-dimensional um space but your data is all on one line that's a one-dimensional data set right so if you wanted to represent this data it would be rank one data because all you would need to know is where on this line does a point lie. Uh because you already know that it's not lying anywhere off this line. So that's rank one data. In higher dimensions you can imagine the same thing. Now um the point is the point is uh um the rank the sign rank of this matrix here is effectively a direct predictor of um how many dimensions you need to represent a given data set right here if I wanted to build an embedding model I needed one-dimensional

Segment 7 (30:00 - 35:00)

needed I needed one-dimensional embeddings but as soon as my data set looks like s I need two-dimensional embeddings. Uh there are more complicated things than linear relationships. So if my data set looks like this, I also need one dimension. I just need to know like let's say the angle here from the top uh along this circle where my data is. Um but here we mostly care about linear uh linear relationships which is fair in high dimensions. Um it's really like this kind of data is really uncommon except uh I don't know in geometric deep learning or something. Um so again they're going to prove something about um about these things and they ultimately so if you look at uh things like this statement right here uh where they relate the sign rank of a matrix um to the um the ordering rank um between what the search engine gives you and the actual relevance matrix. And that as a consequence they get out what they say here. In the context of a vector embedding model this provides a lower and upper bound on the dimension of vectors required to exactly capture a given set of retrieval objectives. Practically this means for any fixed dimension D there exists a binary relevance matrix which cannot be captured via d-dimensional embeddings. um as there ex are matrices with arbitrarily high sign rank. In other words, retrieval tasks whose Qurl matrices have a higher sign rank are more difficult to capture. Exactly. For the embedding models require requiring higher embedding dimension. So if you have more data and that data is of um of a of a more complicated structure and the most complicated structure is obviously arbitrary combinations of two elements of sets, right? that is the that is the most difficult thing you can do to a search engine is like how about you build uh you know a nearest neighbor retrieval index where I can rep where I can retrieve arbitrary two sets of um elements that's the most difficult thing more difficult than like you can con you can adversarially construct even a data set like this that's what they're saying right here. And that's again in a similar vein as in any formal system there are statements that are true that the formal system itself cannot prove. Uh it's a bit like that. Second, if we are able to embed a given matrix in a row-wise order preserving manner. So um if we're actually in a situation where we can represent a piece of uh data with in a given embedding dimension this implies a bound on the sign rank um of uh 2 a minus this is a matrix filled with ones. In particular, this suggests a practical mechanism for determining an upper bound on sign rank of matrices uh for matrices via gradient descent optimization of free embedding representation. And this is interesting uh because the way they're going to prove well proof experimentally show um their that their theory holds here is by saying okay let's define an experiment where the vectors themselves are directly optimizable via gradient descent. So the embedding model here uh we know we don't train where was it we don't train the embedding model we can directly optimize the vectors here like directly we have a data set of queries and of documents and we just can directly gradient descent into these embeddings. So this has no relation to any training data or anything like this. We directly optimize the test set. So we are trying to overfitit to a test set. However, the test set is this adversarial test set right here where we have arbitrary combinations of two sets. And what you can already expect and obviously what happens is that as you increase the number of people who can like things and the number of things they can like accordingly, then it gets harder and harder for a given embedding dimension to capture all of the relationships here. even if you can directly optimize the vectors, right? And that's what they do right here. And you can hopefully see in the um

Segment 8 (35:00 - 40:00)

in the uh graph here, the critical end value where the dimensionality is too small to successfully represent all the top two combinations. So what they're going to do is they're going to pick an embedding dimension. let's say 25 and um they're going to figure out uh at which n so at which the size of data uh can they no longer with you know no matter how they optimize the embedding vectors represent this data and that's going to be somewhere you know somewhere and then what they worry about is this relationship right here and that relationship seems to be in their case of degree three for um for uh top two combinations and that does that apparently matches the uh the theoretical predictions um in their case which is pretty cool. Now the critical end values for certain embedding sizes are if you look at um yeah at for example I don't know 1,000 uh dimensional vectors you can go up to 4 million and I'm guessing n is the number of yeah uh of of elements in this set right of which you have to retrieve two uh two things on like two tuples I guess uh two sets sorry and that critical end value is at 4 million which I find crazy I would have expected much less but I guess I mean I guess it makes sense uh you have a thousand dimensions um but still with a thousand dimension embedding you can represent arbitrary combinations of two sets up to 4 million elements in a set which is that's a lot. So even this theoretical limitations on a wild case that is adversarially constructed to make embeddings break we still can represent 4 million things that accurately right like without flaw that's pretty cool like that's pretty now obviously in practice you don't get to directly optimize the embedding vectors because you don't have the test set um but still I would not have expected that so then they're going out and this is where it gets a bit more shady. Then they're going out and they say, "Oh, how about you know real world data set and then first they hate on real world data set. They're saying oh real world data sets typically with a static evaluation set with a limited number of queries as relevance annotation is expensive to do for each query. This means practically that the space of queries used for evaluation is very small sample of the number of potential queries and yes that is true. Retrieval data sets are not notoriously small because you have to get a human get a query and if you want to do it correctly, you have to have them look at every single thing in the corpus and decide not just whether that's relevant, but how relevant in relation to everything else in the corpus that is. That's a giant amount of work. And even if you don't do that fully, still annotation for retrieval is really hard to do um and expensive. Now what they're saying is oh you the sub the space of queries is a very small sample of the number of potential queries. And that's where I have a bit of a quarrel because again I get that some people are expecting too much of embeddings but it's not useful to want to do arbitrary combinations of queries. That's not a useful thing to do because data isn't random and retrieval intents queries aren't random and which data people are looking for given which queries isn't a random process. Right? And it's not so it's not useful to evaluate search engine and retrieval quality on their ability to create random subsets of data. That's just not useful. Again, every like the whole point of machine learning is to assume the data has structure and to exploit that structure to be able to get away with compressed representations and gain generalizability. That's what we do. and not arbitrary boolean combinations of things. Um

Segment 9 (40:00 - 45:00)

they say although it is not possible to instantiate all combinations when using large scale corpora search evaluation data sets are a proxy for what any user would ask for and ideally would be designed to test many combinations as users will do and then they say oh but that's not happening in practice and that's that has a reason because the users as users will do users do not ask for arbitrary combinations of things. Users are most people are looking for things that make sense to search for, right? And that is the same sense that the data has. And that's where I think this paper kind of falls off a cliff a little bit is when they try to connect this too much to actual practical application. It would be fine if the paper said look we have a theoretical bound on uh like a combinotaurics result effectively. Cool. And they can say well look this relates to embedding retrieval in the following way. Cool. But the moment they try to make this about the practical implications of this and and they go after the benchmarks here um which are flawed. Absolutely. And yes, people do expect too much of of embedding vectors, but that's a bit too much. Um, yeah, so they have this data set and this data set obviously current embedding models are terrible at this data set. Uh, who could have guessed, right? And then they're asking themselves, is this domain shift? Are we simply asking you know like uh you know is it domain like are we simply operating in a different domain? Let's you know let's fine-tune on a train test split of our data set. random data. I'm not I'm not sure how this isn't I don't know like come on training on limit on the training set does not significantly help indicating that the issue is not the who what yes like I'm sorry but I could have told you and you knew this would happen and that's a bit I don't know like I get it like this is paper writing this is academia but uh although our queries look similar to standard web search queries no they don't uh embedding models are not they're neither a they're not a database Okay. And then they look at the correlation with MTE. MTB is a retrieval benchmark and they're like, "Wow, look at that. " There's no obvious correlation um between how models perform on our data set versus how they compare on a standard retrieval benchmark. No, no. you know, no crap because your data set is not testing any sort of practical retrieval capabilities and uh yeah. So in the limitations here after the discussion they say although our experiments provide theoretical insight for the most common type of embedding model which is single vector dense embeddings they do not hold necessarily for other architectures such as multi vector models. This is up to we leave it to future work to extend our theoretical connections to these settings. I can tell you right now what's going to happen with multi vector models. Your dimensions just have to go up. And so if you have uh five vectors per document, then there is going to be a dimensions where they cannot represent arbitrary probably six sets. I don't know. They might be you might be able to get up to five factorial sets or something like this. But there's going to be a set number like a number of combination combinatoric things that you cannot represent because you simply don't have enough information availation power. This isn't about how many vectors you have. This is simply about how many

Segment 10 (45:00 - 48:00)

numbers you can play with and how these numbers interact with each other. And if you don't have enough numbers, you cannot represent arbitrary data with these numbers. That's that that's the I don't know. So yeah, that future work uh is possibly themselves writing another paper and uh um people again are going to be like, "Oh, wow. Wow, I always suspect that was always a bit sus. " They found these embedding vectors a bit suspicious. Cool. So that's two papers for the price of one. And uh they say we did not show theoretical result for the setting where the user allows some mistakes, right? So like what if you don't need to retrieve arbitrary things you can let make some mistakes. We leave that to future work. I can tell you what's going to happen and that is you're going to get a bound that says if you are allowing for 5% mistakes then your embedding dimension will have to be some function uh that's depending on those 5% larger and then you cannot and then you with high probability or absolutely I don't know that yet absolutely you can probably get an you can very likely get an absolute bound on that um it's simply a bit of more math. The principle is exactly the same. same as here. So, three papers for the price of one. Okay, that was it. Sorry, I know I was very negative on this paper. This paper is a solid paper. This paper I have absolutely no doubt that the math is correct. It is explained well. It is written well. um they do the exact experimental research that aligns with their theory. So they show that yes indeed their theory does hold when they put it to the test and all of that. My gripes with this paper is how they're trying to connect this to the quote unquote real world and the sort of implications they make of it. And um that is not necessarily a criticism to the people here but more to the academic landscape nowadays where if you know there's a lot writing on you writing impactful papers and the impact nowadays is much more on perception rather than actual content. So, you can't even fault them for doing this because um it might just be the landscape that forces everyone to make make, you know, like put a cool title and connect it to the practice in a way that maybe not necessarily has this big implications. So, um I don't know if I were a PhD student nowadays, I'd probably do this exact same thing because this is going to decide whether you get a good job in a prestigious lab or not. Um if you're at a company like Google, this uh might decide whether you know people who are PhD students want to come work for you like you want the best talent. So, you need the publicity. you need to show that you are at the forefront that people want to come to you. So, a lot is riding on this um with the rise of AI and yeah, that's the times we live in. That's it. Sorry for being super duper grumpy. Um yeah, thanks to the people who wrote this paper. I really I still appreciate it and it's a very it's a good paper. Um and I would definitely accept it as a reviewer. That's it. Thank you. Bye-bye.

Другие видео автора — Yannic Kilcher

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник