[Classic] Word2Vec: Distributed Representations of Words and Phrases and their Compositionality

31:22

[Classic] Word2Vec: Distributed Representations of Words and Phrases and their Compositionality

Yannic Kilcher 16.07.2020 25 537 просмотров 1 069 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

#ai #research #word2vec Word vectors have been one of the most influential techniques in modern NLP to date. This paper describes Word2Vec, which the most popular technique to obtain word vectors. The paper introduces the negative sampling technique as an approximation to noise contrastive estimation and shows that this allows the training of word vectors from giant corpora on a single machine in a very short time. OUTLINE: 0:00 - Intro & Outline 1:50 - Distributed Word Representations 5:40 - Skip-Gram Model 12:00 - Hierarchical Softmax 14:55 - Negative Sampling 22:30 - Mysterious 3/4 Power 25:50 - Frequent Words Subsampling 28:15 - Empirical Results 29:45 - Conclusion & Comments Paper: https://arxiv.org/abs/1310.4546 Code: https://code.google.com/archive/p/word2vec/ Abstract: The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of "Canada" and "Air" cannot be easily combined to obtain "Air Canada". Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible. Authors: Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean Links: YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher Discord: https://discord.gg/4H8xxDF BitChute: https://www.bitchute.com/channel/yannic-kilcher Minds: https://www.minds.com/ykilcher Parler: https://parler.com/profile/YannicKilcher LinkedIn: https://www.linkedin.com/in/yannic-kilcher-488534136/ If you want to support me, the best thing to do is to share out the content :) If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this): SubscribeStar: https://www.subscribestar.com/yannickilcher Patreon: https://www.patreon.com/yannickilcher Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2 Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Оглавление (9 сегментов)

Intro & Outline

hi there today we'll look at distributed representations of words and phrases and their compositionality by thomas McAuliffe Ilya sutskever chi-chan Greg Corrado and Jeffrey Dean this is another historical paper it's one of three papers it's the middle one that introduces the original word to Veck algorithm and if you as you might know work to back was extremely influential in NLP since this paper basically until recently where it's sort of gone out of fashion a bit in research with the rise of things like Elmo and Bert but it's still a very relevant so we'll look at this historical paper today with kind of the hindsight of being a couple years into the future in fact as you see right here this was released in 2013 so it's seven years later now and we'll look back and we'll see what they said back then about the system this is not going to be like a very you know well enhanced PowerPoint presentation of how we're to Veck works this we're going to look at the paper and read it together if you if you like you know content like this if you like historical paper readings let me know in the comments share it out if you do like it and of course subscribe because this these kind of historical papers I enjoy them but you know many people might already know what these things are so yeah ok let's you know go through the paper and pick up their ideas and kind of put them in context they say the recently introduced continues skip graham model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic world relationships so the Skip Graham

Distributed Word Representations

model was already introduced by mclubbe in an earlier paper that came out I believe not like one or two months prior to this one as I said where to veck is a series of papers I don't think there is a paper called were to vector they here have released the code along with the paper in the code was called - vac so the Skip Graham model was introduced previously but it is replicated right here so this in the Skip Graham model what you're trying to do is you're trying to get a distributed word representation so what does that mean that means that for each word in your language let's take these words right here for each word in the language you want to come up with a vector that somehow describes that word in a continuous fashion so that with in a the - like me might be mapped to I don't know 0. 1 0. 9 and 0. 3 Learn might be mapped to negative 0. 5 and so on so each word gets assigned a vector in the same dimensional space and what the previous paper kind of discovered is that if you do this correctly then these vectors they have some kind of properties and we can already kind of jump ahead because this was already a bit researched in the last paper the semantics of these vectors will be something like this so here they have a two dimensional PCA so these are the first two dimensions of the one thousand dimensional skip gram vector so the vectors they obtain they can do things like this where they can show that in these spaces for example there appears to be a vector Direction that characterizes the capital of a country so if you take a few countries and their capitals and you average that vector you get a kind of a direction for capital Ness of a city given a country you can see that there is a pretty clear relation here now some of these things have later been revised to such that they are ultimately ended up being not that impressive for example there was always this kind of math with vectors and I don't I believe this is this might not be in this is in the last paper where they discovered that if you take the vector for King and you subtract the vector for man and you add the vector for a woman then that would result in the vector for Queen so the way they did it was basically they did this calculation right here and then they searched in the point they ended up they searched for the nearest neighbor in their vocabulary and that turned out to be Queen but in order to make it Queen actually you have to exclude the original word King and people quickly discovered that if you don't exclude the original word it you know the result of this kind of arithmetic will almost always lead back to the original word and then a lot of these analogy tasks are simply the result of you then discarding that word during the nearest neighbor search and then Queen just happens to be one of the closest words and it's it sort of much less dependent on which exact calculation you do here so there's been a lot of follow-up work kind of analyzing criticizing these vector maths but definitely we know that these word vectors turned out to be extremely helpful and syntactically and semantically relevant in downstream tasks because they have performed very

Skip-Gram Model

well so how does the Skip Graham model work how does it assign vectors to each word so first of all it has a dictionary so there is a word an input word and for each word you have a big dictionary and the diction dictionary is basically says that you know - the word - is going to be mapped to this vector point one Dada Dada and so on the word learn is going to be mapped to that vector and then you also have these output vectors right here and what you're trying to do is you're trying to take a phrase from the data set like this one right here and you take out one word like this word vector right here and you're trying to frame this as a prediction task so in this case four different prediction tasks so you're telling your machine I give you the word vector and which other words are around the word vector you just tell it that you don't tell it anything else you just say which other words are around the world vector and the correct answers in this case would be to learn word and representations so these you construct for different training examples where you have an x and a y so the X is always vector and the Y is 2 and then the next training sample the X is vector and the Y is learn and so on ok so this here each training sample is a classification task right and the classification task is and is as you can see no you can't see right here but the classification task is you have the input word and you classify it into one of many many many classes namely there are as many classes as you have words in the dictionary so each word in the dictionary will have a class associated with it right so an image nets you have like a thousand classes but in these and that's already a lot but in these tasks you're gonna have a hundred thousand classes because there are a hundred thousand words in the English language that you wanna want to treat and there are many more but in this case they leave away all the words that appear less than five times in their corpus that's still a lot of words so it's like a super duper lot of a classification task but ultimately if you do something like this then the originally so the representation that you end up with is going to be very good at doing these kind of downstream tasks and that's what they discovered so their skip gram model is nothing else than taking a word in predicting the surrounding words from that word and this is what it means this is the formal statement of the skip gram objective what you want to do is the objective of the skip gram model is to maximize the average log probability this one so for the word we're considering the word t we want to maximize the log probability of each word W that is in around the word C I'm sorry around the word W in a context window of C that's exactly what we did before we take a word like this model right here and from it we predict all of the words around it in energy in a given window right that's all that's the entire objective and that will give you very good representations and this is how you would implement that so what you'll have is you'll have these vector representation V that comes from your original dictionary those are the things you learn and then because you have like an 30,000 way classifier you know that a classification layer is nothing else than a linear layer followed by a soft max operation and that linear layer also has parameters these are the Prime's okay so first you have the look up in the dictionary for the word vector right here and this is the vector of the classification layer now there are modifications where you can use like the same vectors and so on or you can also make use of these vectors but ultimately you care about these vectors right here and the vectors here are simply the classification layers weights so here you can see that there is what you're trying to maximize is the inner product between the word that you're considering and the words around that word and you're trying to do a classification task so you need to normalize now this is the normalization constant and it goes over all of your vocabulary so that's what they tackle here they say W is the number of words in the vocabulary this formulation is impractical because the cost of computing the gradient is proportional to W which is often large and that's 10 to the 5 to 10 to the 7 terms so many 10 like tens of millions of terms in your vocabulary that's just not feasible right so people have been you know sort of trying different ways to get around very large number of classes and here it seems that is really our bottleneck in the previous paper they've already shown that this objective can give you very good word representation but now we need to get around the fact that we have such large vocabularies so

Hierarchical Softmax

the first idea here is hierarchical softmax and this is kind of a tangent i find this paper by the way it's sort of hard to read because it's like a half engineering paper but yeah so first I introduced this hierarchical softmax which is kind of a distraction it's kind of a here is what we do considered first but then didn't end up using really they do compare with it but the flow of text is sort of that you expect this to be part of the final model which it isn't so in the hierarchical softmax what you do instead of having this giant multi-class classification task right here you take all of these classes right here and you put them in a sort of a tree okay so you take this and you put them into a tree so instead of classifying you know let's say we have a thousand classes instead of classifying a thousand ways we first classify in two ways and then we again from each one as you know a thousand is like two to the ten so we need approximately ten layers of this before we are actually arriving at a thousand classes but it also means that we only have to weigh classifications each time so in the hierarchical softmax we build trees like this and then we so we have a word we look up its vector sorry its vector and then we classify it for each of these nodes so your output isn't going to be a thousand log probabilities your output is going to be a log probability a binary log probability for each of the nodes right here so you want to know okay here is it in the upper half or the lower half of my classes okay cool it's in the upper half okay here is in the upper half or the lower half and so on and you learn all to predict all of these junctions right here and that's going to end up you with you having to predict it less now of course you are constrained you impose a very big prior on the class distribution classes are an independently anymore namely if two classes here are in the same subtree that means that they are going to be predicted their predictions correlated because the path to them is the same partially so how you arrange the classes here is very important and there has been a lot of work in this but as I said this is rather a distraction right here hierarchical softmax is a way to solve this however they went with a different way right here they went with

Negative Sampling

this approach called negative sampling has been it's been very influential not only in word to vecna --get of sampling is one of the cornerstones of the current trend in self supervised learning in contrastive estimation and so on so this all of this you know it pops up in unlikely ways in other fields and it sort of I'm not gonna say it originated here but definitely it was introduced into the popular deep learning world right here so they say an alternative to hierarchical softmax is noise contrastive estimation okay so in noise contrastive estimation posits that a good model should be able to differentiate data from noise by means of logistic regression you know that seems very reasonable this is similar to the hinge loss and so on yada while NCE can be shown to approximately maximize the log probability of the softmax the Skip current model is only concerned with learning high-quality vector representations so we are free to simplify in all these contrastive estimation as long as the vector representations retain their quality we define negative sampling by this following objective so this is very interesting they see okay noise contrastive estimation you know it approximately maximizes the law of probability so the noise contrastive estimation would actually be the correct way to approximate their problem however they say well as long as you know as long as something reasonable comes out we're free to change that up a bit so they go with this negative sampling approach right here and you can see that this is almost the same so it's written a bit differently from the original softmax thing because was written as a fraction and here it's as a sum but what you're trying to do in the noise contain the negative sampling framework is you trying to maximize the following you're inner product of the word you're considering and the words around them okay so if you're trying to still predict the words around you but now instead of having this prediction soft max over all of the classes you only have the soft max over a subset of classes so what you'll do is use sample words from your vocabulary at random and you sample K of them and you're simply trying to now minimize the inner product between those words and your okay so what is that ultimately lead to it ultimately leads to the following you have a word like this word here- and what you're trying to do is you're not trying that much to predict the word sampling what you're trying to do is you're trying to say that in my space right here I simply want sampling to be closer than any other words that's not in the context window okay so here is my word negative and sampling and I want these two to be close and if I sample another word like here this is the word cake if I sorry if I sample that I simply want that to be far away further than the word sampling okay so this is now a comparative it's not I classify sampling as the highest class it's simply I want to classify the word sampling against the other classes higher all right so and this is now much easier so instead of a thousand or ten thousand or a million weigh classification and now maybe have I have a k plus one way classification right pretty easy right I simply sample K other words and as I assumed because I have so many words chances that I actually sample one that's in my context window is very small right so I simply sample other words and I say well these other words are random they have nothing to do with the current frame that I'm looking at so they should be you know they can be whatever they want but at least they should be farther away than the words that are actually in my context and that is negative sampling the process of sampling negatives this right here and then making sure that the positives which are these here in this case the words in the context are classified with a higher probability than the negatives for a given in right this here is the input word that's it that's negative sampling and of course yeah as I said you recognize this from current things like yourself supervised learning where you wanna have the same image augmented twice go through the pipeline you know you augment you put a little bit of different noise and then you have a different image and at the end you say these two should be close together well this other one should be far apart it's the exact same thing here except that you have a different way of obtaining the positive and the negative samples in this case positive samples are everything that's in the context negative samples are just randomly sampled from the data set and that you know works of course that works much much faster and you can see that this turns out to give you vectors that are pretty good and you can train with higher vectors sorry with higher dimensional vectors you can train with bigger vocabularies with this has turned out to be very influential as I said now with the rise of Burke and so on work to back is kind of getting forgotten but this was a revolution and distributed vectors so it wasn't the thing really it kind of was a thing before that but it wasn't really a thing that people used what people would do is still they would do n gram models before that so they would kind of diss they would sort of chunk up their sentences into engrams into overlapping engrams and then have a big giant table further where they index their engrams so the word I don't know so the word hello is ID 1 the word hello there is ID 2 and so on so you have a big table for all the engrams and then what we would try to do is you this kind of bag of words estimation where you would take a you know whatever engrams appeared in sentence and you would have this big you know classification where you'd associate the engrams with each other and so on so distributed word representations were kind of a revolution at that point especially distributed representation that actually outperformed these old Engram methods so

Mysterious 3/4 Power

there are a number of tricks right here that are I think not understood until this day for example the question is how do you sample these negative samples right here this basically says get K words from your vocabulary at random according to this distribution right here now how are you going to do that basically you have a spectrum of options the one side of the spectrum is going to be completely uniform okay we sample each word with the same probability and the other side of the spectrum is something like sample this according to their unique gram these are two different things they're opposites in this fashion so here you say hey some words appear way way more often than other words shouldn't we prefer them when we sample right shouldn't we if we have a corpus and shouldn't we sample from the corpus and if in the corpus one word appears 50 times more than the other word then shouldn't we sample that 50 times more as a negative because it's you know so abundant and it should read get a higher classification accuracy whereas on the other hand you could say no we should simply sample every word in our dictionary uniformly they came up with something in between which they say both NC and negative sampling have noise distribution as a free parameter we investigated a number of choices and found that the unigram distribution raised to the 3/4 power ie unigram to death recorder outperformed significantly the unigram and uniform distributions for both NC and neg on every task which including language modeling this I think is a mystery until today and it actually turned out that this exponent right here is magically much better than like the exponent of one or even half like you might be reasonably assumed that the square root you know might be something but the 3/4 I think turned out to be very good and very mystical so what does it mean it means that you have kind of a balance between words that appear often in words that don't appear often usually in these kind of things you have a power law where we have very few words that appear very often and then you have okay that's the tail shouldn't go up but you have a very long tail of words right and what you want to do is in this case you want to sample these words here more but they are they appear so much more often than if you simply sample according to their unigram distribution you basically not regard these words right here you'll forget about them and your performance will suffer because they do appear every now and then so what you want to do is you want to push the dose down a little bit and the optimal amount for the little bit turns out to be to raise it the you raise it to the 3/4 strange but you know turned out to work well the

Frequent Words Subsampling

other thing they do is they do the they do a sub sampling of frequent words so again this is a way to kind of push down the often appearing words where they say the most frequent words can easily occur hundreds of millions of times like in the array such words usually provide less information value than the rare words for example while the Skip current model benefits from observing the co-occurrences of France and Paris it benefits much less from observing the frequent co-occurrences of France and the as nearly every word Co occurs frequently with in a sentence with the so they do another trick here to counter this imbalance between rare and frequent words use a simple subsampling approach each word in the training set is discarded with probably I computed by that formula right so therefore formula right here and might be asking again why this formula so this is the sampling probability of a word and it goes with 1 over T is a temperature parameter and F is the frequency with which the word appears in the corpus so as you can see as the word appears more in the corpus then so this is the frequency as the word appears more then this thing goes down up so it's discarded with this probability a higher probability if it appears more often where F is frequency if the word T is a throat T is a chosen threshold which shows this subsampling formula because it aggressively sub-samples words whose frequency is greater than T while preserving the ranking of the frequencies although this subsampling formula was chosen heuristically we found it to work well in practice it accelerates learning and even significantly improves the accuracy of the learn vectors of the rare words as will be shown in the following sections so again something sort of arbitrary it's even it's more understandable than the 3/4 but still it's sort of arbitrary they experimented around they found this works well and then every everybody ended up you know using that so that's how this kind of stuff happens ok so now

Empirical Results

we get into the empirical results and the empirical results in this case we're already sort of given in the previous paper but here they have these the analogical reasoning task where you can see that the negative sampling did outperform the others by quite a bit right here so the negative sampling approaches outperformed the hierarchical softmax and the noise contrastive estimation and in the previous paper they also compared with other baselines and saw that it also outperforms those while being right time time-efficient so you can see that especially with these subsampling approaches the time here there's 36 minutes for and they again I think they have like a huge corpus that they train on these were to vac code turned out to be really efficient code and that's why I got so popular as well they did the same thing for phrases right here so for phrases like New York Times and so on but this was kind of more of a this was more of a side thing the phrase vectors turned out to be you know rather a side thing from the actual code right here so yeah as I said

Conclusion & Comments

this paper is very different from other research papers and that it's sort of half an engineering paper and all of these papers are they're kinda hard to read because they just kind of state some things in the order is kind of weird sometimes why they do things but you can't you know you can't deny that it had the quite the effect on the community and now this it is a very cool paper a very cool series of papers and it's very cool that actually they release the code and they made the code such that it is super duper efficient even like on a single machine and that was very cool because you know being Google they could have just released code that is very efficient on a distributed data center and they didn't do that so that this is it's sort of not really like today anymore or like today when they release code it's always you need like 50 cloud TP use to do it and it's still cool that they release code but this was really a step into kind of democratizing the AI and yeah so that was my rant about word to vac I hope you enjoyed this I hope this still was useful to you even though most of you probably already knew were to back and yeah so I'll see you next time bye

Другие видео автора — Yannic Kilcher

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник