Contrastive Language Encoding | Ellie Kitanidis | OpenAI Scholars Demo Day 2021

17:12

Contrastive Language Encoding | Ellie Kitanidis | OpenAI Scholars Demo Day 2021

OpenAI 10.05.2021 2 197 просмотров 46 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

Learn more: https://openai.com/blog/openai-scholars-2021-final-projects#ellie

Оглавление (12 сегментов)

Intro

awesome hi guys my name is ellie and for my project i've been working on pre-training a language representation model using contrastive learning

OUTLINE

so i'll start by giving some background on what contrastive learning is and sort of putting this work into context then i'll spend some time going over our framework and procedure and finally i'll share some preliminary results and discuss future steps

Contrastive learning

so contrastive learning is a self-supervised method for learning representations like a lot of ideas in deep learning the idea of contrastive loss is not really new but it's just sort of become a really popular and active area of research recently in computer vision because of its success at learning visual representations so the concept of contrastive learning is pretty intuitive the idea is just that similar inputs should be mapped closer together in representation space and some contrastive frameworks will also explicitly structure their loss functions such that dissimilar inputs are mapped farther apart so you can kind of see this in this very high level conceptual diagram that i've included on this slide which should not be taken literally by the way um but it sort of shows the idea of contrastive learning um so it shows that two representations of the same fundamental concept sort of like two images of the same person should be mapped close together while two distinct concepts like two different people should be mapped farther apart so in practice what usually happens is that an input image x is augmented to create two different versions x1 and x2 this augmentation is usually some combination of random cropping like reorientations and other types of noise injection and then the two versions x1 and x2 form a positive pair where the loss function should seek to maximize the similarity between x1 and x2

Relevant literature

so while contrastive learning has become a booming area of research and computer vision the applications to natural language have thus far been pretty limited um so i've listed here under nlp basically all of the relevant papers i came across most of which have pretty recently hit the archive just within the last few months um because i'm limited to 10 minutes i'm not gonna go through each of these individually and discuss them but i'm putting them up on the slide so that you can look them up and read about them if you're interested um what i will say is that in basically all of these prior works either they're not pre-training from scratch they're you know fine-tuning an existing pre-trained model or they're using contrastive learning as sort of an auxiliary objective along with a more traditional language modeling objective so we became interested in whether or not you could pre-train a language model from scratch using just contrastive learning by itself sort of just in the interest of looking for new language modeling approaches that might have some new benefits either in performance across some subset of tasks or scaling or both i'm sort of specifically motivated by the search for better sentence level representations in language modeling so for our contrastive setup we

Model architecture

re-implemented a recent paper in computer vision um and you know matched the performance of that paper in this so-called sim cyan framework no negative pairs are used just positive pairs x1 and x2 so the augmented data pairs are fed into an encoder which produces the representations those are then fed into a projector and predictor that are trained simultaneously with the encoder and the loss function in this case is pretty simple it's just trying to maximize the cosine similarity between the positive pairs um so the innovation of this particular framework is the addition of the predictor and the stop gradient such that the cosine similarity is actually between p1 the output of the predictor on x1 and z2 projector on x2 with the back propagation only calculated through the branch with the predictor so this was shown in simsyam to help avoid mode collapse where basically the model just learns to map everything to the same place in representation space since we don't have negative pairs to sort of balance out that attractive force so our model oh i guess i forgot to mention um so for our encoder we basically just use a um transformer encoder um where uh we've added a few tweaks from openai's sparse transformer implementation namely uh they present this modified residual block and also some special initialization schemes and activation functions okay so our model is pre-trained on the pile um this is a recent large language data set that captures a diverse range of modalities everything from books to code from github repositories web pages medical papers etc um now a big question possibly the biggest question in applying contrastive learning to language is what should our augmentation method be um so most if not all of the papers i cited earlier use some form of noise injection uh such as randomly deleting or replacing words rearranging the order of words and so on however we felt that um maybe more so than in the case of images injecting noise into language does not necessarily create invariant representations um so if i make random changes to the structure and content of my sentence i'm not really saying the same sentence generally anything coherent at all so we thought a more principally motivated approach would be to choose pairs of text that are near each other in a document and thus likely to have related content and style so once again due to the 10 minute limit i won't go into the technical details of our data processing pipeline but i do have a few notes on the slide for reference if anyone wants to dive in deeper later okay so now we come to the preliminary results um so i'll start by saying that when i did my phd the physics department had an unusual way of doing things instead of doing our dissertation defense at the very end of our phd we did it about halfway through and the reasoning behind this was that they wanted us to get a lot of expert feedback early on the reason i'm telling you this is because a part of me really wishes i could give you an update on this project in a few weeks when we have more conclusive results to share but on the other hand i think it's actually very valuable that i get to show it to you now and potentially solicit your feedback and ideas

Pre-training difficulties

so the one thing i can state with a lot of confidence at this point is that training these models is really hard um so far we're finding these sort of spiky plateaus in the loss function after the first few thousand iterations pretty much no matter what we do um so i've plotted a few examples here but they all pretty much look like this um so far hyper parameter sweeps are not really giving us any insight into this i will say i have not exhaustively searched the parameter space by any means i've only been working on training this for a couple of weeks so it could be there's something to uh some insight to find there but so far we haven't seen anything um increasing dropout and or weight decay also hasn't seemed to have helped um we experimented with adding a tank before or after the average pool um that goes at the end of our encoder that also did not seem to particularly help so we wanted to test that this

Contrastive vs. Masked Language Modeling

training difficulty is related to the contrast of loss um so we tried training the model with an mlm like objective instead for anyone not in the language modeling world that stands for masked language modeling basically we replace 10 of the tokens with random tokens and then take our same encoder and feed it into a linear and sigmoid layer instead so now the task is to predict the replacement mask um this is slightly different than the bert objective of trying to predict the masked tokens since we don't have a decoder or anything generative we're just trying to predict the mask itself and when we train this model we find that we get much more nicely behaved loss functions so that does seem to indicate that it is the contrastive loss that's causing the problems

Query-key test

as another sort of probe we decided to see how well the model can match queries to relevant keys so we randomly selected a thousand text pairs from the data with the same selection parameters as were used in pre-training and we let these be query key pairs we then created a cosine similarity matrix of encoded queries dotted with encoded keys and calculated the percent of queries for which the corresponding key was in the top k most similar so for example out of a thousand query key pairs if the model's totally random i would expect that the matching key uh to be in the query the query's top five most similar uh 0. 5 percent of the time instead we consistently see about 20 um and likewise for the top 100 we see 60 what's kind of interesting about this is that doing the same test on the mlm version of the model we actually get smaller values so 5 and 25 respectively perhaps the contrastive version of the model is better at this type of task of course this could very well wash out at larger scales and with more training these were only trained for you know the first 2000 iterations or so um because of the noisy plateau issues so this is far from a conclusive statement but more of an interesting preliminary finding

Evaluation on 2 GLUE tasks

finally we do have evaluation set up for two downstream tasks from the glue benchmark set of course again at this stage of training the results and the configurations used to get them are far from final um so we include them more just to give a sense of uh where the performance gap is currently um so the two tasks we evaluate on are rte and mnli these are both textual entailment classification tasks in other words for a given premise and hypothesis does the premise imply that the hypothesis is true um i'm going a little bit over on time so i will skip discussing the details of implementation and i'll just talk really quickly about this table so this is the maximum validation accuracy for these two tasks they're both classification tasks classes mli is three classes um and the third row is basically the results using um a pre-trained burp model that i just took off hugging face so the values aren't exactly the values from the burp paper but they're the values that i got in my fine-tuning pipeline very close to the values from the paper and we see that for our model in both versions there's about a five percent gap in rte um so the two models uh perform pretty similarly um but definitely not uh as well as the vert pre-trained um for mnli i actually found that uh this is a much larger data set so it was a lot slower and i found that our model was very slow to converge so i don't actually have a final value nor do i think it's particularly helpful to try to get one at this stage but the training does start uh with a validation accuracy of about 60 percent so it is doing something um but the slowness to converge does uh indicate that it's doing a lot of training from scratch which is unfortunate um so the last thing i'll say about this is that um obviously this is not an apples-to-apples comparison um not only because we cut the pre-training so prematurely but also the model is smaller than bert base but it is still i think helpful to see that five percent gap um and whether or not that can be closed um by improving those parameters okay so very quickly next steps

Immediate next steps

obviously our immediate next steps are to try to improve the training just do a lot more experiments with learning rate parameters and augmentation parameters and maybe expand our scope a little bit in terms of including negative pairs and different types of loss functions potentially also exploring different augmentation methods that might work better more broadly if we were to get this type

Future directions

of model to work obviously we would be interested in probing its scaling behavior um and also its performance on a wider variety of tasks yeah so i've gone a little bit over but thank you guys so much for listening and of course thank you to my mentor and the program coordinators and my fellow scholars this has been an amazing experience and i'll take a couple of questions and then pass it all off okay so the first question is these results make me curious about the input pairs you selected could you talk a little bit more about it yes i can really briefly show you a slide that i had skipped

Data input/output

um so uh basically we started out thinking about how to do this at the sentence level so we started with just sentence pairs basically and we would allow them to be separated by some distance that was a hyper parameter and we actually made it really open-ended by having a minimum and maximum separation such that it could be a random separation between those two intervals and eventually we ended up moving towards larger pieces of text mostly because the evals were all more than one sentence so it just made sense to take on that level of abstraction um so we now choose chunks of text of some number of sentences which is also a hyper parameter and we've explored um those three hyper parameters as well um we then tokenize using a uh subword tokenization and truncate our pad to the maximum sequence length um okay i'll take one more question do you plan to try out contrastive losses with negative pairs such as nce to compare with simpson byol lost performance yeah so as i was saying um so far these preliminary results do seem to indicate that um the simscyan framework might not work so well uh for a messier more complex data set like language and so we would definitely be interested in trying more complex loss functions that do include the negative terms so something like nce um definitely would be uh something we want to look at okay i've gone a little bit over but thank you guys so much for the questions and um if i didn't get to your question please just look me up on linkedin or on slack if you're at openai and i would love to keep chatting about it offline okay so i will now pass it on to cujo

Другие видео автора — OpenAI

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник