# Breaking Contrastive Models with the SET Card Game | Legg Yeung | OpenAI Scholars Demo Day 2021

## Метаданные

- **Канал:** OpenAI
- **YouTube:** https://www.youtube.com/watch?v=AKg0tzunYP0
- **Дата:** 10.05.2021
- **Длительность:** 15:45
- **Просмотры:** 2,397
- **Источник:** https://ekstraktznaniy.ru/video/11587

## Описание

Learn more: https://openai.com/blog/openai-scholars-2021-final-projects#legg

## Транскрипт

### Intro []

um hi my name is lake um i'll be talking about breaking contrast models with the sets card game this project is motivated by a particular failure mode of clip consider this text prompt which asks for a red cube on top of a green blue cube the image forms a natural pairing with the text

### CLIP fallure modes [0:22]

if i shuffle the colors in both the image and the text i get another pairing if i continue this procedure to generate all six pairings of texts and images and feed them into clip i'd expect clip to predict a six by six logics matrix with a strong diagonal but the logits predicted are actually quite even it seems that when there are multiple entities relations and attributes clip struggles my intuition for why this

### Dot-Product Retrieval in Contrastive Models [0:50]

happens has to do with the dot product retrieval layer between query and key representation vectors in standard contrasted models we can view this layer as performing linear classification in which a linear boundary q is encoded to separate the key data points

### Intuition 1. Vapnik-Chervonenkis (VC) Dimension [1:10]

under this perspective the vc dimension limitation applies for quotes we have six keys one query may match with three of them as positive ps but there are actually two to the power six number of ways to assign some keys as positive and others as negative as long as our hypothesis class can throw all 2 to the power 6 number of queries at the keys this is the idea of shattering in general if our queries are limited by the vector dimension d the vc theory states that they cannot match with all possible subsets of d plus one keys in data sets like imagenet there are only a thousand labels in the contrasting model setup this corresponds to a thousand static queries for example match all the docs so model performance is far from hitting the limits imposed by the vector's vc dimension but on data sets like clever many dynamic queries can be formed such as identifying all the scenes with one or more blue objects here we need to subset the scenes in many dynamic ways so we may hit the limit imposed by the vector's vc dimension i also have a secondary intuition that's related to the poor approximation of full rank query key matrices using vectors that are shorter than the rank feel free to ask me about it to validate

### Model Architecture [2:34]

my intuitions i set up two architectures one is a contrastive model which i encode queries and keys separately and score the compatibility using the dot product retrieval layer the other is a non-contrastive model which scores each query key pair as a continuous sequence the contrasted model uses an eight layer transformer to encode the query symbols and an embedding lookup to encode the key symbols while the non-contrasting model uses a four layer transformer to encode the concatenated query key symbols the contrastive model has 17 million parameters while the non-contrastive model has only half of that my experiment goal is to show that contrasting models with limited vector representation dimensions are worse than non-contrasted models that uses half the parameters task wise i borrowed the well-known sex card game it suits my purpose because of some nice extendable properties each card has multiple attributes and possible values for attributes these dimensions are scalable any pairs of cards can form a query that evaluates to a key card in a complete set of three cards on any attribute the cards either have to be the same or all different this query key matrix has a regular pattern but it is still full rank to extend this game i introduced the

### The SET card game - Extensions [4:02]

star regular expressions such that queries can evaluate to subsets of key cards in addition i introduced a set union and symmetric difference operators to make the queries and return subsets even more dynamic i'm going to show an

### The Extended SET card game - How to play [4:20]

example game and how the models would play it here's a deck of 27 cards each with 3 attributes and three possible values per attribute each query has eight pairs of cards and set union operators between them depending on the query the matching subset of keys and its size vary this particular query returns 13 matching positive keys a different query may return a different smaller or larger subset of keys such as 3 or 21. a training example is a tuple of query symbols and a key symbol sampled from the matching positive key we can sample different queries and their keys this way to make a batch the contrasted model uses the info nce training objective which normalizes the dot product over all the query key pairings in this batch size by batch size query key matrix the scores are penalized across both columns and rows the non-contrasting model uses a conventional cross-entropy objective the scores are penalized across keys in the support at test time the models are given queries such as this one and they score the compatibility of this query with each of the 27 keys there are 13 ground truth keys for this query so one crude metric is to measure how much your top 13 predictions overlap with the 13 round trip keys i call this metric the top k prediction ground truth overlap for now a more sensitive metric is to compare the kl diversions between the normalized predicted scores and a ground truth distribution constructed from dividing one among all the ground truth keys evenly

### Putting it all together [6:09]

putting all of this together i have seven contrasted models with vector representation dimension from 512 down to four they all have 17 million parameters and a non-contrastive model with half the size i have five different games that are based on a 27 card deck from the sets game in its original form to the afford mentioned extensions and a final game of shattering 27 numbers they have increasing levels of difficulties characterized by the respective total number of unique key subsets the queries should separate we tried each game on all the models

### The SET card game - 27 cards results [6:48]

here's a results plot on the y-axis we have kl divergence loss on the x-axis we have those five games ordered from left to right with increasing levels of difficulty measured by the total number of key subsets the craze should separate in powers of two the first game is the original set it requires the queries to separate 2 to the power 4. 57 subsets of keys all the models did very well the second game includes a star regular expression and requires the queries to separate 2 to the power 6 number subsets of keys so the contrastive model with vector dimension 4 starts to do poorly as we introduce a set union and symmetric difference operators the models need to separate between 2 to the power 21 and 23 subsets of keys so the contrastive models with vector dimension 8 to 20 also do worse and worse finally on shattering 27 numbers the model needs to satisfy a vc dimension of at least 27 so all the models with vector dimensions less than 27 fails to varying extents notice that throughout these schemes the contrasting model models with vector dimension 5 12 27 and the non-contrasted model with half the parameters performed consistently well the top k overlap metric shows a similar agreeing trend here higher is better so the plot looks like the kl loss plot flipped upside down to understand why the models with shorter vector representations are worse i zoomed into one game for some analysis this is a representative query from erroneous predictions it matches with 13 ground truth positive keys and 14 round trip negative keys the perfect model normalizes the dot products and distribute the probability mass evenly among the positive keys and nothing among the negative keys the contrasted model with vector representation dimension 5 12 and 27 are not far from perfect but if we drop the dimension to 20 or below we start to see mass moving to the negative keys until the dimension 4 which performs about the same as a completely even distribution it seems that as we drop the vector representation dimension the model becomes lessly able to develop a preference among the keys using entropy of predictions as a measure of this phenomenon what we saw from a previous example can generalize to all the queries that matches with 10 to positive ground through positive key cards the entropy of predictions increases as vector dimension decreases notice that queries in this game return to return up to 2 to the power 21 subsets of keys and starting with dimension 20 entropy grows more dramatically this trend is also reflected in the aggregate of all queries in the test sets these entropies are increasingly or monotonically as vector dimension decreases in addition to this 27 card version i

### The SET card game - 27 81 cards results [10:05]

also trained an 81 card version of this particular game the setup is similar the contrastive models range from vector representation dimension 512 down to 16 or around 17 million parameters the non-contrastive model is half the size here are the plots the trends are similar kl loss decreases as the vector dimension increases and top k overlap metric increases with the vector dimension notice that the game has 81 cards so in theory with a vector dimension of 81 or above we should be able to solve this game perfectly and we do see some agreeing trends here

### Training [10:53]

are a few things i learned from training these models more difficult games require more gentle learning rate schedules cosign schedules help to unstructured models from local minima in some cases efficiency of conversions depends on initialization schemes using dot products instead of cosign similarity and temperature works better for this data overall i think these observations are more specific to the nature of this synthetic game than contrastive models besides what i just told you i also

### Adjacent Topics [11:27]

worked on some adjacent topics during the past six months they include a variable binding in which i did some literature reviews on tensor products and a study of dolly and eclipse failure modes and then i looked at a point mutual information for classification specifically comparing the info mce versus the cross-entropy objective for zero shot classification on toy problems and i looked at a various uh versions of game rules and operators and observed that their ranks the rank of their query key distribution measures these changes finally i want to uh thank you for listening and also uh thank my mentor gabe for giving me a lot of valuable guidances and and feedback in these six months um and i'll be update i'll be updating my blog over the weekend um with the uh with the final blog post so feel free to check it out next week also feel free to contact me for uh further questions so now i look over to the q a for questions um so there is a question on uh on the intuition behind entropy going up as the embedding dimension goes down and so let's see let me go to the

### Error Analysis (27 cards) [13:06]

so uh from the error analysis that i saw that um as the vector dimension goes down the vector size the vectors themselves become less able to encode enough signals that allow them to uh kind of separate the different keys um but i would be curious to do more investigation on that front but in general the observation is that as the vector dimension goes down the models become less and less able to develop a preference among the keys and then so there's a question on the intuition on rank let me also go to the corresponding

### Intuition 2. Poor Rank Approximation [13:55]

slide so the intuition is that if our query key matrix it has rank uh has a is full rank so in this case if i have uh if i have less keys than my queries this full rank matrix would have the rank equal to the support of the keys i'll let's say 27 and if my um if my vector representation sizes are less than 27 when i take dot products between my queries and keys it's mathematically equivalent to doing a matrix multiplication of the keys matrix that has size support keys by a number that's less than support size of keys with the query matrix that has a dimension that on the on one dimension that's less than support of keys and other being the support of queries so these vectors are shorter than the full rank 27 and so um so when we do a matrix multiplication between these two matrices the m prime matrix is not going to be full rank it would have at most the rank of the whatever the vector representation dimension is um and then there is also a question of if the non-contrastive uh oh so i see that my time is up so i'm gonna uh uh hand the time over to sam and uh look at the look at the third q a question of line and make a reply