# Language Models are Open Knowledge Graphs (Paper Explained)

## Метаданные

- **Канал:** Yannic Kilcher
- **YouTube:** https://www.youtube.com/watch?v=NAJOZTNkhlI
- **Дата:** 02.11.2020
- **Длительность:** 52:16
- **Просмотры:** 37,088
- **Источник:** https://ekstraktznaniy.ru/video/13283

## Описание

#ai #research #nlp

Knowledge Graphs are structured databases that capture real-world entities and their relations to each other. KGs are usually built by human experts, which costs considerable amounts of time and money. This paper hypothesizes that language models, which have increased their performance dramatically in the last few years, contain enough knowledge to use them to construct a knowledge graph from a given corpus, without any fine-tuning of the language model itself. The resulting system can uncover new, unknown relations and outperforms all baselines in automated KG construction, even trained ones!

OUTLINE:
0:00 - Intro & Overview
1:40 - TabNine Promotion
4:20 - Title Misnomer
6:45 - From Corpus To Knowledge Graph
13:40 - Paper Contributions
15:50 - Candidate Fact Finding Algorithm
25:50 - Causal Attention Confusion
31:25 - More Constraints
35:00 - Mapping Facts To Schemas
38:40 - Example Constructed Knowledge Graph
40:10 - Experimental Results
47:25 - Example Discovere

## Транскрипт

### Intro & Overview []

hi there today we'll look at language models or open knowledge graphs by chiang wang xiao liu and don song this paper on a high level proposes to construct knowledge graphs which is a structured object that's usually built by human by experts either fully manually or semi manually with heavy human involvement it proposes to construct knowledge graphs automatically by simply using a pre-trained language model together with a corpus to extract the knowledge graph from the cool thing about this paper is that there is no training involved so there is no model that learns how to construct a knowledge graph the entire knowledge is simply extracted from running the corpus once so one forward pass through the corpus through the pre-trained language model and that constructs the knowledge graph so that's kind of the core message of this paper they say this paper shows how to construct knowledge graphs from pre-trained language models without human supervision and it turns out the way they do it works pretty well on kind of standard knowledge graph construction uh benchmarks so that's the paper in a nutshell we'll go through all of this um including i have a bunch of criticisms but it is a pre-print uh remember this and yeah so usually i'd say at this point if you like this content don't hesitate to share it out and so on today we're gonna try something different um in three two one

### TabNine Promotion [1:40]

stop it's sponsor time this video is sponsored by tab nine uses deep learning to help you write code faster uh what could possibly go wrong if you do that no i'm joking take a look at this piece of code here i was trying to refresh some elastic indices and as you can see here all i said was could and tap 9 completes it to could not refresh because above i was trying to call a refresh method this is something that i haven't seen any other completion engine do yet compared to a regular coding engine tab 9 is trained on lots of open source projects and it combines this with your code and it predicts what you want to do compared to predicting what's possible which is what a classic engine does tap nine it uses a gpt based model and it downloads that model onto your machine so the code never leaves your machine there is an opt-in feature where you can run that in the cloud and that will just give you a bit of a better beam search and better quality predictions and it saves you a bit of ram as you can see i myself use tab 9. i just have it on by default and i'm pretty happy with it i use it through coc integrated into my neovim but you can also get it in sublime atom intellij vs code even like jupiter notebooks and you can use it together with classic completion engine so you can really get the best of both worlds so whenever you see me code in a coding video look out for this tn marker next to the completions that's the completions by tab 9. it doesn't only work for python it actually works for pretty much any programming language that isn't completely obscure if you go to this link within 72 hours of when this video is released you'll get three months of tap 9 professional for free the professional version removes the project size limit of the free version and it also gives you access to that sweet cloud inference after the three months you're automatically kicked out of the pro version there's no auto sign up there's really nothing to lose i mean the only bad thing here is that top nine itself is written in rust if that's the worst thing about an offer it's a pretty good deal again i use this myself and i'm pretty happy with it so again if you sign up at tap nine dot com slash promotion slash yannick culture within 72 hours of when this video is released you'll get a free three months of tab nine pro no strings attached and now enjoy the video thanks alright i hope that was uh fun

### Title Misnomer [4:20]

let's get back to the paper let's get into the paper so first of all what is my first criticism of this paper um this the title there are some disturbing trends in the last few years in um in machine learning papers and the disturbing trends can be maybe encapsulated with the phrase is all you need so people have sort of since attention is all you need since this paper people have discovered that if they just append this to whatever their paper is about then um the paper will get much more notoriety and the same thing i think is a bit at play here with this with the r because in recent times we've kind of seen a bunch of papers that show equivalences between models such as a famous example is that the transformers are hotfield networks in some kind of in some regard and it these papers are pretty cool right even if the two things are not exactly equal all the time if you can say look there is a setting there are you know under these assumptions under these settings in this situation these two models actually are the same that's a pretty cool recognition pretty cool thing to show and it's very useful for academia and practice i believe however i believe the r keyword the is keyword should be sort of reserved for when two things are equivalent whereas here in the very first at least they're honest right sentence they show they say well we show how to construct knowledge graphs from pre-trained language models so essentially they're going to use a language model to approximately construct a knowledge graph and they're also going to use a bunch of other auxiliary models that come all pre-trained but still they do not show an equivalence of language models and knowledge graphs in this paper not at all so i would sort of i see that you can get somewhere with these titles but um yeah maybe people will be disappointed kind of if they read the paper which it is actually a cool paper believe me all right

### From Corpus To Knowledge Graph [6:45]

so as i said what we have usually is a corpus okay a corpus is simply a bunch of text pieces you can think of maybe just the text in wikipedia okay uh here you know the wikipedia page about bob dylan is a songwriter was awarded a nobel prize signed albert grossman these are easy sentences right there can be sentences are usually larger and longer and so on and what you want to do is you want to extract a knowledge graph so the knowledge graph has two distinct things it has entities and one ended here would be kind of bob dylan songwriter is an entity nobel prize in it is an entity you can sort of think of them as nouns okay and then the second part in knowledge graphs are the relations here occupation sign award received and so on so the relations connect to entities there is always what's called a head of an end of a triple so a head of a fact which in this case is bob dylan three times then there is a tail which is sort of like the object of the verb and then there is the relation which is described by the verb now here you can see there are two stages of constructing such a knowledge graph any system that does this probably goes through these two stages so first you extract a set of candidates which um it's not the knowledge graph yet because these are still strings right you extract a bunch of string triplets as you can see here and as we said as the sentences get more complicated it gets more and more difficult to extract these kind of triples and then the second part is that you need to map it to a scheme to a schema and these schemas are usually defined by humans so here we're still going to rely on humans to define the schema so there is one list that says entities and the entities there are just the entities are listed okay by the humans and at some point it says bob dylan bob dylan and it has a bunch of mentions of bob dylan associated with it and it has a clear id in this case you see the ide is q392 in that knowledge graph and the system not only needs to extract these facts but then also map these facts to the correct entities sorry schema entries this second stage right here is a bunch of standard tasks so especially mapping something like the word dylan in its context to this entity bob dylan which you can think of it as like the wikipedia page of bob dylan right that's how these systems usually work that is a task called entity linking okay entity linking and similar tasks exist for sign like the relation awarded mapping this to award received to this uh so maybe there's some kind of dictionary entry award received and what it means and a bunch of examples and you're supposed to map this to that these are standard tasks and the system that we are going to look at right here is not con not much concerned with these tasks it simply uses pre-existing methods to do these things so the system we're looking at today does this first part right here it takes text okay this is text and it comes up with these candidate facts about the text whether or not how this is then mapped to the schema that is a different question and it's so there are there are pretty cool things in this paper about this step but we're first going to look at the first step and then at the second step all right so how does this system do this and how does it do it that there have been machine learning models before but being machine learning they all have like some sort of a training corpus where you have kind of the facts as a training set and then you have a separate set of facts as a um test set and you try to learn from the conjunction of the text and the training facts how to extract facts not this system simply uses a pre-trained language model so what's the reasoning is the following we used to think that we could do nlp probably best with having a knowledge graph right with having this set of very structured data we can answer something like what's the age of barack obama's wife and then you could go to the entity of barack obama you could look at the relation spouse you could go to michelle obama you could look up her birth date which would all be structured information in this graph so you could sort of answer questions like this and search engines like google and so on they have this built in so there is kind of a knowledge graph entry sometimes when you search an entity in google that pops up and these have been very useful to answer questions like this however in recent years language models have become better and better things like birth or gpt2 have become better than these expert systems let's call them at answering questions by the way if you want to hear a very cool and solid argument of where these kind of expert systems where this kind of structured human annotated or maybe extracted information can still come in natural language understanding i would recommend the machine learning street talk episode we had with wali sabha extremely interesting person and i have i just i can recommend listening uh to that this should be out any day now if it is not already so the language models have become better and better at these tasks without having this structured information so the hypothesis is maybe these language models can already contain the information that's necessary to construct these structured facts because the structured facts is what we you know let's say should use to answer these questions because we feel that structured information is better than unstructured the language models are pretty good at these tasks so maybe we can get the structured information out of the language models so that's what they do they say the

### Paper Contributions [13:40]

contributions are as follows we show how to construct knowledge graphs from pre-trained language models the large graphs are constructed with a single forward pass of the pre-trained language models without fine-tuning over the textual corpora i think this is the this is kind of a very strong point about this paper and it also shows that if you're some phd student somewhere and you don't necessarily have the resources to train the next gpt3 model um or even fine tune it there is still research to be done simply if you have enough resources to forward pass your data which is often much fewer than to train one you can still do very cool research i think this paper shows this explicitly okay this helps researchers explicitly understand what the language models learn bridging the deep language model and the knowledge graph communities through enhanced model transparency okay they say we propose an unsupervised two-stage approach mama m-a-m-a which stands for match and map to first match the candidate facts in the corpora with the knowledge stored in language models that's the first step we looked at then map the matched candidates facts to both fixed and open schema to produce a knowledge graph and then they say they produce a new type of knowledge graph which simply is the facts sometimes the facts they extract they can't really map to a schema entry and we're going to look at that because i think a bit critically of this they say namely the open knowledge graph consists of mapped facts in the fixed schema of existing knowledge graphs annotated by humans and the unmapped facts in the open schema that are new in the reference knowledge graph schema so what they claim here is that their system is finds these new relations that are don't even exist in the schema and is able to uncover kind of build new additional schema entries and they call this the open knowledge graph i'm a bit skeptical of this as we are going to see

### Candidate Fact Finding Algorithm [15:50]

so the first step how do you come up if you have a sentence and this is a very poor example i feel honestly to do this it's i get it must be short but it's a poor example but stay with me so you have this sentence dylan is a songwriter and you would like to extract a fact from this the paper is not really written clearly on how i mean it is i could you can parse it out but the description is kind of distributed so step one is run spacey this is a standard kind of library for nlp to extract noun phrases or they call them noun chunks okay so step one is not there's nothing to do with the language model it is simply you want to find the noun phrases in here the noun phrases are dylan and songwriter now these noun phrases now define your head and your tale of the facts so you already have two things right so the entire task of what of their method they're proposing is so step one is run spacey to find the head and the tail effects step two is question mark for now step three is going to be use the entity linking system and the relation linking system to construct the knowledge graph okay so step one is steel underpants and then step three is profit so what's step two is obviously step two is where their system comes in step two is here is the head and here is the tail in the text somehow where in between there might be a relation and we need to figure out where that is okay so how does this method figure it out um so you already see the assumptions here are very restrictive right so you use spacey to extract basically noun phrases which means you're probably already going to miss a lot of things that are not recognized as noun phrases and they all they also say that spaces annotations are sometimes error prone and that's why they miss a lot of things and then secondly the assumption that the relation must be in between the two things textually now you can run the algorithm forward and backward but still it must be in between and it must sort of be encoded let's say as a semi accurate string in there um i guess then that's up to the relation linker but already these assumptions are super constraining in the kind of things you can find and you'll see in the experiments that their biggest flaw is that they have a very low recall i mean so do all the systems on the task apparently but they still have a very low recall and it's because they constrain their problems so much i'm going to guess if they wouldn't constrain their problems so much then they would have maybe a better recall but their precision would just plummet because uh these things if you let them run wild they just over extract so basically every cen every verb in every sentence is going to be a relation right so like i ate a banana um i ate banana would be a triple not necessarily of a really uh valuable entry in any knowledge graph though banana has a lot of carbs so i would want to know about that um okay so you see that the task is now reduced from building knowledge graphs to simply given a head annotation head piece in the string span and the tail span extract any span in between the head and the tail that describes the relation so the way this algorithm does it that's where it uses the language model okay so here it's going to do something that is going to be similar to dynamic programming um if you've seen kind of the dynamic programming kind of search algorithms let's say um you know string matching algorithms and so on this is going to be sort of similar in that what we're going to do we're going to start from here from the head in the string there could be text before it right we're simply going to locate the head dylan right here and going to start then we're going to look at its attention matrix now the attention matrix is we're going to cross out here the tension matrix if you i've done many videos and attention the tension matrix basically in a sequence means how much each token attends to each other token right how much information is kind of sent from each other token to this token right here so this up here would be uh be the query and these would be the keys the attention matrix specifies that so since we locate things between the head and the tail what we want to do is we want to cross out we want to disregard everything that's kind of behind the query and only look ahead in the sentence okay so that's why the sum of the attention matrix here is crossed out as you can see these are the x's this is exactly because we only search in one direction so from each from the token dylan we can look at three things is a or songwriter and this the question is simply where do we go next with this algorithm right there's no interpretation yet it's simply where do we go next and where do we go next is simply answered by just taking the highest scoring thing in that column of the attention matrix i look at the attention column where of the token dealer i take the highest scoring one that's point three here is higher okay then i go to point three and that means is gets into my candidate fact and once i put is i then go to is so the next thing i do is i go to is and then i again look in the corresponding attention column and i see what's now the biggest entry here and the biggest entry is 0. 4 which is songwriter and you can see here now we skip the a that's how we leave out some text okay um by skipping it basically so you can see that this can create artifacts right this can create like kind of holes in the middle and so on but we skip a we go directly to the point four and then we discover ah the point four that is our tail so now we put our tail into here and since our tail is the last word we can stop the algorithm i yeah so there is no need to go on even if there were texts behind the tail as soon as we are at the tail which we already know right we're given the head and the tail we stop all right so the we simply go forward with always the biggest entry in the attention matrix until we reach the tail that's the algorithm this there it's described here but um it's kind of described in this this way where it has these actions like star yield and like this maybe i'm not understanding something but it seems completely unnecessary to kind of describe these actions and basically start the search from the head is added as the initial candidate and so on then in yield it sometimes says with the largest score from the attention matrix is appended to the end to yield the new candidate and so on but still and then stop we stop and the algorithm description here it basically just says while we're not done if we're if it's not the stop action we continue it's sort of it doesn't tell you anything like this is a super unclear description of this algorithm basically the whole logic that you would want to know about is here in this action manager right so the action manager that gives you the action is doing the actual logic of figuring out which token you know you should do next and where you should go next and so on this is nowhere in the algorithm just describes beam search so you can do this a little yeah the little more sophistication that comes in is that you don't do this deterministically but you actually do it via beam search okay but you can just generalize this all right so the description is a bit floppy with the whole actions and uh action manager and whatnot and not describing the only thing they don't describe formally is how actually to select the next token which is basically the entire uh kind of meat of the algorithm in any case you might this is something that confuses me right here

### Causal Attention Confusion [25:50]

here so fair enough you know they say here we take the attention matrix and we cross out these x's all right but they say they can take things up here right like burt and you know as i said fair bert has a full attention matrix everything attends to everything but they can also take things like gpt2 now gpt2 is an autoregressive language model that means that in gpt2 if you look at it then you produce each token one after another which means that when you produce so each token when you train or when you evaluate even each token can only attend to the things in front of it right you see the problem with what this thing requires oh this is also the same okay let's do that you see the problem with this method is the exact opposite each token attention matrix is deleted such that only the entries ahead of it are in the attention matrix right you don't actually get gpt2 to give you an attention matrix that looks ahead because it only ever looks behind so maybe what's happening is that the query and key matrices are switched up in some way in that case when we want to interpret the algorithm the way they write it down is if i am at a particular part of what i think is the relation between the two uh entities how am i going to find whether or not there is more to the relation right there could be it could be a multi-word relation like um has a child with or i don't know can't think of any multi-word relations or whether we kind of are done with the relation and go to the tail what this thing is saying is that we should look at the language model so if this is really how it is here and you are at the word is what you want to know if this is bird if this is a birth language model what you want to know is if i were to cross out delete this word which other words in the sentence right here that are ahead of me are very informative to predict this particular word and that's kind of the query style and you know if the answer turns out to be songwriter is quite important for that maybe dylan is too but we only look ahead if it turns out a the word a is not as important as the word songwriter right because songwriter yeah it gives an indication that there should be is because songwriter is kind of a profession and there's a person in front of it we don't look at that but the attention matrix would um would have that in mind that that's valid right so that's how this construction is made however if this is the key we have to think of the other way around if we are at is uh we look ahead and say if i were to delete the word a could i reconstruct it how well from this word is or if i delete songwriter how well could i reconstruct that from the word is i think both are you know there is interpretations probably for both of these methods but what i want kind of to convey is that none of these things are really amenable to constructing a knowledge graph it's quite interesting that this stuff actually works because all it asks is how well does one word inform about the presence or how well can one word predict another word and from that information we construct this knowledge graph which probably is a testament to the fact that knowledge graphs maybe aren't so much about knowledge um if you extract them from a corpus but more about grammar i would think that's a thing that goes on here because these language models are a lot about grammar right a lot about how different words appear together frequently so given that songwriter is kind of a mix between grammar and basic word knowledge given that songwriter is kind of an object here the word is being the verb is probably quite important for it um and that's exactly these triples they always appear a bit like in compressed sentences and which are very grammatically relevant so i'm not buying this hypothesis that there is much knowledge in these language models and that's why this works what i much rather think is that they are really really good at kind of grammar and statistical association between words across the language and that's why they can extract these candidates facts so well okay so that's what i think about the

### More Constraints [31:25]

algorithm they do constrain it some more as if it doesn't already have enough constraints um but they all make sense okay so they say the matching degree which is simply the sum of all these attention matrix entries that we've encountered during our search so all the ones we didn't skip or to count it together or the matching degree of this triple the matching degree must be above some threshold that's the first constraint uh because so they give an example right here for the sentence rolling stone wrote no other pop song has so thoroughly challenged artistic conventions and the extracted candidate fact is rolling stone wrote pop song right again you can kind of see here it's mostly going in into grammar ish so spacey extracts rolling stone and pop song and the language model here extracts like the only verb in between wrote so um yeah to limit to kind of limit the the um to limit the matching degree to say it must be at minimum kind of some number it makes a lot of sense because if the matching degree is high that means if we go by this attention matrix it means that these words that are in the candidate fact they kind of as themselves they follow from each other so the language model thinks that wrote is a very good follow to rolling stone and pop song is a very good follow for rote or the other way around depending on which way the attention matrix is but that's kind of the language model thinks that that these words together make sense um in the context of the sentence of course like this entire sentence so as i said it's sort of can think of it as a bit of a summarization paper but with more constraints constraint number two is that um the frequency of r is above a threshold so the relation itself shouldn't be too specific it actually should appear a bunch of times in the corpus so what you do is you know you go through the corpus once extract all the facts my pen just dropped you at all these candidates and then you kind of count them and go through the candidate facts again and delete all the ones that are below a certain thing that's people usually do this with things like stop words or rare words and so on it's pretty standard makes a lot of sense and um constraint number three relation r is a contiguous sequence in the sentence okay so um we have an example here from the same rolling stone wrote challenged conventions which the language model would like to extract because again these in the context of that sentence these words sort of you know they jump to each other in the attention matrix because you can predict them from each other very well but they say this must be a contiguous sequence so what i said before i said this could happen with this constraint they excluded okay

### Mapping Facts To Schemas [35:00]

so for the second part where they actually have to map a candidate fact to a fact in the schema as i said they use kind of pre-pre-made solutions uh entity linking and relation mapping with the schema i won't go into this except to say that whenever they find a match they say that this is a mapped fact whenever they don't find a match they say oh this is an unmapped fact okay an unmapped candidate means that at least one of hr t is not mapped to the schema there are two types partially unmapped facts is where some are mapped and completely unmapped facts indicate that all h r and t are not mapped to the schema okay for example jacob was a registered mennonite now here they so they say they have these different facts and you know it's a cool thing if a model like this can actually come up with new facts not so not only new mapped facts which is something you would expect right if humans provide some kind of a schema then build a knowledge graph this is never complete so if you can automatically kind of fill in missing facts that's very cool though i would say humans if you construct knowledge graphs humans should probably also build kind of like negative connections saying like yes it is conceivable that uh elvis was a vegan um because a lot of texts talk about it but in fact it is explicitly not i don't think that's what we have in the knowledge craft software but it would be cool if this model could fill in new facts yes to the schema it would also be cool if it could uncover completely new relations that haven't hadn't been considered by the human makers of the knowledge graph like if the knowledge graph itself is incomplete the schema is a man you know same argument the schema is probably also incomplete this paper is sort of trying to sell their system as something that can do that and i believe that to a degree but also jacob was a registered mennonite okay now maybe i'm completely wrong from the sentence jacob was a registered mennonite in amsterdam i might be completely wrong but mennonite is a religion i think and um i'm very sure that any of these knowledge graphs with the schema that they have being in a religion or being of a certain faith in their relations table somewhere and i'm also pretty sure that mennonite large enough that would actually appear as an entity maybe jacob not right maybe jacob is an unknown jacob we don't know who jacob is um but this seems more like a failure of the entity linker and relation linker than an uncovered new relation or an uncovered new uh entity so yeah take this stuff with a grin now they are very honest about this but just to say that

### Example Constructed Knowledge Graph [38:40]

that's probably what happens most often so here you can see the graph for bob dylan constructed from the wikipedia pages that are kind of out they say around the page of bob dylan so i guess one or two or three hops away something like this and you can see the blue stuff is stuff that we already knew so that the human humans also found when looking at this then yellow stuff i believe is either new relations so whenever things are annotated it's a new relation in the schema so you can see this is an entity in the schema because it's annotated this is a relation in the schema but the arrow is new so the humans hadn't yet extracted the fact that bob dylan was or was a member of artists united against apartheid um then the yellow also sometimes means that there is a new thing so here tour with is a relation that's extracted that is not in the knowledge graph yet also this one and you can it's pretty cool right that you can extract these things automatically there's a lot of yellow stuff here which means there's not a lot of new information that this extracted and a lot of this new information is actually mapped to the schema right bob dylan residence in duluth i don't know how to pronounce that by the way yes so that's fairly

### Experimental Results [40:10]

cool they do uh some of these tasks of these knowledge-based tasks so in these tasks what you'd have i believe what you'd have is always you'd have like a head and a relation given so you have a document and you are given a head and a relation and you're asked what's the tale of this right and then you ask the system and the system will tell you so you have these baselines and these baselines i believe they are specifically made to extract these knowledge representations they might even be trained i don't know that but you can see that the mama even the smallest one here beats those by quite a bit now you can see that the recall is significantly lower than the precision which is a direct result of how many constraints on the system there are and tells you sort of what the going forward what the improvements can be so they analyze a lot of this and yeah so a first recognition is that larger and deeper language models produce knowledge graphs of higher quality bert language models outperform gpt2 language models under similar model sizes which is interesting is scalable to larger corpora which again as we said you don't need to train it and larger corpora embed more complete knowledge graphs which is something we would expect the other interesting part is the unmapped facts so the numbers you can actually compute only for the mapped facts right because that's where you have data humans produce the knowledge graphs from this that's what you can compare with now the unmapped facts they say they analyze we turn to study the quality of the candidate facts that are not mapped to the above reference knowledge graph schema but are in the open schema generated by mama let's mama we manually judge such on map facts generated by our best method from 100 sample documents in wikidata and tac kbp respectively so they go as researchers they look at these things and they judge them whether or not they're true given these documents in wikipedia they say the quality of unmapped facts is verified um so the claim is that they've looked at them and they are good we find that 35. 3 percent of the unmapped facts are true on wikidata we find that 83. 2 of those true facts are partially unmapped facts um for example bob dylan tour with the grateful dead and yeah here is an ex if this really isn't in the schema right this is a nice relation that you might think humans would miss because touring with someone is not the first thing that would come to mind if you had to come up with a bunch of relations between entities but it is something that is regularly useful regularly used for musicians so that is an application where certainly an automated system can even extend the schema right um user relation is not within the scheme of wikidata while both head and tail are in the schema the register the remaining true facts are completely unmapped facts for example this red jacob was a registered mennonite and they also say accurate entity detection is desired where they say a lot of the errors are due to spacey detecting uh wrong incorrect entities or due to incorrect or missing entity linking by the by that those systems the rest errors made by mama are incorrect relation phrases such as uninformative relation phrases for example bob dylan made and his breakthrough oh what can you do what other one what other verb would you put there um yeah but okay we're going to look at a few last things right here they have a bunch of um a bunch of experiments right here which where they show you know the beam size has an influence this constraint number one and number two that we looked at has an influence right so you can tune these things a bit what is interesting here is that they try to look at either the attention matrix of the last or of all the layers and interestingly the system performs better if you only look at the attention matrix in the last layer now they reduce that attention layer because there are multiple heads using max or mean you can see they perform similarly but it is interesting that only the last and they argue in the text that we know that the last layers kind of have higher level features than the lower layers but i recall uh there are multiple papers like i've done videos about them what does bert learn and so on i think even something in constrain in conjunction with lottery tickets and so on that show that in a transformer at least i think it is the middle layers that encode the most kind of semantic knowledge because the lower ones yes they are for kind of low level features but the upper ones they are again for low level features because the task right here at the end is to predict an individual word or token right so you'd expect that the features in the attention matrix there go back to kind of sort of more grammatical features and so on and that the highest level features are actually somewhere in the middle i don't know if they tested if they only tested like all versus last in which case yeah i believe that um but if they tested each one individually and it still turned out that last is the best that would kind of add to my hypothesis that what happens here is more kind of a grammatical effect of extracting the this correct candidate verb in between the head and the tail all right so that's kind of gives more weight to my hypothesis like so to repeat my hypothesis is that it's kind of a grammatical thing that's going on here uh because the only task of this model is basically to find the correct um string span for the relation between head and tail because it's already given head and tail and there from the text their hypothesis is more like um we the language models have a lot of knowledge built into them and we can extract that knowledge kind of it they make it sound like the language model has this semantic knowledge in them okay so

### Example Discovered Facts [47:25]

let's look at a bunch of mapped facts right here um you can okay you can maybe check out a lot of them yourself but we'll just look at like one in each category blah blah's mail yada yada is in worse shape however klaus told press conference at the western city of essen where yada and it extracts this company and it maps it to the city of headquarters maybe they leave out some text here what i want to get to is the unmapped facts where are the unmapped mapped facts to just kind of show you mapped facts unmapped facts okay so the unmapped facts would i feel and you can judge for yourself please what i feel uh just to pre-bias you before we look at them is that a lot of times simply it extracts things that are um not it simply can't assign things right it's a failure to assign it's not a new thing because in these schemas like you haven't seen the schemas but you kind of get a feel the last which is the last table you kind of get a feel of what contains in it so maybe get a feel for what okay aaron's tackle was born 16th of february 1834 in potsdam okay so the extracted thing is heckle was born on 17th of february 83 in potsdam okay so that it maps to this is in the knowledge base a schema schema but was born on 17th of february 1833 in is simply a failure of the relation linker okay um he was also a pacifist until the first world war yada yada um and then ernst heckle and then was and a pacifist are both not in the schema now maybe pacifism isn't in the schema maybe though i would guess pacifism has a wikipedia page so it must be in this schema because it's a wiki data but was as you know the relation here with something be like a political leaning or something like this which is certainly in the knowledge base right then you have things like um heckle was awarded the title of excellency so you have correctly heckle again recognized a word received is in the schema nice excellency as a tale and excellency you know what do you want like this is this is a this is not a fact right this is the award or the title of excellency would be kind of the thing so this is a

### Conclusion & My Comments [50:40]

failure of spacey so again i have i've seen little facts here that would actually um be of genuine a genuine addition to the schema that should be considered and i absolutely believe that the schema is incomplete don't get me wrong i like at 100 the schema is probably less than one percent of what it should be right if we did a thorough job i just don't think that this system here is um a good like i think that the things that this system comes up with mostly are simply failures of its subsystems rather than genuinely new entries to the schema that's different from when it genuinely discover when it discovers a new mapping between already established things for example pauline baines educated at this college right so these are new facts all fit in the schema and the system might be very nice for that all right so that was my uh kind of estimation of this paper i hope i didn't drag on it too much as i said it's very cool work actually um i look at this appendix is giant go look at it uh check it out please uh tell me what you think about it in the comments any feedback is welcome and i will see you next time bye