# Semantic Parsing English to GraphQL | Andre Carerra | OpenAI Scholars Demo Day 2020

## Метаданные

- **Канал:** OpenAI
- **YouTube:** https://www.youtube.com/watch?v=liMJS5DrnlQ
- **Дата:** 09.07.2020
- **Длительность:** 23:29
- **Просмотры:** 3,766

## Описание

Learn more: https://openai.com/blog/openai-scholars-2020-final-projects#andre

## Содержание

### [0:00](https://www.youtube.com/watch?v=liMJS5DrnlQ) Intro

so i'd like to talk about semantic parsing of english to graphql so first of all a little bit about graphql for those that don't have too much of a background um graphql is basically a query language for your api this compare this can be compared to sql which is a query language to two databases graphql can cover a whole api which means it can cover a broad set of business logic databases and data types one of the strengths of graphql is its ease of developer use it lends itself well to nested relations it provides a schema which serves as an api contract and it makes it can make the development experience for software engineers a little bit easier so here we can see a little bit about what graphql looks like graphql lets you describe your data in a schema so and here you have a small example of that it can show you the relationships between different types of data with that schema we can send a query and that query can give us a predictable result of exactly what we expect to receive from this interface

### [1:23](https://www.youtube.com/watch?v=liMJS5DrnlQ&t=83s) Semantic Parsing

semantic parsing is the task of converting a natural language utterance to a logical form a machine understandable representation of its meaning in this case we're converting english natural language to graphql which is the logical form or the machine understandable representation of so here we can see a little bit more of what this looks like so for example if we have some english prompt and english natural language such as what is the name and date of the song released most recently and some graphql schema that defines that data we can find the corresponding graphql query such as this so why did i want to go through this project there's a few reasons the first of which is i wanted to understand the limits of general language models for semantic parsing this task is a little bit similar to machine translation where and we take some input language and output a target language in our case we're inputting english and outputting graphql uh i wanted to see how it could potentially ease the learning curve for developers so if a developer has the ability to interact with a model that generates queries for them so that they can see what these queries how these queries are structured it might make it easier for them and finally potent this could be potential tooling for non-technical data users so an example of this would be a a manager that uses salesforce um instead of reaching out to an engineer to for them to generate a custom query uh they could type it out in natural in english and they would be able to get a response so previously there has been some work into semantic parsing over a broad range of languages and domains for my specific use case i was really interested to see what the sql data sets looked like as you can see here there's several sql data sets that have a broad range of different domains and different query complexities of these the one that stood out the most and i'll cover why in a second is spider though the problem was that there is no graphql datasets so i wanted to train a model to learn and to create graphql and that would be very difficult without a graphql data set so i looked into spider specifically so spider is semantic parsing of natural language text to sql it has in the data set it has 10 000 questions around 5 000 unique complex sql queries covering 200 different databases with 138 different domains so these each of these um queries uh in the sorry in the corpus between the train and test or validation data sets there are different queries and different databases so for a system to perform well it must generalize to new queries and new database schemas and new questions which is a difficult task to cover this specific task has been tackled for the last few years and has achieved pretty good results so here on this slide you can see what the top five ranking leaders in the exact match accuracy look like and so on the testing set we can see there's around 60 percent accuracy on this sql spider task

### [5:19](https://www.youtube.com/watch?v=liMJS5DrnlQ&t=319s) Converting SQL to GraphQL

so for my task i had to be able to create a graphql data set and so this spider dataset served as a great starting point first off was being able to convert sql to graphql i started out not even being able to know if i could if this was possible um and so it was it took a little bit of time to understand what the tools were and what i could use to accomplish this first task the two very important tools that i used were hasuria and pg loader the surah just generates a graphql schema based off of a database and pg loader converted sqlite to postgres and that allowed me to use his sewer on top of it then the bulk of the work was converting sql abstract syntax trees to graphql abstract sync text trees and i'll cover what that is right now so here's a simple example of a sequel abstract syntax tree to a graphql abstract syntax tree on the left side we see sql so the query right here is select count star from songs so we have we take this query we parse it into a tree and like i said before this is a very simple example then this tree is then converted to a graphql tree as you can see on the right side then that tree is just converted to a raw graphql query

### [6:51](https://www.youtube.com/watch?v=liMJS5DrnlQ&t=411s) Validation

part of generating this data sets required the use of validation scripts and so there was a few things i looked out for in these validation scripts so when i was verifying this data set i covered the syntax making sure that the queries that were formed were actual graphqls queries i also covered validated the syntax against the schema so making sure that the keywords that were used in the queries itself were valid for the schema that we were looking at and then i executed those careers against an endpoint to make sure that they were valid so what this resulted in this whole process resulted in is that um half of the queries were transferred around so the big problem here the big obstacle was that surah doesn't include a group by clause so group by is a clause that's used very often in sql and the surah didn't have a good way to transfer that over and this could have been done manually but i wasn't able to do that within the limit the time limits of the program but in the end i was very confident because of the validation script in the data set um so diving into the details this looks like 160 schemas across 138 different domains around 4 300 unique english prompts and around 2400 unique graphql queries

### [8:18](https://www.youtube.com/watch?v=liMJS5DrnlQ&t=498s) Models

so after that the next step was um experimenting on that data set and seeing what kind of results i could get i experimented with a few different models the ones that stood out the most were bart and t5 so these are two different models that were that other researchers from different research groups have come up with both of them are encoder decoder transformer models and they're very similar where they vary though is that bart is a bi-directional encoder but both of them are encoder decoder models and lend themselves very well to translation tasks so i thought they could lend themselves to my task as well so as i mentioned before the process looks a little bit like this so we input a an english prompt so in this case what is the name and date of the song released most recently concatenated with the graphql schema so it would look a little bit like the what you see on the left here that is passed through to our model t5 and then our output slash target um is this graphql query we see on the right here and this is just done with an autoregressive objective part of this required the use of a

### [9:40](https://www.youtube.com/watch?v=liMJS5DrnlQ&t=580s) Validation Metric

validation metric to make sure that my results looked what i like what i expected them to look like so in this case our i wanted my outputs to look like my targets um the validation metric that i came up with was exact matching exact set matching accuracy what this means is that since i could parse my graphql queries the target and the output into abstract syntax react and could compare those two queries to see how accurate they were um and so since in these abstract syntax trees the order of the of the children's the children nodes don't matter um so two trees with a different order of child nodes could be equivalent so for example um we see these two little trees right here where on the left side we have a green child node and then on the right side we have a right green child node and these two are equivalent queries and so we want i wanted to make sure that this validation metric could be able to handle for that

### [10:48](https://www.youtube.com/watch?v=liMJS5DrnlQ&t=648s) Example

and so a good example of this would be in this query here um the song name and release date could be switched but the query would still be equivalent so what did the results look like um i

### [11:02](https://www.youtube.com/watch?v=liMJS5DrnlQ&t=662s) Results

with these two models uh specifically using t5 which performed the best i got 46 to 50 exact set matching accuracy on the graphql validation data set and this is in comparison to the 20 sql exact set matching accuracy that i got with this with these same models and this is an interesting result for a couple of reasons this range of 46 to 50 is actually the same model but the 46 model um was just trained on the graphql queries the 50 model was and the sql queries and for some reason it was able to perform better my guess is that was able to happen because the model learned what's important what keywords were important between schemas and as i mentioned before this uh the existing spyder um uh best models um got around 65 percent exact match accuracy as we can see here so why did my model perform worse on these and this is because these models that are in the leaderboard here used specific architecture that was specific only to sql so these models here can only produce sql they wouldn't be able to produce graphql as well uh whereas my model was also able to produce sql and graphql and so this sets it up for future work where we could see potential uses where we could find a way to increase the accuracy of both of them and maybe even across different query languages as well as i mentioned before the data set since this dataset was based on the spyder dataset the for a model to perform well um it must be able to generalize well over new schemas and new questions and queries and so to me a 50 accuracy says that the model is able to generalize these results pretty well um what i failed to mention a little bit earlier is that this exact set matching accuracy is more of a lower bound because there are multiple queries that could display the same information but it's a little bit more difficult to be able to parse those trees as well so what does this look like it'll be helpful to see if we're in action so here we have a

### [13:35](https://www.youtube.com/watch?v=liMJS5DrnlQ&t=815s) Demo

database that we're using so this database is called music one which has information about different music genre is an artist and we can ask a question such as what is the country of the artist named enrique so here we generate a graphql query as we can see below on the left side and then we can send that query to server and get a response and so here we see that our response looks like this and the answer to the question is the country's usa here's another example and this shows that the model is able to generalize to a different database so here there's the 160 different databases um and instead of selecting music one we're going to look at flight 2 which has information about uh different airlines and flights so now we ask a new question give the airline with the abbreviation ual we generate a graphql query and we send that query and get a response in this case it says united airlines um and so this is just a small example of what it can do um in the future

### [14:52](https://www.youtube.com/watch?v=liMJS5DrnlQ&t=892s) Future work

the next in the next couple days i want to release these models and code so that everybody's free to use them and improve them i've been working on a paper constantly to submit to archive and within this next week i'll be submitting that as well and then a little bit longer term a couple more tasks so one of them is to add more examples to take advantage of graphql these examples only take advantage of the more simple aspects of graphql but it wouldn't be too difficult to add more complex examples and another task is to test on an enterprise schema um so salesforce and github both have graphql apis and so i'd like to see what kind of results i could get by semantic parsing english to these graphql endpoints and so i just wanted to say thanks specifically to a few communities and people so i wanted to thank openai for the opportunity to be part of the scholars program i've learned a lot and i've and from the people at open eye and i enjoyed having the flexibility to work on this project as well um also thanks to hugging face and pie torch lightning communities uh they made themselves very available to me to ask any questions on how to use their tooling and then i wanted to thank my mentor melanie for taking the time to work with me and as sam said it did feel more like somebody working with me and getting through the more difficult parts of getting into the field wanted to thank also christina and mariah the they were in charge of the scholars program and were very helpful and made themselves very available to all of the the scholars um whenever we needed any help and i also just wanted to thank my wife noel who has been very supportive throughout this whole program and now i'll open it up for any questions that

### [17:00](https://www.youtube.com/watch?v=liMJS5DrnlQ&t=1020s) Questions

i can see here on the q a so first result our first question are your results using models trained from scratch or using fine-tuned or using fine-tuning pre-trained models also what are the model sizes so yeah good question so the models that i was using um were pre-trained on large corpuses and in in the case of t5 for example it was trained on a corpus that was generated from the internet and so it's learned a lot about language in general so what i did is fine-tuned those specific models for my specific task um in a couple of different ways but the main one was converting english to graphql and model sizes so these models weren't extremely big i was able to fit both models onto the gpus that i could run on google colab so google co-lab gives around 16 gigabytes of memory for these models so i was able to run those perfectly fine on there with some smaller batch sizes um so anybody should be able to do that with a google account using google collab and the second question is how often does this model generate syntactically invalid outputs that's a good question as well so um from the so as i mentioned before in the validation metric that validation metric also covers metric or also covers examples that are not valid and so any example that is invalid would be counted as wrong right and so 50 means that 50 were valid and um the correct query that i expected and so 50 and that also means that 50 were probably not um but that's so the upper bound of how many um syntactically invalid outputs would be 50 but in practice when i looked at it tended to be a lot smaller than that it was i would say something like five percent of the outputs were syntactically invalid um and that's because this model has the flexibility to output different types of query languages and next question by mris what is the most challenging what was the most challenging part of the project and i think the most challenging part was definitely converting those sql queries to graphql queries um it was a lot so of parsing using trees and graphs to figure out what the right graphql query would be it's all in all that process probably took a month by itself to work out um and at the very beginning i didn't even know if it was possible but as we realized about half of the queries were possible to transfer over let's see if there's any more yes okay i have another question from alec radford do you think general purpose architectures like t5 are sufficient or is it still a need for domain specific architectures like the specific sql specific ones you mentioned yeah i've been thinking about this question actually and i think um i think it depends on the difficulty um because there is a difference between a model that is where the whole model architecture is specifically tied to sql and a model where just the output heads are tied to sql so if we could um and this could be future work obviously if we could replace the sql head ju and just the sql head and put a graphql head on there that would be great um but obviously this lends itself to the time required to create those heads um but i think we need to do we probably need to explore both a little bit better and compare them and so i think that's a good place to look in the future let's see and the next question is how often does the model generate syntactically invalid outputs i already covered that um probably around five percent of the time and what's the main metric you've used for evaluating your model um yeah that's the metric i mentioned previously which is the exact set matching accuracy so i would take the i would basically work this way and convert the graphql queries into abstract syntax trees and then compare two trees to each other and that way i could evaluate my model throughout every epic and then from christina uh really cool work how do you imagine the model architecture to differ for graphql as opposed to sql is the problem space different in any way so yes the model architecture i feel like most of the model could remain the same because the these general architectures that we as we've seen with other examples they're able to we can use them for translation tasks for task generation for classification different things like that um so i feel like my intuition is that well where the work will go is in preparing the right types of heads and what that requires is just somebody who understands how to basically output these abstract syntax trees directly um so i think i think that though that these general models will be able to um if as long as the general architecture or in other words the middle part of the model is able to understand the english that these models will be able to perform well

---
*Источник: https://ekstraktznaniy.ru/video/11595*