# What is BERT and how does it work? | A Quick Review

## Метаданные

- **Канал:** AssemblyAI
- **YouTube:** https://www.youtube.com/watch?v=6ahxPTLZxU8
- **Дата:** 17.01.2022
- **Длительность:** 8:56
- **Просмотры:** 82,176

## Описание

BERT is a versatile language model that can be easily fine-tuned to many language tasks. But how has it learned the language so well? And what is a language model? And what does it mean to fine-tune a model?

Want to give AssemblyAI’s automatic speech-to-text transcription API a try? Get your free API token here 👇
https://www.assemblyai.com/?utm_source=youtube&utm_medium=referral&utm_campaign=yt_mis_15

We learn all that in this video.

What are Transformers? - https://youtu.be/_UVfwBqcnbM
What is GPT-3 and how does it work? - https://youtu.be/xB6hZwYsV2c
Source code of BERT - https://github.com/google-research/bert

## Содержание

### [0:00](https://www.youtube.com/watch?v=6ahxPTLZxU8) Intro

bert is one of those models that were based on the famous transformers architecture and had a gigantic impact in the world of ai when they were first published so in this video let's see what birth is how it works and how you can use it too this video is brought to you by assembly ai is a company that is making a state of the art speech to text api if you want to use this api for free you can go ahead and get your free api token using the link in the description okay but what is birch so bird is basically a language

### [0:32](https://www.youtube.com/watch?v=6ahxPTLZxU8&t=32s) BERT language model

model that has the capacity to learn specific tasks when it comes to language and it can do this because it understands language it has the understanding of how the words relate to each other and that way if you fine tune it you can specifically have it perform well on different types of language tasks as you might have heard somewhere else before you basically train these models in two separate chunks first is the actual training of the language model where we call pre-training and then the second one is fine-tuning so training the model specific to a task these tasks could be question answering sentiment analysis text classification named entity recognition so these models because they understand language they are able to perform a lot of different tasks that have to do with language pre-training a model like bert takes a very long time and it needs a lot of data that's why it has been pre-trained for us so if you want to use bert right now all you need to do is to fine-tune it to a task that you need okay but what does bert look like so let's look at the architecture of it if you remember transformers and if you don't know anything about transformers you can go watch our transformers video i will leave a link for you and if you remember from transformers we have encoders and decoders so encoder learns the context of the language and decoder does the specific task that we needed to do and if you also remember from gpthree which we made a video about i will again leave the link somewhere here so you can go watch that video if you haven't yet but gpt3 what we did was we only stacked decoders on top of each other to create gpt3 and that was the main thing behind gpt3s architecture well with bert what we do is we stack encoders on top of each other so basically bird only consists of encoders normally in a transformer architecture

### [2:26](https://www.youtube.com/watch?v=6ahxPTLZxU8&t=146s) Transformers

the context is learned in the encoder section of the architecture and then this context is passed to the decoder to help it complete the task that it was trained on and most of the time it was translation what birth is the encoder stacked on top of each other so at the end you have a model that learns the context of language and that's it and that's why we call it a language model a model that has a good understanding of language and how it works what bird is of course is not only a stack of encoders in terms of architecture it needs a way for the inputs to be embedded it also needs a way for the output to make sense so of course then you might have different types of layers after the output of your encoders and you might have different types of inputs based on what you're inputting to the model the output style will change based on what you're training bird on or after a while what you're doing the fine tuning with but for the input layer we have three different information that is being embedded to the inputs the first one is positional encoding if you remember from our transformers video in the transformers we are giving all our input words in a sentence at the same time to the transformer so what happens then is it's really hard to know where a word belongs in a sentence that's why you use positional encodings to pass that information of location to your transformer the second one is segment or sentence embeddings in these embeddings basically we're doing the same thing like positional encodings the difference is that we are looking at the difference of the first and the second sentence because when you're doing different trainings with birds for example if you're doing question answering or if you're doing next sentence prediction you might be giving more than one sentence to the transformer or the model that you're training so at the end you need a way to distinguish what belongs to the first sentence and second sentence and lastly we have token embeddings and token embeddings are basically the representation of each word each single word in a numerical way okay now we understand what the architecture looks like but how do you train something that just understand the core of language you're not trying to train it in one single task right this is not a specialized model you're trying to understand language generally so what do you do to train this model teach this model how language works so normally what we did before was to next word prediction task so what they do is they have a sentence and they train the model to try to predict the next sentence we saw this in transformers but what we do in birds or what they've done to train it is train it on two different tasks the first one is called max language modeling and the second one

### [5:12](https://www.youtube.com/watch?v=6ahxPTLZxU8&t=312s) Masked Language Modeling

is called next sentence prediction with mask modeling what we do is we get a sentence and inside the sentence 15 of the words are being masked so basically left blank and the goal of the model is to predict what needs to go inside those blanks in the sentence with next sentence prediction what you do is you give your model two sentences that are supposed to either come one after another or not and it's your model's goal to tell you whether they belong together or they don't by training bird on these two tasks researchers were able to get a really well performing language model but of course you might not want to use only next sentence prediction or mask language modeling you might want to use bert for other tasks so for that you need to do fine-tuning to fine-tune a

### [6:02](https://www.youtube.com/watch?v=6ahxPTLZxU8&t=362s) Fine-tuning

burp model you basically need two things the first one being a new output layer that you're going to plug at the end of birth that is specific to the task that you're trying to perform and the second thing is a data set again that is specific to the task that you're trying to achieve so for example if you want to do sentiment analysis you basically need to plug in a output layer of neurons after birth that is going to classify the input that it gets from the bird which will be the output of birth into different sentiment labels that you want so you can think of this basically as when you're designing a normal neural network just a deep neural network let's say you are designing what the output needs to be based on the task so it's basically the same thing in this it's the same logic when you're doing it with birth 2. or if you want to do named entity recognition what you can do is to feed the output tokens that come from bird for each word into a classification layer that will classify them into different named entity labels when you're doing fine tuning what is being updated is the parameters of this new output layer that you just plugged into birch there are still some parameters that are being updated in the bird too but they're really minor updates so that's why what you have at the end is a very fast fine tuning process google researchers who have made the model birds very generously shared the source code with the public so you can find the link to their repository in the description below too but a couple of things that you need to know about this library in general or any other bird model that you might find online for example in the hugging face library is that there are levels of bird and there are different languages of bert you can choose to work on it with spanish or chinese or in english whatever you want but also you get a birth small or birth based model which only has 110 million parameters and you have a large model which generally works a little bit better i mean of course it's a little bit bigger better right but that one has 340 million parameters but you have to remember that these are already trained parameters that you do not have to train this from scratch so if you have the power if your computer can't handle it you can go with the large bird model and see how that works for you i hope this video was helpful to understand what bird is how it was trained how it works under the hood the architecture of it and also how you can get started with it i will leave some helpful links in the description below so that you can start playing around with birch if you want to thanks for watching if you like this video don't forget to give us a like and maybe even subscribe to be one of the first people to know when we publish a new video and before you go away don't forget to go grab your free api token from assembly ai using the link in the description but for now have a nice day and i will see you in the next video

---
*Источник: https://ekstraktznaniy.ru/video/13251*