Feedback Transformers: Addressing Some Limitations of Transformers with Feedback Memory (Explained)
43:51

Feedback Transformers: Addressing Some Limitations of Transformers with Feedback Memory (Explained)

Yannic Kilcher 02.02.2021 16 011 просмотров 538 лайков

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI
Описание видео
#ai #science #transformers Autoregressive Transformers have taken over the world of Language Modeling (GPT-3). However, in order to train them, people use causal masking and sample parallelism, which means computation only happens in a feedforward manner. This results in higher layer information, which would be available, to not be used in the lower layers of subsequent tokens, and leads to a loss in the computational capabilities of the overall model. Feedback Transformers trade-off training speed for access to these representations and demonstrate remarkable improvements in complex reasoning and long-range dependency tasks. OUTLINE: 0:00 - Intro & Overview 1:55 - Problems of Autoregressive Processing 3:30 - Information Flow in Recurrent Neural Networks 7:15 - Information Flow in Transformers 9:10 - Solving Complex Computations with Neural Networks 16:45 - Causal Masking in Transformers 19:00 - Missing Higher Layer Information Flow 26:10 - Feedback Transformer Architecture 30:00 - Connection to Attention-RNNs 36:00 - Formal Definition 37:05 - Experimental Results 43:10 - Conclusion & Comments Paper: https://arxiv.org/abs/2002.09402 My video on Attention: https://youtu.be/iDulhoQ2pro ERRATA: Sometimes I say "Switch Transformer" instead of "Feedback Transformer". Forgive me :) Abstract: Transformers have been successfully applied to sequential, auto-regressive tasks despite being feedforward networks. Unlike recurrent neural networks, Transformers use attention to capture temporal relations while processing input tokens in parallel. While this parallelization makes them computationally efficient, it restricts the model from fully exploiting the sequential nature of the input. The representation at a given layer can only access representations from lower layers, rather than the higher level representations already available. In this work, we propose the Feedback Transformer architecture that exposes all previous representations to all future representations, meaning the lowest representation of the current timestep is formed from the highest-level abstract representation of the past. We demonstrate on a variety of benchmarks in language modeling, machine translation, and reinforcement learning that the increased representation capacity can create small, shallow models with much stronger performance than comparable Transformers. Authors: Angela Fan, Thibaut Lavril, Edouard Grave, Armand Joulin, Sainbayar Sukhbaatar Links: TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher Discord: https://discord.gg/4H8xxDF BitChute: https://www.bitchute.com/channel/yannic-kilcher Minds: https://www.minds.com/ykilcher Parler: https://parler.com/profile/YannicKilcher LinkedIn: https://www.linkedin.com/in/yannic-kilcher-488534136/ BiliBili: https://space.bilibili.com/1824646584 If you want to support me, the best thing to do is to share out the content :) If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this): SubscribeStar: https://www.subscribestar.com/yannickilcher Patreon: https://www.patreon.com/yannickilcher Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2 Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Оглавление (12 сегментов)

Intro & Overview

hi there today we're looking at addressing some limitations of transformers with feedback memory also known as feedback transformers by angela fun tibo lavril eduard grav armor joula and sunbayar of facebook air research and lauria on a high level this paper as it says in the title it addresses some limitations of transformers specifically of decoding transformers that are trained with causal masking and the problem is that these transformers they don't make use of all of the information they compute even though they technically could make use of that information but they sacrifice it in order to train in parallel and we'll see what that means to alleviate this paper introduces these feedback memories and thereby they arrive at a model called the feedback transformer that takes into account all of the available information now this new model it can't train as fast because it can't be trained in parallel as the old model however you can build models with this technique that are significantly more shallow so less layers and also the models will remember things for longer and this is especially helpful when multiple steps of reasoning are required and it has to be done over kind of a longer sequence so we're going to see some tasks from reinforcement learning and kind of other sequence tasks where these feedback memories really make a difference in any case if you like content like this don't hesitate to share it out and tell all your friends about it that would be awesome all right so

Problems of Autoregressive Processing

what's what's the deal with transformers what are they doing wrong as i already said we specifically are in the case of this sort of decoder only transformer right here these graphics here they are a bit confusing on first sight i've i found i had to dig into the paper and read the paper it was not necessarily clear from these diagrams so i'm going to try to sort of build up what's wrong so what we're trying to do is something like language modeling now it's not only language modeling but in any case we have a sequence of inputs which i'm just going to represent as circles and what we want to do is we want to predict whatever the next circle is so these could be steps actions to be performed in a reinforcement learning world these could be words of a sentence right up to here and then you are supposed to predict the next word that's called the language model many things are fall into this category so for example gpt3 is trained in exactly this way in order to do this you have to have a model that somehow takes all of these things and somehow builds a representation that then outputs this thing right here okay and that's you know good in itself um how did we usually do it so the first attempts at this of course

Information Flow in Recurrent Neural Networks

were sort of recurrent neural networks and i'm going to go over them here because they're going to be important even though you probably already know what they are so for actually for all of the models we're going to look at today what they do is they build representations of this input data so i'm going to represent this with little boxes what they do is they build these latent representations right here so the data in a recurrent neural network flows like this the inputs go up each time into a hidden representation and this is a neural network layer that does this and then the hidden representations are transformed into each other so the first uh the first input is input here then it is sort of forward propagated to the next time step at which point the next input is consumed and then it is merged with the previous hidden state and that is propagated forward into the next time step and so on at the end you take this representation and you output whatever the next label is and i'm going to purposefully draw this now up here to say so the data flow is something like this there has been uh improved versions of rnns that do multiple layers of this so the next layer would be here and this is a multi-layer rnn so if you like this could be an lstm this could be a plane or an n and so on what they would do is they would do the same thing here but then each hidden representation goes into the next hidden representation like this and these hidden representations they are also connected with a recurrent connection over time like this building sort of like a grid um right so the way you have to think about and then of course here in this for so the output of the last top right one goes into predicting the next token or action or whatnot because the top right one as you can maybe see all the information flows up and to the right in this case right here what this is what an rnn does now you can see this is very well connected information however if you think about this in terms of information flow if for example this thing right here and need to communicate somehow imagine they need to communicate to solve the tasks so what could this be this could be for example a name frank and this could be an like an article referring to frank like he okay and you know it's out of order so but in order to know who he is you somehow need to these two tokens somehow need to communicate i hope that's sort of clear now they here can communicate by means of transform transferring information you know from kind of step to step like over here maybe like this right and then in this hidden representation the information can be combined but you can see the number of steps that the information has to travel is fairly large it can also be combined here if the information flows first up one layer and then over and so on this is the drawback of recurrent neural networks very often the information has to flow along many steps of computation in order

Information Flow in Transformers

to be combined with something else a different approach is a transformer so a transformer handles sequences in a very different not a very different way but in a different enough way so a what a transformer does is whenever it builds the representation for the next layer for example this representation right here a transformer will aggregate all of the information from the previous layer like this so every one of these representations right here also this one it will aggregate all the information from the previous layer let me draw this in blue right here so all the information now that's a lot better because now every node can communicate with every other node in a matter of a single computation step and not just and not like as many computation steps as the two nodes are apart now you need to help the transformers a bit with positional encodings but in essence uh this is a more powerful way of interpreting sequences and you can do this in many layers so the next layer will have access to even more in like so this representation right here it will draw information from all of the previous representations right here and this is by means of an attention mechanism and if you don't know what an attention mechanism is i watch my video on attention is all you need i explain how this works there but suffice to say it the information is aggregated over the whole sequence layer by layer there is a kind of a fundamental reason why this is

Solving Complex Computations with Neural Networks

important namely if we want to do very complex computations and by complex computations you can maybe look at an example right here where they have examples of such a complex computation in the appendix here they give this example of code interpretations there it is so what they give the program as or the model to do is this piece of text right here and the pro the model is simply to go over this code and decide what the output is so you can see right here it has print statements and the model needs to decide what you know what the output of the entire program is you can see right here it has if statement so it has conditional statements as variables that are set but also things like in decrement increment these variables then print them then update them again have some conditions on the variables right so there is a condition between two variables z and x so this is quite complex for a model to solve and if you were to let an rnn do this task because the plane rnn it has you know it has these inputs and it has one vector that's the hidden state everything needs to be saved in this space of this one vector and the longer it goes of course the more noise you introduce and so on so if stuff is very far apart like here in many cases you need to keep track of all the states of these variables rnns tend to do sort of worse the longer the task transformers not so much transformers can look up so a transformer that ingests this token right here can look to any other token in a single step however in this task right here also transformers get at their limits because in order what i said in order to do complex computation you need multiple layers a single transformer layer as a matter of fact a single neural network layer can only do linear operations right it has a non-linearity at the end but you know everything's connected with everything in a neural network layer right here so these are neurons and this here is a giant weight matrix w something like this can also be the attention matrix right here um in every neural network there is a linear operation at the heart of the neural network layer and a linear operation can only do so much notably it can't solve things like the xor problem and it can't do if conditions um keeping track and updating variables you know you cannot let's break this down let's say we have this text x equals one x plus um x if let's say if x greater than 3 then x minus something like this a transformer one layer will be able to look at all of these at the same time but it will not be able to look at them in sequence right it can only look at them at the same time but it cannot say it cannot have a dependence between them it cannot say oh because here i incremented this is greater than three and then this um happened actually it's not greater than three but uh and then this didn't happen it cannot do that reasoning can simply individually look at each of these lines and then somehow integrate them in a linear fashion so it could integrate the plus as simply saying whatever x is i need one more and then it could integrate this and say well x is one and then the two together would maybe give you the result that x is two but this if condition and so on it cannot do that in one layer for that you need multiple layers with non-linearities so by having multiple layers you could a transformer could technically um do things like have four nodes right here and then these the first node might you know combine these two and that sort of represents x equals two now right and then this node right here could represent this if condition x greater than three and it could point i'm just imagining i have no it could point to this node for fulfilling the condition right and then this node here could point to x minus right now i have a simpler program you see i've done one layer i have a simpler program simply by linearly combining things then in the next layer i could combine these two things and this one tells me x equals 2 and this one is x greater than 3 which i can evaluate now since these two and then that might result in a weight of zero right because x is in fact not greater than three and um i could save sorry maybe here i could save that weight of zero right here so this node is now representing zero this node is still representing x equals two and then this node the pointer here this pointer makes this um yeah evaluate maybe to minus one and then somehow 0. 2 and then this node i'm just making stuff up here this node could somehow connect these two right this note could be representative of the connection between these two and then in the next layer finally i can do my aggregation it's then um this and this get combined and then this is 0 because it's negative 1 times 0 and plus the 2 right here and then i get it my final x equals 2. i hope that somehow it is not like it is not how it happens but you can see that if your only method is linearly combining things layer by layer um you have to go quite a convolved way in order to achieve kind of multi-step reasoning things and you can only do this by having non-linearities involved and one step of reasoning is usually kind of one layer with a non-linearity and thereby the number of steps of reasoning here is limited by the depth of the transformer if this is a transformer the number of you know kind of reasoning steps incrementing decrementing a variable is directly linked to how many steps you do this uh so that is a drawback and that drawback can be solved with these memory things so let's look at how a decoding only transformer

Causal Masking in Transformers

specifically is trained so again here we said the transformer can include things from anywhere but what usually people do is they do this causal masking because we want to predict every time the next thing right so here we have a sentence right and then we make samples of it we say okay maybe if i input those two i want to predict this one but if i input those three i want to predict this one and if i input those four i want to predict this one i can make all of this in one if i set my information flow like this so i only let the tokens have access to whatever is behind them that are these decoding only transformers let me okay so if you think of this token right here we just imagine that in order to predict this token we only have access to what came before it like if you write a book and you write the next word you've only written the words in front of it so we just say the representation of here only has can draw it cannot draw information from over here that's forbidden we let it only draw information from here it's its own node sometimes like it depends on how it's represented but um only its own node and to the left of it the same goes for this one so like that and this one here and then this one here it can draw information from and this one can draw information from here from here so still you see the property of long range information is still here by means of connections like this one or this one however we simply cannot

Missing Higher Layer Information Flow

draw any information from the right all right and also you see how this information flows and the difference between a recurrent network and this one is in these lateral connections here do i have another here there is no connection in a recurrent network there is a connection within a layer you see that here there is none but instead there are these long range connections from the last layers what's even worse what's missing in both of them is connections such as the following do i have another color black okay this connection so if you look at this thing right here it can draw from here and if we have the recurrent connection we can maybe also say can draw from these ones but technically it should also be able to draw from this one right because by the time i reach to the prediction of the next node from here i can certainly compute uh this representation up here right like nothing stops me from building in a connection like this one and that's exactly what these memory transformers criticize among these old style transformers they only go feet forward meaning they only go up the layers and they don't even have lateral connections like recurrent networks they only have forward connections in the layers and that limits the amount of steps you can do in computation in contrast with the memory transformers information can flow i'm gonna draw maybe it new because let's actually look at their diagram so you can see right here maybe it's not as confusing anymore actually it's still confusing because we need to introduce this memory information can flow all the way up and then down again so i'm just going to draw two layers right here so information can flow like this and we so the first step is the same right we simply we have nothing here to look at there is no so we can only draw information from the left so that's all we can do the second step so let's say we've computed the first step we've actually output a token like this one and we now continue because we're auto regressive we always input whatever we output um what we now can do is we can do this and this right that's what this representation can draw from in a normal transformer but now we could technically also draw information from here because we've already computed these things in the last step the reason why transformers usually don't do this is now you cannot parallelize training in a setting like we've seen before oh wait i've destroyed it but in a setting like we've seen before you can actually train this whole sequence in parallel like all of the samples if i have five tokens i can make five samples out of that and train that in parallel it's no longer possible right here because if i train it in parallel i do it in the feed forward fashion however here in order to have access to this information i have already had to compute the full forward pass for that first sample okay so that's the drawback right here however um it might be valuable to have that highest layer information especially since that was the one that predicted the next token okay so probably a lot of information about that token is going to be in that highest level information whereas with the previous transformer we could only draw information from down here so we have access to higher layers of representation of the past and that means the information can actually flow all the way to the end like so and then back again every time we have access to the highest layers of representation so we if we look at this thing we can we could actually draw from all of the representations we've previously computed so we could look at um hey what's what was this token that's what a normal transformer could look at as well but we could also look at what did this first layer at the sorry the first token in the last layer compute we can look at that it's probably very informative so now you can see that the reasoning depth is sort of unbounded because uh here even though i have maybe five tokens right here i can only do two steps of reasoning across it um i can only you know one step of reasoning is one layer so i can like save learn to save a variable here and then learn to increment it right here but i can't do more but here i can learn a function for saving a variable incrementing it and so on and do that all of this processing with the variable and then the next thing comes around you know maybe that's incrementing i can look at the end right here and that may be the representation for the saved variable and then i can increment it and store it in this representation and then the next layer can come around and it can look at this representation right here and say oh you've incremented it after you saved it right so this is the current state and then it can go ahead and modulate it as well so we can do an if condition and the next thing can look at that if condition can look at the value of the variable and through the layers here so it has two layers of compute just to implement that if condition on the current value of the variable whereas the old transformer would sort of have to start from scratch you can maybe think of it like this the old transformer always has to start from scratch doing the okay here's how the variable starts here's where it's incremented here i'm going to do an if condition whereas this transformer it does the computation and then it can sort of store information in these higher layer representations and all the next steps can look at it now if you look at the light blue thing that's a lot of arrows this amount of attention

Feedback Transformer Architecture

attention connection would pretty much explode any system and that's why this paper simplifies that and here is where the trade-off another trade-off comes in so you can't train it as fast that's number one and number two is they say well we're not gonna let you look at all of these hidden representations right every square here is a hidden representation what we're going to do is for each token after the information has passed and we've computed these hidden representations we're going to sort of mash them together so we're going to take the two and maybe also the token embedding and we're going to build one so-called like a memory representation of that token so all of this is now incorporated in this memory representation and the next layer what it can do is instead of looking at the individual representations right here them all of them can instead look at this sorry the other way around memory representation that first of all it saves space it saves memory and second of all you can also share the key and value computation of the attention mechanism whereas only the query representation goes here with the different layers so that's queries number two that's queries number one okay so you can share that and then once you have those you also build a memory from the second token and then the third token it can look at both the memory of the second token and first open so you still have that transformer long range information pass but now you have sort of a summary these memory blocks right here within each layer and that's exactly what we see in the diagram right here and that's already the model so the switch transformer is a transformer that forward propagates not in parallel but token by token it forward propagates then it builds this memory and then all the next tokens they can instead of paying attention to um two things in their own layer like so they can now pay attention to previous memories okay again the arrow should go in this direction so that is a feedback transformer it retains the long range information flow but the information doesn't flow from same layer representations the information actually flows from memory and the memory is a weighted sum of all of the representations of a given token that includes higher layers like this one so information can flow from higher layers in the earlier in the sequence to lower layers to later in the sequence and that allows each sequence element to do as many reasoning steps as there are depth in as there are a number of layers whereas in a normal transformer the entire sequence only had that many reasoning steps so here reasoning steps are per token whereas previously the reasoning steps were per sequence and that's of course more powerful yeah that is pretty much the model um

Connection to Attention-RNNs

now okay i have one thing right here um one thing to sort of remark namely you know they consider the rnn right here on the right like how how it's different from the rnn you can clearly see that the rnn the information needs to travel many steps to arrive somewhere that has been the drawback of the rnn but people have sort of solved this in rnns using well you guessed it attention in fact attention mechanisms were first introduced to help rnns overcome this problem an rnn with an attention mechanism would look like something you're very familiar to so here we build these hidden let's just consider a one layer rnn for now we build these hidden representations okay and um again it goes like this and then there are these recurrent connections right here that's an rnn but if we help this with an attention mechanism what we do is we say whenever you compute for example this representation what you're allowed to do is you're allowed to also not only have you know this connection you're allowed to look back at the previous hidden representations and aggregate information using an attention mechanism so that's where a tension mechanism actually sort of come from in this domain and if i look at this switch transformer model i very much just see a bit of an elaborate rnn so if you just tilt this if you tilt this graphic right here you will see and we can do this together so um yes if you look at this and if you tilt the graphic so i'm going to draw again three things let's do it down here i'm going to draw three things but instead of going up with the squares i'm simply going next to each other here three squares for this and right representing the three layers so before these here they were in this direction they were up but now i've tilted them to the right okay and with the the way the memory is built so the information flows like this and like this right and here like this like this will fill in the other connections shortly the memory is built from those three so like this from those three a memory is built like this and now if you look at that when you for example compute this node right here what you're allowed to do is you're allowed to look back at the memories so you have kind of connections like this uh i keep drawing these arrows the way the other way around right so this one it draws it attends to the memories of the previous layer and if you see this as a recurrent neural network you are exactly right okay so um yeah i don't exactly know what to say this is an rnn with an attention mechanism it's just that these the in the construction of the things you can attend like this usually people just took the hidden state of the rnn cell in order to um in order to do what they attend to but now you i guess you also drop the recurrent connection because you can only attend to the memories so there is no there's no you know kind of recurrent connection but there is a connection like this no like to the things here yeah i guess okay if this it's a convoluted it's like a halfway in between an rnn and a transform because you don't strictly have the recurrent connection so you don't have anything like right here but you do have like this connection for example to all the three things down here so it's if you view this part as kind of an rnn cell and this part is has an rnn cell then this is an rnn with an attention mechanism or something that's extremely similar and yeah the attention mechanisms in rnn actually do solve this long computation problem that was exactly why they were introduced and they do solve it and at some point people realized wait we don't need the recurrent connections actually and that's how you end up with transformers so this here is sort of the hybrid between the two right um if you want to go further you can you could actually think of making multiple layers of these memory representations right and then you're sort of at the same um at the same problem to start with kind of you recurs into the problem but yeah i don't wanna go

Formal Definition

into that necessarily so you can see here instead of up here attending um instead of the next layer representation being the previous layer attending to all its sort of layer um to all of its left neighbors in the previous layer you will have you will have the same thing attending to all the previous memories and the previous memory is built as a weighted sum over all the layers and the most important thing for their model is this thing right here you can see that this now goes over all the layers even the layers above the layer we are currently computing it's just that it's from previous time steps all right they also explain how you can as i said share the keys and the values that's not necessarily important but it's just something you can do with this model that you couldn't do before because before uh not all the

Experimental Results

layers were attending to the same memory now you can do that so they demonstrate this on tasks such as language modeling where you can see blue here is the classic transformers and these are different sizes so to the right you kind of go shallower in the transformer and you can see as you go shallower so as you have less layers the decoding speed increases for both of these models however the transformer model the classic model it sinks in performance a lot more than the feedback transformer thanks to those feedback connections however you know here you can see and i would bet maybe if you go to the left here that the classic transformer would beat the feedback transformer simply because the feedback transformer isn't a generalization so it also needs to do this trade-off so it trades off speed down here and also it trades of sort of mixing that memory they have a very interesting by the way this is reinforcement learning uh where you need to remember things for quite long and that is also a domain where they excel at so here they actually look at the different kinds of memory and these are a bit deceptive down here i think to have the whole impression you need to do this over multiple time steps and actually uh kind of see how they develop and then you can see more clearly but you can see that their performance so this here is the feedback transformer and this here is kind of the original transformer where you can see it only goes up the layers they see here that if you introduce recurrent connections that helps a little bit but not too much because the only thing you gain basically is this lateral connection here that you didn't have before however if you do top only meaning that you can attend to the previous time step only to the topmost representation so whereas before you could attend only to things below you or at the same height as you now you can only attend to the topmost so information like flows like this and then can flow down again and then flows up again if you do that you get almost all of the performance of the feedback transformer i hope you see this so here lower is better and um this is all this is without the memory actually this is you know every everything like this is the full generalization i talked about you get almost all the way there by doing top only attention so the reasoning why they do this the fact that the regular transformers they don't have access to that last to these higher layer representations in the next um steps of computation i think that's really valid so you know like experiments here on reinforcement learning in grid world they're fun um not necessarily i don't necessarily believe all experiments in papers but this is a finding that does strike me as quite fundamental and it validates their claims and they have other experiments where they show that they try this sort of top only attention but they it's not top it's you know they choose a layer to which you can attend to the representation of which that the next tokens can attend to and if they say you can only attend to layer one of the previous tokens you do get um pretty bad kind of performance or bad well worse than and you see as you go up the layers you get better and better performance so here is where you average all which is almost what they do the feedback transformer is a it's a learned average right it's a learned it's a weighted sum and the weights you can learn um in fact if they go to the last thing here they do almost get there so i don't know you know that could be experimental noise i totally believe that you know you can get gain a little bit by doing this you know feedback aggregation but you can see if you are only allowed to attend two layers like five and six here you're already doing fairly well and this is a summarization task so this is a language task this is not a constructed task like their rl tasks and that is fairly convincing i would say the trade-offs are evident they have a table somewhere where in training they are much slower however on inference actually they can speed up quite a bit because they share a lot of the weights among layers that others don't yeah so here you can see for example in language modeling the original transformer has much higher speed this is i think tokens per second than the feedback transformer however the feedback transformer in the inference speed is much faster than the original transformer because at inference both models need to do it token by token because they are auto regressive whereas in training time the original transformer can do it in parallel where the feedback transformer has to do again token by token because they always have to compute all the layers for one token before they can go to the next token they have some more experiments where they show that as you decrease the memory so if you sort of constrain these models the feedback transformer performs much better than the original transformer they also compared to lstms i believe and this is on these kind of sequence tasks that you come up with to see sort of the properties of your model so does this mean we can replace

Conclusion & Comments

transformers probably not if you can afford to build a large enough transformer that will probably still outperform the feedback transformer and it will train faster which can be quite important however if you have very special tasks where you need long-range dependencies or really multiple steps of non-linear reasoning or are constrained in your resources and do actually have the time to train it as a trade-off then the feedback transformer might be something for you all right that was it for me thanks for listening share it out i'll see you next time bye

Другие видео автора — Yannic Kilcher

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник