# ML Compilers: Bringing ML to the Edge - Chip Huyen, Instructor at Stanford University

## Метаданные

- **Канал:** DataScienceGO
- **YouTube:** https://www.youtube.com/watch?v=2QIXxJ_lfuk
- **Дата:** 06.11.2021
- **Длительность:** 26:42
- **Просмотры:** 1,314
- **Источник:** https://ekstraktznaniy.ru/video/45950

## Описание

The success of an ML model today still depends on the hardware it runs on, which makes it important for people working with ML models in production to understand how their models are compiled and optimized to run on different hardware accelerators. This talk starts with the benefits of bringing ML to edge devices, which motivates the need for optimizing compilers for ML models. It continues to discuss how compilers work, and ends with an exciting new direction: using ML to power compilers that can speed up ML models!

Chip Huyen, is an engineer and founder working on infrastructure for real-time machine learning. Through her work with Snorkel AI, NVIDIA, and Netflix, she has helped some of the world’s largest organizations deploy machine learning systems. She teaches Machine Learning Systems Design at Stanford. She’s also published four bestselling Vietnamese books.

## Транскрипт

### <Untitled Chapter 1> []

hi yeah hi everyone thanks dr joe perez for the introductions and i feel nervous going after joe because he is such an energetic speaker so i'm gonna try my best so i'm gonna talk today about a topic that is very i'm very interested in working on is machining compilers and how to bring machine learning to the edge um i think dr perez also gave a great introduction about me so i'm just going to reiterate that i'm working on a startup about real-time machine learning and if you want to chat about more about it please let me know so um i it was never a good system engineers and when i was in school i was terrified of anything compiler related but then the more i work about work on machining in production the more i realized that it's really impossible not to touch on compilers and i think i've realized the chord from sumi chintalas pie torch so uh he was talking about how for some years we were competing about framework like pythagoras versus tensorflow and the next word uh next battle i believe is about compilers how to make machining models work really fast suppose in that agenda today um we're not first going over the importance of machining on the edge and then we talk about why you need compilers how compilers work and for the last session we'll talk about how we can use machine learning to speed up machining model so the first topic um so edge computing is one of the phrases that have got so much traction in the last decade and it's for good reason so um i just like to refresh as the audience um competing when you when a lot of the computation is done on the cloud like uh on the server for example aws uh gcp azure and edge computing is when a large chunk of computation is done on edge devices which can be your phone the tablet a smart watches or it can be like it'll be in browsers so example of the of machining models that run on the cloud are a lot of like uh smart assistance so when you talk to alexa siri a google assistant then on the speech recognition system will be run on the cloud and then um they're respond back to you whereas computing example of um something like the same as smart assistant uh even though the most of the query choose smart assistance and then the cloud there's certain queries and then on device for example when you say okay google so okay google win with google assistant even though the phone is not connected to the internet um another example is a pretty text um so when you type in your phone um sorry see i said okay we'll go and now my phone just wake up so um it's a pretty text like when you type on the phone uh it will uh even suggest you want your online type next like or like unlocking the phone with fingerprints or faces and i have the red uh the last example in red because for the task of unlocking the phone using the fingerprints on faces it's like that it's very important to do it on device because a lot of people might feel unsafe to have all the biometric information on the cloud so there's a lot of benefit to

### Benefits of edge computing [3:32]

edge computing so the first is that when you do the computation on the device sensor application can work without internet connections or with unreliable connections so for example like when i was helping companies deploy machining models a lot of them have very strict no internet policy so like they uh first of all is the warehouse they have like confidential information so they can't connect the machine to the internet so whatever applications you have to work with here in the locality the request so uh there's a caveat to it it's like so all the applications might be able to do computations on device but applications might still need external information from the internet to work for example um when you do eta estimations like estimating the time of arrival although computation might be possible to be done on the device but you still need to connect the internet to get all the traffic information to get the most correct accurate predictions a nice another nice thing about azure computing is that you don't have to worry about network latency so if you have some model on the cloud somewhere then first you have to send some uh data like the query from the device to the server and then the server is going to generate predictions and then you send the prediction back to the device so it's a route trip and um it can be very slow and have seen a lot of applications where the network latency is actually the bottleneck and not the inference latency generated by the model so an example we talked about is predictive texting so um for predictive texting to be useful like the recommendations have to be faster than uh than the speed at which you type because like if it's slower than the speed at which you tighten there's no point in using like predictive texting another benefit is uh ash computing means fewer concerns about privacy so when you uh so first of all like you don't because you don't have to send data over the network um it's a less chance of the data being intercepted in the way uh on the way and another thing is that like um cloud-based uh so when you do things on the cloud it usually means that a lot of people data are stored in the same locations so if there's a bridge such as locations a lot of people will be affected they also make uh it easier to comply with regulations uh for somebody gdpr requests like a lot of considerations when you ship data to the cloud somewhere um another kind of so there's a caveat to this is that like edge computing can help you reduce the concerns for privacy but it doesn't eliminate concerns about privacy entirely so one caveat is that um when you tap all the data on a smart device it might make it easier for people to just like take some device and run away and get all the data um and i think the last benefit of the ash of ash computing which i also believe is uh and not believe like from elsa patients that a lot of us is the reason to push a lot of companies to it is that is a cost so the more computation you can we can push the edge and the less computation we have done as a cloud and so unless we have to pay uh for servers and cloud bills are very expensive so what i keep talking to companies and they complain about like how expensive it is to run um uh like how expensive is to do things on the cloud and i think there's a lot of reddit uh threats and um i can use uh threats of awake we will just complain about like how one mistake almost caused them almost like bankrupt them so i think it's an interesting um graph about the cost of the climbing cloud cost of different companies over the years so because of the benefits of ash computing many companies are heavily investing in making better chips better hardware to go to run computation on device so you can see like companies from big companies from tesla to google to um to apple like even facebook they're all trying to develop more chips so they can run machining models on device and they also like many startups that have raised billions of dollars to develop better hardware so okay so um now we're talking about the importance of edge computing we talked about like why you need compilers and why is that related to all the edge devices edge computing

### How to run your models on different hardware? [8:05]

so given the given that there are so many different hardware one question is that after you have developed a model using an arbitrary framework how to run that model on an arbitrary hardware right um so i think it's the questions of

### 1. Compatibility [8:22]

compatibility so you have so many different frameworks you can develop some more with python tensorflow socket learn mxness or like a new and hot uh framework on jacks and there's so many different kind of like hardware like gpu tpu uh phone chips like arm chips then how can you just run the model build an arbitrary framework on the arbitrary hardware

### 2. Performance across frameworks [8:48]

assistant has a problem it's a problem with performance so um in a machining workflow you use a lot of different uh libraries like you might use like pandas numpy pytorch and uh even those h library might have some certain performers some optimizations within the library there are actually no end-to-end optimizations across libraries so this one study by some researchers from stanford when they found out that like a typical data science workload using numpy data numpy pandas and tensorflow run about like 23 times slower in one thread compared to hand optimized code so if you optimize your code uh you can make it running so much faster so usually in you know company have seen it's just like the easily they have data scientists are machining engineers develop will flow and then try to deploy it and when things get slow so you start hiring optimization engineer to try to like uh boost the performance the speed of their of the system however like the optimizations um across libraries across hardware can be very

### Backends: memory layout + compute primitives [9:57]

difficult and the reasons is that like different hardware backends have very different uh compute primitives and different layout so first of all if you want to do some kind of computation efficiently you would like to leverage like in memory uh buffer layout of the hardware to do that faster and then like every single like a type of hardware when we do it differently so we can see here like um some of the layout of like different mem the layout of my memory layout or different hardware background a cpu gpu and tpu and you can see like they have different primitive uh so first of all like uh traditionally cpus uh use scala uh as the compute primitive um gpu's u vector like one dimensional um array whereas tensors you um like two-dimensional array as their compute primitive and you also know that in diploid models you have like high dimensional instructions and you need to convert into like low dimensional compute primitives um yeah so um so it's really hard because you need to understand how each back hardware back end is laid out so we can optimize our code for that back end um so what usually happened so far is that like for framework developers when you uh if you want to build a new framework like brighton tensorflow usually offers support like support across a narrow range of silver class hardware first we say let's try to get our framework to work really fast like gpu uh and whereas at the same time when uh for hardware vendors when they develop new hardware we probably try to like offer students uh kernel libraries for narrow range of frameworks specific nvidia you will have people um trying to optimize like writing kernels should get the gpu should work really fast on like common frameworks so a problem is that like you might have hardware login so that if you develop the model with a framework you might have i'll get to it might be stuck to a hardware that supports a framework and also make it incredibly difficult for framework developers to develop new framework because now they have to provide support for like multiple hardware back ends

### Compiling: lowering & optimizing [12:11]

so that's where compilers come in so compiling uh automatic so it's not like optimizing of ultimate optimizing compilers they do two things the first they help with compatibility as in the lower ring um they're lowering the model code into hardware native code so lowering here just means it's like generating um algorithm code for your models so it's not translations because they see not one there is no one-to-one mapping from a model code to the hardware native code and another thing that optimizer optimizing compilers do is that it helps with performance so in the process of lowering your model code into the high creative code so the compilers also optimize the model to run the hardware so it sounds like wonderful and you might wonder like so how do compilers do that um so let's back practice the question uh look back to the previous slide that we saw like um with compatibility so we have a module back end uh hardware backend and multiple firmware front end how should uh how do we make a model run on an arbitrary hardware backend so instead of like

### Bridging frontend & backend [13:20]

optimizing instead of like trying to get every single framework to work with like every single back end so what if we have something like a common language like a middleman so that like we can convert any of the framework code into like this common language and then you just need to generate from the common language to like the hardware backend so like for the framework developers instead of like generating like making it work for every hardware background you have to make it work for this common language and then similarly hardware back ends only have to work with the combo language instead of with every single framework

### Different IR levels [13:55]

so the tama language is called intermediate representations and there are different level of intermediate orientation so um so you can have high level um super economies ir because i'm not a native speaker and i have difficulty saying like long phrases like that so you have like high level irs so which if you use tensorflow you're pretty familiar with so it's like a computational graph so you um you have like basically like it is a graph of like how the composition computation is done on the model and then you have lower level irs so which is like language agnostic so something example of like a llvm and vcc and there's something in middle tuned irs which we're going to go into later

### How to optimize your models [14:40]

um so i think one thing i've been very curious for a long time is like how to optimize our models so they are like standard compiling uh compiler optimizations for example like virtualizations look tiling explicit parallelisms and like doing cash um so here's a visualization of how loop tiling works so loop tiling is when you have maybe like a nested loop and instead of like um computing something like on the left like doing the first row and then second row you try to leverage how the memory um um cash so i can do it more affectionately so instead of doing like first row and the second row you can do it like something like on the right and it's more efficient

### Operator fusion [15:25]

efficient um and another thing is like operation fusions it's also very similar idea instead of so you can fuse different operators together so they can do this at the same time and make it so more efficient um here's just like a graph to show like how of uh how operator fusions can improve the performance so um it can give you like up to like 2x speed up uh in certain cases

### Graph optimization [15:51]

another kind of optimization is like graph optimizations which is more uh more general so given a computational graph you can maybe like do vertical fusions like you fuse is like uh in vertical directions or you feel some horizontal fusions like in this example and this can help you like uh run uh do one pass through the computational graph faster

### Why is it hard? [16:15]

so it is actually like really hard because it's like this on the so even though they're like uh there's a common method of doing optimizations or like while under suit doing so is still very hard because automation optimization has to be like hardware dependent so really depends on different uh processing memory cache latency hiding as layout above it also operator dependent because it's also depending on the operators or like the kind of computations you do in the model and because like new models are being developed all the time different ways of like combining operators um they're new operators new models all the time so like every time there's some have a new model you have to like think about like how to optimize that new model and there are many different possible paths to execute a graph so for example like if you have three operator a b or c you might want to fuse a with b or b c or a b c together so there are many different paths to execute a graph so what a lot of companies do right now is they try to come up with like hand tone optimizations so it's like heuristics based so like they get some like really experienced engineers who say who nobody can fire because like zia's the only one is a company who knows how to do this kind of like stuff um so um it's real so first of all like nvidia might hire some optimization engineer to optimize a rest net for the new tensor core gpus so the problem with that is like is it's not um it's uh it's not optimal because it's heuristics based and even though like a lot of heroes can be very efficient it's usually not optimal and it can be like uh not adaptive so like first one if you have a new hardware new framework new models and helps you adapt uh optimization to own these new things so an idea is just like um what if instead of doing hand tone optimization we can use machine learning to optimizations

### Idea: automate the optimization process [18:19]

optimizations so the idea is like um what if we explore own possible paths to execute a graph to find the optimal path so it can be like um so like so that you can explore a path and you can run the path into it to find out how long it takes to execute a path and then you can just like choose a path is fasted but it's like too slow because there are too many possible paths um so um so there's there have been a new research directions where we use machine learning to narrow down the search space to find approximate approximately the optimal one so you don't have to find like the really optimal but you can approximate the optimal um uh path so here's an example of an auto scheduler like which is like what tbm does is i first you break the graph into different subgraphs and then you predict how big each subgraph is so you can allocate time for that graph so like instead of like having to run the subgraph from end to end and measure them you can actually predict how long it's going to take to do the path and then you you find you predict like the optimal way uh to execute the subgraph and then you can type them uh on some graph together to execute or entire graph end to end um so like you can pretty familiar with um so what i'm saying here tvm auto scheduler which is more generous and the qd and auto tune if you haven't tried qdn ultra tune for python you should definitely do that um and here's just like some uh because right now just run over like some nice result of how auto tvm work um it's clearly show uh so of course this benchmarking is really hard sometimes can be hard to reproduce and you sometimes have to take the result with the grain of salt but here it shows that like a tvm auto scheduler auto tvm actually run faster than hand tuned kernels written by qdn uh here's an example of show tvm compare um compared to other optimization method on raspberry pi um and yeah so um tvm actually an open source project and if you want you can actually install it using pip install here um so one catch with uh using tvm though uh this because it's using machine learning it explores a lot of possible paths so it can be pretty slow for the first um for the first run so it'll be like hours even days but the nice thing about is that like you only have to compile the model once so if you're setting when you have to deploy the models like multiple over multiple module devices then um you just comprise the model once and helps even like again a little of speed up but then like across multiple inference it can be a huge saving so yeah anyway uh i think now it's time for q a and thank you so much everyone if you have any questions please let me know all right chip thank you so much for uh presenting that information very useful um let's see look at questions that we've got here um what can happen when you use test data instead of actual data with machine much i can't say it you think you can't talk yeah machine learning i can't say it right test data instead of actual data with machine learning um sorry can you repeat the question like what happens what can happen yes what can happen when you use test data instead of actual data with machine learning i guess how can that throw off the bias maybe or throw off the what do you want you um okay so i think i'm always confused with the questions because what are you using test data for because i think machine learning you tested your evaluation model or do you mean tested as in toy data um yeah that's a good one tell you uh we have uh let's see pooja has just uh checked in she says she can take that question later because it's probably more in the context with uh what she'll be talking about all right um our next question uh this one says i've heard the term data mining thrown around quite a bit with machine learning what's the difference between the two oh that's an interesting question so they are mining which is machine learning so i think um i can see the confusion because they are both deal with a lot of data so i think of data mining uh i think that machine learning is actually a subset of data mining so data mining is when you have a lot of data and you try to gain and generate insight from the data and whereas machine learning is upset of it because even select leverage data to generate like predictions or help you understand the data in different ways so yeah so um so i think data of mining is a very general uh big film it's like any time when you want to gain extract information from a large amount of data okay and let's see here uh there's always a response for that yeah that's a great okay very good uh yeah it says uh data mining uses techniques created by machine learning for predicting the results while machine learning is the capability of the computer to learn from a mind data set there we have it all right excellent i hope this answers your question and last question uh there's unless another one comes in there's always going to be bias when looking at data and issues and so forth that you're trying to resolve what is the trade-off between bias and variance um it's very interesting it is um it's a it's really all like you said um it all will always be a trade-off because there's certain algorithms and techniques we have to reduce the bias but there's certain uh techniques when have you reduced barrier variance at the cost of like higher bias so um so yes so i don't think there would be like a way around that but i guess um i think understanding the tradeoff also help you choose what is the kind of algorithms that is best for the solutions yeah there's a lot of yeah sorry no go finish your thought i'm sorry um yeah so i think they're um i think understanding also help you to like um i think andrew has a pretty great lectures on that we also help a person for example like if um as a certain uh if so algorithm is very much like high bias like helping more data is maybe like um having so have you decide whether you need more or less data because like really depends on whether it's a model is like higher variance or higher bias you could need to decide the amount like data that corresponds to that i couldn't agree with you more chip i mean you know either you're either you've got erroneous or overly simplistic assumptions that's going to lead to your bias and various well variants you've got too much complexity in your learning algorithm right you know it leads to the algorithm being highly sensitive to high degrees of variation so um yeah i'd much rather have that seems to me that seems to make more sense to me than to make uh erroneous assumptions about the data um when you're you know you're putting it through the models so uh yeah uh components of reducible error so all right very good well chip thank you so much for uh for being with us this afternoon uh what's the best way for people to connect with you um so yeah so i think i'm not sure against using my screen but even so on my twitter and my email yeah just shoot me an email anytime thank you so much for having me absolutely chip thank you for being here and uh wish you all the best thank you have a nice day all right