# How to make your CPU as fast as a GPU - Advances in Sparsity w/ Nir Shavit

## Метаданные

- **Канал:** Yannic Kilcher
- **YouTube:** https://www.youtube.com/watch?v=0PAiQ1jTN5k
- **Дата:** 17.09.2022
- **Длительность:** 50:20
- **Просмотры:** 52,021
- **Источник:** https://ekstraktznaniy.ru/video/17125

## Описание

#ai #sparsity #gpu 

Sparsity is awesome, but only recently has it become possible to properly handle sparse models at good performance. Neural Magic does exactly this, using a plain CPU. No specialized hardware needed, just clever algorithms for pruning and forward-propagation of neural networks. Nir Shavit and I talk about how this is possible, what it means in terms of applications, and why sparsity should play a much larger role in the Deep Learning community.

Sponsor: AssemblyAI
Link: https://www.assemblyai.com/?utm_source=youtube&utm_medium=social&utm_campaign=yannic_autochapters

Check out Neural Magic: https://neuralmagic.com/
and DeepSparse: https://github.com/neuralmagic/deepsparse

OUTLINE:
0:00 Introduction
1:08 Sponsor: AssemblyAI
2:50 Start of Interview
4:15 How the NIR company was founded? 
5:10 What is Sparsity about? 
9:30 Link between the human brain and sparsity
12:10 Where should the extra resource that the human brain doesn't have go?
14:40 Analogy for Sparse Arch

## Транскрипт

### Introduction []

today i'm talking to near chevit about sparsity near has been long time active in the field as a professor at technion and mit and has also been awarded with various prizes such as the good old prize in 2004 and the dykstra price in 2012. he's also founder of a company called neural magic that questions one of the fundamental core principles of current machine learning namely you need gpus neural magic uses various techniques such as sparsity which we're gonna talk about today but also other optimization techniques to make inference on models like bert to be as fast as a gpu on a regular cpu this is pretty huge and can have vast implications on where you can deploy these models and just how expensive it gets to roll them out to many people in many places so today we'll talk about the biological foundations for sparsity why we shouldn't attempt to replicate the brain and just what it takes to make something go really fast on just the cpu i hope you enjoyed this conversation if you do give nier and his company a follow and i'll see you around bye-bye

### Sponsor: AssemblyAI [1:08]

hi this video is sponsored by assembly ai does real-time and batch audio transcription of audio and video files powered by the latest advances in artificial intelligence so if you are a developer or work for a company that's looking to get more out of your audio or video data through transcription and audio intelligence assembly ai is the best place to go not only do they have a user interface where you can just upload stuff but they do have a very powerful api but transcription isn't all they do once your audio is described they actually post-process it in many different optional ways so they can do things like speaker classification or annotations of various forms inside of your audio one feature i'd like to particularly highlight today are the auto chapters for this simply provide auto chapters equals true on your upload and assembly ai will after it's transcribed your audio automatically recognize chunks of audio where you talk about the same thing give you a summary of those chunks and a neat single description headline of what you were talking about there this is absolutely ideal for anyone who does any sort of long-form podcasting or videos like mine where viewers are very helped by the fact that there are chapter annotations and to have these be done automatically is just absolutely great so if you're interested head on over to assembly ai use the link in the description to let them know that i sent you they are the single api to transcribe and understand audio they do so in batch and in real time via websocket they accept all kinds of audio and video formats and they do so in over 15 languages give it a try and thank you very much to assembly ai for sponsoring this video and now let's get into the video

### Start of Interview [2:50]

video the topic of sparsity is a big thing in neural networks right now mostly because we have no idea really how to do it and i think that's exciting times for the future so uh welcome what brings you into the sparse world actually i um you know i've been a professor of computer science for many years and i um worked on multi-cores for more than 30 years and got involved in computational neurobiology in the last 10 years and one of the things that you really see in the brain is really how sparse its computation is it really is very sparse and so you know looking at neural networks we see that there are there's a similar phenomenon to what happens in brains happening in neural networks right where you can actually reduce the number of parameters through pruning by huge amounts and preserve accuracy of the performance of the network and that kind of says okay if we really want to have brain like performance you know sparsity is probably one of the tools that we want to use to get there so that's kind of how i kind of got into this uh yeah

### How the NIR company was founded? [4:15]

and you founded a company that also works into this direction right you want to talk about that a little bit yes i founded neural magic was founded because what we were seeing in my lab i was busy with doing machine learning at a large scale for biology projects and what we realized was that we could get cpus to run at gpu speeds like at the time it was a pascal gpu and we could make just a regular cpu do what the pascal gpu was doing um through the use of sparsity and other similar uh techniques and so we said okay well there's a real commercial value here for people because you don't need an accelerator you can just do it on your commodity cpu and that's normal magic so what we do is we deliver you know through sparsity and similar optimization techniques um gpu performance on cpus that is quite a

### What is Sparsity about? [5:10]

promise maybe let's first dive into a little bit about sparsity itself what is it about sparse you mentioned the brain is very sparse yet our current or at least the way we train neural networks is very dense we can accelerate the dense neural networks much better what is it about sparsity is it just the saving of parameters or is there something more to sparse connections than to dense connections what do we know that's a good question so clearly what we're doing today is not the sparsity that we will be doing in the future what i mean by that is your brain is sparse way beyond the levels of what we see in neural networks today so your typical brain in terms of the compute right you know your cortex is like a cell phone of compute right but the graph is enormous it's like you know the graph is the size and you need petabytes to basically hold it so a cell phone of compute on a petabyte or more of memory right but the accelerators that we build you know are designed to deliver petaflops of compute but on a cell phone size memory their memory is very limited because they use this high bandwidth memory so in a sense we're building the opposite of what we want right so if we want to mimic the brain we should not busy ourselves so much with the amount of compute and rather worry about how it is that we implement this very large graph it's a very large graph but it's extremely sparse that's the point right and as you asked the sparsity is not necessarily the same sparsely that we do today through pruning techniques but it's a combination of a very sparse architecture together with um you know a sparsity in what we call in machinery in the kernels right so it's not just that the kernels are sparse but everything in the design is very sparse okay and we don't know yet how to design very sparse architectures part of that has to do with the fact that machine learning grew up in the gpu world where sparsity is not an advantage actually because you're doing lockstep computations so you win nothing by being very sparse and therefore you know we don't see those architectural sparsity things yet but um but i'm expecting that to happen we should be this should come along you know and even more than that what i expect is things are starting to show up like the pathways from models from google and so on where even if you have a very large model you don't execute the full model layer after layer but rather you execute small regions of the model at any given time per input that's another form of sparsification of your computation right and that is what the brain really does so your brain typically you know when you see an input or so on uses a very small fraction of its total graph to do the computation and so that's where we're headed we're not there yet we don't know how to do it but this is the goal and that's the old you only use 10 of the brain at any given time right yeah right that's right i mean really from energy considerations it really is like a cell phone okay it really isn't you know this massive monster multi-gpu thing that we use today and so my expectation is that you know that as we learn more and more about how to design sparse networks we're going to see them become the standard they're not the standard right now because we started the whole journey right by applying flops and still applying flops is the main paradigm but we will see it appear both in hardware and accelerators and in cpus um this idea that we can utilize sparsity you know to get really great performance gains yeah that's coming

### Link between the human brain and sparsity [9:30]

now is the question is a little bit the chicken and the egg problem is the brain sparse because it has the limitations of the cell phone power or does the brain only need cell phone power because sparsity is such a good architecture right like which causes which yeah um so so i would say that you know the whole notion of parallelism in the brain right um if you think about it imagine that you need to do a billion operations per second okay and what you have are these very slow chemical devices neurons right that can do that right so you need a billion operations a billion you know firings of neurons in a second how are you going to do that well what you need is massive parallelism right you've got to get massive parallelism if you can do the massive parallelism you can get the billion operations right and and so our brains are parallel if you will because we have this special medium right now on a modern multi-processor right you can get a billion or 10 billion instructions executed you know per second sequentially you don't really need parallelism for it right and so what i'm trying to say is you know the whole idea of kind of how brains evolved is clearly because of the way you know they're implemented but we should not think of of going and implementing this in uh in silicon in the same way right because we really what we really should think about just is that both of these things are turing complete right you can do you can implement the algorithm you just need to know what the algorithm is and then on silicon we'll implement the best algorithm we can right you know of the brain but we don't have to have the exact architecture of the brain to do that okay does that make sense that that's my what i'm trying to say you know let's implement the algorithm but not necessarily the architecture okay so when i say sparsity i really mean sparsity algorithmic sparsity right and it doesn't mean that you have to have a very sparse kind of you know silicon vlsi circuit to do this that's not the case yeah given that we that that's a good segue given that we do have the flops right that we don't have in the brain it naturally it is a different system we do have

### Where should the extra resource that the human brain doesn't have go? [12:10]

different a different system we do have terraflops petaflops even in these giant compute clusters where should we put them in your opinion like where should that extra resource that the brain doesn't have go should it go into sequentially executing what the brain executes in parallel or you know where should we put that so first i want to say is that we have those flops but they're costing us a lot and you just have to open the papers to see what the cost of the flops is it's enormous an enormous energy drain and it's also an enormous uh architectural drain on what we're doing and so i would say we want to get rid of the flops because probably we don't need them okay and especially as you go from the data center down to the edge you get the your capability of delivering flops comes directly at the you know if at the edge you can put the sorry in the data center you can put you know your google um data warehouse right next to a waterfall or whatever you want right to a source of energy right when you're doing this on your cell phone or on a tiny device at the edge every little uh bit of energy that you waste is critical for you right and so what we really want to do is move away from the flops and move more towards the very energy efficient way the brains work because this adding more flops is a momentary thing for us right so yes we can do this but at a very high cost and no we don't want to do this forever we want to find ways to cut the cost reduce the compute and and there's a little other thing that i want to say and that is architecturally we generate the flops by running right now at least by running many many tiny cores thousands of tiny cores typically right in an arc in architectures they require a lot of connections to the memory this high band with memory and this thing doesn't scale so in a sense we're trading flops for memory if you use the cpu today you could get a terabyte on your desktop but go a gpu right and so boosting the flops is going to enable us changing the architecture if we don't need so many flops then we can actually increase the size of our memory which will make us able to hold these giant

### Analogy for Sparse Architecture [14:40]

models that we want to do very cheaply if you will if i explain a deep neural network to someone i usually you know you start with a fully connected layer you say you know here is a layer of neurons and they have their connections right and each connection has a little weight and so on you usually describe like a dense fully connected architecture and that is conceptually i want to say easy to grasp for people and so on do you have an analogy for sparse architectures like what is the conceptual like could you conceptualize to someone who doesn't know what like a sparse architecture is and how to think about it what is different yeah the way we do sparsity today i don't know what it'll look like in the future but today sparsi looks like imagine that the two layers of the neural network are these kind of there are chords from one layer to the next right there springs attached and these are of course these are the connections the weights that we're using in the computation right and varsity means i take scissors and i chop chop chop you know until i have five or ten percent of those chords left right and those chords it turns out right if i do this kind of pruning right are good enough to capture right the uh accuracy of the model as it was before because a lot of the connections are not important for this process that's kind of the big discovery and modern research in techniques for sparsification right um you know play along this kind of game so you can do this kind of unstructured thing that i just described where you arbitrarily cut in many places based on the effectiveness or you can also structurally take things out so in a lot of the modern models right we're removing pieces that are not necessary we do architecture search to find these uh these uh places to things right so that's where the whole game right now of efficiency and neural networks right is

### Possible future for Sparse Architecture as standard architure for Neural Networks [16:48]

the game of how do i cut this thing down right in the brain there are certainly some systems like the visual system where that is clearly organized into layers but there are many other systems that have no resemblance to layers there are connections going up and down and left and right and you know between the halves of the brain and all is there a possible future where this could become into like a standard architectures for neural networks that the notion of layers and things like this isn't even really a you know a thing anymore or is there you know some fundamental way where we say no there's probably always going to be layers but it's just going to be sparsity between those layers so when we look at you know we have a full connectome of essentially only a couple of animals a worm and a fruit fly that's it and as that said don't see a lot of layering there it looks more like a mess very sparse mess okay um and um i would i wouldn't venture to think about how what cortex what a cortex looks like right um we don't have that yet we're working very hard to it's a very these are very hard computational problems to be able to to go and get a model we just want to do a mouse even a mouse is just too big for us to do right now like a small mammal right but my i would venture to guess that yes the answer is that you know it's extremely it's an extremely sparse architecture and that it wouldn't it will not look like layers okay you can impose a layer structure on any graph okay it's not so the idea that i say there aren't layers sure okay i can take the graph and i can layer it yeah i can do a bfs on it and layer it but the point is not so much that it's more that by design when i think about it right i'm not going to think about it as a sequence of layers where the change that i make is the change in the layer one layer is different from the other but rather it'll be a combination of thinking about paths different paths and i'll do different things along different paths that's kind of the idea you know if you think about you know there's recent research from mit you know you can detect um people can detect an image in in 0. 13 uh set 0. 013 seconds in 13 milliseconds okay in 13 milliseconds you can detect that you can say what an image is okay this is there's no time for neurons to fire this thing is extremely kind of parallel right and uses very little compute and gets you an answer and a large part of that is prediction because you're already expecting something so we need to learn how to do those things and so machine learning right now is in a very naive early stage and so given that and given the things that we are doing right now it's not a surprise that we're doing the brute force kind of massive compute kind of thing that's always what you do and with time we're going to get better and better at it right so that's kind of how i see this

### Pruning & Sparsification [20:08]

progressing speaking of becoming better uh if you know the flat worm is sparse the mouse is sparse the human is certainly sparse yet our best models today are all big dense you know computation hungry things there is not really a case every time i prune i sparsify and so on i get savings in per like you know savings in cpu or gpu i get savings in you know my storage but i also get like a little bit worse right that's the common thing today in pruning is that i get like just a tiny bit worse than the dense model i prune from why do you think that is just the fact that we prune from a dense model or what's holding back the sparse models so how about if i turn this around let me turn this around for you okay you can take bert uh bass which is a common model that people uh use okay and you can sparsify bird base um at neural magic we sparsified 95 so a 95 sparse bird base 1 over 20th of the compute okay way beyond anything a gpu does even if you run it with full throttle okay it's just cutting the compute so much that there's really almost nothing to compute there it's just moving data okay no i'm exaggerating of course but you know it's really becomes a data movement problem rather than a compute problem when you and you lose one percent less than one percent accuracy okay um and i say okay great so you've done that you know and you've gotten all this uh speed up but you've lost you say oh near but you lost less one percent accuracy but what i say instead is forget that take bert large a much more accurate model several points more accurate than bird-based okay and prune it so that it actually right with 20x less compute it's actually faster than birthdays okay and so now you have the accuracy right and you have great compute and this is through sparsity so by sparsifying the larger model i actually delivered you the best of both worlds little compute and great accuracy and that's how i want you to think about sparsity right it's a way of enabling us to run much larger more accurate dense models but because we specified them we are you know we're getting great performance that's how to think about it what's the limit currently that keeps us from we always need the dense model first in this model in the pruning setup we first need the dense model then we go to the sparse model we get

### What keeps us from building sparse models? [22:57]

huge savings at inference time what keeps us from just building the sparse model in the first place great so this is kind of the lottery ticket kind of question if you will um there is research actually dan alistair one of our uh consultants uh neuromagic works exactly on this kind of stuff we know how to um to run um a training session right now four four models where you start out and you need to do only a certain fraction of the um you know of the uh forward passes backward passes dense and then immediately you can already start pruning while training so there is research going in that direction but you are right that right now at least right in the standard if you look at what's going on there out there standardly you're right we do most of the time take a standard model and from dents we sparsified and so on but the thing to remember and this now i'm not talking about the research because the research is going to get there you know yannick i don't know if to what extent we will uh how fast this will happen and so on but we will learn how to build sparse architectures and start sparse and continuous you know it's really a matter nature does this and so there's no reason why we'll be able to do it but i want to say something about today's uh machine learning where you kind of start with the dance and then you have to sparsify this is really not the common paradigm for most users of neural networks for most users a model is given to them that you know from a known architecture right and then they transfer learn onto it and most people do that rather than train from scratch they really use the model that somebody already worked very hard to build for their specific use case and then they transfer learn onto it so this is what you can do with sparsity you can take a sparse model and sparse transfer learn onto it it's extremely efficient because you're running at the speed of the sparse network right so you can sparse transfer and then you don't need all of this kind of start with dents and we're seeing more and more sparse networks um appear you know in the in the literature and the data in the you know in database collections of machine learning models and as we have more and more of these initial good sparse models right people are going to learn to start with the sparse already that's kind of commercially i think

### Why are GPUs so unsuited for sparse models? [25:34]

that's what we're going to see more and more of yeah why you mentioned this a bit already but why are gpus so unsuited for sparse models and what makes cpus in the way you do it really suited for sparse models or are they even suited or are you simply you know seeing that they're better yeah i mean look the gpu architecture you know is designed for this very you know small course tiny caches you're not going to go and throw all that away just because you know you found you discovered sparsity so you're trying to do sparsity while keeping this kind of lockstep execution structure right and this is difficult to do sparse you need uh you need you need really a different kind of setup to get an advantage out of sparsity now i'm not i it's not like you can't do that right people can design and have design hardware that utilizes sparsity efficient okay there is such hardware it's just not a it's not gpu-like it's not like the accelerators that we have today um but all of these again all of these accelerators have a different problem that has just to do with the memory because of the way they're designed right they typically have very small memories so we're talking even ones that can run sparse right still have the limitation of their memory size so the reason that cpus are attractive is not so much that you know that they that you have a natural way of running sparsity because you can run asynchronous with large cores but rather that the large cores enable you very easy access to very large memory pools right so the advantage of having strong powerful pores right is really that i can put several terabytes of memory next to them right and run easily and that's where the big advantage is going to be as we understand more and more about how to build giant models that don't run all the model layer by layer at the time right then the compute will be less important but actually the ability to hold that model in one place and run it rather than break it apart on 8 or 16 gpus that's going to be your advantage and so this is so i'm kind of saying it's not so much that you can't build a hard piece of hardware to run sparsely you can right but you should build it looking like a cpu in the sense of you can access a lot of memory because you're not doing tiny cores that's kind of that that's my two cents so the cpus are good because they have you know fast connect to large memory but also over the years we've put more and more levels of cash onto the cpu how much do you have to take this into account when you're building i mean you're maybe you can explain a little bit what your company does in terms of software you build compilers or can i just run tensorflow or something

### CPU and GPU in connection with memory [28:47]

yeah so let me explain so so first of all the the connection between the cpu and the memory is slow gpu has faster memory and faster access to it right smaller but faster right cpu memory is slow but large very large uh but cpus have a cache hierarchy as you said and so if you know how to utilize your cache hierarchy then you know if you're running in the l1 cache of the cpu okay you're running as fast as the gpu there's nothing there that the gpu does that the cpu can't do once you're in cash okay in fact cpu caches are much faster than gpu caches and the performance is better so the so the question then right and this is what neural magic does is okay so what we do is we sparsify the model now you know if the pro you know machine learning is about okay i need to meet a certain latency and because i couldn't meet that latency with a cpu then we added the gpu and boom there's machine learning with gpus now i can meet the latency but there's two ways to deal with latency one is to add more flaps and the other is to reduce the flops right and so sparsity instead of adding more flops and hardware reduces the number of flops needed in software but now that you have this very sparse model because the cpu memory is slow okay then what happens is you hit a bottleneck and it's very hard to move if you do this layer after layer it's very

### What Neural Magic does? [30:14]

hard to move the data in and out okay so what neural magic invented is a way of running neural networks depth wise so we have the this technology which we call tensor columns where essentially you can run okay you know you can break the model lengthwise and run you know each one of these kind of columns you know um in cash okay and you because you're not leaving l2 really you're rarely leaving l2 you know you actually get great performance so in a sense right what we're doing is we're using the natural ability of cpus to prefetch things from memory and then run in cache and because this you know this cache hierarchy on cpus has evolved over 70 years or i have maybe i'm exaggerating 60 years of hardware design it's a very well understood thing where people know how to optimize it right especially the big uh you know chip makers they really know how to make these caches work really well and so with these really good cache hierarchies um you really get great uh performance by running the model depth-wise so that's neural magic you know we take the model sparsify it now it doesn't need the compute and now we run it on the cpu and get speed because we're running in cash okay and if you look at the numbers i mean you know we are you know at the speed of i mean some numbers we haven't punctured we're at the speed of an a100 even faster in terms of how long it takes a four core cpu can in terms of latency do what a a100 does on a common model like bird okay so it's really the amp given that it's sparse or yes yes by sparsifying it and running it you can make a four chord do what a100 does so it's really now a matter of throughput and the a100 has a lot of throughput okay so now the question is you know how many cores do you want on your cpu to meet the throughput of the a100 and again the story is that you know the big providers are adding more and more cores so you're going to be able to compete better with the gpus down the road so that's kind of the the story of neural magic yeah so the way i can imagine these tensor columns is that because i execute depth wise the sort of values that i need for the next step in the computation are the results of the very last step therefore are already going to be in cache and since everything sparse i don't need all of the last layer for the current step and therefore

### How do you deal with overlaps in tensor columns? [32:54]

you know i have it already okay right and of course i'm i mean you know when you think about a neural network there are overlaps between these columns and the question is how do you deal with the overlaps in a way that doesn't kill your computation and that's the magic of it there's an algorithm that allows you to do that and because you can do it you manage to run this way and you don't hit this memory bottleneck and boom you're in business yeah so for gpu it's almost like you know gpus enable us to do dense models but i think also models have almost co-evolved with the gpu so people have started building models to fit the gpu architectures better right especially something like a transformer is like that's a that that's like made for gpus um is there a type of

### The best type of sparsity to execute tons of CPU [33:41]

sparse model like if you could wish for the best possible sparse but you know there's different kinds of sparsity like what is the best type of sparsity to let's say execute on a cpu if we want to look forward and we want to especially build architectures for that yeah this goes back to your original for one of the first questions you asked right it's about a different structure for the neural network execution so we should forget the synchronous layer after layer execution and think about the fact that you know we can run through a model right in multiple paths with multiple computing units use the same weight structure and so on of the model right but run at different speeds and by running going through the model in different paths i can get from the same model multiple answers to my questions which is kind of what i believe what your brain does so what happens there is you have this network but it's not like you know it's all firing like this layer after layer it's rather you have these asynchronous flows going through it right even going through matching pads and cpus are naturally built for this thing now i'm not saying that somebody can't build a beautiful fpga that will perhaps have a better closer structure to what a brain does maybe so but you know but there is an advantage to being commodity okay the fact that the cpu can do other things is a big win if i can make if i can move everything to software is really is the thing then i can really get all the advantages of modern software so i'm not pulling hardware accelerators i'm saying great you know they have a role and so on and so forth but they come at a price right and the price for any organization is that you instead of just downloading or shipping your product with the machine learning piece you have to ask the client to buy a certain accelerator or run it with a certain accelerator and this all goes away if we can figure out how to make the cpus do what the gpus do right then we have then we're back into this beautiful world of containerized movable software and that's really kind of where i would love machine learning to move to rather right that we would have and maybe down the road right there is this you know cpus have a history of absorbing the key components of any new paradigm that shows up you know virtualization started out with tricks on a g on a cpu and then later on added the features networking had special accelerators and then they moved into the cpu and i'm expecting that whatever features are necessary for machine learning to run well we'll move into the cpu and we won't need an outside accelerator to make this thing work if you could um so i think that's by the way also the story of gpus themselves right they were already kind of consumer-ish available and then they can't they absorbed machine learning it's not necessarily the best architecture for machine learning but let let's say there's already all this hardware out there right there is very good cpus next to very good gpus how do we get the best out of a machine like this right now we've advocated for let's move things to the cpu right we have some advantages there but what if i have a box with both like currently i just use my cpu to ship data to the gpu

### What kind of architecture would make the best use out of a combined system of CPUs and GPUs? [37:24]

right that that's what my cpu does but is there a way where i could potentially you know what kind of architecture would make the best use out of a combined system of cpus and gpus no i think this is really the vision that nvidia has at least today for their grace hopper architecture it's essentially this there will be a cpu and a gpu connected to one another and the cpu will do all the things that are memory intense and the gpu will do all the data intense things the thing about the problem with this kind of a model is it's a beautiful model by the way i'm not saying anything uh bad about this if you really want to build a gpu world that's a great thing to do but again the you know how you how much you utilize your gpu your attached gpu has to do with how you write your application because you need to move the data into the gpu in and out and that's slow right you remember it's like it's exactly like going to memory right it's the gpu is not up it's not sitting in your caches so if you're on the cpu and you're computing something on a cache and suddenly you get a page fault and you have to go and get something from memory that's the latency that the gpu introduces here right and so if if you're going to design it with that you have to create really good software to pipeline things and this is at the level of the application so the application programmer has a big programming task and so this is a great solution for large-scale big projects where okay i'm gonna facebook is gonna get you know a thousand of these or ten thousand of these whatever it is you know uh or google ten thousand a hundred thousand of these and you put them together with then it's worthwhile to write this kind of complex software but if you're joe company right and you have your little thing i don't think you want to be writing that interface right so kind of so i'm saying it's a it's great for large things right data center things big things but i'm very doubtful if this is going to be um effective at the edge if you can um actually utilize the cpu for it okay and i will say one more thing and that is that um you know that the modern way that the designers of hardware think about it is that is mod it's built-in modules if you look at the amd latest architecture right essentially you have these ccx's so the machine even though it has you know maybe 40 or 50 or 60 cores right they're grouped into groups of eight right and each group of eight like this is a little piece of the die okay and i think intel is shifting in that direction too so nothing's to prevent you from making pieces of that die be specialized pieces of hardware like a gpu you don't have to have outside device so if you ask me what the future is going to look like it's probably you know these large cores right that have um or large machines with multiple dyes and on these dyes we might have a gpu die we might have accelerated and that's more like what i expect to happen rather than having a massive you know accelerator on the side if we if we hear sparsity and uh things not being in layers and so on naturally the topic of i think graph neural networks

### Graph Neural Networks in connection to sparsity [41:04]

is very close to that at least in the imagination of people do you have anything to say about you know where current graph neural networks stand with respect to sparsity yeah i would think of graph neural networks as a as a different kind of okay so graph neural networks i use some graphical networks in my research and the idea there you know is that you know we can use graph neural networks to solve graph problems that otherwise would be very complicated to solve if we tried to solve in brute force okay now it's not generally applicable there are quite a few limitations but but as a tool i would say that you know rather than think about the neural network itself as being looking like a graph neural network right i could use graph neural networks right um to define um what we call motifs in the neural network so for example when we try to look at how brain struct brains are structured right when we look at the graphs of brains and we try to understand you know is there a motif that is repeating itself in this graph right then using a graph neural network for that is a really nice way to try to find these motifs okay efficiently right um because the problem itself is p space complete or actually we don't know it's a graph isomorphism so clearly we don't know right how to do the brute force algorithm well but the graph neural network can come to our aid here and so i would say that right now i don't really see a real network design neural network design that is specific to that or a way that it helps but in research it definitely works and we really want to use these networks to help us in

### Intrinsic connection between the Sparsification of Neural Networks, Non Layer-Wise Computation, Blockchain Technology, Smart Contracts and Distributed Computing [43:04]

research um this might be a bit of a tech bro question but if i hear you know i can do sparse computation very i can reduce the flops and so on um is there any intrinsic connection between the sparsification of neural networks the non-layer-wise computation and blockchain technology and smart contracts and distributed computing and things like this have you ever given this any thought or uh is that completely off yeah look i think nothing is completely off with respect to maschine in the sense that i am sure that machine learning will find its way into all of those areas right it's a matter of time and um and right now right the all the work there doesn't need the efficiency of right of what machine learning offers because machine learning in the end is an optimization and so when i think when all these blockchain algorithms and all you know become more commonplace and we need to provide them with things like security further security or analysis and so on i think then we're going to see applications of machine learning there and with that i think all these things of sparsity and so on are going to appear but you know but for me right it really is the whole story of sparsity right is the story of uh of a phenomenon that is very prevalent in nature right that may you can say surprisingly or not surprisingly shows up in machine learning and it kind of it makes me feel like it's strengthening my belief right that even though the exact computations that we're doing are not the same as spiking neural networks and brains right that there is a lot of commonality there and the emergence of these similar phenomena like sparsity like you know pruning and so on and the fact that we can get benefits from it this tells me oh okay these are related i think that's a very important uh interesting point to keep in mind

### Neural Magic's target audience [45:23]

with neural magic who is your main target audience like who is listening to this do you want to let know like we are exactly for you so we span the gamut from the data center to the edge um i would like to say i mean we just now are moving into providing the same properties for arm architectures and so i would say the exciting new thing at neuromagic is we're moving from doing this you know uh for amd and intel architectures to doing it for arm which means that we're going to spam again all the way to the very bottom of the food chain if you will and i think this is very exciting because as you know because sparsity has a dual role as you go down the food chain right because for the large accelerating you know the fact that the memory footprint is large or small is not that important but as i go down sparsity gives me two things speed with neural magic gives you speed but it also makes the model extremely small so you're getting a small accurate model right running on a very small device and this you know typically is an arm device and so that's that's the audience that i'd like to say hey we're coming you know we're coming and we're going to deliver the same things that we can deliver for intel and amd we're now going to deliver it for arm at the very end at the very edge if you say edge do you mean smartphones do you mean security cameras do you mean robots everything okay i mean everything i'm not like i'm going to do everything to start with but yes but yes we're aiming in that direction yes and with the danger that this has become going to become like a marketing opportunity question but how easy is it to get started with what you're doing like let's say i'm a i'm like i've done you know my tensorflow tutorials i know how to build a model and train it and so on like how much does it take for me to transition or to apply what you're doing yeah so you just go to our website go to get download deep sparse our you know our engine download our uh ml tooling and um you know immediately you just either pick a sparse model and transfer learn onto it with our tool so we have recipes you have a model you have a recipe exactly what you would do if you went to hugging face and downloaded a model and downloaded a recipe you do the same kind of thing and you sparse transfer learn onto it and you're in business so it's not very hard so i think this is really and we're working on making it even easier this is one of our goals right is to make it really easy to do this and the advantage of course is that you know people are already busy uh you know quantizing their models to get more performance so this is like quantizing in some sense right you're going to do the same kind of thing

### Is there a type of model where it works particularly well and the type where it doesn't? [48:16]

and get yeah a lot more performance yeah is there a type of model where it works particularly well and the doesn't like i'm thinking you know convnets recursive networks other regressive maybe you know the big language models like what is it best at yeah so right now you know it's best that at bird yolo models we do computer vision and we do uh and we do the language models but not the large language models we haven't done a large language models yet so for those types of things like the birds and the yolos and the you know the uh whatever the variants of efficient nets and all these guys this is you know visual transformers these are the things that we do right now and every all our technology is right now you know available for those um i'd love to do the large models a cpu is a natural environment for running the knowledge models you know these giant models these trillion or whatever parameter models that people talk about splitting across 16 gpus they fit on your desktop okay so clearly a cpu is a natural place to run a very large model okay and so that's that will be a target but right but not right now okay very exciting uh is there any last things you want to get out maybe about neural magic or sparsity in general well you know our whole machine learning software stack is open source and we'd love people to come in and help us build you know better sparsity use sparsity in their models and tell us about what they're doing and you know that it would we have a community and we'd love you to join our community excellent near thank you so much for being here today this was very pleasant thank you very much bye-bye bye-bye
