# Imbue - training a 70B model from scratch! (w/ Bowei - head of infra)

## Метаданные

- **Канал:** Aleksa Gordić - The AI Epiphany
- **YouTube:** https://www.youtube.com/watch?v=wTE8Dk6I80A
- **Дата:** 16.09.2024
- **Длительность:** 59:26
- **Просмотры:** 2,659

## Описание

Become a Patreon: https://www.patreon.com/theaiepiphany
👨‍👩‍👧‍👦 Join our Discord community: https://discord.gg/peBrCpheKE

Bowei joined us from Imbue to talk about Imbue's latest endeavor: building the infra to support a 70B model training! They wrote up an amazing blog post with a lot of details describing the grind that was needed to set everything up from scratch. :)

▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ 
Blog post: https://imbue.com/research/70b-infrastructure/ (From bare metal to a 70B model: infrastructure set-up and scripts)
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ 

⌚️ Timetable: 
00:00 - 00:45 Intro
00:45 - 02:25 Hyperstack GPUs (sponsored)
02:25 - 11:30 Bowei's background
11:30 - 18:30 More on Imbue, their research, their focus
18:30 - 26:20 Training a 70B model
26:20 - 45:40 Building a cluster from scratch
45:40 - 59:25 Anecdotes, Q&A

▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ 
💰 SPONSOR

The AI Epiphany - https://www.patreon.com/theaiepiphany
One-time donation - https://www.paypal.com/paypalme/theaiepiphany 

Huge thank you to these AI Epiphany patreons:
Eli Mahler
Petar Veličković

▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ 

💼 LinkedIn - https://www.linkedin.com/in/aleksagordic/ 
🐦 Twitter - https://twitter.com/gordic_aleksa 
👨‍👩‍👧‍👦 Discord - https://discord.gg/peBrCpheKE

📺 YouTube - https://www.youtube.com/c/TheAIEpiphany/
📚 Medium - https://gordicaleksa.medium.com/ 
💻 GitHub - https://github.com/gordicaleksa 
📢 AI Newsletter - https://aiepiphany.substack.com/

▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬

#imbue #infra #70b #llm

## Содержание

### [0:00](https://www.youtube.com/watch?v=wTE8Dk6I80A) 00:45 Intro

I think we can start uh yeah Bo thanks for joining Boe uh yeah basically Boe is the head of infrastructure at them view uh maybe you can introduce yourself and then we can kick it up from there sure yeah I'm so glad to have you to be here today with u with you Alexa and your server um yeah so I'm head for Adam bu we are um a pretty small AI startup um we're based in San Francisco in California in the United States um and I'm here to talk to you today about um something which uh you ml folks may or may not have experience working with but more of the like hardware and the bare metal and the infrastructure and what goes on underneath um uh the training that happens on these gpus hey guys I want to

### [0:45](https://www.youtube.com/watch?v=wTE8Dk6I80A&t=45s) 02:25 Hyperstack GPUs (sponsored)

give a huge shout out to hypc folks who have generously sponsored my compute over the past month or so uh basically I got 16 h100s which is two uh eight GPU notes and the performance has been amazing basically 2 extra speed up compared to my a100 so I want to quickly show you how you can get started yourself it takes basically three steps only you go to the environments here you create new environment you give it a name you pick between Canada and Norway that's the first step the Second Step you go to uh SSH Keys here you create a new pair you pick environment you just created you give it a name you paste your public SSH key that's it and finally go to Virtual machines deploy new machine give it a name again select environment let's select Canada here because they have more compute there then you select the hardware you want like for example h100s you select the images you select the SSH key you just created and hit deploy that's it literally a couple minutes to get started it was really easy for me to just uh um basically create these two notes I just mentioned and get started running llm trainings so the documentation is super cool I could actually solve all of my problems just by looking at their docks they additionally have a slack Channel where they were super helpful so I can't recommend them enough uh honestly the thing they kind of focus on is basically because a lot of these GPU providers focus on big Enterprises so often times you can't get on demand h100s whereas next gen focuses particularly on that you can get top-of-the-line Hardware on demand even if you're individual or like a smaller team or what not they also focus on bigger companies but that's kind of their edge here so without further Ado guys let's go back to the talk I do suggest you check them out and let's continue by

### [2:25](https://www.youtube.com/watch?v=wTE8Dk6I80A&t=145s) 11:30 Bowei's background

before we go there uh let's maybe just quickly walk through your journey how you ended up at INB given that some people here will care about um Landing a job but maybe even INB or something so maybe if you can share a bit more about how you ended up there yeah super happy to so um I actually come from more of like a full stack uhu uh standard software engineer background um INB is my fifth startup um I was previously so my first uh job that I joined out of University um was a like MLA AI startup um it's called the hive. a I believe it's still around um yeah I was doing um ml AI infra there um and then after that I actually uh went off and did non a infer stuff so I was at um three other startups all pretty small in the range of um between 10 and 100 people and then IMB with my fifth startup now uh so I joined roughly a year ago I've been doing um infrastructure here uh for around a year mostly um honestly setting up um the infrastructure needed r7b model and yeah my educational background before that is um I actually studied pure math uh at University and it was yeah I was on a PhD track actually before I dropped out to do startups and mli stuff nice how did you decide when you're doing a PhD how do you decide like obviously there are some tested names in the industry like if you want to join open or deep mine or Google or whatnot deep Tech those are kind of like battle tested so to speak so how did you decide your first startup when you're doing PhD in math was it most somebody you knew basically recommended you hey come here or was it more proactive on your side yeah um so you're your question was uh how do you decide um when you're doing a PhD or when you're doing in research how do you decide what companies to apply and what companies you should accept um so at the time I was doing research but not in something like completely unrelated to ml like I was in the uh the math partial differential equation space um but I uh I had done some like math competition so I did have a friend who uh sort of inted me into startups and software in general and I had done like a little program in classes before so I came on at a very Junior position um at a place that wasn't even I guess doing MLA at the time um but uh what really struck me about them is they had a lot of they worked with um like data at a pretty large scale which I'm really position them well to make the transition into AI this was back uh in like 2015 or like image CNN were like first or first coming online you know like alexnet is the paper that um they sort of based their initial um model architecture off of and stuff nice long time ago yeah ancient history now for some of these fols yeah so maybe walking quickly through the decision to go to startups as opposed to just big Tech how did you think back then if you remember um yeah I honestly it was um I was pretty young pretty naive in the industry like I wasn't really sure of like uh how to get myself into um some of these uh bigger um Tech roles like I was going to like career Affairs and stuff but um I was coming from sort of a non-traditional background so what I found is like startups really value um different things from like big Tech who look more closely at like rumes and GPA and stuff and like relevant internships and coursework um I I essentially like went on site to interview with um the startup uh What was um then called another name but it's now Hive I talked to the founder and I think the founder gave me like a really impressive like startup energy kind of vibe where it's just like yeah we want to work hard we want to learn we the St the founder himself was like really young and like um was trying to figure out what direction best to steer as company so I sort of joined off Vibes pretty much nice okay that's a slow as well just go by my theistic for enrolling into electrical engineering is I'm looking for what's the hardest thing in my country and I'm like okay electrical engineering like was considered one of the hardest things to do and then I just kind of went that way it's not always the best theistic but it worked out for me um yeah having said that I think unless people in the audience have some questions we can continue uh and by the way like um if you have questions raise a hand and then I'll just let you speak so we can curate it so okay awesome you can kick it off Bo oh yeah I'll just continue yeah again like yeah please interrupt I also have a habit because I used to be a grad student of like restating questions that get asked to me just to make sure that I understand what's being asked and so that other people can hear them as well so I'll probably be doing that throughout this talk um anyway that's a little about me I can also talk a little bit more about why I joined in VI oh is that a oh sorry um yeah please interrupt me if it's a question if it's just a comment you guys can um chat amongst yourselves in the me in chat as well ignore the chat feel free to ignore the chat yes tell us more about INB yeah um so prior to this I was actually at a full stack um startup doing a mixture of web development and backend we were building a collaborative whiteboard um for engineers um and while I was there I was like thinking more like longterm like what do I want to do with my career where do I see myself in in engineering and based on one of my previous roles I had interacted uh heavily with um one of the engineering uh one of the infrastructure Engineers there and the um infrastructure manager and um that team just struck me especially as just having really deep knowledge and the ability to debug any problem right so you're you're doing some ml training or you're uh like looking into your web server and it goes down or something and then you're like why is this happening you know like what docs can I look for these errors and sometimes you are able to debug yourself but sometimes um when you're out of ideas you have to go down further in the stack go down in the kernel level and even the hardware level um sometimes um like poke through the cloud abstractions and so infrastructure in general struck me as the folks who know how to get those answers and um that's one reason why I was I started looking for an infer role and it just so happens that INB around one year uh before had just raised um a series B um around uh 200 million USD uh from various uh I guess Venture Capital uh folks and they were interested in scaling up you know um getting on board with training larger models and building out um owning their own Hardware infrastructure to do this so it was actually a really great uh timing for me because I just started um exploring infer roles you know reaching out to reaching out some contacts in my network is How I heard about in View and it was also a good opportunity for me uh because again this was around a year ago maybe around a year before that like chat GPT generative llms were super hype back then uh so I was like yeah seems like a good fit um and the interview process really convinced me that um INB was where I wanted to work for um a longer time uh I really appreciate like how thoughtful the interview process is here um for instance one thing we do that's fairly unique that I haven't seen in a lot of other places we have uh candidates come on site um to our San Francisco office for between one to two days um and that way they can actually integrate with the team and figure out what it's work what it's like working day-to-day um with the CEO the CTO and uh the rest of the technology stack which is something that I personally you know always wondered when I'm interviewing at startups you know it's like I I'm doing all these like coding questions I'm like you doing some architecture design but what do you actually what do you folks actually do day today is what I was wondering so I feel like MB really mb's interview process really answered that for me nice and when you say you fly out the candidates does that mean somebody who's already in the Bay Area or would you like fly out somebody from Asia or Europe I guess oh yeah we we've flown out people from um the UK for instance or um east coast of uh east coast of the United States or other places as well nice yeah and we do try to we do understand that there's like a time zone difference so we try to give them like a day or so before to Just Adjust and get some sleep if they hadn't slept on the plane yeah I really you really like here it's again we're around um 30 head count right now in mb um so it's definitely startup Vibes um like there are there's definitely like a wide scale of AI startups and like I uh the one of the examples here is like we don't really have titles like everyone here is technically like member of technical staff so I get to make out whatever title I want including how nice nice awesome thanks for sharing the context there yeah and then um I had

### [11:30](https://www.youtube.com/watch?v=wTE8Dk6I80A&t=690s) 18:30 More on Imbue, their research, their focus

some people actually help me with these slides um so here's a little more data about INB this is um our pitch you can also find this on our website inb. com um we are interested in uh building AI that can build software essentially um we recently uh finished training a 70b model from scratch um so that's mostly what I'm gonna Center talk this talk around but I'm also open to getting distracted and talking about tangents M that that's why I got to First hear about your work I mean I knew about in actually met with your founder at Europe's 2022 or something back then I don't think you raised 200 million back then but I think you're you you probably pivoted since then because back then you were trying to be an AGI lab like nowadays I don't know the yeah you're probably not trying to compete with the with open eyes world yeah but super cool yeah um I would say our we we've definitely pivoted since then you're right I think one of the charms of being at a small startup is being um flexible and lean and open to listening to what the market is telling us for instance I think um maybe even around a year ago people were like honestly a little scared of AGI like people are people were like oh we're building too fast like we're gonna get there too soon there's gonna be a singularity um but nowadays um I feel like people have sort of settled the industry in general maybe has settled into the idea that um uh these generative LMS um they offer Solutions but there are also a lot of problems with them like one of the problems we're focusing on is um reasoning and uh the ability to verify um just accurately what the code that's generated is correct or not and there's um there's it seems like there's still a lot of research to do on that end before we can say like oh these AIS are like ready to take over the world and all they need is more gpus okay um and then I understand there's uh um a decent number of you folks who are in the audience who are researchers I I do want to say um I I asked a researcher um to give me the pitch um on this like just yesterday so one of the reasons why I think um the researchers we do have are really happy here are it's a small team there's lots of opportunities to contribute and um there's just a lot of opportunities to discuss ideas amongst the team one thing we do um every week is we have a weekly uh part um reading party so we'll pick a paper that's we think is probably relevant to our interests and we'll read it together and sort of discuss and see what ideas there can be actually folded into um our day-to-day um next week or next month or something um in terms of hiring I would say we're really interested in um people who can be mostly self-sufficient U I think we don't really have the bandwidth to take on you know um interns or uh that many like fresh um uh fresh researchers who are still um finding their feet in the field um but of course we're willing to make exceptions and we're definitely interested in um people who are um a little more experienced on the research side not necessarily with um like we don't require that you come with like a lot of ideas or theory we're mostly interested in like you know can you run and replicate experiments um can you make significant process progress day-to-day and just um help out the team and pair with maybe other researchers and make some progress on um our research directions and so like I said before um we are particularly interested in the verification process so given in an llm that generates a snippet of code um how can we tell using how can we train an llm to be better at distinguishing a good generation from Bad generation and like what is the process for even defining uh the good and bad Generations you know can we fold these into unit tests um can we give the LM some runtime information um stuff like this is all within scope and one of the other research directions we're interested in is adding more like Chain of Thought reasoning traces to generations and the hope there is um the hope with both of these directions actually is that um regardless of uh what large Foundation model open source um like llama 3 CH jp2 equivalent that comes out um in the next year or so we're hoping that uh these layers can continue to be useful on top of uh whatever uh baseliner foundation and we're Al obviously willing to find T as well on top of these nice I gotta say I'm surprised that you're not uh exploiting Insurance more like most companies that's just cheap labor for you there graduate student descent as I think karpati said or somebody yeah I would um I would say like it's more to do with like our man management structure in our culture we don't really think of um like work at M as being like very top down like you get told what to do and you have to like grind it out and ship we really want like everyone to feel uh responsible for the work they do and the work that other people in their immediate vinity are contributing towards and have more ownership of like company Direction and be able to make more of their own decisions around like oh as long as like we're aligned you know on a high level with the CEO the CTO the rest of the research team like you can a lot of say in like what uh what specific uh things you think are most viable maybe a quick question how focused are you on actual products versus research if you can maybe share about that um so I would say uh it's changed recently actually um we used to be um like as you see like we used to be very focused on training big foundation model and doing research there I think recently we've we felt the need for more um more accurate or more realistic uh training data uh to deliver um a sort of like good product experience so um basically in the last quarter we've scaled up more of the product side so that also means there's um more roles open for um generic uh sui uh backend and like I would say like um a sort of a mixture between research and product like um doing some small scale experiments um and stuff like this on the product side um so you you just summarized you're asking like how do we how do we um how is our split between product and research um I would say even on the we I would say we are maybe a little more um on the product side but even on the product side we're it's um we're still like at our Roots I think a research lab and we're doing a lot of research and product prototyping on that side as well hope that answers your question yep thanks okay um yeah I can talk a little bit about the 70b model that we

### [18:30](https://www.youtube.com/watch?v=wTE8Dk6I80A&t=1110s) 26:20 Training a 70B model

trained um so we pre-trained this from scratch uh we grabbed um our own data sets as well so we mostly use the open source data sets but we also had some proprietary um some proprietary eval data which we used to tune hyper parameters um and we also um have uh tuned our own um split and cleaning algorithms for the um the like large corpuses of uh data that we trained so there are posts on both of those on our website inb. com I will skip a slide um I think people might want to know just like what actually goes into the 70 model so um I also got this uh from someone else because this actually preat start my time in mb we were training 7B and 7B models um before we before I joined to get us scaled up onto the large cluster um but at the time we were um based on uh standard um architectures like fallowing llama 3 um we have vast attention in there as well and then on top of that we had a few tweaks to the tokenizer obviously the hyper parameter search um sweep we use called carbs we developed in house and we have a paper on that and then um we also have some Fuda kernel acceleration code that um was put in at some point I would say most of the code in our code base actually on the training sign is just scaffolding to get all the stuff um running and debuggable and effectively scaled on multi-gpu um for instance like around torch run we have our own wrapper uh which uh we just use to inject uh debug information and also just set all the environment variables uh correctly that we need to debug on the n6l side and surrounding all yeah go on quick question you mentioned Cuda kernels like do you have in house Cuda experts or is it mostly hey I picked it up because there we don't have anybody who is dedicated to this role and I wanted to learn it like how do you about yeah I I'm it's much more the latter like um I think we don't even have a lot of expertise on the kud Kernel sign we've just read um you know some blog posts that are like okay you guys should uh it's stand it's good practice to profile um like which kernels are being slow and then we noticed that like I think one or two particular kernels are slow and then one person one on team was like oh I really want to like figure out how this works and see if I can improve it there seems to be a lot of upside here and after that he spent I don't know like maybe one to two weeks like figuring out how to get the how to insert the correct code there how do you think about hiring for those roles would you even consider having a dedicated person who is doing just lowlevel stuff or I'm just thinking out loud like what's the scale of a startup where you can justify having such experts I guess it depends on whether how much money they can save you I guess yeah that's definitely a big factor um we would probably want before we make a decision to do some profiling and see like what do we give like have a good estimate on what we think the uh potential upside is of doing that optimization and then in terms of like hiring someone who has specifically has that expertise um I think our general um mode of hiring here is we want people to be like broadly uh capable and also have like rather narrow Specialties so um if someone has like a lot of experience in goodel we still want them to be you know like um quite competent at other research tasks in case there's not a lot of critic colel to write or in case we write one and it's like oh it's really amazing it speeds up everything that's all our that's exactly what we needed and then um we would want them to go shift on onto other tasks like more of the debugging or scaffolding or research allocation scheduling type code as well and we related question where do you find people how do you think about right now like do you look like a profiles like see who is contributing in the open source Community or you use some websites or what's the flow for you yeah we have um a recruiting team we can probably uh talk more about this um but we we do a lot of sourcing from um our contacts in the like research um community so we have some graduates from Stanford um and various other labs that we reach out to their connections to we obviously also um try to keep track of uh like promising fresh GRS or promising research um for instance like you mentioned you met keni and NS probably I believe we went to uh newps um last year as well and sort of um sent fers out in the community and trying to figure out who who's good and who's working on relevant stuff and most importantly who's just interested in enjoying like a small like more Scrappy startup like us and um exploring what we want to explore um and then I just going I going off here um one other charm I'm working in is we have an internal job scheduler um for allocating GPU compute resources um it's it's not slurm which um obviously is very popular in the community that's not kubernetes which is I feel like the other big player in the scheder space um but I'm very happy with it was built internally it's built um mainly uh honestly just using python code that hooks into um OS calls hooks into our Docker containers and hooks into SSH and it doesn't handle any logs we have a separate um log aggregation system for that um that's the standard Prometheus graphon setup U but yeah I know I personally actually worked both with slur and kubernetes and I would say our system is better day-to-day it's more optimized for uh solving the problems we want to solve maybe talking 70b model like um oh yeah how many tokens did you use and also because you were training from scratch did you experiment with meup yeah so I can talk about what those you asked about um token count and um up uh so I believe it was 200 trillion tokens um and the way we actually determined um the correct token count and also which data mixes to go into tokens is um through the hyper prer Optimizer and using the uh the chinchilla scaling laws so we trained a bunch of smaller models first and we have a system that um runs like small models in parallel one per server and also um like mediumsized models um using in using infin ban like multi- multi-gpu multi-server um like mediumsized models and we also scale uh train a bunch of these in parallel um so using that we were able to uh find what is the correct uh token size relative to the amount of the actual amount of money we're spending on this computer and then you also asked about M yeah so um I think Midway through training the 70 the like the large you know three- month trading run we started looking into M as um uh as preparation for maybe a subsequent training on that we wanted to do um I think we got that merged into the codebase we didn't have a chance to like run the hyper parameter scaling and actually test this at scale also I was not like particularly involved in the M optimization so I don't know too much about it but yeah it's definely something we looked into yeah thanks for asking worries yeah no worries yeah I guess the number of tokens was probably 200 billion not trly oh sorry yeah okay and then I hopefully I get to talk about the stuff I'm excited about

### [26:20](https://www.youtube.com/watch?v=wTE8Dk6I80A&t=1580s) 45:40 Building a cluster from scratch

which is like Hardware the compute um so this is 511 hosts um that's around 4,000 gpus um h100s um these are located in the Northwestern United States so pretty close to us since we're on the west coast means compl cany um this is a long-term rental um I can speak a little more about our motivations here as well so this was back um right when h100s were coming online and we noticed that there were a lot of supply issues so we worked really closely with um one of our partners called vure Park um to basically secure a long-term contract for them uh and so because they were also building multiple data centers at the time they were able to use this like scale and buildout to um to essentially like jump the queue in terms of Nidia and getting early access to these h100s um so I believe we were we were definitely not like first um to get these h100s online but I think we were the first to get these h100s online in the specific like Server Chassis that we use and um this process started uh I would say like August 20 2023 and it's been ongoing um I think we got a final delivery somewhere around um October and we've been uh fixing issues and like learning how to service uh infinite man cluster of the science just like all inhouse um so that's definitely been a huge learning experience for us and that's what I'm excited most excited to talk about here on this talk today uh but yeah I think we I was we got like actual training running um smoothly without like just tremendous amounts of downtime uh I would say like we got around 90% uptime somewhere around February March um of training this training the large model we basically couldn't start effectively um training before then um and yeah uh around um I think May or June we like sort of wrapped this up um and we took the time to like clean up all the scripts and um release and write up the training process as well as um open source some of these scripts um so if you guys um feel interested you guys can actually go to our blog post and there's a GitHub link and you can um see what goes into um all the uh all the setup scripts yeah so um again uh I don't know how much background you all have on the infrastructure side of like what it takes to set up um one of these closes so please stop me if I'm using like unfamiliar acronyms or uh things that could use a little more explanation um but go ahead there is no way every somebody will not understand something we can always Google stuff and yeah so J go ahead yeah this is a huge learning process for us as well um so we got access to these hosts as bare metal hosts so we don't have um just short of like physical access to the data center um which we don't have but we have a relationship with a company that does do that um we have uh access to the management consoles on all these device devices and that's sort of it right they didn't actually ship with any 's um they shipped with like ssds and drives but we also had to set those up so I'm going to talk about those um so this is just our Hardware spec we have the 4800s we have pretty BP CPUs and one terabyte of ram per machine in there as well and like 10 terabytes of um dis space um these are all also network using infin band and infin band is the technology that allows really fast direct GPU to GPU correct uh communication so the since these gpus send so much data it's not efficient for of the eight gpus all to send data up to the CPU and then to go out through one ethernet um connection instead each GPU has its own 400 gigabit dedicated cable that goes up into a switch um all these 4,000 gpus then are network together in a in the infin band Network architecture um and that's a separate network from the ethernet is what we use to like SSH and passing around the training data and stuff so here's an overview of the different tiers the infinite anthopology actually so uh we we were given this in like a table format of like basically uh you can think of as an incidence graph you have one node on a switch and one node on this other switch and you have a table of which connections go where and so we had to work out um a high level understanding of the topology here so that was definitely a learning experience for us as well quick question here like how much of this network topology is just like um off the shelf as so did you have to kind of think through multiple design options and then pick one or did you have to do it completely custom or like yeah can maybe walk us through I'm not familiar with yeah so the overall design um was given to us um and then we just had to like essentially verify it and make sure that the cabling um on the ground actually matches the design uh which is something that we we did write a small script form um but this is a pretty standard um topology it's recommended by Nvidia um the one in particular we use is rail optimized fat tree and that's just what it's called rail optimized fat tree um which means that in a block of 32 hosts that is um 256 gpus uh all of the gpus at index zero are connected together and then um directly to the Upstream Leaf switch and since in the block of 32 hosts the 32 gpus at index one are all connected together and uh we call each of these indexes are rail so GPU zero is on all the hosts is like rail zero and this makes it so that all the gpus at rail zero can communicate to each other at lower latency because there one fewer hop uh in the network than like GPU zero on one host connected to GPU like five on the same hosts and one of the reasons the network is designed like this is uh because there's this technology on the hosts called nvlink which ensures that the it's basically another path for the data so if the data on GPU zero on the same host on one host wants to go to GPU five on a different host it has sort of two choices it can go through the full infinite band Network so all the way up to tier four uh at the top of the tree or it can go on the same machine to gpu0 and then it can go sort of a shortcut and only go like one step up in the tree and one step down how are those routing decisions made like uh which software part of those te scandles making those optimal decisions so you um what part of the software handles uh the routing and uh how to determine uh like if one GP wants to send data to another like how does it actually get through the network so we actually um Nvidia has provisioned us uh the standards software here which is called UFM unified fabric manager um this is we have one dedicated non-gpu server that runs this software it talks via infin band like uh to the management interfaces on all these infin band switches and each of these infin ban switches has like a little routing algorithm that says like oh I've received one data packet from this GPU dedicated for this other GPU like how do I know where to send it and the gpus themselves also um they as I understand there is some in the nccl the nickel Al Ru algorithms um that um are aware of this topology and try to make uh the similar rail gpus talk to each other slightly more got it so unless you're literally very low level at the ni level lower yeah you don't really care about you're abstracting away the hardware completely absolutely okay yeah so this is why like uh a lot of ml researchers like don't really see this dayto day um they just train on even on top of nickel right so the Cuda drivers all call into um the Nickel bindings I believe so when you're spinning up a training run you don't see any of this like all this abstracted but like physically this data has to go through um these wires at some point and it actually matters very much to us like which of the wires if there's any failures where they are in the network because at the lower level each GPU is actually um there's no redundancy uh there is a redundancy um in that like it can go to another GPU on the same host but the link from each GPU up to the leaf switch is actually unique so if that cable fails we can fail pretty fast like it's going to result in a fairly obvious degradation because uh the other seven links like don't have enough bandwidth a lot of the time for the type of um training that you want to do when you're training 70b as opposed to like the um the links in between T2 and T3 or T3 and T4 those are uh those are optimized for throughput but there a lot of those are parallel links so if one of them goes down it's um it's detrimental to the network in the sense that like the routers have to figure out like basically remap the network and reroute stuff around it um and which will inter latency costs but it's a lot less harmful to like the overall bandwidth of say one specific machine and then I can also talk about um uh what this looks like in training uh which is maybe something like there's the infin band guys at the low level who will tell you like oh this this cable is down or this link is flapping just means it's sometimes up and sometimes down and it depends on when you look at it and then there's um like the ml researchers uh who care about their uh their training benchmarks like how much second like seconds per step um what is the mfu model flops utilization was the GPU utilization and we've actually we're actually a small team so we have a lot of interconnect between the people who are in charge of both and so I in particular can speak to like what if a link goes down like how can you tell just from uh just from like mfu shape so one thing that um H has happened to us uh quite a few times is we'll we'll kick off the training and then um there there's like you know the five minute startup time where it initializes uh the weights on all the gpus um and it uh you start negotiating the um the nickel initialization but what will actually happen is it'll random train train and then randomly uh the nccl will hang and you'll usually like Oh there's uh it's not doing any training it's stuck in uh one of the training steps in the backwards pass for instance and if you go fetch out the debug information of like uh in the NC debug of like what it's doing you're just like it it's been waiting one of the gpus is waiting to receive data from the other GPU but the other GPU claims is already sent it and this data is lost in the network and it may be like a fairly large tensor so you're like I can you tell me like which exact bit of the tensor was lost and it's just like no it just decides to like hang for like five minutes and then it'll in fact continue hanging generally until it hits the um the mccl timeout and this will crash or run you you'll have a Cuda error and this will trickle off a torch and you're like H how do I debug this and generally the answer actually is you have to look at the infin ban Fabric and be like which link went down at the wrong time that caused one of the data pack to get dropped the nccl version that you're using probably has no recovery method um for this synchronized all reduce just one of this one of the tensors getting like falling out of the network um so this was a pretty big pain point for us in the training this was not the first paino um that we had working with the INF band network uh but this is definitely um one of the ones that was most painful later on in the training process why do you think the timeouts are set by default to like a crazy big number like 30 did you ever observe the situation where like a Becket arrived 29 minutes after the okay yeah no so I think um one thing that people do is uh they in some instantiations of the training Loop uh they'll do an NL all reduce and then in the middle of that there'll be a an upload of like your weights like because you're on step 20000 you're like okay I want to I want to save a checkpoint I'll save that checkpoint up and I and my understanding is like there some there are some training Frameworks who that try to do that in the middle of the uh nccl um packet uh so that's one reason why it's actually important to not set the N time out too low because if you have a bad um ethernet Network Uplink and you try to upload your checkpoint S3 and you have a very high like nccl parallelism so that um like you're essentially trying to do another forward step or backward step while you're trying to upload checkpoints you're the one or a couple machines which are responsible for uploading the checkpoint will actually slow down the rest of the NC group that is in and you'll you'll hit up against n time out on that regard like it's not it's going to be timing out the the specific op that it's doing like the low-level NL op regardless of like what's causing the Slowdown whether it's something on the hardware or something you're doing on the software stack that makes sense it's just that yeah you you have like a machine that costs many millions of dollars waiting for like a cheap ethereum link Uplink to up so like I still think you should probably fix the Uplink and not like increase the time out like just yeah yeah so one of the first things we did for instance is um uh we have like these 500 hosts right so obviously when we save weights we want to the weights are sharted so we save the we tell each individual server to save all its weights um but one thing we do is we we try to save those um locally to the local dis first and we have an async process in Python basically on another thread that's it's able to pick these up and send these up to um to like shared storage or to a S3 or whatever you need it um definitely prefer shared storage inside the data center as that's going to be significantly faster in terms of um ethernet bandwidth uh compared to um sharing up to S3 because um just due to the nature of our cluster uh the 100 gigabit link we have to the public internet is um a it's like shared amongst all the nodes in the data center so whether you have uh one node uh trying to like max out the E upload or you have like 500 noes all trying to match mash um as much data as you can through the pipe like the pipe is only so big you're not gonna have a good time but like these are all details that um that matter when you're training network of this size and they also matter depending on your specific Hardware deployment like maybe you're at you've maybe you'll be working somewhere else at another company and they have like a really good ethernet Uplink but they don't have as good bandwidth on the local storage or maybe the local storage um has good bandwidth but um there's a relatively High latency cost there and then the latency then runs up against whatever python uh S3 Library you're using um so you it benefits a lot from knowing exactly where in the stack and uh what Hardware components are being the bottoms there is one question from surj um he's curious if there is any specific reasons to write your own scheduler instead of using subset of features from either slurm or a lightweight kubernetes um yeah so the I so yeah the question is uh why basically why didn't we use pick the good parts of slur and kubernetes rather than writing our own I think um one of this was a decision that honestly predated me a little um but one upside that I can see of working with our own system is that we can tune it to do things specific things that uh that we want that are relevant to our uh to our cluster and how we want to use it um for instance uh one thing that we do is uh because we use Docker images and we don't want to pull all these down through the um through the 100 gigabit pipe that we have out public internet um so we integrate with uh a software called Kraken k r a keen um from GitHub uh just written by Uber it's an open source software um it essentially uses a bit torrent like interface to share this the exact same Docker image across the entire network and it was relatively easy for us to integrate this into our scheduler and have it so that like um when you schedule a new job when you um build your new image that you you just fix the bug in and try to push it up and then want to push it down on all these 500 hosts it's as fast as possible and this ensures a quick a quick um uh cycle development cycle and it's something obviously that's possible in slurm or kubernetes but um we just feel that um like based on the fact that we're we wrote the schedule and we're more familiar with it this is um easier to do on our end okay yeah so going back to like how this started up so one of the things that happened uh with the cluster is uh we received these in chunks of 128 so the first thing that we do after we get access to management consoles is we provision the OS on them we use the software called Mas uh it's open sourc by Ubuntu it's called Ubuntu metal as a service we connect to the management controller um we boot these over fixie boot um so essentially we send the OS image over the Internet Network um out to all these machines we give them IP addresses via DHCP um the IP Discovery protocol um and we try to boot them into the OS so yeah so at the beginning there were a lot of boxes that just failed to that failed to boot up um this was just due to hardware issues like these boxes have to be physically shipped from somewhere and they installed and um there's a certain failure rate that happens like um when you plug in all the cables and power it on for the first time these machines have like a lot of internal components that are relatively successible to like jostling whether they're in like you know the usbx um truck or when they're getting lifted up into the server rack they're getting slotted into so our first step was just diagnose um all the broken machines from the hardware perspective and uh make sure the OS is flashed onto them make sure um the essential you know Nidia driver melanox driver infin band stack uh software stack is all installed on there and then at that point these nodes will be ready for a single node GP train and uh that is what we used them for in the beginning before we validate infin man paric was working so this will be like um running uh individual like Dev notebooks stuper notebooks um or obviously we don't have that many but also running some hyper

### [45:40](https://www.youtube.com/watch?v=wTE8Dk6I80A&t=2740s) 59:25 Anecdotes, Q&A

some small experiments in hyper parameter hyperparameter tuning St like that I guess I'm actually gonna skip through most of my slides and try to talk more about um some of the experiences we had setting these up I think that's great and by the way can you share the slides after the the meeting oh yeah somebody asked already in the the chat yeah most of the slides in the middle were honestly um uh taken from the our infrastructure write up which is actually just available on our blog so I want to keep this um keep you guys uh available get you guys more Insider info here so yeah one of like in the very early days were're like these why do these boxes have so much hardware issues now and um one of the one we found out is like first that this is like kind of standard for um GPU deployments of this scale like these this is new hardware um these like the design and infrastructure of the server boxes themselves is like non-trivial and so we got on calls with basically everyone that were involved with so Dell and the video voltage Park and the folks that are physically like on hands in our data center to try to figure out to try to understand more about um which components fail are most likely fail on which components that are more important um to check more often and can fail by themselves rather as opposed to like that fail um have a likely more likely to fail just only on power Cycles one of the other things uh that we invented um back when we're pring the OS that's uh that's that we're still doing to this day is actually um so there's this uh command on the Linux UB onto service called D message um is here it basically when the server first boots up it takes Diagnostics of like every single like booting operation kernel operation setup operation that happens like each piece of Hardware like that gets any sort of Diagnostics around boot um these all show up in D message and what we did is because we expect all of these machines to be exact same um we sort of know what are the allowed D message like errors and warnings because there's obviously a lot of errors and warnings in there that are like not particularly relevant to our training um so we we went through and cataloged like each and every one of these boot messages and we're like okay has this one been seen before yes has it in during a good training run yes okay then we can ignore it right and if anything new shows up we're like oh something showed up there on this machine we don't expect that let's look into this machine a little bit more and see if we can run some extra Diagnostics because it's doing something it shouldn't and this is also just extremely specific like I think this is maybe one of our health checks that uh we didn't we either shared publicly or didn't uh maybe we didn't share publicly or maybe we did but it doesn't really work well and it's because like um it interfaces with every single piece of Hardware you have in your stack so if you have like slightly different Hardware from a slightly different manufacturer it's not going to match up against the specific allow list that we build so this is something that you guys uh you folks um would probably want to like rebuild if you if you care this is a if this is an issue that you run into again I talk about custom stack um yeah I'm going to go back over to the overview where is the overview yeah so after um OS installation after we wrot out all the DS and haror issues the infiniband cluster the fabric health is the next thing we check so we also learned this via just talking learn learning about the infin technology from talking with the Nidia folks um there's a couple lowle tests that we always run first um that are basically like bandwidth and latency tests on each of the cables so this is something we used to do at the beginning manually we're like okay this box is looking sort of funny it has eight infin band links let's just use like essentially like a an internet speed test but configured over the infin band cables to make sure that each of the cables is operating at the expected speeds and we're like okay yeah they're all running at like 400 G per second or what not and if that's true that's good and if not then uh then there's an issue um but there's ah it's honestly a little bit of an art deciding like when is too low because these cables are rated for 400 gbits um generally in production they they're not expected to run at the full 400 gigabits and especially during training like it's not always required to to run these at maximum so we've we're pretty happy with honestly anything other above like 360 gigabits um and sometimes even lower than that is fine like somewhere in the vicinity of like 300 gigabits um it's not ideal but if it only happens like once in a while it maybe it's just attributed to like I don't know maybe uh one of the switch on the other end is like a little busy or not as responsive as we would like and uh yeah so it's like we to go through like a variety of thresholds if we set the threshold to high it'll have false positives and we're like hello like we're literally using this node in training it just came out of a good training run that we have good mfu and numbers for and our tests are telling it that the infin bandwidth is too low and it should be taken down like that's not ideal we're goingon to run out nose at that point if we keep taking out nose like this so definitely more of an art than a science there what else uh I would also like to talk about this um the garbage collect related mfu droup um this was probably the first issue that we ran into when we started scaling up um from like small 7bs a small 7B models to 7B model um so like our the first run we saw this on was um I believe like on 80 machines we hadn't run a multimachine infin band run of that size yet as we saw this really weird Behavior where it would start off with good performance you know good step um Step uh timers like each training step was going pretty fast and then gradually over time uh it would degrade to around like 70 I can't draw but uh it gradually degrade over time and then uh we would have to like take it down and restart the training off the latest checkpoint and when we do that it would magically go back up to like 100% of like what we expect uh to see on the M you there I'm like why is this happening you know is it our first thought was obviously like something physical like this smells like heat related uh like maybe the gpus and the infin cables are getting too hot in the data center when we run and that's what causes the performance degradation um and this is obviously pretty difficult test uh but one thing that really set us off and made suspicious is that um after a checkpoint or after we like would save a checkpoint um the that is when we would see the most like increased variance in uh how much the mfu was degrading and every time we save a checkpoint it would get slightly worse but like in between checkpoints it was like roughly constant like how basically how bad the droop was over time like what could be causing this and uh we we debugged all over the stack like we ran the inside profile we ran um ppy uh python um like thread spying to see where the python was spending time and eventually one of our folks um oh and the other things we did was we took out real data we thought like is this a data loader issue um like is there some sort of BWI there uh based on like how fast is fetching data out of the dis so we disabled this uh we made sure that um the random seed in terms of data shuffling was entirely deterministic um and we yeah and we eventually nailed it down someone made the uh really astute observation that like it might be that these machines are like getting like slightly more and more out of sync with each other uh because like we would do profiling on each individual machine and like oh the each like one of the machines is being slow but the next time around it's some other machine like this is really suspicious and eventually someone like thought of like garbage collection like maybe it's the python garbage collector that is causing like maybe one machine randomly decides to do garbage collection while the other folks are while the El machines are waiting for the nccl to all reduce so when that happens like that guy will slow it down and as the Run progresses if you do something like checkpointing which um passes a lot of stuff through the python memory like there there'll be a lot more opportunities for these things to do random garbage collection at random times and sync all the N all reduces gradually as the Run goes on so we put in um a custom environment variable uh to disable garbage collection and only do it um when deterministically when we need to and that actually fixed the mfu Drew and we were just so happy when we figured that out that was an incredible find that's a trecky one there there's one question uh from s uh maybe you can read it as well you want it's a bit longer oh yeah sure been incompatibility due to OS firmware melanox and drivers from different sources to get a patch driver to continue ah yes um we did in the very beginning so uh we had an issue with uh the latest um nid driver uh that we started the training on um and we did have to go to nid support and ask them like U why is this driver seems to be incompatible with uh our the other side of the stack and like yeah they just told us like okay yeah that driver is just a little broken on your specific Hardware stack either go up or go down so we actually ended up going down and we've been pinned on that uh Drive version ever since so maybe that's not what you want to hear but that's just what work what's working for us um and that goes to one of the points I uh we raised in the blog post of like just change as little as possible at any given time like if it doesn't work but it works with one driver um just stay on the driver uh for as long as you can like make sure you know the rest of the errors in your training stack and then once you're confident um that uh you're pretty experienced working with the source of Errors then you can you know try to upgrade your driver and see and also be ready to um roll back if necessary we've also had that happen in other components in the firmware stack like there's um there's like maybe a bios firmware that we need to to upgrade and we always make sure to do it on half of the cluster first and then run some basic um profile like essentially run a 70b training model for um like 100 or so iterations make sure it's actually attaining the same training speed that we do on the other half of the cluster before we spend the time to upgrade the entire cluster up to the same version and if it does not pass in testable we'll just roll that back down and notify NY our vendors or just like look around for What on earth is going on with that incompatibility are even a serious info engineer if you don't know the driver version when you w when somebody wakes you up in the middle of the night it's just those have to become your friends all um vendors will always tell you as well to upgrade their drivers to the most recent version um and I would just take that with a grain of salt um sometimes they're not as reliable as they think they are and it's up to you to basically verify that the most recent driver version actually fixes your problem and doesn't introduce other problems elsewhere in the I think we're out of time maybe if you have some interesting one from the last slide that you want to share and we can slowly rip it up yeah I guess this is May a little embarrassing but um something that has happened to us um we did get stuck for a week on two separate unrelated um configs that we just had hardcoded um in the environment variables like I think one time it was like cax devices was not set properly and the other one uh we were just literally running off of the wrong architecture and trying to match up the uh mfu numbers against these two like obviously different uh configurations um but we at the time we were like what is going on we're so confused we didn't think to check uh so this is um one learning from us is uh try to make sure like all your runs are as reproduc reprodu reproducible as possible we obviously pinned to uh the git versions but these two um configs in particular were not being loaded from get or being loaded from local flat files so we lost an unfortunate week there being very confused and thinking we're going crazy and it's no it's not us it's just the configs just remember to check the configs awesome Boe by way am I butchering your name Boe or Bo that's correct yeah awesome cool thanks for for sharing the struggle the struggle is real um yeah I guess most folks will never have to really build their own cluster I did have my share of dealing with some of these lowlevel details and the llmc framework that I've been working out with Andre carpati and a couple of other folks like the lowest we up was like nickel so I never had to actually go below that and actually like install like OS on the on separate machines and do everything you guys had to do so it was really interesting to read through the blog and and also have you here on on the yeah on the server I guess yeah thanks for having me Alexa yeah

---
*Источник: https://ekstraktznaniy.ru/video/49220*