# [Live] Data Professor - Bioinformatics from Scratch Episode 5  [Part 2]

## Метаданные

- **Канал:** Data Professor
- **YouTube:** https://www.youtube.com/watch?v=YCQfbm4L8ew
- **Дата:** 02.05.2026
- **Длительность:** 33:29
- **Просмотры:** 414

## Описание

Let's build a bioinformatics project from scratch! Feel free to share your questions and interact with other participants in the chat.

🐙 Code https://github.com/dataprofessor/aromatase

## Содержание

### [0:00](https://www.youtube.com/watch?v=YCQfbm4L8ew) Segment 1 (00:00 - 05:00)

All right. Yeah. So, I think the computer kind of crashed and we're going to resume our session in a few moments here. So, I'll wait a few moments for all of you guys to kind of join us. Yeah. So, we're starting soon. All right. So, I think we're live. So, yeah. Yeah, I mean the live stream is happening all at once on the data professor channel on the LinkedIn on my LinkedIn profile as well as the Facebook page for data professor and I think it was because we were building a lot of machine learning models at the same time and so it kind of crashed the system. It was using I think more than 100 almost 100 gigabytes. So I think it kind of overflowed the machine and yeah we probably will have to kind of reduce the number of models that we're using um all at the same time. So we're probably going to uh compute fewer. So let me um resume our session for cortex and I'll let me share my screen. Yeah. So for those of you who are new here, we're doing a live stream. This is part two for today's live stream. Uh prior to this, we were resuming our session on building a bioinformatic project. Unfortunately, the machine learning model that were we were building it kind of crashed the system. So, yeah, very sorry for um for that and we're resuming. So, let me share the screen in a few moments. All right. Share entire screen. All right. Cool. Okay. And I'm going to open up the coding assistant session. And we're using Cortex code for that. And Cortex code is a AI coding assistant from Snowflake and it has context awareness for essentially everything to do with data. So Snowflake is a data platform. So yeah, you could store data, you could build models, you could use AI inference. So all of that we're using here in this project. So let's wait a few moments for the Cortex code to kind of update and resume. And yeah, for those of you who have just joined us again or joined us for the first time, this is part two of today's live stream. Because part one we kind of encountered a technical difficulty. We were running too much uh model building. We were running against 16 different machine learning algorithms and it kind of crashed the system. Um, I think probably because the memory kind of ran out and then the streaming software kind of froze and it was I was look looking at it, it was like consuming 90 gigabytes of memory or so. So, it kind of crashed. Yeah. So, yeah, maybe we don't do the auto update. Let's try again. No auto update. Yeah, let's probably do

### [5:00](https://www.youtube.com/watch?v=YCQfbm4L8ew&t=300s) Segment 2 (05:00 - 10:00)

that. All right. And then we're going to resume. Let's resume work on ML model building of the aromatase project. And it's going to look at the agents. md file in order to get the context and it's figured out that we have already created the okay so it's looking through all of our scripts file. Let's wait a few moments for it to get its context. All right. So, it had so far Yeah. Okay. So, yeah, we already have the test set data here. Previously, we ran evaluation of the trained model on training set and 10 fold CV set, but it crashed. Do we have to start over or do we have results saved somewhere? Let's see what the AI is going to figure out. So this time around we're not going to run it in the entire chunk, but we're small chunks. Okay, it's incomplete. Missing CB. Okay. So, it seems to said that the training and test has finished but the 10fold was not finished. So, let's reassure. So training set model performance metrics were collected and the runs were completed. Can we save it? Save results separately for training testing and 10fold cross validation. Yeah. So, let's enter plan mode so that we don't have to iterate through several back and forths so that we'll start with the plan that we wanted to do. That is better. Okay. So, we have a few questions but I I'll answer as soon as we have the plan finished and running. So, in a few moments, I'll answer the question in the chat. All right. So, here we're going to have the train CV test. Okay, that looks good. But the CV might take some time, right? Would you like Cortex to proceed with this plan? Can we run Can we not run all at once but run in small batches otherwise the computer could crash.

### [10:00](https://www.youtube.com/watch?v=YCQfbm4L8ew&t=600s) Segment 3 (10:00 - 15:00)

Okay, so it's going to run. It's going to probably perform the run and it's exiting the plan mode. And let's see if it's going to initiate the run in small batches. Let's get the paper that we were using previously which was the I think the erred here. Let's go to materials and method. And I think we were at here the multivariate analysis. And here as we were building the classification model in the prior studies, we we're using these to evaluate the performance of the models. But here we're also going to build the regression models as well. Okay. So in the meantime, I'll answer the question in the chat. So we have Nissa asked the question that okay you're confused about substructure and yeah it's cleod oops let me extend the time yeah so you mentioned that you're confused about substructure and cleod in live discussion of episode two is the same with Max. No, they're not the same. They're different molecular fingerprints. But what is the same is that they're both molecular fingerprints. Um, Max, is it same with Max or Morgan fingerprint? Yeah, they're different. So, the I mean by saying that their molecular fingerprint is that they are essentially describing the chemical features of the molecule. and they're describing it in a different way. And because most of it are just kind of binary, right? So they're like on off or it has or does not have the molecular feature and there's like several columns of those. So each molecule are uniquely described by this vector of molecular features. So at an example, let's say the substructure is more interpretable because each column there's I think there's like 307. So each column represent a unique molecular feature. Like for example, does it have a carbon? I mean caroxyic acid? Does it have a um an aldihide? Does it have a an amine group? Does it have a pyodine group? Um so if it has then the value will be one. If it does not have then the value will be zero. So other molecular fingerprints could be described by different molecular features. So um I think there's the smarts pattern for each of the feature that it is describing so that you'll be able to see like how does it look like. Um if you take a look at this research article that I have on screen here I think we do mention briefly what they are. Let's see. Do we have substructure? Yeah. We also use the fingerprints here and we're using it from the paddle descriptor software but in this particular live stream we're creating our own version. Yeah. So yeah as mentioned already one or zero denotes the presence or absence and two of them have the count version I think substructure has the count version and another one which I can't remember at the moment has a count version probably click coder rot right okay yeah so hope that helps so let's see now it's saving Incremental saves each model resell are appendent

### [15:00](https://www.youtube.com/watch?v=YCQfbm4L8ew&t=900s) Segment 4 (15:00 - 20:00)

immediately after it finishes so a crash doesn't lose completed work. Yeah, that let's sounds good. Okay, I think this looks nice. Okay, so it's going to run in batches. So, the first batch of one through five. Or how about better yet, please have it run as single model instead of batches of five. Yeah. So yeah, we're not going to do batches. We're doing more granular. So at each malicious learning algorithm is going to save the results. So I hope that it will not aggregate or you know aggregate a lot of memory as it kind of computes but it could just dump it into the file the results file. Yep. So, it's going to process the model one at a time. If we run all Will it save the results as soon as it completes just to make sure? So, we're not going to do it in parallel otherwise it'll crash the live stream again. This one. The first. The first one. Oh, wait. It says that. Okay. Is this explaining? All right. So, we're running batch one and in the meantime, let's answer some questions [snorts] and yeah, so Lisa replied in this research search. You want to compare the fingerprint between Yes, that is correct. We're comparing the 12 different fingerprints to see which one provides the best performance and all of them are relatively compute effective to calculate and it describes the unique chemical structure pretty granularly. And we have Sheik asks seeking the question, what AI platform do you use for machine learning? Um, my platform [clears throat] I'm currently using my local computer, but then the AI coding assistant here, I'm using Cortex code. I do have a signup link if you're interested. It'll be in the video description of this video after the live stream and the link will also be um provided in prior live streams as well. So yeah, we're using here Cortex code. All right. So the performance are shown here. Model one has been completed. Okay. Nice. proceed with batch two to three. Yeah. So we're starting slowly so that we get the feel of how fast it runs but also like different algorithms will use you know different compute. All right. So second and third model are completed. Um performance is not so great. Let's continue with models four through five.

### [20:00](https://www.youtube.com/watch?v=YCQfbm4L8ew&t=1200s) Segment 5 (20:00 - 25:00)

So, you know, like building machine learning models in the age of agents is pretty easy. Back in the days, you would probably have to, you know, either run it, I mean, quite recently, you probably would run it in a collab notebook or you could create Python scripts to run your model. Okay, here um the fourth and fifth models are looking better. Okay. So, I hope I answered the question. All righty. So, it's completed. Decision tree has pretty high for the training. Proceed. And I'm going to look at the resource usage by going to utilities, going to activity monitor, and then looking at uh memory consumption, CPU. So, yeah, we're consuming pretty high memory here. So, my laptop has 16 gigs of memory and we're using up 14 gigs already. So if we were to run it in parallel. Yeah, I think the Yeah, so it's still running. It kind of froze a bit just a moment ago. I hope that we're still good on the live stream. Um, so let me go check the chat. So we have AJ. Hey, welcome. Yeah, this is our fifth episode. So we're continuing with the machine learning model building. So yeah, I mean we're making pretty good progress here. Hey, welcome Sonia. This is our fifth episode. Did you use a deep learning model? Um, I haven't really used any deep learning models so far. Um, but I mean, yeah, we could definitely try that um in a future live stream. — [clears throat] — Not sure. Is it crashing? Okay. So, it kind of seems like the camera has some issues. I think is it's because of the model building. Uh let me know in chat if you could hear me. Let me see if I reshare my screen.

### [25:00](https://www.youtube.com/watch?v=YCQfbm4L8ew&t=1500s) Segment 6 (25:00 - 30:00)

Yeah. Okay. So, you're you guys can hear me, but the screen is black. Yeah. I mean, I'm still sharing the screen. It's still showing, but I guess the only the audio part is working. And I think I believe it's because I'm running the full model training, the machine learning model training. And it's probably some of the machine learning algorithm specific ones are consuming a lot of memory. So it's batch number eight and nine. So let me see which machine learning algorithm was that. Yeah, as expected. I think it's probably because of the gradient boost. Yeah, extra trees or the gradient boost. And yeah, it's probably spinning up spawning a lot of trees as part of the training. Um, let me see if I could stop the training. Batches eight and I had some issues. Um, let me try refreshing. Okay, I can't turn on the uh I refreshed the screen the browser, but I can't turn on the camera. But yeah, I mean with audio work for you guys today, I think we'll probably conclude the uh the demo parts. Uh but I'll be happy to answer questions if you have any uh in the chat. But before let me try sharing the screen. All right, cool. I mean the screen works now I think. Do you see the screen? Okay, so we have Rohead mentioning that okay he could use codeex um what is Mac flow? Oh yeah. So as mentioned already, Sonia mentioned about the screen. Okay. But now the screen is not black. Um that's great. So I think you can see the screen. So that's cool. Uh but unfortunately I can't turn on my camera. So yeah. So you probably will see the screen for now. Yeah. So let me know in the chat if you have any other questions about the project so far. And I think the culprit is here. So I had to quit the model training for batches eight and nine. Batches eight and nine crashed the computer. what algorithm puts it. So, I'll probably have to run this after the live stream. Okay. So, I think it was probably doing some parallelism is running jobs in parallel. Okay. Using all CPU cores. Can we have it use no more than two CPU cores because the number of jobs was specified by default to be minus one which is like unlimited. So it'll use all of the CPU cores of the computer.

### [30:00](https://www.youtube.com/watch?v=YCQfbm4L8ew&t=1800s) Segment 7 (30:00 - 33:00)

Yeah. and the cross validation is probably the culprit as well together with the algorithm. Or maybe just use can we have the end jobs as one. Yeah. Let's see. And then we're going to run batch eight. Yeah, but it's going to take longer. Um, yeah, I mean I could run this like, you know, full speed after the live stream. Um but the next part would be to have the results for the training set, the testing set and also the cross validation set for this particular molecular fingerprint called MACCS the max fingerprint which is one of 12 and we have 12 fingerprints and we are testing against 16 machine learning algorithms. So essentially we have 12 * 16. So we have a total of 192 machine learning model and fingerprint combination. And that's a lot of uh models that we're building. 192 models. And that will allow us to see at a glance. Maybe we'll create a performance heat map to show you at a glance how all of the model and fingerprint combination are performing. And once we have that, we're probably going to select the best algorithm to use. And then we're going to perform hyperparameter optimization. And for that, we're also going to use a lot of compute as well. And yeah, I mean today's episode, I think we've done a lot of we got a lot of progress, but then unfortunately um because of the heavy compute, it kind of crashed the live stream. Yeah. So, we'll probably continue in the next episode. So, tomorrow, join us for the live stream as we're going to continue with our machine learning model building. So, I think between now and tomorrow, I'm going to run the remainder of the calculations. So, this is only one, right? This is only one versus 16. So, I'm going to run the other 100 70 or so um model combinations. So, yeah. I mean, any final questions before I end the live stream. So yeah, if not then thanks for joining us today and we're going to continue with the next episode tomorrow. So I'll see all of you uh tomorrow and until then happy coding.

---
*Источник: https://ekstraktznaniy.ru/video/49815*