[Live] Data Professor - Bioinformatics from Scratch Episode 5 [Part 1]
38:59

[Live] Data Professor - Bioinformatics from Scratch Episode 5 [Part 1]

Data Professor 02.05.2026 331 просмотров 7 лайков

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI
Описание видео
Let's build a bioinformatics project from scratch! Feel free to share your questions and interact with other participants in the chat.

Оглавление (8 сегментов)

Segment 1 (00:00 - 05:00)

Hello. Welcome to the live stream. Let me wait a few moments before we get started with the bioinformatics from scratch live stream live stream. So this is going to be episode 5. So let me take a look at what we've done so far. So here we have three AI assistant AI coding assistant sessions open up which we ran in parallel. And we also had the research article here which we're using as kind of like a template or a guide that we'll follow. And if you're just joining here and this is your first live stream that you've seen from this channel. So we're going to essentially build a bioinformatic research project live and essentially we're we have already compiled the bioinformatics data from the prior live stream. So this is the fifth episode. So we had already in the first episode we manually compiled the data through the graphical user interface through the internet browser. And then in episode 2 we used a programmatic approach whereby we've taken the API docs page for the ChEMBL uh database and then we fed that into the AI coding assistant in order for it to use and compile the data on our behalf. Then in the third episode we've performed data curation. We've also prepared the non-redundant data set whereby we removed all of the redundancy in the data. And we started to do a little bit of the exploratory data analysis part. And then in the fourth episode we've performed exploratory data analysis or EDA through building a Streamlit app. And in the third and fourth episode we've also right here we've also created our own implementation in Python that will allow us to compute the descriptor fairly quick. And because like the original implementation was in Java and I think the estimate was like 3 hours to complete the computation for uh one of the fingerprint type molecular fingerprint type. And then we've created the Python implementation which allow us to complete it like I think more than 10 times quicker. So from 3 hours to only 3 minutes. let's have a look here. Right. And then in the fourth episode we've also prepared the class column whereby we took the pChEMBL value which is the dependent variable that we're using. And then through the use of this criteria the pChEMBL criteria we've categorized the molecule as being inactive active or intermediate if the pChEMBL value falls within the following range here. And the pChEMBL is essentially the minus log transformed value of the IC50 and also the KI which we got from the ChEMBL database. So this is the breakdown of how many inactive that we got almost 1400. The number of actives are almost a thousand. And the number of intermediate which is between 6 and 7 less than a thousand. So at about 900. And yeah 483 we probably won't use because it doesn't have the pChEMBL value. So essentially the these values here will allow us to build machine learning models. And if we're using the

Segment 2 (05:00 - 10:00)

classes here we're going to build a classification models. And if we use the pChEMBL numerical value we're going to build the regression model. So there are ample opportunity for us to build these two types the regression and also the classification models. And let's see the third AI coding assistant that we've used. Yeah. So this is the EDA app that we built. So let me show you if it's still running. Okay. Right here. Yep. It's still running. So this is our EDA app that we built using Streamlit. And it's essentially a dashboard. It's essentially a an EDA dashboard. So it tells you how many compounds you have how many unique assays were the compounds from how many pub research publication or research article was it from and also the year span for the data sets. And it was taken probably from the publication information. And then we've also applied the Lipinski's rule of 5 in order to see how many pass the Lipinski's rule of 5 versus the total number of compounds. And yeah. So the these are the ranges of the pChEMBL value. This is the median value. And we also have the pie chart here. I mean the donut chart here. And then um Yeah. Based on this other classification I think this is different from what we have. Um they have like very potent moderate and weak. Uh but yeah. I mean it's also worthy to try this classification as well. And these are the other ones. So I think yeah we've already covered this in the fourth episode. So now we're going to move towards here. So in the fourth episode toward the end we've created two data split algorithms for us to use which is the K-means stone and also the random split. So we're essentially going to split our data into 80/20. Um so 80% will be used for training the model. 20% the test set. So let's continue. And yeah. But before continuing if you have questions please drop it in the chat. Okay. So we have YoYoEdits123. And you're studying to become a data analyst. Okay. And please reply. Yeah. Drop your questions in the chat and I'll try my best to answer all of them. So we're probably going to answer questions when we're waiting for task to be completed. And so let's um Let me share my screen. One moment. Okay. So All right. So these are some of the session ongoing sessions that we had so far. So let me do a quick check. And then I'm going to look at the folder where we created the molecular fingerprints. So all of the molecular fingerprints here we calculated 12 different types. And we're essentially going to use it to describe the molecule which will then be used for um building the machine learning model. And so the molecular fingerprint will serve as the independent variables the X variables. And the um pChEMBL will be serving as the dependent variable or the Y variable which will be used for predicting. So, if you think of it in mathematical

Segment 3 (10:00 - 15:00)

terms, you have Y equals the f of X. f of X means the function of X. And X would be the descriptor, which describes the molecular feature of the compound, which we're predicting whether it is active or inactive. So, a molecule that is bioactive will have specific features. And those that are inactive will have, you know, unique specific features. So, that will allow us to figure out whether a compound is active or inactive based on the molecular features that we've computed for. So, as you'll see in prior episodes, we created the scripts here using our AI coding assistance. And we also done the EDA, and these are the fingerprints that we had calculated so far. So, we have the atom pair, and we save it as the CSV file. And I think we have 12 of these. Yeah, 12 of 50. And 50 meaning like the entirety of our project folder. And here we have the agents. md, which kind of adds all of the rules that we have for our project. And the agents. md were computed, I mean, were created automatically by us telling it to create all of the rules or all of the fixes that it figured out to do when creating our workflow. So, whenever it encounters some error, and it it figures out how to solve the error, we want that lesson to be saved into the agents. md file so that in the future, it will be able to recognize the error, and then it will try to fix that on its own. Okay, so we have Insh asked a question. You're planning on you're planning a small review paper on scientific machine learning. Please suggest a simple research problem or direction that you can work on, preferably something beginner-friendly. Okay, yeah. So, if you're Okay, you're writing a review article, meaning that you're going to perform an exhaustive literature review. And you want it to be simple. Yeah, so by simple, I think it means many things. And the possibilities of such an article um is also pretty countless. I mean, you could focus on the use of specific machine learning algorithms in a niche field of study. Like what I'm doing here with the aromatase, they are actually kind of niche. Like when you say when you say that you want to find an anti-cancer drug, it's pretty broad, right? But then if you want to find breast cancer drugs, then it's more specific. And if you want to find breast cancer drug targeting aromatase enzyme, then you become more targeted. Um similar concept with writing a review article, you want to narrow down your topic to be super specific to a certain area, because if you want it to be universal, it applies to everything, and that's going to take you a lot of time, and it's simply impossible um to do in a first article that you will write. Yeah, so I would recommend to kind of narrow it down like a funnel into a specific niche that you can write about. Yeah, hope that's helpful. Okay, I think I remember now what we're we were working on. Yeah, so as I earlier mentioned, we created the train test split. So, let's see where we're at. So, we're here, right? And then statistical analysis, I think we could keep this for later. So, let's get started in building out the multivariate analysis, essentially the machine learning models, the classification models, and also the regression models. So, here I think we were using only one specific machine learning algorithm, which is the random forest. Okay. So, let's go ahead and use the

Segment 4 (15:00 - 20:00)

All right, so we have we're going to continue with this session. So, we're going to Let's build a machine learning model using against the p chemble values. So, we're going to build a regression model for the random split. 80/20. And then we're going to use seed number 42 for the random split. So, that is more deterministic and also repeatable. Otherwise, the if the seed number is randomized, then every time that you build the model, you will get a different combination of data samples in the 80% set and also in 20% sets. And so, let's see. Let me clarify a few things here by what you liked for the ML modeling. Which algorithm Okay, so it's asking us which algorithm we want to use. Um yeah, I mean, a good set would be these, right? But also additionally, please suggest all possible ML algorithms that can be used. I want a representative for each ML algorithm type. All right, let's see what it has, and then we go with next. And it's asking us, should the models be trained on all 12 fingerprints or subsets? Yeah, let's just use one as an example. And then submit. And if it works well, then we're just going to expand that to the remaining 11 other molecular fingerprints. So, let's see what machine learning algorithms it will suggest for us. So, essentially we're going to use the scikit-learn package. And the scikit-learn package here provides several different types of machine learning algorithms for you to use. So, the major classes are classification and regression. If we have the X and the Y variables, the Y variable is the bioactivity that we're going to use, and the X variables are the molecular fingerprints or the molecular descriptors. And the for clustering, it's it's a different type. So, for clustering, we're not predicting the bioactivity if we're going to use the cluster. So, essentially clustering would potentially cluster the data samples based purely on the X variables, which are the molecular features or molecular descriptor or molecular fingerprints. Uh so, all of these are synonymous with each other. So, based on the molecular descriptor, it will cluster the data samples based on those, meaning that the chemical structure, which are similar, could then be clustered together. So, if it looks alike, it'll be clustered together. If it looks differently, it will be clustered, you know, with a in different clusters. As shown here in a different color. So, each of the colors here is a cluster. And scikit-learn has several more auxiliary functions that you could use for machine learning, like dimension reduction, um model selection, preprocessing, and yeah, several more. So, data split is also part of here, the preprocessing. So, we're going to build the regression model. So, these are all of the models that it has. So, I already prompted it to suggest which one are usable because a lot are quite the same. So we want like for example we have

Segment 5 (20:00 - 25:00)

random forest, we have XT boost but that's for another um one I mean I'm another Python library. And so we want one to like to be representative of um the different types of ML algorithms to use. Um but yeah, but actually they're tree-based so um but XT boost has been shown in a lot of the Kaggle machine learning competition to be pretty robust. And so let's have a look here. Would you like Cortex to proceed with this plan? So let's look at the plan. So our target is the peach humble value, correct? Because it is a continuous numerical value. Like it it's like for example 7. 58, 9. 51. So they're like they're like decimals. So it is yeah, continuous. And the target will mean the Y variable that we are seeing and our X variables will be the features. So we're using the max fingerprint. And we have already performed filtering. That's good. And also we're doing the random split and we are assigning the seed number of 42. And then this has been already complete completed or computed. And the packages that we're using are here. Okay, so before beginning I wanted to maybe fix the Python versions so that it is repeatable. So for XT boost and SK learn we have already used these version numbers. But for the other we should fix the version number. Please fix the version numbers to be the latest as of today and put those in the requirements. txt file. Yeah, let's have it re-implement that. And while doing that let me read first here what we have. So we have linear ridge regression. K nearest neighbor support vector with RBF kernel. So RBF kernel is essentially kind of compressing the dimension of the original data space of the X variables from higher dimension into a lower dimension. Decision tree bagging ensemble which is the random forest and then we also have XT boost. So you see that these two are from the same family type of algorithm but it's okay. Uh but random forest should take should be more computationally efficient. It should take less time to compute. And the newer network here we have Okay, let's see for linear models what do we have here? Okay, so there there's actually several types. We also have logistic regression. So what if we were to ask it What's the theoretical What's uh what's Let's say maybe we have more of a comprehensive list. Let's have a look here. Okay, so um So we have 16. Okay, that's a good size. Let's do it all 16. Yeah, let's just do all 16. Next submit. All right, cool. So we now have 16 algorithms that we're going to use against 12 different molecular fingerprints that we're going to use. So 16 by 12. But here we're using one of the fingerprint the max fingerprint one of the 12.

Segment 6 (25:00 - 30:00)

All right, so the plan here Yeah, looking good. Looks good. Okay. Check R squared values are reasonable. Yeah, because we're building the regression model. Therefore we're going to evaluate it from the R squared values but also Please also use RMSE as together with [clears throat] R squared for the model performance metric. Let's revise the plan here. So the ML model script will be the seventh file here. And we've already pinned the specific version numbers for the ML libraries. And yeah, I think it looks pretty good now. Let's go ahead and run it. Should take a while. So yeah, all of these that you see in the boxes here are agent tool calls, agentic tool calls. And it's just telling us what it's doing. So prior to this it kind of invoked the plan mode and now it exit it's reading and I mean actually writing out the Python library versions that we're using in the requirements. txt file. So this will allow us in a future point in time to reproduce this entire project using the exact version numbers. Because if something breaks in the future then we'll know which specific version numbers we had used and then we could trace back and then make the adjustments accordingly. Because if we don't fix the version number here in the future we'll probably have to guess and it it'll probably be a headache. Okay, so now it's saying that some of the version numbers are incompatible but it's fixing that on our behalf. Okay, so now it's going to Yeah, it has already edit the machine learning model building script. And now it's going to attempt to run it. So what it's doing here is it's changing the directory to the working directory which is here in the aroma test folder and is invoking the Python function to run the script for here the model building script. And now it's Yeah, using the max fingerprint with the random split of the data with the 80/20 split and with the seed number of 42. Wow, it was fast. So all 16 models trained successfully. All 16 models evaluated. Okay, here's the result. So here the best model is random forest. Um that is my personal favorite. The random forest model because it's fairly quick to compute and based on what I've seen it generates pretty good performance uh across several data scenarios. So that's pretty much my go-to. And you also can see that the error is also low when we're using random forest as indicated by here the RMSE and the mean absolute error. So this is on the test, right? This result is on the test set. Let's also get the result for the training and also 10-fold cross validation. Um so here we have the data split into 80% and 20%. The 80% are used to build the training model which when we've already have it we're going to apply the trained model to make predictions on the test sets and also on the training set. If we apply it on the training set, it's going to be called like recall because we're essentially using the same set of data to train a model and then to evaluate back on the trained model. So

Segment 7 (30:00 - 35:00)

that should give you a fairly good performance, but it's like the baseline. And then we're going to apply it the trained model on a unknown set of data samples that the model has previously never seen before. And then that is kind of like an indicator of how well the model performs on unknown data sets. And then we're also going to do 10-fold cross-validation where we're going to use the 80% to perform the 10-fold cross-validation to rigorously evaluate the performance further. Um but yeah, let let's tell it to give us the trained and also the 10-fold cross-validation. Can you also evaluate the model performance on the train training set and also the 10-fold CV set. CV is cross-validation. Afterwards, please save [snorts] the model performance results in a CSV file for all three sets, which is the training, test, and CV set. Okay, let's do it. And so this should take some moments in order for it to do. And in the meantime, I'll go to the chat to see if we have some more questions. Yeah, so we have Running Nuggets, good afternoon, new viewer here. Welcome to the channel. Is this some kind of molecule library fingerprinting for some kind of classical ML? Yeah, we're using ML model. Um We've already pre-calculated the molecular features of the compound. So, we're using a bioinformatic data set or specifically, you could call it cheminformatics. So, we've already compiled the data from the ChEMBL database. And let me show you what it looks like. ChEMBL from here. From the ChEMBL database, we already searched for aromatase. And then we selected the human variants of the enzyme. And then if you scroll down, you see all of the compounds. But here, the data's in here. So, we click here. So, on the first episode, we've done this manually. And then here's the bioactivity. IC50, here's the IC50 values, and then this is the unit type. And then these are the pChEMBL values that we have. And in episode two, we did this programmatically by using the Docs API. And then we just provided this link to the AI coding assistant for it to figure out how to retrieve the data. And in a nutshell, that's what we've done so far. What And then we have another question. Would you share the code for us maybe on GitHub? Yeah, that's a good call out. Yeah, so I've been intending to share the code on GitHub. Um in the meantime, you could go to github. com /dataprofessor. And then I believe we have a bioinformatics folder. Um but those are from the prior version of the bioinformatic from scratch. Um but then for the live stream, we're starting over from scratch as well. So, yeah, I'll definitely include some of the data and the files in that uh folder. Maybe I'll create a subfolder called live stream so that you'll know that it comes from this series of um videos. But yeah, that's a good suggestion.

Segment 8 (35:00 - 38:00)

Okay, let's see.

Другие видео автора — Data Professor

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник