# 05-Fine-Tuning Grounding DINO for Scientific Image Analysis

## Метаданные

- **Канал:** DigitalSreeni
- **YouTube:** https://www.youtube.com/watch?v=QNAzdCcklds
- **Дата:** 11.05.2026
- **Длительность:** 22:45
- **Просмотры:** 1,047

## Описание

In this video we fine-tune Grounding DINO on kidney histology images so it reliably detects glomeruli across both H&E and IHC staining, using only 24 annotated images and a single GPU. The entire training run takes under 5 minutes (with a decent GPU).

This is part of the Applied LLMs for Scientists series. In the previous videos we built an annotation tool combining Grounding DINO and SAM 2 for text-prompted segmentation. This video closes the loop: we take the annotations generated by that tool, fine-tune the model on them, and load the resulting checkpoint back into the annotation tool for deployment.

What we cover:

- Why base Grounding DINO struggles on scientific images and what fine-tuning fixes
- COCO annotation format and how to merge per-image JSONs into a training dataset
- Fine-tuning with a frozen backbone using the Hugging Face Transformers API
- Reading live training curves and validation F1 scores inside the GUI
- Before/after comparison of base vs. fine-tuned model on held-out test images
- Loading the fine-tuned checkpoint back into the annotation tool

All code is available on GitHub: https://github.com/bnsreenu/LLM-Assisted-Scientific-Image-Annotation-Tool

#Python #MachineLearning #ComputerVision #ImageAnalysis #DigitalPathology #GroundingDINO #SAM2 #ScientificImaging #DeepLearning #DigitalSreeni

## Содержание

### [0:00](https://www.youtube.com/watch?v=QNAzdCcklds) Segment 1 (00:00 - 05:00)

Hello everyone, welcome back to my channel Digital Shreene. I hope you are a subscriber. If not, this would be a great time to hit that subscribe button. This tutorial video is focused on fine-tuning Grounding DINO model for scientific image analysis. And before we jump into the demo and to show you how it's actually done, I want to spend a few minutes on the conceptual side, what you know, of exactly what the plan is, of what we are doing, because the demo will make a lot more sense if you understand what fine-tuning actually means in the context of Grounding DINO. So, in the previous videos in this series, again, I hope you watched those the last four videos, we built an annotation tool that uses Grounding DINO and also SAM too to detect and segment objects in scientific images. And you do that, you did that, we did that just by typing text description, right? That works, of course, reasonably well out of the box. We saw that. But if you have tried it on your own microscopy data, you have probably noticed that confidence scores are low and you need to drop the threshold quite a bit to catch everything and even then you may not be catching the objects that you're interested in, because maybe the Grounding DINO model did not see a lot of your type of data or maybe never saw anything of that sort. So, we're going to fix that today. We are going to take Grounding DINO and teach it specifically what glomeruli, in this example, you know, what they look like in kidney histology images. Of course, replace it with your own sample. I'm going to show you the whole process, so you should be able to reproduce whatever I'm showing on your own samples. Okay? So, first of all, again, I did mention about Let me go to the next slide. I did mention about Grounding DINO a little bit in probably a couple of tutorials ago. So, but let's quickly recap what it actually is. It's important for us. So, Grounding DINO is what's called an open-set detector. Most object detectors you may have encountered, like for example, like YOLO, Faster R-CNN, they're trained on a fixed list of categories, right? You give them an image and they tell you whether that's a cat or a car or a person and so on. Grounding DINO does something fundamentally different. Instead of fixed category, you know, list, it takes a text prompt. Again, we saw that in the last tutorial. You'll see that in this tutorial, where we have written just glomerulus or renal glomerulus. It's tough for me to say that, but you type whatever is, you know, the stuff that you're looking at. I say stuff or object or organelle that you're trying to find in your images. It attempts to find those objects matching that specific description anywhere in the image. There's no retraining required for new categories, right? So, the way it works is that the model has two encoders. Okay? So, I don't think I put that slide. I did talk about it in the last tutorial, but just listen to me for a little while and then we'll move on, yeah? So, the model has two encoders, one that processes the image and one that processes your text. Okay? One is image encoder, one is text encoder. These two representations are focused together in a cross-modality detector, which is what produces the bounding boxes that you see right here on the screen. And each box comes with a confidence score reflecting how well the detected region matches your text description. And the limitation, and of course, that's the key point, right? I mean, the limitation is that Grounding DINO was trained on natural images. They're like photographs of everyday objects. It has never seen your H& D stained tissue. Right, so it's never seen a glomerulus. Maybe it saw some of those in published journals, but not enough. So, when you ask it for glomerulus, it's making its best guess based on the visual patterns that superficially resemble what it learned during the pre-training. So, that's why the confidence scores on scientific images tend to be very low. Okay? So, that's why of course we are going to fine-tune. And again, the reason why fine-tune is this slide actually shows the problem and the solution side by side. On the left, you see the basic model right here, right? So, on the left you see the basic model, the model straight off hugging face with no additional training. It finds one glomerulus, maybe two if you're lucky at very low confidence. You have to drop your threshold to something like 0. 01 or 0. You know, 0. 15 or so. Just to get those two detections. On the right, you see the same image after fine-tuning. Again, if you have a lots of images, you get higher confidence of

### [5:00](https://www.youtube.com/watch?v=QNAzdCcklds&t=300s) Segment 2 (05:00 - 10:00)

0. 9, but in our case, we are going to work with only like 10 15 images or so, I think 20 images. And you can't expect 0. 9 in this demonstration, but that's ideally that is possible if you have enough training data, right? So, you can use a sensible thresholds to actually detect all these objects. Of course, in this case the model learned what a glomerulus actually looks like in your specific imaging conditions. This specific stain, this magnification, this tissue preparation. So, that knowledge did not exist in the pre-trained model because it was never part of the training data. That's the point I want you to appreciate. Now, what are we going to do? Now, let's look at the actual steps involved in this case. there are five uh steps right here. I want to walk you through each of these. Step one is annotation. Um we have a tool, again, the last video was all about that, right? So, we use our annotation tool from this previous video and uh click on the glomeruli across a set of training images. The tool runs Grounding DINO and SAM too to uh you know, to detect these objects and then we correct any mistakes by manually uh adding some of the ones that it actually missed. Good. Now, step two is merging because when we actually save our annotations in our tool, it generates separate JSON files for each uh image. One image can have, let's say, 10 glomeruli, okay? So, you have image one and each glomerulus. Image two, maybe seven glomeruli, right? So, you have all of these. So, we need to merge those into a single JSON file because that's exactly what uh you know, Grounding DINO actually uh expects. Okay? And step three is training. Once we have that, we load the fine-tuning tool, uh point it our uh at our, you know, training and validation JSON and select the base Grounding DINO model as the starting point and click start training and the tool handles everything. So, uh that's it and evaluate, of course, is the training curves and everything and uh deployment. We take the saved checkpoint and we load it into our uh annotation tool that we built in the last tutorial. So, that's uh basically uh the idea and uh This is the slide I always get questions about. So, let me explain it carefully. When we say we are fine-tuning Grounding I have hard time talking. Sorry. When we say we are fine-tuning Grounding DINO, too many things I have say there. We're not retraining the whole model, right? I mean, we are fine-tuning. It's in the name. That would require weeks of compute and millions of images if it's if it's really retraining. We freeze the heavy part. Just like any convolutional neural network or anything, right? You freeze it and then you only train the top layers, right? The text backbone and the image backbone, that's a Swin Transformer again. We only update the cross-modality decoder and detection head. That's it. Those are the components that learn to match visual features to text tokens and produce bounding boxes. If that doesn't make sense, don't worry. I have my code. You can use my code. It works great. If you want to know more about it, plug it into cloud or whatever your favorite LLM is and try to understand every line or every block of code. Nowadays, I used to spend my first few videos long time ago for 5 years ago. I used to explain every line. Here it is. Maybe I should do that so people learn more about it, but I think the LLM tools do a much better job at explaining stuff based on your learning style. Okay, enough digressing. Now, what are we trying to do? I just mentioned that Where am I? Sorry. Sorry for just going on a tangent. So, uh the this thing is important for two reasons, right? First, it means training is fast because we're not training the entire thing. We are only training the cross-modality detector and decoder and detection head right here. So, the training is going to be fast. And stable even with a very small data set. And we'll see that in our data right now. In our demo, you'll see training of 20 epochs. That's more than enough, actually, on 24 images. I took like 12 images from H& D, 12 images from IHC of glomeruli, and then I trained this. It took 5 minutes. Literally took 5 minutes on my GPU. So, probably we'll just do the new training from scratch, and then I'll pause the video, and then we can continue from that point, okay? Not It's It's In some cases, it can be an hour. Again, it depends on how many images you have.

### [10:00](https://www.youtube.com/watch?v=QNAzdCcklds&t=600s) Segment 3 (10:00 - 15:00)

Okay. So, uh I don't know. I mean, I This slide is probably not worth talking about. Again, I'm 24 This is the key point, like 12 H& D images and 12 IHC images. I'm going to show you the all of this anyway, so no point in discussing that. So, let's go ahead and jump into actually, uh you know, train our model and use that model to see if it's any better than the base model that we downloaded from Hugging Face. So, this part, the annotation tool, is the code from our last tutorial. I did enhance it a little bit. You'll see that later on, but we don't In fact, we do need this first. Sorry. I almost said we don't need this. We want to fine-tune it. But before fine-tuning, we have to generate our labels, right? So, ground truth data. So, let's go ahead and run this. Okay. So, this is it, and now I I did correct the mistake from last time. What I did is I added font size large, so you guys can see it from your side you know, from in the video. Okay. So, right now, we have Grounding DINO base. I mean, what I added is custom fine-tuned model, so we can load our own custom models. Previously, it was only these two options. So, let's leave it to that, because we don't have a trained model yet, right? So, Grounding DINO base, and then for SAM, it's SAM 2. Load the selected models. Let's go ahead and load these models. So, they're loaded, and now let's go ahead and load image. What do I have here? Let me show you the images that I have. So, let me expand. So, these are all the images that I have for training. 12 H& E images and 12 IHC images. And I have these images at different magnifications and also a little bit of different stains right there. You can see not drastically different, obviously, between H& E and IHC. They are completely different. One is blue, one is pinkish. And you can see different magnifications. That's pretty much it. This is my training data. Not a lot of glomeruli here. So, let's go back and I am going to load these images one by one. Don't worry, I'm not going to do this uh uh you know, I'm in I'm not going to show all of these, but I just want to show you in case you haven't watched the previous tutorial. Once you load the image, this is where I'm going to say glomerulus right there. And I also want to add another class called Sorry, not a class. When you click on glomerulus, it actually you can add a phrase uh additional phrases that describes the object. It helps. The more text uh the better it is. So, let's call this glo glo meruli. Okay? So, either way, that's fine. Um so, we have these two terms and let's drop the threshold to about 0. 15 and run the LLM detection. Let's hope it detects a few of these. Oh, it did not detect anything. That's why we need to train our own model. Let's go ahead and uh remove this one. Clear all. And let's run the LLM detection. And let's see if it detects anything. And it's not bad, actually. Let's uh delete this large mask and uh say okay. It detected some of these. Let's delete this. Let's delete that. Okay, we have uh we have these detected. And now I'm going to add masks right there using the SAM tool on the ones that it did not detect. So, you see how amazing this is. So, it's that's it. Each of these, now, once I have all detected, I'm going to click save masks to folder and then pick exactly where to save. That's what I have done, and you can see all the masks we have right here. These are all the masks. So, from each image, we have like a whole bunch of each one is one glomeruli, by the way. But, this image does not matter anymore. It's this rat kidney HE1 coco. json, rat kidney HE2 coco. json. And these are the names of my images. So, if I go back, rat kidney HE1, rat kidney HE2, and so on. And so, for each of these, I have a JSON file. Now, to train our model, these JSON files need to be merged together into one single coco file. So, to do that, I have added a tool here called merge coco annotations. And you can just browse your masks folder, images folder, output folder, and what is your validation fraction? I leave it to 20%. You can change that. And then it saves your output in the In my case, I saved it into the JSONs for training, and I have a training JSON and validation JSON. That's it. Okay? You can do all of this using your own code. It At the end of the day, these are two training and validation JSONs that contain the bounding box information for each image and for each glomerulus in each image.

### [15:00](https://www.youtube.com/watch?v=QNAzdCcklds&t=900s) Segment 4 (15:00 - 20:00)

That's pretty much it. So, once you have that, then you're ready to train the model. Okay? So, now let's go ahead and train the model. And hopefully, you understood how, you know, we are generating our training data. And now that this is done, let's go ahead and open our I say training, but fine-tuning. Let's switch to the next one, run, and it should open a GUI that is very similar-looking. And again, I can increase the font size to largest, so you can see stuff while they're happening. Okay, now what are we fine-tuning? Well, we are fine-tuning the base model, and it's already selected right there. You can browse and load whatever the model that you want to fine-tune. So, that's over there. And the train JSON file, the one that we just saved is in our JSONs for training. Let's do train JSON and validation JSON. And the checkpoints folder. Let's actually create a new folder for checkpoints. I don't want to overwrite the old one, training checkpoints. Let's just do temp checkpoints, for example. Okay, let's save our data in temp checkpoints. There you go. All set. And now we are doing 20 epochs, and I'm going to leave this to 20 epochs and so on. And let me go ahead and hit start training. And it will start training. Like I said, it took about 5 minutes the last time I did this. As you can see, it's doing epoch one out of 20, and as soon as epoch one is done, it places a a data point up here. You'll see that in a second. I'll wait until that's done, and then I'll pause the video, and let's continue after this is done. So, let's see if our new trained model is any better than the default model that we have, hence proving that fine-tuning definitely helps. So, let me go ahead and pause, and I'll continue as soon as this is done. Okay, so looks like the training went fine. Let's go ahead and look at this. So, the training is fine. It's uh And how do we know it's doing any better? Well, let's do a preliminary check right now. So, once this is done, let's go ahead and load a test image. And let's run a comparison. So, the model is already selected because we just trained, otherwise you load a model and you do this comparison. So, let's do the test comparison. And it should load this image on the left-hand side. You see with the same settings how both models are actually doing. So, here first of all, you can see one it two and here it combined both into one and now you have one two. It separated these two. You got one there and these two are you know, in addition detected. Let's actually decrease this to I don't know, point 15 and let's run the comparison again. And now you can see this started to pick up a few more on the left-hand side, the base model and the right-hand side is also picking up a few more as you can see. So, in fact, it's picking up some of these other ones and we'll see. We'll actually import this into our annotation tool and actually do a good decent comparison. But this is how the model is actually performing. Okay. So, let's go ahead and close this. Now that we have trained a model, let's go ahead and close this. And let's open our annotation tool and load this model into our annotation tool. Again, let me go ahead and increase the font size so that you can see. And instead of grounding In fact, let's go ahead and use the grounding dino first, the base one, and let's select a decent image. I mean, let's I don't know. I'm thinking about maybe something a bit challenging. Um how about let's pick this one? IHC. This is a bit more challenging than the H& D images. And let's see how these perform. So, class name, let's uh of course glomerulus is the class name, and let's also uh once you select it, let's also add another phrase called glomeruli, and that should be good. And with those, let's leave everything as is, and let's run the LLM detection. And uh let's see if it detects any. Well, it did not detect any with those settings. You see glomerulus glomeruli 0. 25. Let's drop it down to 0. 2, for example, and let's run this again. I hope we can detect with the base model. This is the base model, by the way. This is not the model we just trained. This is the uh default. So, this is what you would be doing without uh the base model. Oh, let's drop it to 0. 1. Wow.

### [20:00](https://www.youtube.com/watch?v=QNAzdCcklds&t=1200s) Segment 5 (20:00 - 22:00)

I'm a bit surprised. Okay, so we detected something at 0. 1, but it also detects this entire region as uh I mean, you can actually go down and say, "Okay, delete that entire big region and this one. " And you have 1 2 3 4 glomeruli detected using the base model. Let's actually go back to 0. 15. I don't like all that fuzzy stuff. And let's delete all of these and see what did we detect uh with 0. 25 15 setting? Did we detect anything? No. So, here we're not detecting anything. I hope we'll find something with the new model. So, let's go ahead and load the custom fine-tuned, and pick the custom fine-tuned. And uh let's pick the one from temp checkpoints because that's the one we just uh saved, literally, and final checkpoint. So, let's see how our new model actually, and let's load the model, and let's see how this one actually is working. I'm not going to change anything, same image, same settings. Let's just click run LLM, and I hope we'll pick some objects here, otherwise our training is useless. There you go. Isn't that amazing? So, under the same settings, everything same, we did not detect even a single one using the base model, and just with a handful of images training, you know, you can go ahead and Google search, you know, find some images, go ahead and download. I bet this model is going to do much better than the base model in terms of detecting glomeruli or on any image. In this case, we trained it on two different variants, still okay, still not bad. Uh if you add more variants, the more versatile that model actually becomes, and you don't you don't need a lot of data. You may ask, "Hey, it got like some of these. " Actually, let's go down and let's see. Did an amazing job. The only two things are these two. Now, I can just go ahead and delete these two masks. Okay. So, now you can see Uh actually, this one is also not useful. Now, let's see. That's pretty good. Uh that's pretty good, I would say. So, now you can go ahead and save your masks, and you're all done. Okay. Uh this turned out to be rather slightly more lengthy than I wanted it to be, this the specific video, but I hope you appreciate this approach, this tool. And again, the link to the code for all of this is down under uh you know, description, or just search for my name, B N Sreenivas, B N S R E N U. Obviously, I created my GitHub account long before I started my digital Sreenivas channel, so that name is not I mean, I'm not using that. Thank you very much. I'll see you again in one of the future tutorials.

---
*Источник: https://ekstraktznaniy.ru/video/50844*