# 04-Building a Desktop Annotation Tool - PyQt5 + Grounding DINO + SAM 2

## Метаданные

- **Канал:** DigitalSreeni
- **YouTube:** https://www.youtube.com/watch?v=4l9Vi7xMhUw
- **Дата:** 08.05.2026
- **Длительность:** 12:49
- **Просмотры:** 557

## Описание

The final video in this series brings everything together into a practical desktop application you can use on your own images without writing a single line of code.

The tool combines Grounding DINO and SAM 2 in a PyQt5 GUI with a complete annotation workflow: load an image, define your classes, add multiple plain-text detection phrases per class, set per-class confidence thresholds, run detection, and review the results. Objects missed by the automatic pipeline are added with a single click. Wrong masks are removed by clicking on them. When you are satisfied, the tool exports one labeled PNG mask per object instance plus a JSON summary — a format directly compatible with training pipelines in Python, ImageJ, and MATLAB.

Key features covered in this video: per-class threshold control (scientific images often need box thresholds as low as 0.05), multiple detection phrases per class for visually ambiguous objects, a cross-class NMS pass to handle overlapping detections, and a GPU memory management strategy that allows large models to run on GPUs with as little as 4 GB VRAM. We also cover how to download and run models completely offline — important for anyone working on a corporate network with proxy restrictions.

Full source code: https://github.com/bnsreenu/LLM-Assisted-Scientific-Image-Annotation-Tool/blob/main/annotation_tool_v2.py

#PyQt5 #ImageAnnotation #SAM2 #GroundingDINO #Python #DeepLearning #Microscopy #AIforScience #ComputerVision #SegmentAnything

## Содержание

### [0:00](https://www.youtube.com/watch?v=4l9Vi7xMhUw) Segment 1 (00:00 - 05:00)

Hello everyone. Welcome back to my channel Digital Sreenu and also welcome to the fourth tutorial as part of our Applied LLMs for Scientists uh mini-series. And in this video, I am going to demonstrate an annotation tool for scientific images that lets you describe what you're looking for in just plain English. It's like uh just type your word mitochondria or just type something like uh elongated oval objects. And the tool finds it and draws the masks for you. So, there is no manual drawing, no retraining a model. We are using a pre-trained models two of the recent open vocabulary AI models to make this happen. One is Grounding DINO, that's for detection, and SAM too, we use that for segmentation. So, let's go ahead and jump in. But, one note, like before we actually look at the tool, it's worth spending 30 seconds on why this problem is actually hard, right? So, uh if you have ever annotated biological or microscopy images, you know that drawing pixel level masks by hand across hundreds of these uh images is genuinely painful. And of course, it takes hours. On top of that, the boundaries uh you know, of what counts as a valid object, they shift between images. So, which introduces like inconsistencies that's kind of a bit difficult to control even for a single person, not between researchers. Uh right? And classical detection frameworks like YOLO or Faster R-CNN, and they require you to pick from a fixed list of categories at the training time. Which does not well uh you know, work well for your uh scientific structures that have unusual shapes or no labels. So, that's uh these are the three problems together, you know, that we are that the tool is designed to address. Now, this is something I talked about in the last tutorial and the one before that, right? So, there are two steps in this pipeline. The first one is Grounding DINO. This is the detection like I just mentioned. Uh it takes like a free-form text as input. You type your description, it finds the bounding boxes. The second is SAM 2, and it takes each of those bounding boxes from the step one and produces a precise pixel level mask for the object that's inside it. Okay? So, this is it. This is pretty much it. And I Please watch the one before this, the tutorial before this in this playlist, because that's where we kind of did exactly this, but using Colab notebook. So, you can kind of see exactly, you know, how these steps are put together. Once you have that idea, now of course now we can go ahead and look at our product, if you want to call that, and you know, start actually generating real masks. And here it is. Here is the quick summary of this series. The first video was just an introduction. Well, not just an introduction. You learn more about a little bit about, I should say, these two models. And the notebook one, the two videos prior to this, it's Grounding DINO. In fact, the last one where we stitched the DINO plus SAM 2 pipeline together, and now we are putting everything together into some sort of an application that I'm going to demonstrate right now. And the code for this application is just a single Python file. It's not like multiple files or anything. And if you actually go to my uh GitHub repo for this specific project, then you can see the code, like the Python files right here. And also, you can see the previous ones, the notebooks, that walk you through this process, which is very good for learning. But more importantly, I also included the download_models. py, so you can go ahead and download the models, and it places the models wherever they need to be. Uh of course, I say wherever they need to be because in the code it expects them to be under C {slash} HF models. This can be changed if you want. But, uh, yeah, other than that there is nothing much. You just click on, uh, I mean, if you're using an IDE, go ahead and run the code or just run it from command line. Doesn't matter. And let me go ahead and, uh, make it full screen. Uh, one thing I should have included is a way of increasing the font size on the right-hand side. I usually build that into my applications. One of the first things I do because it's difficult for me not to see, uh, or to see stuff without my glasses, I should say. Uh, but, uh, since I built this, I think I know where things are. So, I'm, uh, a bit semi-blind today. Okay, enough about me. Now, the first thing, this is divided into multiple steps, like seven steps right here. Step number one is models. And these are the models we downloaded. They are the ones that are available. So, load the selected models. By the way, I have GPU here. It runs even if you don't have GPU, but, uh, obviously, it's going to be slow. That's it. Once the models are loaded, go ahead and load the image and let's load our

### [5:00](https://www.youtube.com/watch?v=4l9Vi7xMhUw&t=300s) Segment 2 (05:00 - 10:00)

uh, rat kidney image with the glomeruli. This is a relatively difficult image to segment because you see the background, it's pretty busy. You see these glomeruli. Of course, you can see them as humans. Uh, we are trained to see these type of things, but, uh, can the computer see these? That's the thing. Okay. them with our text prompts is the question, right? So, the next step is to actually add the classes. You can add any number of classes and give it a name. I'm going to call it glomerulus and say okay. And, the, when you click on it, it opens up a small panel right here. It says phrases for, uh, glomerulus. What that means is these are the phrases that go into the Dino model. Yeah, this is the prompt. I'm just giving one word as a prompt. And in this case it probably works okay. Like you can add multiple phrases. You can have different phrases to actually describe your objects. We'll check that in a minute once we're done with this. But let's come back to our class. You can add multiple classes. You can actually do these. You can do some other objects in the background. But let's stick with one for now and it works for multi classes as I said. And once you this is set the box threshold, text threshold and NMS threshold. I talked about these in the last couple of tutorials. So I'm going to not repeat those. Let's go ahead and run LLM detection and let's see if it detects anything. Absolutely nothing. That's because a box threshold of 0. 25 probably is okay for natural images where you're trying to separate, I don't know, detect a cat, a dog, a car. But for scientific images where it may not be trained on a lot of these images, we need to because glomerulus, what does that even mean to to uh a normal average person? Probably not much. If you're doing research in this field, of course you know what it is, right? So when the Dino model got trained on, it probably included some of the images, not images, some of the text and images of course annotated images where it gets some idea about glomerulus. Otherwise there's no way you're going to segment anything by typing just glomerulus. Let's drop the threshold down to 0. 05 just because our Dino model probably did not get trained on a lot of glomeruli. So the confidence is very low. So let's go ahead and do that and run LLM. And there you go. It actually did such an amazing job right here as you can see. And for the remaining ones I can scroll down and again pick the class. There's only one class right now. Add mask and right now I'm only using SAM 2 model. It's just now desi- designed to detect edges. SAM 2 has nothing to do with our vision transformers. It has not It does not work like Dino, so it has nothing to do with the prompts. It's just click and it picks clicks and it picks the boundaries. That's it. And if you don't like anything, you can delete it. Okay, so we got 22 different objects right now in this image and you can go ahead and save masks to folder, in which case it saves 22 different PNG images. Each image is a binary image. Background is zero and foreground, whatever that object is, has a value of one. Okay? So, if you open this masks in Windows or default Mac program and you see nothing, don't complain. I do get those complaints that oh, the program is not working. No, it is working. It's just that Windows displays everything from zero to 255 and in our case, we only have values of zero and one, so obviously you're not seeing anything. It's a shame that I have to teach this, but it's understandable for people who never ever did this type of work. You know, when they open the image, they see nothing. You can understand why they're asking these questions. Okay. Enough. Now, let's go ahead and clear all and let's do another thing and end this video. Let's load the mitochondria image, which is a bit more difficult because this is a grayscale image. This is has no color information, making it a bit more difficult. On top of that, look at the background. So, let's go ahead and add the class mitochondria. Mitochondria, right? I hope I spelled it right. I'm typing in blind. Okay, so mitochondria and let's go ahead and run LLM detection and of course with 0. 25 threshold, not much. Let's drop this to 0. 05 like we did the last time and run LLM detection. I think it should detect. Yeah, it is detecting some of these, but I'm not happy with this. You see this big lump? I can actually go down, delete the mask and I delete this big lump. It's now okay, decent. Although these two are combined right there and this one is a

### [10:00](https://www.youtube.com/watch?v=4l9Vi7xMhUw&t=600s) Segment 3 (10:00 - 12:00)

bit bleeding outside of its edges. So, I'm not that happy with this. So, let me go bump it back to 0. 15 and let's go ahead and run the detection again. Now, I'm not detecting anything. So, maybe I should add a phrase that can help us uh add some context. So, I'm saying elongated small oval objects. That's what I typed. Elongated small oval objects, which is what I'm describing and that is something I hope it helps. And now let's not change anything and just run LLM detection. And voila, there you go. Such an amazing job as you can see. So, it detected this mitochondria fine. It did not detect this. By the way, if you have any better ways of describing mitochondria in terms of shape, go ahead and do that. Now, why did it work with elongated small objects? Because this is an actual natural language English sentence and it converts them to embeddings and now it goes back to the image and now it's identifying all the objects that we just described. When you do mitochondria, probably there are not enough embeddings in the training training you know, when it got trained, so it doesn't know, "Hey, I don't I can't see mitochondria here. Maybe one here, one there. " Cuz we don't even know what type of mitochondria was trained on actually if it even made it to the training, right? So, maybe they are bright field images, not electron microscope images or something. Okay, and uh for the remaining ones, it's easy, right? I mean, you can just go ahead and click, click, click. By the way, I'm not forwarding this video. It is really that fast. I'm talking while I'm clicking here. It's really that fast. Yeah? So, that's I mean, that fast on my system with 20 GB of uh GPU, which is pretty common, right? Pretty standard, I should say, if you have a GPU-based workstation. Okay, now you can go ahead and save the masks. That's it. Now, the only the second thing that I wish I had have done I I did, the first being adjustable font, the second is actually introduce a concept of project where we open a whole bunch of images and save the project where we work on some images at a time and save the project and continue. That would be nice, and maybe in future I'll do it. And if there is someone out there with enough skills who wants to take this and actually enhance it and uh add additional functionalities to make it even more useful for others, please, uh go ahead and help me out there. Okay, I hope you really find this tool to be useful, and you learned something from this uh from this tutorial series. And if so, please do not forget to hit the subscribe button. I'll see you in a future tutorial. Thank you very much.

---
*Источник: https://ekstraktznaniy.ru/video/50847*