# Context R-CNN: Long Term Temporal Context for Per-Camera Object Detection (Paper Explained)

## Метаданные

- **Канал:** Yannic Kilcher
- **YouTube:** https://www.youtube.com/watch?v=eI8xTdcZ6VY
- **Дата:** 28.06.2020
- **Длительность:** 34:22
- **Просмотры:** 12,886
- **Источник:** https://ekstraktznaniy.ru/video/13477

## Описание

Object detection often does not occur in a vacuum. Static cameras, such as wildlife traps, collect lots of irregularly sampled data over a large time frame and often capture repeating or similar events. This model learns to dynamically incorporate other frames taken by the same camera into its object detection pipeline.

OUTLINE:
0:00 - Intro & Overview
1:10 - Problem Formulation
2:10 - Static Camera Data
6:45 - Architecture Overview
10:00 - Short-Term Memory
15:40 - Long-Term Memory
20:10 - Quantitative Results
22:30 - Qualitative Results
30:10 - False Positives
32:50 - Appendix & Conclusion

Paper: https://arxiv.org/abs/1912.03538

My Video On Attention Is All You Need: https://youtu.be/iDulhoQ2pro

Abstract:
In static monitoring cameras, useful contextual information can stretch far beyond the few seconds typical video understanding models might see: subjects may exhibit similar behavior over multiple days, and background objects remain static. Due to power and storage constraints, 

## Транскрипт

### Intro & Overview []

hi there today we'll look at context or CNN long-term temporal context for per camera object detection by Sarah Barry Guan Hong woo vivek Rathod Rani Votel and Jonathan Wong so on a high level this paper tries to do object detection for cameras where you the camera is in the same place for a long time for example these wild trap cameras or traffic cameras right here it proposes to do object detection by incorporating data from the images that the camera has seen in the past to help the detection in the current frame and it does so via an attention mechanism that it runs over a memory of past data so we're going to take a look at how this is done and how well it works and yes take around if you want to know as always if you enjoy content like this then consider sharing it out telling your friends about it subscribe if you haven't and tell me what you think in the comments so the paper starts off and describes the problem and the problem is fairly simple

### Problem Formulation [1:10]

you want to do object detection in images object detection is the task of basically if you if I give you an image you should tell me what is on the image and where so in this case here you would have to draw me this bounding box and say this is a deer on the bottom you would have to draw bounding boxes maybe they have to be rectangular maybe not and say this is a bus and here is a truck and here is another a car and so on so there can be many objects in an image there can be one object there can be objects of different classes or there can be no objects at all so this is just object detection and there have been many papers on this and specifically there has been this our CNN and this is the model that we're going to extend so the our CNN model or specifically the faster our CNN model that we're going to build on is a model

### Static Camera Data [2:10]

that same detect these bounding boxes in signal in single images but now we consider the situation where we have a camera that records images for a long time so in these wild trap cameras they often sit there for months and it's not that easy to make use of them because in addition to there being a lot of data they have motion triggers so that could be there is no nothing for a long time and then there's a B the animal walks in the trap and then you have a bunch of images like one per second for ten seconds and then you have nothing again for like a day or two days and then you have ten images again because another animal walks in or maybe doesn't and so on and another so you have irregular sampling frequencies you have very different distance between the frames all of this makes it very not suited for models like temporal convolutions or things like LST M's because they don't work super well with data like this now I know there are formulations where LST M's can do this but they don't work very well with these super long contexts and irregular sampling frequencies and so on so the idea is if we have a frame right here like this one and we want to detect what's on it we should be able to pull information from other frames that the same camera has seen like from this one or from this one right here and we should be able to do so in a dynamic way now why could that help if you look at for example down here these images have been taken they say images were taken on separate days but you can see this thing right here is in both images so or a very similar thing is probably that buses regular route so in order to classify if whether or not this here it is a bus it might be very helpful to also look at this picture right here and see all in you know its about at the same location it looks the same and also it looks like a bus so you know that kind of gives evidence that this could be this other thing could also be a bus then also there are background objects so sometimes the single frame detectors get confused it might be labeling this here as a car because just the lighting the exact lighting in this picture is just off by the correct amount that it is confused but considering this picture over here maybe it recognizes here no that's not a car and it can bring over this evidence to that frame and consider ah maybe you know this is the same thing so it's not a car so this is not the same then simply adding training data we really consider the fact here that these images they come from the same camera that is in the same location or maybe you know that is filming the same thing so this all of this is going to be within the same camera not just adding iid training data and with animals as well like often the same animal has like its regular route or within these ten these bursts of tens the same as animal will kind of walk around a bit and maybe you know here it's half occluded but maybe in a different image you see aha here I see the nose so it helps you make a better prediction also animals are often in kind of crowds in and that helps if you see that there are other deer around the probability that this is a deer increases rapidly so how are we going to do this what we're going to do is we're going to build an attention mechanism that can do these kinds of look into the past and some also a little bit of the future as we will see but mainly we'll look into other images from the same camera in a dynamic way and we'll learn how to address those other images from a memory bank so the

### Architecture Overview [6:45]

architecture is described right here now as you can see we are still in the business of doing object detection so what we'll do is we'll sort of hijack a existing object detector and the object detector we're going to hijack is going to be this F or CNN this faster or CNN object detector that's an object detector for a single frame problem so that means you have one image and you're supposed to detect what's on it has two stages as you can see so stage one if you have an image and let's say there's some stuff on it stuff stuff there stuff okay this what stage one is supposed to do is it supposed to extract regions of interest this could be okay all of these are regions of interest so it simply says well there is something there might be something right here in these regions of interest and then it describes each of these regions of interest using features okay so it extracts these regions of interest and each region of interest gets features assigned to it so well these are I think these are like 7 by by 2048 features but let's just say for the sake of the describing it that these are just a vector of features for each region of interest so each region of interest is going to be associated with one vector of features that this model extracts ok and the next region of interest also has a vector and so on stage 2 then takes each one of these vectors and assigns a class to it so this would be deer right here ok so stage 1 proposes regions of interest along with features then stage 2 takes each of these regions of interests and classifies them basically and I guess there there's many in-between stage like this is massively simplified there's not maximum suppression there is kind of an alignment stage where you can refine the bounding box and so on but in essence these are two stages and you can see that this system here it goes in between the two stages so all of this right here we shove in between the two stages so we'll still use the stage one and two but in between in this thing right here we'll try to sort of pimp these features such that the stage two detector has an easier time classifying okay so we're going to pimp these features by incorporating in because these features right now if we just do it vanilla these are just from the current frame and we're going to add to them information from other frames of the same camera and we're going to do it in two different ways so the first way as you can see here the first way is

### Short-Term Memory [10:00]

this short term memory and the second way is the long term memory now the two are slightly different as you can guess the short term memory is going to be only over a short time period around the current frame and the long term memory is going to be basically across a very long time horizon into the past you can see we're trying to classify this blue frame right here that what we call the key frame so what we'll do is we'll run it through stage one cool so we have features for each region of interest and then you can see this goes here and through these residual connections this goes into stage two over here so basically stage two still receives the same input it receives whatever stage one outputs for the key frame but we're going to add to that twice so two things as I said so the short-term memory is added right here now how do we build the short-term memory we built the short-term memory simply by considering all the frames around the keyframe and this you can see right here the current window around the keyframe which can be like one frame around it or two frames or three frames just a few frames around the current frame and this can be fairly helpful as we said for example if the deer moves a bit the car moves a bit you know it gets into a slightly different lighting and so on this can help us very much to classify the current keyframe if we also have features from the surrounding frames so for each of these surrounding frames we also run them through the stage one detector to also extract regions of interest and that all of these features go into this memory short-term memory bank right here there's different strategies you don't always have to extract all of the regions of interest you can also extract just the top one and so on or you can extract the mean since these are fairly you know consistent the cameras at the same place there are many ways you can do this but what you ultimately end up with is a short-term memory bank that kind of is so you'll have a Mac and you have lots of these feature vectors in here for your region your regions of interest of the surrounding frames now if this here is your occluded deer right so this is the half occluded deer and you want to consider information from the surrounding frames maybe in the next frame so maybe this is three frames like one two three and two is the key frame maybe in the next frame the deer moves a bit and you see its nose and that this particular region of interest here is relevant so how do you know how do you now get from this entire memory this feature vector that would be helpful and the answer is you get it through an attention mechanism you can see that right here the way the short-term memory is added is through this attention block they describe the attention block right here it is a fairly standard attention mechanism so I've done a video on attention is all you need if you don't know what an attention mechanism is go check it out but you can see it's very standard so you have these input features which are the features that come from the keyframe and you have the context features which are all the features in your memory bank you encode the input features with into a query using a fully connected layers and the context features in two keys and then you match the queries with the keys and the softmax in order to get a weighting over the context features and then you a granade the values from the context features so this is a standard attention mechanism what does it mean it basically means that each of these vectors right here they will emit a key that kind of describes what kind of information is contained in that vector the vector over here will emit a query that describes what sort of information it is looking for to in order to describe what's in the region of interest as well as possible and then you simply match the query with the keys to determine which key fits best to that query and whichever one fits best let's say this one here then you take that vector from the memory bank and incorporate it together with your current information that you already have so that's how you address things that from other frames using an attention mechanism okay now if this were all you know we could train this right now we could train all of this because all of this is differentiable right this stage one detector right here is differentiable it goes here and here you know the information the attention mechanism is differentiable the stage two detector is differentiable all differentiable cool we can train this end-to-end now what's the problem is this

### Long-Term Memory [15:40]

long-term memory right here so in this memory ideally we would want to fit let's say an entire day an entire week or even an entire month of data from one of these cameras and it's just not feasible that we expand this current window here to an entire month or an entire week for many of those cameras because even though they have a low frame rate and so on it's still too much in order to then be all differentiable all backpropagate able and so on so we can't really back prop in for this long-term memory in essence what we want to do is exactly the same we want to build up a memory of all of the regions of interest or maybe selected regions or all of the best regions of interest whatever heuristic strategy we have of the past whatever this camera has seen let's say in the last month or in the current week or something like this we want to build all of this up and then use an attention mechanism just the same in order to incorporate it but we have to come up with these things right here in some other way then a way where we can back prop so we can't really use this stage one detector right here because this is the one we're training and so we have to back prop through it now an easy proposal is to simply use it any way but do like a stop gradient on it so we don't back prop through it that is one way but this the paper decides on a different way the paper decides that all of the past basically right here and so on we'll take a pre trained object detector so not the one we're training currently but we'll trick take a pre trained one that was pre trained either on something like Coco which is an object detection data set or you can pre train it on Coco and then fine-tune it on the task you're interested in a single frame fashion in word for whatever way we'll take a pre trained object detector or region of interest extractor and that will give us for each frame in the past will give us also the regions of interest along with the features okay and these are the features that we then go and put in to put into the memory bank sorry my tablet just crashed a bit there we go okay so we'll take a pre trained extractor right here that will give us features for regions of interest we'll put that into the memory bank and then we will use an attention mechanism to incorporate it now the attention mechanism we can train but we cannot train the extractor for the features and this is the difference to the short-term memory where we can actually train the feature extractor in order to help us with you know building the memory now the memory is simply built without a goal in mind basically and the attention Mehcad the attention mechanism basically has to learn that it doesn't work with features that are meant for its task it works with features that have been originally created for a different tasks and they're not going to change but as we'll see this you know can be handled so that's what they do they incorporate short-term and long-term memory into their stage 2 prediction and then the stage 2 prediction simply takes in all of those features and classifies the class of the object and that's the architecture of context or CNN it's our CNN with long and short term context so they describe a very different ways of you know how they built the memory and so on how they build the features I didn't I kind of glossed over this right now there's a lot of consideration in building these thing and you have to look at the paper how exactly do this I'm more interested in the high-level architecture and the sort of ideas behind it so when they do this

### Quantitative Results [20:10]

they do outperform the current or the single frame baselines by quite a bit so this SS and this CCT are these wildlife datasets whereas this CC I think this is the city something city cam these are this street dataset as you can see they do outperform the single frame baseline by quite a bit now interesting as you can see right here as they increase the time horizon of this long term memory so they can all they can now choose how much information do they want to put in that long-term memory as they increase the time horizon for one minute one hour one day and so on the performance goes up and up which is you know a strong indication that these features actually help from the timer eyes because you don't have more parameters you simply increase the amount of information in the memory bank and if the performance goes up you can make a very strong claim that these this is actually due to the fact that you have more information in that memory bank couldn't really guess any other explanation right here so they do it they do investigate different memory strategies they do a lot of ablations right here where they also say okay what if we only have the short-term attention long-term attention what do only if we only have self attention that means attention only into the current frame but of they are across regions of interest that's interesting if you have like a herd of animals and so on and they all help but as you can see the long-term attention tends to help the most in this data set and the short term attention help helps a lot in this data set if each is compared to the other owner these are two different metrics not datasets sorry about that but in essence it helps the most when you combine the two and that's you know that's pretty cool to see so they do

### Qualitative Results [22:30]

some qualitative results which I find very interesting for example they can visualize what the attention weights of their models are so here you always have a very long time frame I think an entire month in this memory bank of the long-term memory now in the top classification you see the large thing here the large frame is the one you actually want to classify and the other frames are the frames where the top attention score is so that the attention weights are the highest so here in order to classify this what does the model pay attention to or which other frames and you can see right here they are all spread across the entire month here is the timeline the most attended two pictures are spread across the entire month and almost or all of them actually have that warthog in here so this must be like its regular route and the model recognizes that and pulls in information from all these other images in order to correctly classify it here on the other hand on the next example this gazelle and tablet crashed right here it also puts all the weight on top of the images of that same gazelle but you can see maybe that gazelle was only there for this one particular moment and all the pictures this camera has of it is you know in the very few moments that the gazelle was around you can see they all come here from the same point in time or very close points in time and you can see that it puts a lot of weight on wherever the gazelle is so you know that's a pretty strong indication that it actually learns to pull in the correct information be that from long time horizon or from a short time horizon if necessary you can also see right here they visualize the top attention where the top attention weights go in terms of how long the frames where the attention goes to is away from the frame that they are trying to classify so on these graphics are somewhat kind of weird to interpret this here always means how much is the total time of the buffer so the memory buffer here contains always pictures from the total from one hour before until one hour after the key frame you want to classify so this is the at minute zero and the memory buffer contains images from 60 minutes before to 60 minutes after so it's not real time right you go back to your through your footage and you try to classify so you can also pull out images from the future you can see there's most attention is on the current frame which makes sense you've trying to classify it the current frame and then kind of falls off as you go further and further away and this is across the entire data set so this is not a specific example which also makes sense probably in most of the time the relevant information is you know next closer in time rather than farther away but also you can see that the distribution is pretty spread out so it makes the model makes use of the entire range of time and you can see that throughout even if you have an entire day in the buffer or two days even if you have entire week before and week after in the buffer and even if you have an entire month here and especially if you look at when you have an entire week in the buffer you can see the periodicity through the days so that means the model tends to pay attention to images that are from the same time of day compared to the current keyframe that's fairly good indication that the model has actually learned to address these this memory by its content right now night and day isn't super difficult because you can just go on the brightness and so on but still it's pretty cool to see that this is actually happening they do have some failure cases of the single frame model that their model is able to handle up here and if they make a lot of sense so here you can see that there is an object that's moving out of frame and the single frame detector wasn't able to recognize this but probably because it's you know moving out of rave whereas this new this context our CNN is able to detect it probably because it looked at the frame just before it where the car was somewhere back here and it could correctly classified well that's well just disregard my drawings here it managed to recognize this animal in the back whereas this old model the single frame model hasn't also probably by looking either at frames next to it or by looking at other frames of herds of animals and realizing that usually when there's two elephants there's more here you can see that the object highly occluded so we're talking about the object like at the very edge of the frame object poorly lit this is particularly impressive and also an example where the animals are often in herds and if you see one deer the likelihood that there's other deer it's very high in this particular camera and by aggregating information from different frames you can see that maybe it's always the same patch of the ER that comes by and here the single-frame detector detects this patch here as a vehicle where it shouldn't and of course the new model the contacts are CNN is able to recognize that this is present in all of the frames and in most frames the single object detector doesn't detect it as a vehicle and so it can kind of carry over that information now you can already see sort of what the downsides might be if the single object detector is like very very sure that this is in a single frame that this is a car it could carry over that information to the other frame so even though the single frame detector might have failed in that particular frame if it fails super hard it might you know shout that to all the other frames basically dominate the memory saying like look this is a car I'm like pretty sure and it will carry over that information to all of the other frames and they say in one of the in one of these high confidence mistakes it basically detected the same tree as a giraffe over and over again what I find

### False Positives [30:10]

particularly interesting is they do look at so here they have this curve of on the bottom you have confidence thresholds so how confident the model is and on the y-axis you have the number of false positives and you can see that in the low confidence regime the context or CNN has lower false positives than the single frame detector and the green line here is when you only have positive boxes so when you only include regions of interest where there is an actual object which in this case is sort of hurtful you also want the regions of interest where there is nothing because that helps you avoid false positives in other frames that's why the orange line is below the green line but strangely here in high-confidence regime you can see that the single frame model has fewer false positives than the context our CNN and I like the text that they have to this in Figure seven we can see that adding empty representations reduces the number of false positives across all confidence threshold compared to the same model with only positive representations we investigated the 100 highest confidence false positives from context our sinan and found that in almost all of them in 97 out of a hundred the model had correctly found and classified animals that were missed by human annotators so basically these the these graphs are even under estimating how good that model is because the model is seems appears to be better than the human annotators of the test set I find that to be pretty impressive and here you can see failure modes where they say for example when exploring the confident false positives on these snapshots Serengeti dataset yeah yeah the three out of a hundred images so whatever was not fail human failure were we're context our CNN erroneously detected an animal where were all of the same tree highly confidently predicted to be a giraffe so this is a failure mode when the model is highly confident it might spill that over to other frames because we now aggregate

### Appendix & Conclusion [32:50]

the information within the same camera across the frames well to be said of course they're trained test split is such that there's not the same camera in the training data as in the testing data they have different entirely different cameras in the testing data than in the training data just so there is no information leakage yeah so that is the that's the model right here how it works it's pretty cool it kind of wedges itself in between any single frame object detector that has these two stages and you know it's a pretty neat idea to bring in context from even past or even the future of the same camera just a quick glance at the appendix they have lots of different examples right here in one example their camera kind of fell over and they say well it still worked the camera with the system was still able to kind of do attention across this failure this kind of tipping over of the camera they have more examples right here which I find pretty impressive like these super low light things where it correctly detects like the opossum and yeah I invite you to check out the paper the code they say should be out soon and I'll see you next time bye