# Google crawlers behind the scenes

## Метаданные

- **Канал:** Google Search Central
- **YouTube:** https://www.youtube.com/watch?v=JpweMBnpS4Q

## Содержание

### [0:00](https://www.youtube.com/watch?v=JpweMBnpS4Q) Segment 1 (00:00 - 05:00)

Hello and welcome to the latest episode of Search of the Record. In this show, we from the search relations team at here at Google are trying to give you a glimpse of what's happening behind the scenes. And with me today is Gary. Hello, Gary. — Hello. How are you doing? — I'm great. — Fantastic. Let me change that. — Okay. — I want to talk about crawling. — Oh, no. No, no, no, no. — Actually, no. I want to talk about crawlers because I am wondering if we ever discussed how exactly our crawling infrastructure looks like because people keep talking about Googlebot as if it's like a sort of almost a living thing or at least like a specific program. But I mean there's no like Googlebot exe that you double click on and then it launches or something, right? — There's not. — It works a bit differently. No. — What? — You taught me that. Yeah. — Well, you're correct. — Do we want to elaborate a little bit on that? So, how can I imagine Google bot? How does our crawling infrastructure roughly look like? — I mean, calling it Google, but that's a misnomer and it's something that back in the days, perhaps early 2000s, it worked well because back then we probably had one crawler because we had one product, but then soon after another product came out. I think that was Edwards. And then we started having more crawlers and then more products came out and then crawlers. But the Googlebot name that somehow stuck and generally when we were talking about our crawling infrastructure in general, then we tended to call it Googlebot. But that was wildly inaccurate because Googlebot was just one thing that was communicating with our crawler infrastructure. I don't know if that makes sense. — How can I imagine that? What do you mean by communicating with our crawler infrastructure? Googlebot is our crawler infrastructure. No. — Well, that's man that's what I've been saying for the past 3 minutes. — Yeah, but I can't picture it still. — So Google bot is not our crawler infrastructure. Our crawler infrastructure doesn't have an external name. It has an internal name. Doesn't matter what it is. Let's call it Jack. — And it is I don't know how to put it. It's software as a service if you like. — Oh, okay. — Like an what's that? SAS. — Mhm. — Right. — Yep. — And then so Jack has API endpoints, so to say. And then you can call those API endpoints to do a fetch from the internet. Mhm. Right. — And then when you do those API calls, then you also need to specify some parameters like how long are you willing to wait for the bytes to come back or what is your user agent that you want to send? what is the robots txt product token that you want to obey and all these parameters and we do set a default parameter for most of these things not all of them but so you can generally omit them which makes these calls simpler I guess because you don't have to specify all the stuff but otherwise it's really just an API call to something in the cloud or on some random data center and then that will perform a fetch for you. — Okay. — As a software developer or a product or whatever. — That's really nice. So I guess there's also like a team that manages it because if effectively what I'm doing is I'm outsourcing it to someone else. — Yeah. — To make all these decisions for me. Okay. — So this product like — because we can call it a product at this point even if it's internal. This has been around for a [snorts] very very long time. So technically it's been around since Google existed. — There were some changes to it because the original version that was more or less just a wget that was running on some random engineers workstation. So if we think back 1998 or 99 and then as more products came out the more staffing it needed for example more resources it needed and of course we needed to rearchitecture the whole thing to enable teams to call this service right but in essence it always been

### [5:00](https://www.youtube.com/watch?v=JpweMBnpS4Q&t=300s) Segment 2 (05:00 - 10:00)

doing the same thing. It's basically you tell it fetch something from the internet without breaking the internet and then it will do that if the restrictions on the site allow it. That's it. Like if I wanted to put it in one sentence that would be it. — Okay. So basically I hand over a bunch of configuration and say like a part of that configuration is like the bunch of URLs that I want crawled and then I hand that over to their service and then they come back with something to me, right? — Yeah, pretty much. — And that something probably is like the HTTP response and the headers and the body and maybe some additional metadata. Cool. So basically Googlebot is just a piece of this configuration that I hand over a name basically. — Say that again. Sorry. — So Googlebot is not really a program but a piece of this configuration that I hand over. Basically just a name of the configuration so to speak. — Well it is one of the callers of the SAS. — Okay. Like it's not even part of the configuration. It's just the name that one particular team is using for their fetches that are sent to this central. — So basically like one of the clients. — One of the client Yeah. Exactly. Exactly that. — Well, that suggests that there's other clients. — Yeah. Sure. I mean, we try to document a big chunk of them, but Google is a big company, so there's lots of teams that want to fetch from the internet. So there's lots of crawlers, lots of named crawlers, which means that we would need to document dozens if not hundreds of different crawlers or special crawlers or fetchers. And on a simple HTML page, that's kind of infeasible. So we kind of try to draw a line and say that like if the crawler is really tiny meaning that it doesn't fetch too much from the internet then we try not to document it because the real estate on the crawler site on developers. google. com/cwers is actually quite valuable. Mhm. — We might try to deal with that differently, but for the moment basically just the major crawlers and special crawlers and fetchers are documented because quite literally because of lack of space. You say fetchers and crawlers. What's the difference? So the simplest way to explain it is that crawlers are doing work in batch and then fetchers do work on individual URL basis. Meaning that you give a URL to a fetcher and then it will fetch just one URL. You cannot give it a list of URLs to fetch. — Okay. And then for crawlers, it's a constant stream usually of URLs and it's running continuously for your team and fetching for your team from the internet. And internally we also have this uh policy that fetchers need to be in some way user controlled. — So basically there's someone on the other end who's waiting for the response of the fetcher. — Okay. Yeah. while with crawlers is like just do it when you have the time. — Ah right. So if there's an automatic system that consumes the response and then does something whenever it's available then we can obviously treat that differently than if someone clicks a button and waits for a result. — Right. Okay. Got it. And that's the difference between fetcher versus crawler. — I think so. Yeah. — Okay. Cool. — I mean I'm pretty sure that there's more differences. Like for example the IP ranges that they are fetching from are different. — Ah — but otherwise it's pretty much the same infrastructure more or less is just performing different or performing the same task differently — right so I guess if we have documented at least like the major crawlers and maybe even fetchers then people probably know about them. But you said you only document the major ones. So if I were to start a new project and I needed to somehow have people type in their URL and click a button, then you wouldn't necessarily document that specific project if it's small enough, right? — Yeah. Exactly. — Okay. Mhm. basically if like the trigger for us documenting it. I spent way too much time coming up with basically something like SQL queries to trigger alerts for us internally when a crawler or fetcher passes a certain threshold of

### [10:00](https://www.youtube.com/watch?v=JpweMBnpS4Q&t=600s) Segment 3 (10:00 - 15:00)

number of fetches per day. And if that alert triggers internally, then we would get a bug opened, an issue opened internally that would say that, hey, there's a new large crawler in town and perhaps you want to document it. And then we would go and look at the properties of that crawler, what it's doing, why it's doing it. We would check the team to ensure that they are not doing something accidentally because we also had instances where we got a complaint about a crawler doing something on a site and then we looked at it and or the team was like no that crawler is unlaunched like we unlaunched it two years ago like that's not possible and then we were looking at our logs and yeah it was fetching and then we tracked it down that there was some random job that they forgot to turned down when the project was sunset that they forgot to turn off that job and it kept fetching from the internet for no good reason. But nowadays that's really rare because we have all these monitoring and all these checks in place to ensure that the fetches that we are doing or crawls are actually or they actually have some utility internally. — Mh. not just like randomly fetching and on the utility note there's also really aggressive caching on our side internally and that's regardless of the HTTP caching mechanisms. So for example, if uh let's say Google News fetched something 10 seconds ago, then does it make any sense to go out with another crawler who supplying data to web search and fetch that thing again? It probably doesn't. So basically, we just hand it the copy that we got 10 seconds ago to avoid these things. But then there's also tricky things where like different projects uh might have different policies about reuse of — content fetch for something else. — Let's say that something random like Adwords cannot reuse content that was fetched for a web search. — That makes sense. And you said something about a job that was still running. So I'm guessing this infrastructure is huge and has to consume a lot of URLs every day. So I'm guessing we're not running this from like your computer on your desk, right? So this is going a lot into our infrastructure, but imagine the same way that Google Cloud has those uh runner instances or whatever they are called. We would have something similar internally. So basically I can bring up a job on some remote server in some random data center in Atlanta and run my job there and the job would be a C++ program that I compiled into a bin file and run it from there or run it as a bin file. Oh — okay. — But within that program that I compiled I would make the API calls. So basically I can instruct that program to aggress to an API endpoint to that SAS crawler infrastructure thingy and instruct a crawl or set up a crawl or whatever. So yeah, — do I have to do that manually or is it smart enough to try to schedule an egress point that makes sense? For instance, if something is geoblocked. — Oh, pet peeve. Geob blocking is interesting because generally we don't have the infrastructure for handling it. So the typical address points that we have like the IPs that start with 66 like 66 129 blah those are assigned country is US. — Mhm. And if you dig into it, it's going to be Mountain View, California, which means that and we have this in the doc that we are typically crawling from the US. — And when someone is geob blocking then our typical crawler will have an IP address from that location from California. — Mhm. — And we will not be able to fetch. we are most likely going to get some sort of error. Either an HTTP error, let's say a 403 blocked, or some sort of network error, let's say connection timeout, like some random router that had the firewall setting to block requests outside of specific regions, that would just drop the connection, like it wouldn't even send back an echo. And the way we deal with this is trying

### [15:00](https://www.youtube.com/watch?v=JpweMBnpS4Q&t=900s) Segment 4 (15:00 - 20:00)

to find IP addresses within our assigned pools that have a location set to a different country and then lease those IP addresses for the crawling infrastructure. But these aggress points were not designed for high-capacity crawling. So they don't have the capacity to handle crawl for everyone in let's say Romania or in Germany or Switzerland. Well, Switzerland is tiny so maybe yes. So we are very frugal when it comes to assigning crawls to those IP addresses. But technically we kind of can and sometimes we do especially if we know that the utility of that content is very high. So it's a really bad example but let's say if enough people search for blue-eyed Martin. Oh — god, — why? — My eye color comes up again. Okay. All right. — That literally never comes up. Anyway, if someone is searching for Blue-Eyed Martin and we know that there is a site in Germany that has that content, then we would make an effort to address from Germany to be able to fetch that content if the content otherwise would be blocked — or geoenced. But, and again, this was a bad example. Don't quote me. Let's say that John Moo said this, my manager. But in theory, that's how it works. — Okay. All right. — It's a very very bad idea to rely on this. — Okay. So, no geob blocking for Google bot if you reliably want to be — Yeah. — crawable. I see. But um another thing that comes to mind is yeah there might be people geob blocking things but in general it's a lot of traffic that a crawler can generate. Are we having some sort of like behavioral rules or best practices for our side of things? Because I guess if I build a project and I say like, "Hey, Google crawling infrastructure, here's my configuration. Please crawl these bazillion URLs every hour. Will they just do that or is there like some sort of guidance and how our crawlers should behave? " So how our crawler should behave how — because you can overwhelm the internet basically. Right. — Right. So that kind of thing is handled at the infrastructure level. — Mhm. — And basically it's actually one of the reasons why we have that infrastructure because we need to be able to force teams to not break the internet. Let's say that I'm a new engineer. — Mhm. and I come to Google and I sit down and I quickly get access to one of the machines in a data center and I start scripting. I write a bash or a shell script, open a socket and start streaming in data. — That particular server has a 10 GB connection and I go to martinpit. com and start streaming in the data 10 GB per second. I think that your server or at least your hoster is not going to like that. — Yeah. So what we are doing instead is that generally you cannot address directly from the servers that are running in our data centers — unless you are calling one of the fetch services like one of the crawler infrastructure endpoints and aggress. com started slowing down on repetitive — fetches. So basically from the baseline the connection time just went up and up and up and we have to slow down and then it will throttle the requests that is sending to martinisplit. com. If it gets a 503 HTTP response then it slows down even more because that actually means that the server was most likely overwhelmed in some way. But then 403, 404, like all those, they don't mean anything. That's just like random client error like you sent the wrong URL or something like that. So yeah, um the please don't break the internet part that's in the crawler infrastructure at infrastructure level and uh generally that's not something that individual teams can control. — Okay, so I can't screw it up with my own project. That's nice to hear. Are there any other like general guidelines that

### [20:00](https://www.youtube.com/watch?v=JpweMBnpS4Q&t=1200s) Segment 5 (20:00 - 25:00)

the crawler level infrastructure prescribes so to say? I mean there's a bunch of things that are for our own protection or our infrastructures protection like for example the infamous 15 megabyte default limit — that is set at the infrastructure level and basically any crawler that doesn't override that setting is going to have a 15 megabyte limit. Mhm. — So basically it starts fetching the bytes from the server or whatever the server is sending and then there's an internal counter and then when it reached 15 megabytes then it basically stops receiving the bytes. — I don't know if it closes the connection or not. I think doesn't close the connection. It just sends a response to the server that okay you can stop now like I'm good. But then individual teams can override that and that happens like it happens quite a bit and for example for Google search specifically for Google search the limit is overridden to 2 megabytes — for everything — well mostly everything like for example for PDFs it's I don't know 64 or whatever okay — because PDFs can like the HTTP standard if you export it as PDF I think you said that then it's uh 96 megabytes or something. — I think so. Yeah, it was huge. I remember that. — But that means that it would overwhelm our infrastructure if we fetch the whole thing and then convert it to HTML blah blah and then start processing it. It's just like it's overwhelming because it's so much data. And same goes for HTML. It's the HTML living standard. Like if you have like 14 megabytes, we're not going to fetch that. We are going to fetch the individual pages because fortunately they also had enough brain power to have individual pages for individual features of HTML. We can fetch those pages but we are not going to have anything useful out of the 14 megabyte one pager. — Yeah. — Of the HTML standard. — Yeah. So yeah and other crawlers I never worked on other crawlers but other crawlers I'm sure have different settings. I could imagine for example that even in individual projects it can have different setting for the same thing like for example I can imagine that if we need to index something very fast then the truncation limit could be 1 megabyte for example. I don't know if that's the case, but I could imagine that to be the case because — if you need to push something through the indexing pipeline within seconds, then it's easier to deal with little data. — That's true. I think in general it is useful to have cleared up this idea of crawling just being like a monolithic kind of thing. It is more like a software as a service that search is or web search specifically is one client to and not like a monolithic kind of thing. And as you said like configuration can change. It can even change within let's say Googlebot. If I'm looking for an image we probably allow images to be larger than 2 megabytes I guess because images easily are larger than 2 megabytes. PDFs we allow 64 whatever is documented. we'll link the documentation. But uh I think that makes perfect sense. And if you think about it as in like it's a service we call with a bunch of parameters, then it makes a lot more sense to see like oh okay so there's like different configuration and this configuration can change on request level not necessarily — just on like Google bot is always the same. Okay. Wow. All right. That was uh — that was something huh — that was a whole bunch of stuff. Yeah, that was a lot of stuff. I think that was useful though and I hope that our listeners think the same way. Let us know in the comments below if you're interested in more stuff like this or if this was useful or not and um subscribe to the podcast and uh tune in next time. Thanks so much Gary for being here with me today. — Are you a cop? — I'm not a cop. — Then don't tell them how to live their life. — I don't I'm just like making suggestions here. I'm just um — Okay, — rude. — Fine. — Okay, fine. — Bye. — Fine. Goodbye. — We've been having fun with these podcast episodes. I hope you, the listener, have found them both entertaining and insightful, too. Feel free to drop us a note on LinkedIn or chat with us at one of our next events we go to. If you have any thoughts, let us know. And of course, do not forget to like and subscribe. Thank you so much for listening and goodbye.

---
*Источник: https://ekstraktznaniy.ru/video/21749*