Analysing Robots.txt at scale with HTTP Archive and BigQuery
27:41

Analysing Robots.txt at scale with HTTP Archive and BigQuery

Google Search Central 23.04.2026 2 509 просмотров 68 лайков

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI
Описание видео
In this episode of Search Off the Record, Martin and Gary turn a simple robots.txt question into a data‑driven deep dive using HTTP Archive, WebPageTest, custom JavaScript metrics, and BigQuery. They explore how millions of real robots.txt files are actually written in 2025–2026, which directives and user‑agents are most common, and what that means for modern crawling and AI bots. Perfect for beginner to mid‑level developers and SEOs, you’ll learn how large‑scale web measurement works (HTTP Archive, Chrome UX Report, Web Almanac), and how to turn raw crawl data into actionable SEO insights. Subscribe for more candid conversations about crawling, indexing, and the data behind how Google Search and the web really work. Resources: Web Almanac → https://almanac.httparchive.org/en/2025/ Robotstxt custom metric for the HTTP Archive → https://github.com/HTTPArchive/custom-metrics/pull/191 robots.txt parser change → https://github.com/google/robotstxt/commit/4af32e54b715442bb04cd0470e99192f0ffb9792#commitcomment-178586774 Episode transcript → https://goo.gle/sotr108-transcript Listen to more Search Off the Record → https://goo.gle/sotr-yt Subscribe to Google Search Channel → https://goo.gle/SearchCentral Search Off the Record is a podcast series that takes you behind the scenes of Google Search with the Search Relations team. #SOTRpodcast #SEO #GoogleSearch Speakers: Martin Splitt, Gary Illyes

Оглавление (6 сегментов)

Segment 1 (00:00 - 05:00)

— Hello. It is I, Gary. From Google Search. And you're hearing me today because we have yet another episode of Search Off the Record. You know, the podcast from the Google Search team discussing all things digital and shedding some light on how Google Search works sometimes. Or how the internet works sometimes. Lots of things. Um, let's see. Am I alone to I'm not alone. I'm Martin. Hello. Did you almost forget about me? Am I that forgettable to you? Yes. Ouch. Wow. Well, don't ask questions if you're not prepared to be hurt by the answers. Okay, fine. Hello. Hello. It's been a while I've seen you. Like less than 24 hours. Oh, true. Yeah, we've seen each other yesterday. That's true. Yeah. Do you have anything exciting coming up? Uh, Search Central Live are coming up. We're visiting the world once again. Yes. Where are you going first? Um, my first one is going to be Brazil. Oh. Please eat some acai for me. Oh, I will. Oh, God, yes. — Thank you. Oh, yes. I very much appreciate that. It's so good. But I guess you don't want to talk about acai today. Oh, I could talk about acai. I read so much about acai the past few years. The first time I went to Brazil for conference, it was an external conference. Someone introduced me to acai. I never had that before. And it was basically just from the hotel like walking one block. I think it was with Pedro Diaz, former Googler. Now, I think he's developer something. He has a company where they are developing stuff and also some SEO. Anyway, and he just took me to this very small corner shop. And he was like, you have to try acai. And I'm like, "Oh, no. Leave me alone. " And he's like, "No, no, no. You have to try it. " And then I tried it. And then I kept eating acai. On one day I had like five bowls of acai because it was so good. And I kept ordering via room service. And at one point the room service person was like, "Yeah, I we think you should stop. " And I'm like, "No. " He's like, and the person was like, "Yeah, it's already like 7:00 p. m. or something. Like you should really stop because you're not going to sleep. " And I'm like, "What do you mean you you're not going to sleep? " Like it's full of not caffeine but like some other one of those components that just make you not sleep. Great. I haven't slept for two days. But you're right. I I'm not here to talk about acai today. I had a saga and you helped me a little bit with that saga. And I thought that we talk about that saga. Oh, the — I'm talking about. I think the sequel stuff we did for Web Almanac. Well, yeah. But the project was bigger. So, to give some background, we received a pull request on the official robots. txt repository to add two new rules {slash} directives to the unsupported tags list. Basically, Search Console would report that it recognizes these tags but Google doesn't support them. And the pull request was great. Like it was a very good idea. The person who sent it, well, the username is 3 x 10 raised to 8. I don't know who that is, but the pull request was great and it was a good idea. But I don't know about other companies, but at Google we try to not do things arbitrarily but rather like collect data and then say that yes, this makes sense based on the data. And John Mueller, our manager, had this idea that how about we don't just add this one tag but look through the let's say top 10 or top 15 tags and add the ones that we don't have in that list yet. Because that would give us a decent starting point, a decent baseline. And be able to say that okay, we are documenting the top 10 of these tags that we don't support, right? And fast forward two days, I'm struggling finding a public repository of robots. txt files that we could use to identify these tags. And he suggests the HTTP Archive. I have never used the HTTP Archive

Segment 2 (05:00 - 10:00)

before other than looking at the reports that they do. I think it's called the Almanac or something. — Yep, the Web Almanac. So, I don't know how it works. I don't know where the data lives. I don't know anything about it. In fact, I still don't. Do you? Yeah, I do. I used to contribute to the SEO chapter a few times. Mhm. — Okay. Want to teach me how it works? Okay, yeah, sure. So, you know nothing about it, I'm assuming. I honestly have like literally nothing. Like we did some code for it the past few days. But I don't know how it's used or why it's used. I don't know the data sets. I don't literally nothing. Okay. So, it has been running for the last I don't know how many years. It's definitely been around since 20 19 it must have been because I think I was involved in the 2019 edition. I believe that it has been running before that as well. But I'm not sure about that. But the idea is that you basically look at a large number of websites or web pages more specifically and look at how the web changes or like things that you can learn from looking at large quantities of websites. For instance, what language are they in? Are they mobile-friendly? Are they using HTTPS? Are they I don't know, using canonicals? These kind of things like all sorts of stuff that you can basically infer from looking at the source code of a website or from things you can infer from the behavior of a website. So, it has to do so a crawl. So, it has to basically like know all these things. And then it also has to kind of quote-unquote render them. So, it has to do some sort of analysis on them to get some additional data. For instance, performance like how fast is this loading? How do the Core Web Vitals look like and so on and so forth. You can't get that from just a crawl. You have to actually sort of run the website in a browser to get this kind of data. And these two things can be combined and then a bunch of people set out every year to ask questions about the big data set of information they have. So, there's like two stages. In stage one they're like, "Hmm, I wonder I don't know, for instance, I wonder how many words per page there are for all the websites that we will look at. " And then they write some script that gets this data from either the crawl so with words from a page you probably can get some of that information from a crawl. But if it uses JavaScript, you would have to get that also from the rendered version. So, when you say a crawl, is that what? Like who's crawling what? Okay. So, we start with a bunch of URLs that we know exist. And I believe that these URLs are coming from the Chrome UX Report if I remember correctly. Okay. So, if you opt into it, your Chrome browser sends data to an aggregate report basically that says like, "Hey, here is what we've seen in terms of performance data for instance from real users opening this website. " It doesn't say like Martin or Gary have seen these numbers, but it basically aggregates it. So, on average across all the people who have visited this website until now or in the last year. I'm actually not exactly sure how Chrome UX Report segments the data. But basically, all these URLs, all these websites have been visited by someone who sends the data into the Chrome UX Report. And then you can query So, it's a public data set of aggregated user experience metrics for websites. Interesting. And in this set are millions of URLs. I believe it's like 16. something million. Just a huge data set. Okay. Historically, that have mostly been homepages. So, they kind of filtered it out to only get like homepage data. Kind of arguing like, "Oh, you know, it's probably the more popular part of every website to go to the homepage. " Like you go to ebay. com or to amazon. com or to google. com rather than to like google. com/howsearchworks. That is a page on this website, but it's probably not one that we have a lot of data on. So, historically it has been focusing on the homepages. But in the recent couple of years, and I'm not sure when they started this, but at some point they expanded to what they call secondary pages. Okay. So, you can kind of say like, "Oh, we are only interested in homepages or we are also interested in how do homepages perform or look like compared to quote-unquote secondary pages. " Right? Okay. Because we usually we have this kind of stuff in the Chrome UX Reports. And for some websites, they might also be much more popular than the home page, for instance, and then the home page gets a bit neglected and then

Segment 3 (10:00 - 15:00)

whatever secondary page you have is more popular, so you put more effort into it. And then they basically run a crawler. I'm not exactly sure how they are crawling it, but they are basically doing a big run. I think what they do is they run through uh web page test. I'm not sure are you familiar with web page test? Uh that's some service uh Yes. Okay. Yes. Uh web page test. org. You can go there. You can type in a URL and it runs your website in an actual browser. And I believe what they do is they do that. They have like their own instance. Uh they're probably paying for that. I'm not sure. Or have some sort of collaboration with web page test. org. And then they put these URLs that they got from the list from URLs or the list of URLs from Chrome UX report and they basically run these through a browser instance on a server hosted by web page test. Okay. And that's what they do to crawl, I believe. Cool. But as you run it in a browser, you get a bunch of information that you don't get if we you were basically using curl or Wget or whatever on the command line to kind of just download the HTML. Like for instance, you can tell what amount of CSS has been actually used and how much is unused. You can run a lighthouse test on it. And you can run some JavaScript that you can control and that's what we wrote. Remember? That's the JavaScript that we created. — Oh, I remember. Yes. — I remember. — Yes, of course you do because you love JavaScript so much. — JavaScript. So that was also weird to me because I didn't realize that you can use JavaScript for this kind of stuff. But anyway, the way I discovered this whole thing works, well not how it works, but how the data is stored. I don't know if you can download it or not, but there's also a big query data set or data sets. Yes. And then you can query write basically SQL queries to query those data sets. Yeah. That's the second step. Yeah. Which um can be very harsh on your wallet. As I learned. — That is true because the data is relatively large, I guess. — I literally remember that Daniel Weisberg, uh our teammate, he wrote a blog post about how to avoid large charges — [snorts] — on BigQuery. What? — When you are digging into search console data. And when I got the charge, I ran one query, like one large query. And like I got like hundreds of dollars worth of charge for that one particular query. And yes, it was running for quite a while, but still it's like hundreds of dollars for it. So yeah, yay. — It happens. And it's an open source project, so yeah, I will just absorb it, I guess, but it was painful. Anyway. What? What mean? Uh reminds me of the what kind of banana cost, Michael? $10? Oh, yeah, yeah. I love that. It's so good. It's a great show as well. Yeah. I thought it was coffee, no? From Mean Girls or something. Anyway. Arrested Development. It was uh — Oh, yeah, yeah. Arrested Development. Yeah. Anyway, and we quickly figured out that no one is actually requesting robots. txt files. So the data sets don't typically have robots. txt files in it. — Mhm. Which was also very painful because I already paid like hundreds of dollars for that one particular query. — This is great. — [snorts] — But don't stop laughing. I'm so sorry. You are not. No. And then more internal discussions and then we realized that why don't we just put this in the custom metrics data set, which again is not something that I knew of. Mhm. If I knew of that thing, then probably I wouldn't have run that initial query that cost me so much. Do you know about the custom metrics? Yeah, so step number one is kind of exactly what you then did, the custom metrics bit. Where we take the URLs and we run them through web page test. And as web page test says, "Okay, this page is now done. There's nothing happening anymore in this test browser. " You can run some JavaScript on whatever you got. And I believe that there are some URLs in there that are robots. txt URLs because I think in the SEO chapter there is a robots. txt analysis. I'm not exactly sure how you can filter for robots. txt specifically from all the URLs that we have. But basically, that's step one. Like you gather these metrics. And they are custom metrics because they're not by default exposed. For instance, if you run a lighthouse test, you have certain things like, I don't know, uh I think lighthouse tests for the core

Segment 4 (15:00 - 20:00)

web vitals, so you can basically say like, "Hey, from this database that is created from all the things that we run through web page test, I want to see from each of the pages the lighthouse. corewebvitals. I don't know largest contentful paint. " And then you get the numbers. And then you can do things like you can tell like, "Hey, so what's the average? What's the maximum? What's the minimum? Blah, blah. What's the 90th percentile? " You can do these kind of things, but that's not a custom metric because these metrics are kind of like default and you can get them just by running the page through the browser. But then you can run these extra JavaScripts that are looking at the content and you can do things like for instance, you can say, "Hey, give me all the children elements that are in the head. " And then later on in the queries part, like you get a list of all the things that are in the head. Let's say like you call it custom. head-invalid-elements or something. And then you can say like, "Okay, so for all of the things that we have in this database of these head elements, which are ones that don't belong there? Or which is the most likely head element that we are seeing? Or I don't know, what's the charset that people are setting in their meta charset element that they have in the head? " And to have any sort of metric that isn't by default available to a browser or lighthouse or whatever other tools we're running, I think there's another one, web app analyzer or something like that gives you information like is what framework is this been built with or what content management system is this using? So if it's not in these kind of default tools that are running, then you can add custom code to get out what you need. And that's what you did, right? Yeah. I mean we did it. — Okay, fair enough. Like we did, yeah. You wrote the code. I looked at it and cried only a little. So that's the suggestion that we got from Barry Pollard. So he pointed us to their GitHub repository for custom metrics. And then there we found this um weirdo JavaScript function or class, I don't remember what it is. Anyway, that is actually extracting some limited number of rules. But they were hardcoded, right? So basically, it was a noindex, a noarchive, I don't know, crawl delay, whatever. Basically, just counting those that they knew of already. And we needed the exact opposite. We wanted to learn of all the rules that people are using. Not just the ones that we know about. Mhm. So we twisted it around and we got some really good comments from Barry and some other folk in the GitHub community. And then we started collecting data. I think we submitted it February 3rd or something like that and then it was merged a bit later, but it was submitted right before the next run. So basically, the next crawl. I don't know. — Yeah, the next run basically. Probably using the wrong terminology, but basically, we managed to get in data for the February 1st data set. Oh, nice. Okay, that's really nice. Yeah. And then again, go back to BigQuery or wait for the run to complete, go back to BigQuery and then run the query again, get heart attack. And then just use that data. And yeah, that's the story. Do you want to talk about our JavaScript or not? Uh we can talk about a little bit about the JavaScript because I basically start to remember things, which is I think generally good. Great. So you know what would have helped a lot? What? — Martin. What? If we had a JavaScript parser. You were less than enthusiastic and interested and now is like the most important thing. — Okay, fine. — told you a bunch of times that there are people who actually need it and finally it was me who needed it. And I was very disappointed that I didn't know I didn't I'm so sorry, sweetie. Are you? I apologize. Do you? Yes. I actually do. — We are going to include the link to that JavaScript function and I'm going to ping you this so you can also see it because you probably don't have it. — I have it open. — do? Mhm. How did you find it? It's very secret. No, sir. So some discoveries. Basically, what I was trying to do and then you confirmed that we can do that is to roughly imitate what the C++ parser is doing. Mhm. And that is basically going line by line and then I thought about going character by character, but it doesn't make sense when you are doing this kind of stuff. Because you are not looking for one specific tag or rule, you are looking for anything that looks like a rule. Yes. Right? — Yes. So, I am really really bad at writing regex or regex.

Segment 5 (20:00 - 25:00)

So, I asked the toaster or the AI chatbot to write me a regex because it is really good at writing regexes for some reason. I don't know why. Maybe there is lots of training data for it. But, it came up with this monstrosity of a regex on line 58. Yeah, that one is scary. I mean, regex in general difficult difficult, but like this one to ease. Yeah, and basically just came up with that. I tested it over and over again. I actually ran it through a fuzzer. So, basically just to try to break it, basically test its limits, and it didn't break. So, I was happy with it. And then we are just matching each line that we extracted that starts with something that resembles a key-value pair. Mhm. Separated by a colon. And then we are just extracting that. And that will produce lots of weird stuff. Like, if you look at the distribution, maybe I will put this on LinkedIn or something. Well, maybe not LinkedIn, but what's the bird the new bird thing? Blue Sky. — Blue Sky. If you look at the distribution of rules that it extracted Oh. it is How do I show it to you? I don't know. Send me a link. Okay. Where is Martin? I'm here. Martin split Okay, link. Mortimer. Mortimer. So, if you look at the distribution Oh, yeah. Ooh. — It's basically a extremely sharp drop-off after the really popular ones. Mhm. So, basically you can see that like we have the other bucket, which is basically all the lines that had a colon in them or something like that. But, after allow and disallow and user agent, the drop is extremely drastic. Like, even if you put it in log scale, I have one in log scale as well because that's showing it better. And also people can extract this from BigQuery as well from the HTTP archive. Yeah, from BigQuery. This is now in the latest crawl data. Mhm. It's in the custom records. If you look at this one, you can see that like even on log scale, the drop-off is extremely sharp. So, basically there is a large chunk of robots. txt files that contain these tags. And then there's broken files like johnmu. com/robots. txt or garye. com/robots. txt which contain just fun stuff, so to say. Actually, there's a bunch of pages that probably don't have a robots. txt and give us some sort of error page here. Yeah, yeah. Like, lots of HTML pages with uh the CSS in it. Yeah. With CSS in Yeah, exactly. That's why you see all those uh padding and IMG and — IMG A color with — Yeah. We can also use this to identify the typos of the disallows. So, I'm probably going to expand the typos that we accept. I just realized we might be able to filter these out in the query in the custom metric. Okay. If you have ideas, I'm happy to review it because I'm so good at JavaScript, as you know. Yeah. I mean, logically speaking, we have to check that we get a 200 status back, so we will avoid all the 404 pages. Sure. We can probably tell if its content type is text/html and then just not deal with it. Well, if you are strict with it, then it's fine. the parsing, then it's fine to ignore those, but technically Google does want to parse out rules from normal HTML files as well. If — we are not doing that if the HTTP status is not 200, right? That's correct. Yeah. Anyway, and then all these things that it extracted plus some additional data that was always there like the size of the raw byte size of the thing being the robots. txt file those are put in a JSON file and then basically put in the data set custom metrics data set. Is it a data set? What is it? Mhm. Data set, I would say. Okay. Yeah, and that's how we expanded our understanding of robots. txt rules. With data. That's really cool. Wow. And I think it's really nice because that might make its way into the SEO chapter for this year's Web Almanac because they just have more information available. Oh. Ah. Yeah. I did not know that. Yeah, yeah. I think they have Let me check. I think the Web Almanac in the SEO chapter, it's brilliant. I definitely highly recommend reading it. I think they do discuss Yeah, here. So, robots. txt is discussed. For instance, the status codes. Oh.

Segment 6 (25:00 - 27:00)

Like, 84. 9% of the URLs that they had looked at from the crawl set basically have a 200. 13% have a 404, and then others are weird like timeouts, 403, 500 are like negligible basically. Less than a percent. Oh. Each. Uh robots. txt size in kilobytes, most of them are between 0 and 100 kilobytes. Huh. Yeah. I mean, that makes sense. Like, you can put that much stuff in it. A lot of them contain asterisk as the user agent. Makes sense. AdsBot-Google is the more often mentioned. Googlebot only appears in 6. 2% of the robots. txt files they looked at but AdsBot-Google in 9. 8% last year. Oh. Interesting. Yeah. Huh. Interesting. So, yeah, they have a bunch of stuff here. Cool. Nice. Has been fun, now. Well, Martin, guess what? Do you want to talk about something else? Well, yes, but we cannot. Oh. So, you leave me. Well, you are leaving me. Uh quite literally, you are moving to — Fine. a different country. You Temporarily. I'll be back. You don't have to worry about that. — Yes. Will you? Yes, of course. And we will be back to all of you out there as well with a new episode soon as well. You are saying my line. Yeah, because I'm a sweetheart like that. I reduce your work. Oh, fantastic. Now, I don't have to deal with AI anymore. I you taking That's worse. — Less predictable. More unstable. Well, Martin, thank you so much for chatting with me. You are the only one who's still chatting with me. And for the listeners, thank you also for listening to us. Please like and subscribe wherever you get your podcast. And please do because if you want to listen to more episodes, then we need numbers. True. Because we are a data-driven Martin and Gary. Well, Martin, again, nice chatting with you. Goodbye. — Nice talking to you. Bye-bye. We've been having fun with these podcast episodes. I hope you, the listener, have found them both entertaining and insightful, too. Feel free to drop us a note on LinkedIn or chat with us at one of the next events that we go to if you have any thoughts. And of course, don't forget to like and subscribe. Thank you and goodbye.

Другие видео автора — Google Search Central

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник