How Browsers Really Parse HTML (and What That Means for SEO)

How Browsers Really Parse HTML (and What That Means for SEO)

Google Search Central

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Оглавление (7 сегментов)

Segment 1 (00:00 - 05:00)

Well, hello, hello everybody. This is Martin from the search relations team and welcome to a new episode for Search of the Record. With me today is Gary. Hello Gary. — I don't want to talk about it. — Okay, but I do want to talk about something because I had a thought. Oh no, not again. Martin, — I know. — Oh, we asked you this so many times. — I mean, they are few and far in between, but I thought I give it a go again. You know, it's — Oh, — it's a new year. It's a new me. It's a new thought. I promise it's the only one for this quarter. Is that okay? — So, what were you thinking? — I don't think we've ever discussed how HTML parsing works, which I think is kind of important to understand. And I see that especially when I look at people who have been working with the web for a long time, not all of them are paying that much attention. And I realize that I'm not paying enough attention because there's a new kid on the block that I have heard about but not really looked into and that's client hints. But I think before we can discuss client hints, we should talk about how that generally works. And I think you're the right person because you and I discussed parsing beforehand. So shall we talk about that? Okay. — Or can I say no? — You can say no, but I'm going to talk to you about it anyway. As if — that had made a difference. I don't know when you will learn that, but uh — interesting. — I'm an excited puppy. I'm going to bark up to you anyway. So, — okay. — HTML. — That's such an exciting topic. So, okay, stepping back. Why are you bringing this up now? I know that the way you build your websites has an impact on how it performs in terms of perceptible speed for the user as well as how it performs when crawlers have to interact with it. And there are things coming and have come in the last couple of years that I honestly slept on. — Okay. So I think now is a good time as good as any time really to catch up people out there as well as me on this a bit. — So basically nothing happened. You just want to talk about it because we haven't talked about this. — Yeah. Pretty much. — All right. I was asking because I kind of find that when we are grounding these discussions into an issue that we found it's more interesting to me to talk about like because it's you are explaining an issue versus trying to describe a system but of course this should be fine as well. — Do you have an issue in parsing that I'm not aware of? Oh, so many. I mean, parsing HTML is notoriously — challenging. — What's the PC term? Yeah, let's go with challenging. You would think as a fellow developer who probably started in the early 2000s or even in the 1990s that you can just write or at least when you were newbie, you might have thought that hey I can just write this nice regax and or rejax as John Mueller would say and that will work right. — I did that — right? — I did that. It did not work. — Yeah, me too. I think everyone who ever tried to develop something for the internet at one point in their life will have written a piece of regax that probably worked for some cases but not for all cases, right? — Yep. And that is because technically HTML is supposed to be this beautiful structured thing. But in reality, it's just a mess because well, it has to work all the time in browsers, which means that browsers are extremely lenient about what they accept, which in turn turns the developers extremely lenient, and then they spit out random stuff in their Notepad. ex, exa which will work in the browsers, will work for the users, but it's going to be a nightmare to parse. — True. — Right. — Yes. And I found the standard also quite lenient. — Yeah. — It allows a lot of stuff. It's interesting. Yeah. — Yeah. We should probably link to the standard in the episode notes, but basically it's a living standard. It keeps changing depending on what the web needs, right? — Yeah, I would say that's true. And it postulates how browsers and user agents

Segment 2 (05:00 - 10:00)

in general should deal with what's out there on the web. And they're trying to minimize breaking what is already on the web. Yeah. — While keeping it flexible enough for new stuff, which I think is pretty cool. It's a pretty impressive effort, I would say. — Yeah. I mean, it's been alive for like 30 years or so. So, — yeah. Wild. Even like the age is a testament to how cool it is. — I also remember that when I started building websites, I was absolutely in like madly in love with the validator. There's like a thing that tells you if your HTML is valid or not. — Oh yeah. — And then I was very depressed when I found out it doesn't matter as much. — What's this called? Uh W3C — Validator. Yeah. I also used to obsess about that as a younger newbie developer. Is it Nubai or Newbie? — Newbie, I believe. — Okay, newbie. We had uh two native English speakers on the team anyway. And I was obsessing about it quite a bit and then eventually I kind of noticed that it really doesn't matter. like it doesn't matter for the browsers, search engines unless you do something utterly stupid with your HTML. It's just going to work. I think that this has evolved to this stage where we are now because in the earlier days when you still had Netscape for example, Netscape is a very very old browser for listeners. Back in those days, you did have to do really hacky stuff, — right? Because you had Netscape, you had Internet Explorer, what else? — Safari at some point. — At some point. Yes. The early Firefox. And they were lenient in different ways. — Mhm. — And then you had to do some hacks like including special CSS for just I don't know, Internet Explorer or Netscape. — I remember that. Yeah. like the star hack. Yeah. This only works in certain browsers, so you can use it to address those Yeah. — quirks and those ones. Yeah. Cross browser compatibility was a huge issue. — Yeah. And back then looking at the W3C validator actually mattered because the more valid your HTML was, in theory at least, the better it worked across different browsers. But nowadays I think that matters very little. — Okay. — I don't know if you agree with that or not. — I do agree with that. But I know that there is nuance here which is like there are ways to break things that then break expectations in ways where then like yeah — there's always ways to break things. — People kind of was like oh so you have to be like 100% compliant to the spec. And then other people are like nah it doesn't matter. and then they build something that doesn't work and they're like well why does this not work right? — Yeah. — So it's not as easy as saying like it doesn't matter or it matters a lot little because it depends on what you're looking at. Right. — Yeah. I give you an example because I think we have in video and I'm going to link the video in the description of the podcast as well where I discuss with I believe it was Bastian Grim a case where they had href lang link tags in the head where they belong. — Yeah. But before them there was a script which can also appear in the head that is legitimate like specs compliant. But then the script injected an iframe right after itself and that kind of closed the head and then the links moved into the body and that's where our infrastructure ignored them correctly. So I would argue — right? Oh I would strongly argue for that as well. So if you go back to the standard the living standard and then you look at — yes — what can appear where you focus on metatags in this case or link tags and it uses some kind of floral language about where those tags or elements or whatever you want to call them can appear. Basically, for example, for metatags, it says that they can only appear, and this is from memory, like I might use the wrong words. Metatags can only appear in sections where metadata is defined or in the context of metadata definitions. And that is a very broad thing to say. But then if you start looking at the spec again and start looking at where are you allowed to define metadata, it's just a head. Like I couldn't find any other place where you allowed to define metadata. — Oh man, I think I looked at it and I think there is one specific case where you can do it in the body, but it's like a really limited edge case. Maybe I'm hallucinating. — No, no, no. You are not. I have to specify. A meta tag with a name

Segment 3 (10:00 - 15:00)

attribute. — Ah, okay. can only appear in the context of other metadata and other metadata can only appear in the head. So for example, if you take the metaame robots element, right? Because that's a named — Mhm. — meta element. — Mhm. — According to the standard, that can only appear in the head. — Okay. Yeah, that's very possible. And I think char set can also only appear in — Yeah. — the head. Yeah. — Yes. And I haven't looked at link tags because I didn't have a reason to. I mean, recently at least, but I would assume that those can only appear in the head also. — I think that's the case. Yes. — I mean, we can look it up — and if they're appearing in body, they're like discarded or something. — It being standard. Now, we are using our favorite search engine. I'm not going to say what it is because reasons. Okay. The link element. — The link element. Yeah. Where metadata — it is metadata content. And the context in which this element can be used is where metadata content is expected. — Mhm. — Which takes us back to the head. — True. — But then weirdly enough, the standard also says that in a no script element that is a child of a head element, it can appear. Sure. — That makes perfect sense. And then finally it says that if the element is allowed in the body where phrasing content is expected. I don't know what that means. So we click that. So phrasing content is the text of the document as well as elements that mark up that text in the intra paragraph level. Well, this is not helpful. — Oh, but it's stuff like most of it. — Yeah, — I wonder. Hm. — Okay, I figured it out. I'm a genius. It's the Also, I'm very humble if you haven't — You noticed. Yeah. — So, it's the same as with the meta name. If a link element has an item prop attribute or has a real attribute that contains only keywords that are body, okay, whatever that means, then the element is set to be allowed in the body. — Okay. Yeah, that makes sense. So I'm assuming that you can use it for RDF stuff. I don't know. But anyway, — anyway, generally you would expect them in the head. I would argue links, — right? But in body, you can also find it, but only if it's used for very specific purposes. So for example, again going back to the '9s, well not ' 90s, early 2000s or mid200s, you remember ping back? — Yes. like on blogs you could find those ping back thingies and ping back is okay like if it's a real ping back or link rail ping back that is okay in the body for some reason I don't know why — prefetch preload is also okay stylesheet is okay yeah — because it's not technically metadata it's a thing that will change how the page looks like — in case of stylesheet — with preload you are just instructing the browser to do some magic in the background to load the next thing faster. So yeah, but as you said, in general, you would expect link elements or at least those that carry some form of metadata to be in the head. — Yeah. — And I would argue that it's really quite dangerous to have link elements that carry metadata in the body, right? — Yeah, I see that. I see your point and I think I follow that. Yes. — Okay. I'm still like mulling over this body. Okay. Bit. So, if it has an item prop, then it can potentially live in the body, — right? — Okay. It has to have one of two properties, but I'm not sure which of them needs to be there. Like, it has item prop or probably href, I guess. — Um I don't know. Um I mean, we can look it up. — Yeah. Anyway, so Right. But why does the browser for instance close the head when it sees something that shouldn't be there? — Well, exactly because that assumes that the page finished loading the things, right? — Mhm. — That should be in the head. So, for example, if you put a paragraph like a P element in the head, that's basically content. Metadata is not shown on the page. — True, true, true, true. Right? So the metadata is in the head and then whenever we see something that isn't metadata then the browser has to assume that the intention was yeah — that this is shown into the user and that would mean that the body has started and it quote unquote has missed

Segment 4 (15:00 - 20:00)

that the body has started so it starts the body for us automatically. Okay, got it. — Yeah, I think so. And for search engines that's probably the same like they try at least to behave more like browsers. Mhm. — Sometimes that works, sometimes that doesn't work. And they would accept these tags elements in the head but not in the body. — Right. — And some of the things that you can specify for search engines actually carry a lot of weight. Let's say raonical. — Yeah. That's a pretty strong signal to a search engine. Yeah. — Yeah. But if we allowed that in the body, then mischievous Martin split could show up on my blog and put it in a comment and because I'm really bad at escaping HTML, Martin could hijack my page point with a rack equal to his blog and suddenly I don't have anything in search anymore. — Ah, okay. But wait, interesting point but counterpoint. If I have the power to inject random markup that gets parsed as markup, I could inject the script, right, — and ask it to add the link rail canonical to the head as well? No. — Yeah. Isn't it much simpler to just have a link tag? Like you would want to go for the simplest. — Sure. But if it doesn't work, then I can still use JavaScript to get around that limitation. Ah, — sure. Oh, see mischievous Martin split is mischievous. Mischievous. — I see what you mean. — Yeah. — Ah, but we can get around that. — By not rendering. — By rendering. — Oh, how? — Oh, wait. No, — no. I was thinking that if the link was introduced by rendering, — we can tell that because we have the original thing and we have the thing after rendering. — Yeah, I wanted to say something relatively stupid. It's like we don't accept the link canonical if it was injected by rendering. But we cannot do that. We have to — Yeah. — accept the link canonical if — because there's legitimate cases where that Yeah. is done. Yeah. I think we're coming to a point where we need to realize as well as the people listening to us and this is interesting because we actually are thinking about this as we speak that there are decisions to be made by whoever is consuming HTML that are going beyond the standard, right? The standard is just like you should do this if you want to work with a HTML document. But there are additional rules that you can and probably have to put on top of the standard that are not defined by the standard because they are application specific. — Yeah. So this is interesting because I know that browsers are doing a few things that are not exactly described in the standard either as in like when you run into a script tag normally unless you use any modifying attributes to the script tag the browser kind of stops doing things there executes the JavaScript and then carries on otherwise if it's basically just like HTML head with some metadata body some text some images. Then it kind of does like a preliminary scan to see if there's any images. So it can in the background start downloading those and then kind of starts building like the DOM tree and the render tree and making sure that you basically start seeing text as soon as it possibly can. So it doesn't like parse the whole HTML and then show you things. It shows you things as it goes through the HTML. And the standard doesn't specify how that works. That's how browsers kind of work I believe. — Yeah. But there are exceptions. There are specific metadata bits and pieces that do give us as the website owners and us in terms of us as a search engine company hints and suggestions as to what to prioritize and how to do things, right? There's like DNS prefetch. Yeah, there's preload, I believe. as well and then there's like script defer script async. Are we using any of that? — Sure. — Nice. — Like I don't know what we are using from those things. I don't think we are using much because we don't need to. — Okay. — Like it's very helpful if you have like a crappy internet to do DNS prefetching for example. In our case, we don't need to because we can talk very fast to all the cascading DNS servers for example to resolve whatever or pre-connect like why would we pre-connect like we are not — fair — following links right — true — for example and even for rendering the fetching of resources is not synchronous

Segment 5 (20:00 - 25:00)

right — true yeah because we're doing batch stuff — yeah and we don't refetch the resources is necessary for a page all the time. It's basically we are caching on our side or well yeah the resources to save some bandwidth and host load and whatnot for the site itself. Same with preload. If we are not synchronous then we don't particularly need to listen and look at preload. — True. — So these are very useful for browsers. I was super super excited about it when this came out in the late 2000s, I think, — because it was so easy to see how much it helps. Like, you just dropped one of these tags or keywords in a link element and it sped up things so much — because you were on an internet that was not necessarily great. you had to connect from your location to servers that were thousands of kilometers or miles away and all these little things like uh pre-connect and I don't know DNS prefetch and preload or prefetch they were doing stuff in the background that you didn't have to do anymore. — Yeah. — So yeah, I remember at one point Google introduced this link. I think it was preload for — Yes. — the first search result. — Yes. — Or something like that. Or first two or first three or whatever. Something like that. And when I noticed that in my brain, again, this was like before I joined Google. In my brain, that was nothing short of magic. — Mhm. because it loaded the page, the search result page, and I clicked the first result because I'm a sheep and I do what other people are also doing. Click the first search result and it like that just it was on my screen immediately and to me that was mind-blowing. So for browsers it can make a huge difference to use these but for search eh. Did you know that one of the couple of reasons that we had this mcache was the preload thing? No because preload has a few problems and that's why it was deactivated. I'm not sure if it's back, but I think it was deactivated for a while in browsers because with preload the problem is you're effectively like triggering a action that you normally a user would and then you're like — giving cookies and stuff so people could infer like ah — they have seen me in search results or somewhere else and that was problematic — of course — and you could avoid that by having like the mcache in between because then the mcach would download things from the server without cookies and without being able to trace it back to a user. That's one of the things where I'm like, "Oh, the mcach makes sense. " But then the discussion was so heated that people — other issues with it and it had a lot of issues. So I think that's fair. Yeah. So you would say these like link rail prefetch and stuff is not useful from an SEO perspective, but it is very useful for users still. I mean, it depends how far do you want to go with SEO because there are plenty studies out there, independent studies even, that show that people do appreciate quite a bit when things load fast. — Of course. Yeah. — And they convert better. I don't remember what the studies say, but I kind of remember that they convert better. — Retention is higher. Yeah. — Yeah. Retention is higher. So if SEO is just about search engine optimization and just the technical part of it, then these link hints don't really or link keywords don't really matter. If you step beyond the technical SEO and you also start looking at once the user is on my site or on the site that I manage, how can I retain them? How can I convert them better? then it they can become quite useful. — No. — Yeah. But it's tricky to measure that, right? So that's — sure — that's why not many people are paying attention to it. So I I'm happy that we're calling this out. And I think in general it makes sense as an SEO, especially if you're on the technical side, to understand what valid markup should look like and if a deviation from the specification is okay or if it's a deviation that is potentially problematic. — Right. So would you agree that for example metatags and uh link tags belong in the head? — I would agree yes. — When they provide hints for search engines at least. — Yeah, I would say so. Okay. — Especially because you can assume that something that is in the body was probably not put there deliberately or at least not in good intentions

Segment 6 (25:00 - 30:00)

— right? — Because sometimes we have this problem with like mixed signals, especially when JavaScript is involved. Like if you have a canonical that is there at the first time we fetch the HTML from the server and then the JavaScript changes it. We actually advise against doing that changing something with JavaScript because then it's like what is the intention here? Was the other one kind of like accidental? — Yeah. — Was the other one the right one and now accidentally they changed? — Yeah. — Uh I understand that there are situations where for whatever technical reason you can't have them in the initial HTML then add them with JavaScript. Fine. But like these mixed signals are difficult and tricky to understand the intention. So giving as clear intention as possible I think is generally the course of action. And I believe that the metadata then should also sit in the head to be very explicit like this is our intention. — Yeah. — Okay. Huh. Cool. I think that made sense which is surprising. So, okay. We talked about parsing. We talked about hints in the metadata. We talked about metadata in general. I think that caught us up on the topic. We finally discussed this in the podcast. I mean, you still have the body, but I think the body itself is kind of boring. — Yeah, that's just the content, — right? But there's no like I don't see how there are gotchas there. Like there's stylistic choices that you can make. And I'm talking about the source, not the like what you see. — Yeah. Not Like there's stylistic choices that you can make. Like for example, internally I I'm really fussy about breaking lines close to 80 characters because then it's easier to review stuff. — Yeah, on your Commodora 64. — Sure. Have you seen my setup? — Yeah, it's a nice setup. I like the vintage. — Anyway, but like for majority of the programming languages that we use at Google, one of the big ones is C++. And I wrote a lot of C++ at Google or C and C++. And for that you have to break the line at 80. — Mhm. — So everything needs to fit in 80. — Says the style guide. Yeah. — That's the style guide. Yes. And most of our review apps or software that we have, they will tailor for that for those 80 characters. So the review platform that we are using is going to do really weird things when something runs more than 80 characters and then if you are reviewing big documents all those little weird line breaks is going to be really weird to review. So yeah, I'm breaking at 80 characters as much as possible, even HTML, but other than that, I cannot think [snorts] of other things that you can — I actually have a question for the body. — All right. — What's your stance on semantic markup? So are you — what's that? expecting a difference between me just having like a paragraph element and then some text with links and images and another paragraph and me kind of like using headlines randomly or there's like an HTML 5 algorithm or structure like the standard says like oh you should do this with like one H1 and then you can use article and section elements on a page to kind of give more semantic meaning and header and footer and nav and all this kind of stuff. Does that make a difference from a search engine perspective? — I don't think so. Unless you do something really weird. — Okay. I think it helps as in like for the users and for the browsers, but I don't think it helps a search engine that much as well. Yeah. — Well, you asked me about search engines. — Yeah. So, search engines, you think it's like a small difference in practice? — I think so because like you can say that something is valid. That's a binary thing. It's very hard to say that something is close to valid. M — and then like what do you do there when something is just close to valid for example and this doesn't exist so don't try to come up with conspiracies but you cannot give a ranking boost to valid HTML for example — true — because for example if I miss a closing span then suddenly my HTML is not valid it will not change anything for the user — true h interesting but that's good to know and that's something that I ink comes up every now and then. It's like, oh, we should use only one H1 element and then H2 for all the different sections versus just use H1's for all my sections. Like, I think that's generally fine, especially because visually you can still do something with it if you don't care too much about — Yeah. — the structure semantically. Okay, cool. That was an interesting conversation. And thank you so much, Gary, for um

Segment 7 (30:00 - 32:00)

talking about parsing with me. That was wild. — Yeah, — I liked it. That was good. So, we can take away a few things that I didn't know or wasn't sure about beforehand. Okay. — Like metadata in the body, for instance, not necessarily a great idea. — Yeah. — As we discussed, HTML validity, not as important as we developers like to think sometimes. — What else? — Semantic markup, not that important. useful for accessibility and users but not that important for search engines at least, — right? — And I think like performance and performance improvements for users do have sort of secondary effects on SEO but not necessarily primary effects because the way that we as a search engine are using the documents is different from how browsers for users are using them. So — indeed. — Yeah, I think that was really interesting and I think those are a few really good takeaways. We can ask the audience like feel free to comment on this podcast and reach out on social media to me because Gary doesn't like to be talked to I hear. — Yep. — Yeah. Would be interesting to hear if you would like more of this kind of stuff or if this is too nerdy. — Yeah. And I think one of the problems is that Martin and I probably can talk about this for seven more hours because it is a wild topic and — um it is quirky to say the least — and there's lots of facets that we can explore. So, if you have questions, just yell at Martin or John Miller and please — leave me out of the yelling. — Thank you. — Leave us comments below this episode on the podcast platform that you're most happy with and we look forward to hear this is something that you all are interested in or if this is a nerdy echo chamber. Anyway, thank you all so much for listening and thanks a lot to Gary for being here with me today. Thank you. — Yeah. — I wish you all a fantastic day. Take care and talk to you next time. Goodbye. We've been having fun with these podcast episodes. I hope you, the listener, have found them both entertaining and insightful, too. Feel free to drop us a note on LinkedIn or chat with us at one of our next events we go to. If you have any thoughts, let us know. And of course, do not forget to like and subscribe. Thank you so much for listening and goodbye. —

Другие видео автора — Google Search Central

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник