description. We'll call this Google Map Scraper no API. I'm just going to add a tag for NAN course here to keep things very simple. Okay. So, first step that I'm going to do is I'm just going to add a manual trigger. The reason why I'm going to do this is because I'm not going to connect this to my Google sheet, at least for the purposes of this demo. I'm just going to keep things super simple and super easy, and we can talk about adding a sheet input later. The next thing we're going to need is an HTTP request. Okay, now this is where we're going to be putting in the URL of our Google Map search. Now, Google Maps is scraped using a very specific URL. It is www. google. com/maps/arch and then we put in the search query. You can't have spaces in the query, and that's why we needed to add that plus beforehand. There are two additional options I'm also going to add. The first is going to be ignore SSL issues and the second is going to be response. We're going to include the response headers and status too. Now, when I click this test step, what's going to happen is it's actually going to perform that HTTP request to the Google Maps back end. And then we're going to receive a giant list of essentially HTML. Now, hidden within this HTML is a bunch of links that we can then take, do more HTTP requests to, and then extract email addresses directly from all of those. The question is, how do we actually get these links? Well, what I'm going to do here is I'm just going to share a little snippet of code that I've used for this purpose. It's actually very straightforward. I should say that you don't need to use code for this, but I thought it was simple enough that I just asked Chat GBT in 10 seconds, whip me up a little snippet that does this, and it did a pretty good job. So, I'm going to write code that's going to allow us to run some custom JavaScript or Python code. I'm then just going to take all of this stuff out, and let me run you guys through what the code would look like. Okay, so what we're going to do is we're actually going to grab some of that information on the left hand side here. And I'm just going to package it all inside of an input variable. In JavaScript, we do that by writing const. Now, the purpose of what I'm about to show you is not that I expect you guys to learn JavaScript or kind of figure it all out just watching me write this. It's just to show you guys how easy and simple it is to grab data that is in no code format and then use a couple of lines of code to simply and quickly convert it without also requiring a ton of execution. The cool thing is you can now ask AI to do large portions of this for you. I just know that this particular snippet of code works. That's why I'm going to reuse it. But essentially what we're going to do is we're going to store with an input all of data. So I'm just going to go dollar sign. This is going to allow me to select the specific item that I want to pull. So we'll do input. Now in NAD they have this convention where they actually return all data from previous nodes as an array of items. And so what we have to do is we have to select the first of this array. Even if we're only getting one item, which in our case we are, it's technically an array of items. We have to select the first one. Kind of annoying, I know. Talk about annoying. We have another convention here which I don't really talk about which is this JSON convention. In order to access this data, we first have to go through this filter of JSON and then after finally at the very end now we can actually select that data. Okay, so now for all intents and purposes this is inside of this. All right, moving on. What we have to do next we have to build out the pattern that we're going to use to extract all of the URLs. So what I did is I actually asked chatbt a moment ago to build me out a reax which is essentially a regular expression a templating language used specifically for this purpose. Now I know this because I use reax all the time to extract parse and do various things like this. If you just had a brief 5-minute conversation with chat gbt and asked it how you would do this it would probably return reax as an example as well. So don't think that this is some super convoluted scary programming stuff. What I'm going to do is I'm just going to copy this. Then I'm just going to write const reax. And then I'm going to paste this in along with a couple of additional characters. I do a slash at the beginning and a slash g at the end. This just stands for global. And again, it's one of those little formatting things. From here, what we have is we have the input data, all of the scrape stuff in code. We then have the pattern that we're going to apply to this. What we have to do is we actually have to do the applying. So the way that I do this is I write const. You know, const is just a convention in JavaScript. URLs or why don't we just use websites. It's probably a little bit easier for people to understand. What we're going to do is we're going to go input and we're going to match it to this reax. Okay. Then finally, what we're going to get as a result of this is we're going to get a giant list of websites. So what I want to do next is I want to return these websites in the format that NAT is expecting. So what I'm going to do is I'm going to return websites the map and then for all websites I'm going to return a website. Then they have this specific format with an equal sign and then a greater than symbol which is basically like an arrow. Then we returned things nested within one layer. Here, I'm going to go uh JSON and then I will return my website down over here. Okay, so if I didn't screw something up, if I click test step, we should have a giant list of websites under this website thing, which we do. Now, what you'll see is we got a giant list of websites, but these aren't really websites related to our search. We have schema, Google, whatever, gt. It isn't until I kind of go way farther forward, okay, that we actually start getting the dental care websites that I was looking for. Okay, this is a problem. Obviously, we don't actually care about Google and Gstatic and stuff. So, what we have to do is we have to remove them. And that's what this next step is going to be. It's going to be filtering out all of these really annoying domains that don't really add anything and then giving us a nice tight list of only the dental websites that are left. Cuz remember, what we're doing is we're basically going to like Google Maps. We're pumping in Calgary dentists. We're just scraping the entire page, right? So, we actually need to take this data and then we have to format out all of the additional links they provide us. But never fear, that's actually very easy to do in NA. What I'm going to do is I'll just press P. That's going to pin my output data. Then over here, I'm going to go filter. Okay. Now, filter allows us to remove items matching a condition. So, I'm just going to drag in website over here. Now, what I want to do, if you think about it, is I just want to remove all those bogus ones. So, schema, I want to remove Google. a bunch of stuff. And in filter, in order to do that, just go to string and then go does not contain. So, first of all, I don't want anything to contain schema. Next up, I do not want it to contain anything with the term Google, right? I saw a couple of other terms there that I'm just going to pump in really quickly. And you know, we'll go back and forth until we actually get all this stuff done. I think it was GG, right? I wanted to contain that. Let's test this and let's see how that works first of all. So, we fed in 302 items. And as you can see, it's only returning 133. So, we're actually getting pretty far there. And it looks like we're actually getting like dental domains now in the first page, which is nice. We still have gstatic. Okay. So, we got to get rid of those. What else? Gstatic search. openare. Open care might actually be good. I'm not entirely sure. Okay. Uh, what else we got? Gstatic mostly, but then we also have CAN 5 recall max. Okay, I don't know what that is. Chat now. Okay, so Gstatic is really the main one. Why don't I go and then we'll go I also don't want you to return anything that contains the term Gstatic. Okay, let's try this. Now, as you can see, we're just like progressively filtering, but this looks pretty good. What you'll notice though is we're getting a ton of duplicates, aren't we? Richmond dental concept dentist pathways heritage heritage that's not good we need to remove these so that's what I'm going to do next okay so first of all I'm going to press P again pin the data and now I'm going to go to remove duplicates so how do you do that dd dup or actually it's the remove duplicates node here and I'm going to go remove items repeated within the current input this is the easiest and simplest way to just immediately remove duplicates from the preceding node if I press test step you'll see that we fed in 60 now we only have 27 left okay very easy awesome so now that I've removed the duplicates if Think about it. What have we done up until now? What we've done is we've scraped a bunch of data over here. We then extracted it with code, extracted URLs with code. We're then filtering these URLs. Now, we're removing all the duplicates in those URLs. Well, the next thing we have to do is obviously we have to start scraping the individual pages to look for emails, right? So, I could theoretically just add another HTTP request here. And what it would do, you know, is I would feed in the URL of the website that I want. Okay. What this would do is this immediately process all 27 items in the list. But I'll be honest, I've tried this before and if you try in NAN just process all 27 websites, usually your IP address gets blocked. So what we have to do is we have to do like basically some scraping hygiene here. And we have to be a little bit smart about how we're going to be performing all of these scrapes. And the way that I like to do it is I like to use what's called a loop over batches or split in batches node. So just type loop. loop looks pretty intimidating if you've never used it before because there's all these arrows that are going in and out of different modules and there's this replace me node which means nothing to most people here but let me just run you through really simply what this does is it will take the output of the previous node as input and then it will run for all 27 items this loop so it'll do everything that we say over and over and over again 27 times and on the 28th run it'll say hey there's nothing left back here and then it'll say okay well I guess we're done and it'll proceed down the done path. Okay, that's all that's really going on here. So, you have to feed the output of this into the input in order for this to make sense. Okay, so what I'm going to do now is I'm going to add the HTTP request right over here. And then what I'm also going to do just before I map all this is I'm also going to add a little weight node. Just going to make a wait of 1 second cuz I've had a couple of issues in the past where I don't have any weights in my HTTP requests and then as a result, you know, I can like demo it or whatever, but I don't just want this to be demoable. I actually want you guys to be able to use this. In practice, if you don't have any weights and you can just get IP blocked pretty easily when you're doing any sort of scraping. So, I usually recommend at least for testing purposes, just put some weights in. Okay. Then the output of this weight is going to be the input to the loop over items node. Okay. And now what we've done is we've essentially closed the loop. And now we have this done little string which we can fire off after. All right. Okay. Just to make things minimally ugly, I'm just going to move this stuff down here. And now what I need to do is I just need to get the input into the loop over items. Now, this kind of annoying to do. I'm just going to look at test step and see. But we can't actually do this cuz I've connected this. Why don't we just delete this? Retest this. Okay. So now we have the loop branch which contains the website. So we can actually feed this into the website and then we can add this loop branch in. Now that we have access to that, we can just drag the website over. Couple of additional things that I'm going to add under the redirects tab. I'm just going to go do not follow redirect. Some websites will redirect you multiple times and when you hit a redirect loop, it just makes the thing error out which is kind of annoying. And anyway, then I'm going to have that wait and then it's just going to go for all 27 items. Okay, pretty straightforward. Now, I'll be real. I don't want to take 30 seconds to run this every time for demo. So, what I'm going to do is I'm actually just going to cut the input way down. See how it says 27 items? There's a quick little hack that allows us to do this during testing in N. Just add a limit node, then add the limit, something really small like three. Because we did this, what this will do now is this will take 27 items as an input. Then it's only just going to poop out three items, which will mean that when we run this through our loop, it's going to do it in 3 seconds, not 27. This is just going to help me do this video a little bit faster and then just retain my sanity while also minimizing the likelihood that we get IP block because we're running a lot of requests. Okay, work smart, not hard. Okay, now I'm going to execute this workflow. As you can see, first item done, second item done, third item done. And then once we're done, what it does is actually returns three items. How cool is that? So, what is it returning right now? Well, it's returning all of the HTML from the websites that we just scraped. So, a bunch more code. But this isn't really what we're looking for, is it? No, it's not. I'll tell you what we're looking for. What we're looking for is we're actually looking for the email addresses. Okay, so how do we find the email addresses here? Well, that's where another code block is going to come in handy. What we're going to do in this code block is instead of finding URLs, we're actually going to go and we're going to parse um emails. Okay, so I'm just going to stick this in over here. Then the output of this code block is going to loop back and be the input. And then once this is done, we can then get into some final data processing and then we should be good to go. Okay, so what are we going to do with this code block? Well, I mean, you know, I just pasted in a bunch of the stuff over here. Well, if you think about it, we're basically going to do the exact same thing that we did for the URL and just instead of doing the URL, we're going to do this for emails. So, we're going to run a bunch of code that basically takes the data that we're feeding in from this weight, which actually we already have inside of input, and then we're going to look specifically for emails. So, I have another reex here. It's just instead of this, what I'm going to do is I'm just going to ask it to find me all emails. So, let me go back here. Then I'll say, okay, great. Now, build me a simple reax that finds all emails in a website scrape instead. What it's going to do is it's going to write me something very similar. And I don't actually know if this is entirely good to go. I'm going to try it. We're going to see what happens. But I usually just run it and then play it by ear. Go slashg again. Then here under constant websites, what I'll do is I'll go emails. Okay. And instead of returning websites map, I'll return emails. m mapap. And then instead of website, I'll go email. And then JSON, I'll go email. Okay. I don't know if this is going to entirely work. We'll give it a go. Okay. We couldn't find any emails in the first three. So I just pumped this up to 10. And it looks like we are now getting a couple of email addresses, which is pretty cool. Yeah, that's to be expected. We're not going to get emails of everything, for instance. Right. In the demo, I think we pumped in like 300 or something. We got like a 100. So that's that. Now that we have a bunch of email addresses, what we're going to do is we're going to go and proceed down the done loop. What do we got to do with this done loop? Well, if you think about it, like we're outputting a bunch of emails, right? Everything is nested within this emails array. So we're going to have a bunch of email arrays. So what we have to do is basically to make a long story short is we have to split all of these out so that they're each their own object instead of being independent arrays. And then we're going to take that data and then we're going to add it to our Google sheet. So what does this actually look like in practice? We actually have to like get the data out to this loop in order for me to access it. So, I'm just going to add a weight for one second and just push it all the way through. So, we're scraping, scraping. Okay. And then once we finish, we now have access to those 10 items. Let's just take a look at what this data looks like. So, for some of these, we're not going to have access to the email because some of these will have been null. Okay. So, if you find yourself ever getting an error with an HTTP request, what you can do is you can go to settings and then just go on error continue. And in reality, NAN can't scrape all web pages. So, we're just sort of throwing the ones that it can't scrape away just for simplicity. But for the ones that it can, we're going to have email addresses as we could see. So, Mloud Trail Dental, Galaxy Dental, Scenic Smile, Satin. Right. And now that we have all of these, what we need to do essentially is we need to aggregate all of the individual emails, and we need to remove all of the null entries. So, I'm going to go down here to filter first. We're just going to use the filter to remove all the null entries. Let me pin this so we don't have to do that again. What I'm going to do is I'm just going to feed in emails, and I'll say emails is an array, right? So, I'm just going to check. Let me just go schema or JSON. Is it an array? Yeah, emails is an array. So, I'm going to go down to array and then I'll just say check to make sure emails exist. Okay. So, this should just filter out all of these null entries. Cool. And now we have three items of the what did we feed in? 10. And now that we have three items, as you can see, this is scraping multiple and aggregating multiple into a single array per website. So it scraped three instances of info at galaxy or four and then two of set in a galaxy info at what we want to do is we just want to stick all of this into one giant list and then we want to run through and dduplicate it. So what I'm going to do next is I'm going to add how would I do this? I do split out. I think Yeah, pretty sure. And I'm just going to go emails here and this should basically concatenate all of these together into one. Cool. Now once I have this I'm going to ddup it. And then this will now filter down all of the many into four. Beautiful. Now once we have four, what can we do? Well, now we can, I don't know, add them to a Google sheet or something. So, let me go down to append row in sheet. That's what we're looking for. Just going to use my own credential. This one right over here. Then we'll go from list. Uh, I think this is scrape without paying for APIs right here. Right. Then the sheet was emails. And we should just uh dump the email directly in here. Okay. And I'm going to use the minimize API call option because I've obviously had some issues with this in the past where I've just done so many demos that it's just dumped a bunch of stuff into a Google sheet and then I run into API rate limits and stuff and then I can't record my video for half an hour. So, I'm not going to allow that happen to me today. Um, why don't we go back to this email list, just delete all of them, then go over here. Why don't I just pin my outputs and finally I'll just run this. See how this works. Oh, perfect. Just dumped all four. very good and uh yeah in a nutshell that's more or less how to do it. Okay so a couple of gotchas that I think are pretty common for people um as well as a couple ways to extend the system. The first way you could extend the system is right now all we're really doing is we're scraping the kind of like the homepage of all these websites. Realistically the email addresses aren't just buried on the homepage they're buried on all pages. So you know over here where we extract the URLs. Well, what you can do is first you extract the URL. Then you do an HTTP request to that URL and then instead of extracting emails over here, you actually extract other URLs on the URL that you can access. Then you run a third loop and that third loop goes through each of the URLs that you just extracted from the homepage and then it does the exact same thing that we're checking here aka extracting emails. So now, if you think about it, we're significantly increasing the number of pages total that we're scraping from. Um, but I just didn't do that for simplicity purposes, and I just wanted to give you guys like a I don't know, a little nugget that you could build out. This isn't the first time that people have built a sort of system like this. It's not like this is revolutionary or anything. And this is definitely one of the poor scraper systems I think that I could have put together, but yeah, I just wanted everybody here to um have a good place to start for more advanced scraping applications. Now, after you're done with this as well, what you could look into is if you run this at any sort of scale, you know, eventually this Google Maps um HTTP request module, this initial one uh initial one will run into Google Maps rate limits where they'll think that you're scraping them, which you are, and their AI will detect on it and then they'll just start throttling you. So, you can only do maybe one request every hour or something. If you want to get around this, there are a couple of options. The most common is to use a proxy. Now, proxies are basically third party services where you pass the request through before it goes to the end URL, which in our case is Google, that sanitize the request and then make the request appear legitimate by sprinkling in a bunch of additional data. There are variety of proxies you can use for this purpose. I'm not going to recommend any specific one, but the way that the HTTP node works is if you wanted to add a proxy, you just go down to the bottom, click proxy, and then paste it in. And this is going to depend on the proxy service that they're using. They all have slightly different formats and they're going to give you their username and the password and stuff like that. But that's how you do it. And then if you are going to search for something like that, then obviously search for like a search engine results proxy, a SER proxy. That's sort of the thing that you should start by googling. SER proxies depend on whether or not you want to do like residential, warehouse. There's a little bit more nuance there. If you guys want to learn more about how all that stuff works, check out the Appify course on proxies. It's probably the best written one on the internet today. I am affiliated with Ampify, but you pay nothing to access that free resource.