# Scraping Amazon Product Data using Python and Power BI

## Метаданные

- **Канал:** Parker Stevens - BI Elite
- **YouTube:** https://www.youtube.com/watch?v=xUs6F7qBhbo
- **Источник:** https://ekstraktznaniy.ru/video/44766

## Транскрипт

### Segment 1 (00:00 - 05:00) []

what's up everybody it's parker with bi elite it's been a minute since i posted a video here on youtube but we're back in 2022 i want to bring you a lot of power bi content i've been thinking actually this year i want to spend more time talking about python programming and how you can incorporate python into your data visualization in power bi so if that's something you'd like to see definitely let me know down in the comments i'd like to see where you'd like this channel to go in the next year it's always interesting to hear from y'all so for this first video i want to focus on how you can use python to scrape data from a web page and this is going to offer some benefits over how power bi might do it out of the box with python you have a lot of flexibility in what you want to scrape via the html elements on the page for example in this example specifically we're going to talk about how you can scrape price data off of an amazon product page so you'll be able to connect to a web page scrape the prices off of that page and you'll be able to monitor the changes over time maybe set up alerts if that price dips below some threshold that you set a lot of cool things you can do with python and web scraping so with that let's go ahead and dive into this quick tutorial on how to set this up so firstly we're looking at the finished product here not much here the whole idea here is you supply a few urls and we are going to connect that url and grab the title and the price right now i'm really just looking at some random items here i'm looking at a yoga mat i think all four of these products i've thrown in here are yoga mat related so we see they have varying prices a couple are more expensive a couple are cheaper um so i'm going to take you over to the python script and we're going to walk through it line by line we're not going to type it out together that might take a little bit long but we're just going to walk through this line by line because it actually isn't too much and i'll show you how each individual line is going to affect this script and then afterwards we're going to move into power bi a blank power bi file and i'm going to show you how you can set this up in a really cool way so firstly here we are in this python script this is the entire script everything you would need so before we actually dive into the script you need to make sure that you have python set up and installed on your machine i have another video on how to set up python on your machine and hook it up to power bi desktop i'll link that video down in the description so once you have python already set up in your environment you have to import a few libraries so the first library that you're going to import comes out of the box with python it's requests that's what's going to allow us to reach out to a web page and get some information back the next one is called beautiful soup and i'll show you how to install that one in just a second that's how you're able to parse the html that comes back to you and then pandas which is basically a way to create what are called data frames within python and that's for the power bi piece because power bi needs to receive the output of a python script in a pandas data frame and we'll talk about that more in a little bit so with that let's go ahead and install our dependencies so once we come over to command prompt since i'm on windows i've navigated to where my script lives you'll just need to create your own script or if you want to download this script i will include that in the description below i'll also include the final power bi file which has the script already included in it i'll just have both of those in case you want to go one way or the other so the easiest way to install any library is with pip so you can type in pip and you'll see it come up with some information here if you have it installed this is also included in my other installing python video so we need to install beautiful soup and specifically you'll just type in pip install beautiful soup four and you're gonna see i already have that so that's gonna install beautiful soup four and then you also need to install pandas so pip install pandas and once that requirement is satisfied we have everything that we need so now i'm going to copy that and we're going to fire up python the python shell so you can see now i'm running python currently i'm running 2. 7. 17 which is a very old version of you can run this with python 3 any version that you would like whatever you have set up on your machine so i'm going to go ahead and paste in everything i've imported and click enter and you can see if i don't have any errors i have successfully imported these libraries meaning they do live on my machine and i can run whatever else i want so once we've gotten through step one here we can move on to the actual code of this python script so a general overview of what we're doing here in this first line we are setting the headers of our web request this is important because this is going to make our web request look like it's not coming from a bot it's going to make it look like it's coming from an actual user so if you actually try to scrape data off of amazon without setting this these headers with this user agent it's actually going to know that you're a bot

### Segment 2 (05:00 - 10:00) [5:00]

and it's going to say like hey you know we're not actually able to give you this information so you have to set these headers once you have these headers you can set a url so i have this long url it's just actually this product page and then you actually get into the web scraping piece i'm calling request. git so that's sending a get request to this url passing in the headers and that's going to give me a response and then in order to get the html of that response i'm going to say response. text and then we have to parse that with beautiful soup that's going to put it into a format that beautiful soup can then handle and parse and then we're going to find specific pieces of this page to give us the information that we want like the title and like the prices so the next few lines are parsing through what we have in beautiful soup and finding the pieces of information that we want and this is putting it in a specific format and then the final step here is to turn it into a pandas data frame and we'll talk about that a little bit more in the end so those are all the steps it's really not too much that go into it so with that i'm going to go ahead and just copy this line by line and paste this into the python shell to show you what's actually happening in the background so i'm going to copy that let's paste this in so i pasted in my header so now if i type in headers you can see this value that i've just pasted in pretty easy so url we're going to say the exact same thing so copying that guy and pasting it in so if i type in url again so you can see it's just my long url and that's a string so now the actual step where we're doing something i'm going to type in response equals paste this in all right so that's going to take a second because it's actually a web request so if i now type in response it's going to tell me response 200. 200 is a success code meaning it's actually successfully connected to that website and gotten information back if you were running into something where you weren't getting a successful response back you'd probably be getting a 400 or 404 or 401 forbidden error code so as long as you have a 200 you should be good and now we have this response object in order to see what that response object looks like you type in response. txt so i'm going to take this line and paste this in so now my html variable is this really long i probably shouldn't have done that this is a really long output of all the html so similarly i'm going to do this next step which is going to turn it into the beautiful soup format so now if i look for soup again it's still just a lot of html code but now we can actually start looking for the items that we want to find in that html code so for example the title i've already done the hard work of finding where or kind of what we're searching for but you actually do that from the browser itself for example i want to find this title right here so if you just click on it right click and do inspect while you're on chrome it's going to open up the html behind this web page and it's going to show you exactly what you're looking for here so this is a pretty easy thing to pick up on because it has an id so this span html element has an id of product title so we can specifically look in our html for where the spin id equals product title and that is exactly what we're doing here so the code for it is soup. find and we're looking for the span html element with an id of product title so if i were to take this and copy and paste it so as you can see in the output down you see this output here it's everything starting with jim cope all the way to fitness and i've added a little bit extra here to strip off the extra white space that comes through with that value so if i'm actually just going to take this line and paste this in so title is just the title that we want jim cope all the way to fitness it's a very long title so that's kind of the basic uh call to beautiful soup that you're going to use in order to extract elements from an html page this one was easy again because it had an id that we can just exactly target on so the span id that eagle's product title return back the information that is associated with that html element so the price is a little bit more difficult to find so if we were to look over here at price i can right click on the number inspect and you see it now highlighted span class a dash price dash whole so if i'm going to open this up you'll see the number here 59 and then we see a dot here and then a price fraction which is 99 that gives us our entire price 59. 99 so in order to pick up on this 59 we come back here to our um to our code here it's the same syntax souped up find uh let's go back to make sure it's a span so that's the span html element you

### Segment 3 (10:00 - 15:00) [10:00]

might also see this as div they're very similar in how they work with beautiful soup but this is held inside of a span so our class is called a dash price dash whole so we need to pass in class underscore equals a dash price dash hold so if i'm going to paste this in we see that 59 dot so that dot is helpful because it already gives us our decimal but in case this doesn't actually work for the currency that you're working with you might have to you know do something a little bit different but for us dollars this works well so the difference between passing in an id in a class is a little bit different for an id you can expect that there will only be one element on the page with that id with a class you can have multiple instances with the same class you might expect to see that this same class is used multiple times on a page in which case you might have to do something a little bit different but soup. find is just going to return the first element that it finds on the page if you want to return all of the items that have that glass you can use soup. findall and that'll give you a list of items that you can pick and choose which one you want but since i only seem to have one price hole in one price fraction i'm just using soup. find so the same thing for fraction price which was just below so a dash price dash fraction is that 99. so i'm doing the exact same thing give me the class where a price dash fraction equals 99. and one quick note this is class underscore and that's because class is a reserved keyword within python so once i have everything i want from the page itself i'm gonna copy and paste this in just to make sure i have all of those variables initialized so i have everything i need then i have these kind of individual prices i have my whole price and my fraction price so i want my final price which is just my whole price plus my fraction price and don't be confused about that plus if we look at whole price it looks like that it's actually a string and fraction price that's a string as well so it's 59 dot and then 99 both as strings so if we type in whole price plus fraction price it's not actually doing any math it's concatenating those strings so we end up with 59. 99 that's still a string i do want that as a numeric value so you type in float and everything we just had whole price plus fraction price and that gives us the numeric format so that's basically everything we now have the title as we want it we have the final price now power bi is going to expect that we return this data back in a pandas data frame so that's where this import statement of pandas is going to come in so we're going to create a new variable called df and we're going to set it equal to pd. dataframe to create a data frame and then we're going to pass in our two variables that we care about which are title and final price and we have a little bit extra here that says columns equals title and price so this is going to create a data frame just a two column data frame with one row just a title and a final price with the titles of those columns as title and price so if i'm going to copy this and paste this in i don't have final price yes i never actually initialized that let me go ahead and copy in this line of code paste this in and run this previous line again so now we can see what df looks like it is a data frame that has column one and column two the title and the price very easy so now we have a script that takes one url connects to that web page and gives us the information back now we're going to move over to power bi and talk about how we can run this script for each element in a table so for example if i have five amazon urls i can run the script five times and get back the information that we want so i'm actually going to set this up as a function within power bi to make it extremely flexible and you can call it whenever you need so let's go ahead and head over to a blank power bi file so here we are in i'm going to go ahead and grab a couple of amazon urls in case you already have some maybe you're watching this video you already have a list of urls that you want to connect to that would be great but i'm just going to take a few random ones for example the one that we're looking at just as kind of a testing item i'm going to paste that in and i'm going to call this first column url my table is going to be called data so let's go ahead and grab just a few more for example this huge one this really expensive one we're going to throw that in as a second row let's just take one more i will just scroll down find something else uh this one is fine too copy that and paste this in all right so i have three urls let me go ahead and click on edit that's going to open up power query where we will then create our function that's going to call that python script

### Segment 4 (15:00 - 20:00) [15:00]

so here we are so the next thing i'm going to do is i'm going to click on new source and go to more and run a python script source so i can just search for python script connect and i'm just going to take everything here and paste this in so let's click ok this is going to create a new query here it's going to run that python script and it's going to give us back our outputted data frame so under the python folder we're going to see a data frame called and it's called df because we named it df so power bi works by looking at all the data frames within your python script and it can return any of those data frames we just have to select the one that we want we only have one so we'll just click on df and click ok so there's nothing dynamic about this right now it's just the url that we're passing specifically into the script we don't want that we want to pass in the url dynamically so all we have to do firstly is i'm going to name this like something like fx gets data so function get amazon data so a couple things we need to do firstly we want to turn this into a function by saying open and close parentheses we can pass in a parameter here i'm going to call this url as text so we have a parameter called url that is a text data type which with a hash rocket which is going to make it a function so now we can take this url and paste it into our code this is actually our entire python script in one long line it's kind of hard to navigate so we need to take this url and pass it into our python script so it runs dynamically so let's go ahead and go over here and after this first line of headers we will go over we'll start to see that url give me one second there it is uh url so everything after the first single quotation mark i'm going to delete so just give me a second i'm going to delete all the stuff okay i've deleted that really long url you can see i have my open uh quotation mark and my close quotation mark so now we just need to actually separate this long string into multiple long strings with some uh double quotation marks so now i have my first string which is everything before this and my i have my second string oh i need to scroll over again so everything after the headers uh let me find that again okay so i have my first string and i have my second string so now i can just paste in my url so what we're doing here is we're saying alright first string and url and second string so this is going to create you know one long string but i'm just kind of closing off the first one throwing in url and opening up a second one and then that's my entire dynamic python script and that's all we need actually so let's go ahead and click done so now if i just want to as an example invoke this function with a single url let's come over here let's grab this url and throw it in here see if it works so i'm invoking it and if it's working it's going to give us back just the title and price and yeah there it goes uh gxm matt large yoga yep that looks to be the one for 98. 98 uh perfect so in order to run this for our entire table of data we go back to our data table and we can add column invoke custom function and run our function on the url column which is our only and a couple more steps once it returns that data so let's expand that and that's everything um we have our url we get our title and our price so it's as easy as that you can run this as much as you want the good thing about you know doing this with python is it's going to be pretty performant you're just going to run that script as many times as you need and each time that script is running it may take just you know a second to connect to that web page and return it back so it's pretty fast um one kind of performance consideration that you might want to think about is we're running this as a function so it's calling this individual python script for each item in your table that might be a good thing i haven't done any performance testing but what might be better is to run the script on the table as a whole as opposed to each individual row if anyone actually does any performance testing on that please let me know your findings i'd like to run things as an individual function so we can actually run this on other tables of data if we need to but that's the entire trick here i thought that was a lot of fun to do taking this a step further what would be fun is to set up alerts in order to say if a price goes above or below a threshold send you an email or something like that or maybe you kind of set it up to track the history and see um if it's below its seven day rolling

### Segment 5 (20:00 - 20:00) [20:00]

average you can set up an alert there there's a lot of different ways you can go there so also that's something you all would like to see let me know because i think we could have a lot of fun with this amazon data or similar and yeah i guess we'll end it there if you like this video make sure you hit the subscribe button like button it really helps me out as a content creator helps me continue making power bi content if you do want to see more python and power bi videos make sure you leave a comment down below and with that i'll see you in the next one
