Building a Native Correlation Plot in Power BI

Building a Native Correlation Plot in Power BI

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI

Оглавление (3 сегментов)

Segment 1 (00:00 - 05:00)

what's up everybody it's parker with bi elite this video is for my statistics and data science and nerds out there in this video we're going to learn how to create a correlation matrix within power bi a correlation matrix or correlation plot is really helpful because it allows you to see the correlation between two different variables in your data set for example we are looking at the famous mt cars data set which gives you information on different cars for example it gives you the miles per gallon for a specific car it gives you the horsepower for a specific r and pitting those two against each other we can see how correlated those two variables are for example if i want to look at my horsepower column i can see that it is positively correlated with the number of carburetors in the car it is positively correlated with a number of cylinders in the car but horsepower is negatively correlated with miles per gallon that makes sense because the higher the horsepower you have the lower your miles per gallon in your car is so i should back up for just one second if it is red it means there is a negative correlation between those variables if it is blue it means there is a positive correlation between those variables and to fully understand what this is showing us i want to go ahead and plot this real quick so let's create a quick scatter plot and if i go to my cars data set and in our recent example we're doing horsepower versus miles per gallon so i'm going to take horsepower on the x miles per gallon on the y and i can just simply add a trendline we see that it trends downwards meaning it should be red here showing a negative correlation in the previous example as well we did horsepower versus number of cylinders so let me change miles per gallon to number of cylinders and we see oh sorry horsepower on the x cylinders on the y and we see a positive correlation that's why it is colored blue it's nice to see this in matrix format because it's going to boil it down to a single number it doesn't make us have to interpret the slope of the line or the distribution of the points it's just going to easily tell us are these two variables correlated yes or no there is a custom visual available on appsource that lets you create a correlation plot nice and easy but it does require that you download and install r and i know a lot of people don't want to have to download and install r and enable script visuals so that's why i'm making this video so we can create this visual natively within power bi there's a lot of power in creating visuals natively within power bi because it allows us to customize however we want to so with that let's go ahead and dive in i'm going to try to keep this a little bit briefer than usual because there are a good number of calculations that we have to create so let's get to it firstly let's take a quick look at our raw data set if you do want access to this data set you can just download this power bi file uh by going to the blog post link down in the description it's going to have all of these different tables here but right now i'm going to show you our main data table it's called cars we see we have an index column that's just going to tell us which number of car we're looking at we see that we have data for 32 different cars so for car number one it has a value of 21 miles per gallon it has six cylinders in the engine a displacement value of 160 horsepower of 110 for example i don't necessarily know what all of these columns mean if you want to learn i'll link this uh link down in the description as well it just has a nice description of all of these variables i'm not incredibly knowledgeable about cars so i'll stick to what i do know so as you might expect if you have some knowledge of cars you can expect that as the horsepower goes up the miles per gallon is going to go down as the weight just kind of makes sense with how cars are built and that's exactly what we're going to see in that correlation plot so the next step after you load in the data i'm going to go to power query here so the next step after you load in the data i've just added an index column which is actually pretty important so i added an index column starting from one and i reordered it to put the index in the very beginning so that's all i have for my raw data table and then i'm going to create a cars pivot table which is just referencing my cars table and i'm going to unpivot all columns other than my index column let me just replicate what i've done there so starting with my cars table i have my index column highlighted here right click unpivot other columns that's going to put it in a nice vertical format where i have my index so all of these values of 1 was my first row of data that we just looked at i have my attribute and i have my value so we need it in this format in order to do our calculations properly and then finally i have two tables that are identical here attributes one this is basically just getting a distinct list of our attributes and in order to do that i just referenced my car's pivot table

Segment 2 (05:00 - 10:00)

i removed all columns other than the attribute and we see that it just replicates over and over so i just removed the duplicates in that column that gives us all of our individual variables that we want to analyze and then attributes to is the exact same thing i just duplicated but i could have referenced as well but here we are with just two tables that give us all of our attributes so we have those four tables only the last three are really necessary uh you can disable the load of the raw data table if you want to but i just kept it in so i could show you the scatter plots so with that that's all of the data that we need in the proper format so i'm going to go ahead and close and apply and quickly in the modeling view we see that there is no relationship between attributes 1 and attributes 2. so in our visualization view i'm going to create a new tab and we can build this from scratch so let's create a matrix visual let's bring this to the middle so in attributes 1 i'm going to take my attribute column and throw it in the rows and attributes 2 i'm going to take my attribute column and throw it into columns so that doesn't work until we throw in a measure i have a measure called one which is literally just equal to the value of one i'm gonna throw that in the values here so we can see there's just the value of one for every different combination of our rows and columns so that's working just fine for now and now we get into the actual calculation of the correlation coefficient which is just going to give us the representation of the red or blue color so in order to actually calculate this i'm going to show you the specific calculation it looks like this so this r is the correlation coefficient and we have just different variables that we need to calculate in order to arrive at the final single number to bring it down quickly n is the total number of samples so cars in our data set x is basically our value along the x y is y so this one specifically right here is the sum of x times y we then want to subtract the sum of the x multiplied by y so you can see this is very specific kind of looks like a statistics calculation a very specific one so if we just follow this to a t we arrive at this correlation coefficient i'm going to go ahead and skip the bottom here because we're actually going to put that into action in our final measure calculation but basically you can kind of get your head wrapped around how we need to calculate all these different variables sum of x times y sum of x sum of y i see sum of x squared y squared so those are the five or six calculations that we need to replicate they're all going to be extremely similar though so we just need to wrap our heads around one of those calculations all right so in my car's pivot table i've already created these calculations and we can start to throw this in just maybe a table just to understand how these calculations are working here so in if you recall from our page here in is the total number of samples total number of cars so that's easy just distinct count of my car's pivot index so if we throw this in we see we have 32 cars that's perfect nextly we need to understand the sum of x y so sum of x times y so basically that is going to look at our data set let me look at our car's view actually let's um pick an example here so let's say we want to figure out the correlation between horsepower on the x and miles per gallon on the y so what we need to do is we need to sum x times y so that is miles per gallon times horsepower and we're summing that all the way down so if we want to understand this imagine a calculated column that's the horsepower times miles per gallon in this example it'll be somewhere times somewhere around 2000 so calculate the product of the two and then sum them all that's how we're basically doing that so in order to put that into code let's take a look at our x y so it's a fairly easy pattern once you understand it so current x is selected value of my attributes 2. remember my attributes 2 is along my x my columns current y is my attribute's one attribute which is going down my y or my rows and then we need to create a virtual table i'm just going to call it var virtual and set it equal to the summarize column so we're summarizing our pivot table our car's pivot table we are going to group based on the index and then i just want to grab the value for my current x so it's basically looking just like my cars table for index one give me the horsepower value along my x give me the miles per gallon value along my y so basically it's just getting the max value where my attribute is

Segment 3 (10:00 - 14:00)

my x which in this case we're talking about miles per gallon similarly doing it for the y grabbing that value where my current y is horsepower and then finally we just return the sum x of my virtual table x times y so as you can see we're creating this virtual table so it's going to go from 1 to 32 all of my index values and it's grabbing the x in the y value and then finally we're iterating through that table multiplying x times y so pretty easy to do we're going to replicate this pattern for all of our calculations for example it is the exact same pattern but we're basically just summing x for y we're just summing y for x squared we're summing x times x for y squared or summing y times y and i think that is all of our calculations so that's pretty much it um if you need a second to wrap your head around that one maybe plot that one out let that soak in and then our final measure is just this correlation coefficient and what this is it is this exactly so we have our n times x y minus x times y divided by square root of n times x squared minus x squared times n times y squared minus y squared that is exactly what it's doing not too much to talk about there but once we have it done like that we can remove actually let me go ahead and get rid of my table we can remove the ones from here i'm going to throw in my and my correlation coefficient right now is rounded to zero decimal places but if i let's add a couple decimal places we see the actual correlation coefficients uh for these different combinations of variables so now if i want to create some coloring i go to my correlation coefficient in the values conditional formatting and add a background color so i'm going to build this conditional formatting based on color scale based on correlation coefficient calculation so my lowest value i'm actually going to create custom from negative one it's going to be a red color to a value of 1 which is blue so negative 1 is red positive 1 is blue anywhere in between will be kind of a purple color let's go ahead and click ok and there is our correlation matrix so in order to just format this properly we can get rid of our subtotals so i get rid of my row subtotals my column don't need that anymore i can also change my correlation coefficient conditional formatting font color to be the exact same thing so i want my font color to match my background color so it just doesn't show up anymore so that's looking pretty good and then you also might be wondering how to make them squares it's actually pretty easy to do you just need to go to values and text size anything over like 20 will make a will make somewhat of a square i did forget one step um so for the correlation coefficient we want to leave that at zero decimal places because we want our values to just be one uh digit that's going to allow us to create the squares so we can do that by coming to the formatting values let's change our text size to be like 20 24 let's say make them more squares so that's looking pretty good you can size this however you want but that's a pretty nice correlation visual you can change the formatting of this table however you'd like to but you can just see this nice correlation plot one last thing here you can notice when you have variable 1 versus variable two and they're the exact same you see that they are perfectly correlated that makes sense because a change in one is the exact same change in the other so you're going to have that nice blue across the entire way then everyone else is going to have some other form of blue or red because they're probably not perfectly correlated so that was a long video but a lot of fun i enjoyed making this natively in power bi i hope you liked this video if you did please make sure to hit the subscribe button is the best way to show your support of the channel and helps me continue creating power bi content if you like the way i present power bi concepts make sure you check out my training over at training. bealey. com we have some awesome training on power bi dax alteryx and sql at the moment and we're going to add more and more very soon and with that i'll see you in the next video

Другие видео автора — Parker Stevens - BI Elite

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник