Data Science Made Easy with Statistics & Probability | Data Science Roadmap 2026  | Edureka Live
41:52

Data Science Made Easy with Statistics & Probability | Data Science Roadmap 2026 | Edureka Live

edureka! 12.05.2026 1 253 просмотров 50 лайков

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI
Описание видео
🔥𝐄𝐝𝐮𝐫𝐞𝐤𝐚 𝐃𝐚𝐭𝐚 𝐒𝐜𝐢𝐞𝐧𝐜𝐞 𝐰𝐢𝐭𝐡 𝐏𝐲𝐭𝐡𝐨𝐧 𝐂𝐞𝐫𝐭𝐢𝐟𝐢𝐜𝐚𝐭𝐢𝐨𝐧 𝐂𝐨𝐮𝐫𝐬𝐞 : https://www.edureka.co/data-science-python-certification-course (𝐔𝐬𝐞 𝐂𝐨𝐝𝐞: 𝐘𝐎𝐔𝐓𝐔𝐁𝐄𝟐𝟎) 🔥𝐄𝐝𝐮𝐫𝐞𝐤𝐚 Integrated MS+PGP Program in Data Science & AI : https://www.edureka.co/dual-certification-programs/ms-data-science-pgp-gen-ai-ml-birchwood This session on Statistics And Probability will cover all the fundamentals of stats and probability. ✅ Subscribe to our channel to get video updates. Hit the subscribe button above: https://goo.gl/6ohpTV 📝Feel free to share your comments below.📝 𝐄𝐝𝐮𝐫𝐞𝐤𝐚 𝐎𝐧𝐥𝐢𝐧𝐞 𝐓𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐚𝐧𝐝 𝐂𝐞𝐫𝐭𝐢𝐟𝐢𝐜𝐚𝐭𝐢𝐨𝐧𝐬 🔵 DevOps Online Training: http://bit.ly/3VkBRUT 🌕 AWS Online Training: http://bit.ly/3ADYwDY 🔵 React Online Training: http://bit.ly/3Vc4yDw 🌕 Tableau Online Training: http://bit.ly/3guTe6J 🔵 Power BI Online Training: http://bit.ly/3VntjMY 🌕 Selenium Online Training: http://bit.ly/3EVDtis 🔵 PMP Online Training: http://bit.ly/3XugO44 🌕 Salesforce Online Training: http://bit.ly/3OsAXDH 🔵 Cybersecurity Online Training: http://bit.ly/3tXgw8t 🌕 Java Online Training: http://bit.ly/3tRxghg 🔵 Big Data Online Training: http://bit.ly/3EvUqP5 🌕 RPA Online Training: http://bit.ly/3GFHKYB 🔵 Python Online Training: http://bit.ly/3Oubt8M 🌕 Azure Online Training: http://bit.ly/3i4P85F 🔴 𝐄𝐝𝐮𝐫𝐞𝐤𝐚 𝐑𝐨𝐥𝐞-𝐁𝐚𝐬𝐞𝐝 𝐂𝐨𝐮𝐫𝐬𝐞𝐬 🔵 DevOps Engineer Masters Program: http://bit.ly/3Oud9PC 🌕 Cloud Architect Masters Program: http://bit.ly/3OvueZy 🔵 Data Scientist Masters Program: http://bit.ly/3tUAOiT 🌕 Big Data Architect Masters Program: http://bit.ly/3tTWT0V 🔵 Machine Learning Engineer Masters Program: http://bit.ly/3AEq4c4 🌕 Business Intelligence Masters Program: http://bit.ly/3UZPqJz 🔵 Python Developer Masters Program: http://bit.ly/3EV6kDv 🔴 𝐄𝐝𝐮𝐫𝐞𝐤𝐚 𝐔𝐧𝐢𝐯𝐞𝐫𝐬𝐢𝐭𝐲 𝐏𝐫𝐨𝐠𝐫𝐚𝐦𝐬 🔵 Post Graduate Program in DevOps with Purdue University: https://bit.ly/3Ov52lT 🌕 Advanced Certificate Program in Data Science with E&ICT Academy, IIT Guwahati: http://bit.ly/3V7ffrh 🔵 Advanced Certificate Program in Cloud Computing with E&ICT Academy, IIT Guwahati: https://bit.ly/43vmME8 📌𝐓𝐞𝐥𝐞𝐠𝐫𝐚𝐦: https://t.me/edurekaupdates 📌𝐓𝐰𝐢𝐭𝐭𝐞𝐫: https://twitter.com/edurekain 📌𝐋𝐢𝐧𝐤𝐞𝐝𝐈𝐧: https://www.linkedin.com/company/edureka 📌𝐈𝐧𝐬𝐭𝐚𝐠𝐫𝐚𝐦: https://www.instagram.com/edureka_learning/ 📌𝐅𝐚𝐜𝐞𝐛𝐨𝐨𝐤: https://www.facebook.com/edurekaIN/ 📌𝐒𝐥𝐢𝐝𝐞𝐒𝐡𝐚𝐫𝐞: https://www.slideshare.net/EdurekaIN 📌𝐂𝐚𝐬𝐭𝐛𝐨𝐱: https://castbox.fm/networks/505?country=IN 📌𝐌𝐞𝐞𝐭𝐮𝐩: https://www.meetup.com/edureka/ 📌𝐂𝐨𝐦𝐦𝐮𝐧𝐢𝐭𝐲: https://www.edureka.co/community/ - - - - - - - - - - - - - - What is AI(Artificial intelligence) ? AI (Artificial Intelligence) is like brainy computer programs that learn from data. Imagine a super-powered tool that analyzes information and helps you create or solve problems. It can suggest new ideas, write stories, or design art! AI is here to be your creative partner, not a replacement. - - - - - - - - - - - - - - Types of Artificial Intelligence. Imagine intelligence on a spectrum, with humans at the complex end. Here's the breakdown of weak AI and strong AI on that spectrum: Weak AI (Narrow AI): This is the kind of AI we see most today. It's like a super-powered tool for a specific task. Think of a chess-playing program or a spam filter in your email. Weak AI excels at one thing, but can't necessarily apply that knowledge to other areas. It's like a super-skilled athlete who dominates their sport but might struggle in others. Strong AI (General AI): This is the realm of science fiction (for now). Strong AI would be much more human-like. It could learn and perform any intellectual task we can, like writing a novel or solving complex math problems. Imagine an athlete who can excel in any sport they try! Here's the key difference: Weak AI is a specialist, strong AI is a generalist, like a super-powered human brain. - - - - - - - - - - - - - - What does AI do at its core? AI's core superpower is discovering secrets from information. It analyzes massive amounts of data, like text or images, to find patterns and connections. With this knowledge, it can then perform tasks like making predictions, creating new things, or even making decisions. For more information, please write back to us at sales@edureka.in or call us at IND: 9606058406 / US & Others: +18885487823 (toll-free)

Оглавление (9 сегментов)

Segment 1 (00:00 - 05:00)

Hi everyone, a very good evening. All right guys, let's get started on statistics and probability for data science. Of course, everyone know about the fact that there are three pillars of data science and one of them is actually statistics. The first one is programming like information systems that we talk about and then the second one is statistics of course probability comes under that statistics and the third one is business domain. So of course it is very important to understand the statistical aspect of data science. How we can use the statistics and the concepts of probability in data science for predictive analytics or maybe exploratory data analysis or maybe visual understanding of the data. You know there are various ways you can actually use statistics and its concepts in data science. It may be Maybe sample techniques. A lot of ways you this all right and can tell you from my experience when you talk about statistics it is about knowing any of the to that you can perform analysis All right and those statistics concepts are very useful. All right stand that's pretty much right like to understand and also in py understand for today we'll talk about the introduction to probability yeah we'll talk about couple of right talk about the statistics the world of statistics told you yeah in data science Let's talk about it. All right. So, let's get into the main content that is probability. You might be knowing about what exactly is probability. But let me just introduce this. So, when you talk about probability, it is nothing but the chance of something is happening. So or you can an event will occur. How likely? I mean for example if then what is there for or what is the chance that someone will the Are you getting this? So you know we use the table shows if I just consider a die right just a simple die outcomes right one and desired out we are choosing four so four is the desired outcome but the total outcomes are six all right so this is something uh desired outcome is only one. I mean what is the chance that four can come? It's only one side out of six possible out like 1 / nothing but the probability four or maybe with the procept it's called as random very important concept that tells us maybe they die or maybe looking for the probability of a customer to to buy something. All right. These are all randomments you are trying to understand whether happening that is withas talk about a sample space it is nothing but all the possible outcomes like if I just talk about rolling a die then possible outcomes 1 to six, right? 1 to this called a sample space of an event, right? And with that event reference, we would like to talk about the probabilities. Okay?

Segment 2 (05:00 - 10:00)

Then you can also see a disjoint event. So if some joints are rolling four on both the dice, right? So there can be a joint event as well. You can also see the example here. A student can get 100 marks in statistics and 100 marks in probability. So you can see there's a joint event of things, right? And you can see the outcome of ball delivered can be no ball or a six and a six, right? You can see that's a combination joint event. But sometimes it is disjoint event. So like they don't have a common outcome. All right? like here 100 marks in statistics and 100 marks in probabilities. It's a common thing but sometime it is not common. So you can see that a single card drawn from a deck of cannot be a king and a queen together. It's not possible right? Can you see we can get the probabilities of a king separately and a queen separately but both the events cannot happen together. That's why they are disjoint events. Okay. the events cannot happen together. When we move ahead, right, couple of more things we should understand uh to understand the idea of probability used in data science. So the idea here is the distribution. Yeah, probability distribution. So we'd like to understand what are the probabilities of possible outcomes. Like if I just give you an example, let's say I have taken sample of 100 drugs. All right. For a reason about T out of will form the desired task. So how can you know about such kind of things or when I say that the do? So here we talk about at least 10 doses that means um it can be 10 or it can be more than 10 as well. Yeah. So in that case we need to understand the distribution of the probability and there are multiple ways we can understand this. Yeah. The first one is called as probability density. It's called as PDF normal distribution or central limit theorem as well. Basically central limit theorem is about sampling as well. Let's move ahead with the statistics. Let's understand the probability. Yeah. And that is nothing but PDF probability density function. It also relates with the idea of random variables. Random variables are nothing but a numerical representation of the outcome of an event. All right, numerical representation and that's how we can actually empirically the probabilities of something and that is done using probability density function. All right. When you talk about the PDF that is probability distribution function, it tells you the possible outcomes and their probabilities. Yeah. So when you see that equation this distribution is called as probability. Yeah. Now when you talk about normal distribution, it is killer of every distribution, right? Normal distribution is a symmetric probability distribution function on a normal random variable X and there's an empirical formula. You can see that for a random variable yeah nice bell-shaped curve yeah which represents the data and you can also see the data is distributed inside this mu plus 3 sigma. What is that mu? That's nothing but the mean of the data and sigma is nothing but standard deviation. So according to normal distribution the data should spread in between maximum mean + three standard deviation or mean minus three standard deviation. So data should be inside this then we can say that the data is normally distributed. Yeah. And uh guys trust me this normal distribution is very useful. We use normal distribution to compare any distribution of the data whether it is close to normal distribution or it is not close to normal distribution. Okay. One more thing is important. We also call it as a standard normal distribution when the mean is zero and variance is one. Right? Or in that case we call it as a standard normal distribution. Right? But that's a next level concept. We should understand the normal distribution first. So you can just see here the normal distribution probability. that here. Right? You can compare your data with this nice bell-shaped curve. Okay? And you can understand how much it is deviated from the normal distribution. Yeah, that will

Segment 3 (10:00 - 15:00)

tell you the idea of um the distribution in your data. All right. So as I told you that we are not covering central limit theorem right now. All right. There are different type of probabilities as well right marginal probabilities, joint probabilities but before you know understanding these concepts right we should also revise or have an idea about the basic statistical concept right so I think right we should move to the statistics all right so when you talk about statistics that's a core idea right in the data so you can see that statistics is an area of applied mathematics concerned with the data collection, analysis, interpretation and presentation. Now there can be a lot of examples like you can see here. Yeah, you can see the example here. For example, your company has created a new drug and may cure cancer. How would like to conduct a test on confirm theness? So you want statistical evidences against and you can perform some statistical concepts here and uh you can find out right whether the drug is really working or not. Yeah. Now let's get started with some of the terminologies here and that is very important guys. Okay. So when you talk about the statistics although you can call it as a inferential statistics that's the proper word when we talk about population and um uh sampling all right I'll tell you what is inferial statistics right there are two type of statistics one is like descriptive statistics and another one is called as inferential statistics and those concepts are actually the idea of population and sampling let me just give you that idea When you talk about population that is the entire data that you are really interested in sometimes the population can be huge and you cannot handle such amount of data right or it's not about handling you cannot even collect that data yeah like if I give you a very simple example that let's say if I ask you what is the average income of um of people in America or if I ask you that what is the length of u dolphin fishes is in the ocean what is the average length of dolphin fishes it's not possible to come with that right or if I ask you that what is the percentage of lead in megi packet it's not possible to go to each and every uh people in the America and um find out that whether um you know uh what is their salary and all right so that is population sometime it is possible to do that right sometime the population is small and uh we can work on that. Yeah. So whenever you we study the population then it is descriptive statistics. All right. We talk about the central measures of population like mean of the population or you can say that median mod skewness cutosis distribution. Yeah. So everything we talk about as a population because population in some cases can be handled but as I have given you some examples it's not possible to have that entire population and study the data right in such cases we would like to go for sampling. So sampling is nothing but a subset carefully chosen subset of the entire population. Yeah, well chosen sample that means we just try to get a subset of the entire population which we believe we cannot handle right and we study that sample or maybe multiple samples it's also possible right multiple samples and then we can actually conclude something about population like the same example that I've given let's say I have carefully chosen 20 people from each state of uh USA and uh I just asked them what is your salary and um I conclude something like the average salary of people in America is $1,000. So are you getting this idea that I have not dealt with the entire population but the carefully chosen sample. So this type of statistics guys called as inferential statistics when we would like to infer or we would like to get an idea about the population but with the study of samples. Yeah. So that is what we call as inference and statistics. All right. But as we see that the samples are very important here. So there are multiple sampling techniques which actually um you know which actually makes us uh trust on the

Segment 4 (15:00 - 20:00)

samples that we are working on can actually uh calculate the parameters about the population and some of the popular sampling techniques are you can just see some of them are probability probabilistic sampling and non-probabilistic but some of them like here random sampling systematic sampling, stratified sampling. We can also do a snowball sampling. These are non-probabilistic, right? Right. Convenient sampling, judgmental sampling, kota sampling, non-probabilistic, right? But the point here is all the probabilistic sampling techniques every data yeah every item in the population right can be or I should say is having the equal chance to be the part of sample that is probability sampling equal probability is there but when you talk about nonrobability then we do not consider the probability and there can be random chances Yeah, but let's see some things. The first one is random sampling. As the name itself suggests you, right? We choose some random items from the entire population. You're going to see each member of population has equal chance of being selected in the sample. Random sampling. I randomly chosen some of my students and ask that how how you are studying or what is the preparation that you're doing for the exam. Right? But purely sheer random sampling. Second technique is systematic sampling. What we try to do here, we have a system that is systematic, right? We have a rule or we have some criteria using which we perform the sampling. It is not randomized. Let's say I have a strategy that every second record I will choose. Yeah. So that's a strategy that I have, right? So from the population I will choose the second item. Yeah. Uh or uh I can have any other strategy as well but it is not random. You can just see that that's why it is called a systematic sampling. Every nth record is chosen from the population to be the part of the sample. The third one is also very interesting that is called as stratified sampling. When you talk about stratified sampling, we talk about stratum. Yeah. What is that stratum by the way? Again a subset of population but they have some common characteristics, right? That is called a start. All right. So the idea is simply like whatever the share that we have in population like for an example I have male subset and I have female subset. Yeah. So what I will do here is like when I have stratified sampling, I will follow the same ratio of male and female items in my sample. So if you just see that I have chosen two males from the male subset of the entire population. So I will also choose two females out of the female subset of the entire population. Now that is stratify. we try to maintain the ratio of the u of the population here. Now I have already given you the idea about the inferial and descriptive statistics but here it is. So when you talk about descriptive statistics it uses the data to provide a description of the population. Can you see that we work directly on population? It's not samples that we working on. Right? All right. either through numerical calculations or graphs or tables. So statistics can be imping the numerical calculations or it can also be visual EDA and we can draw some nice um nice bar charts, graphs like histograms, box plots to understand the data. Yeah. But if you just see here that it is nothing but mainly focusing on the main characteristics of the data. Okay, just think about like this, right? Let's say all these people's wear uh people wear t-shirts or clothes. So, what is the size or maximum size of clothes that they are wearing or average size or minimum size, right? But you can see we are working on an entire population. Yeah. But when you talk about the second type of um uh the statistics, it is called as inferial. I have already given an idea. What you would like to do is infer right and predict about the population because we cannot uh do that for the entire population. So the idea here is to

Segment 5 (20:00 - 25:00)

answer the questions about the population with the help of samples that is called as inferial statistics. Yeah. Now what we have in descriptive statistics and inferial statistics. So majorly in descriptive statistics we talk about the measure of central tendencies. So we would like to have an idea about the central tendency of the data in descriptive statistics. Yeah remember that we are talking about the population here. And then we can also talk about the measure of variability or measure of spread. That's the same thing. Right? So we like to understand how the data is spread. Yeah. And here we can talk about skewness tois and all but exactly what we do in measure of central tendency right let's talk about it about the measure is we talk about mean median mod and other related things. Okay, like when you talk about the variance, right? And you can see here the variance, range, interquartile range, standard deviation that comes in variability because here we would like to understand how the data is spread. So I hope you got an idea. Three basic things like mean, median and mod. We have it in um the central tendency right that is a center of the data and variability tells you the spread of the data or the distribution of the data. Not exactly distribution but yes spread of the data using this range IQR variance standard deviation. Yeah. Right. Now looking for the measures of centers. Right. So you can just see here there are some examples but I would like to show you in the core part right so I'm just uh skipping this particular part you might be knowing about what exactly is mean it's nothing but average of something all right you going to see let's say if I talk about the average horsepower that delivered by these cars yeah you can take an average of these values uh by just getting a total of eight divide by total number of observations what is median is a central value in a sorted data set. Yeah, it's a middle value. That's why it's called as median. Yeah. But how do you calculate median values in the order ascending order by the way? Yeah. And then you look for the middle value. If the middle value is not possible to be rectified, then we take two uh middle values and we just take the average of those two values, right? that we consider as a middle value of these values. Right? When you talk about mod, mod is nothing but the most popular item here. Right? Mostly mod is calculated on a categorical data when there are categories. Right? So it's nothing but the most recurrent item in the sample. Yeah. So if you just see here in this example only HP, right? If you just see here the most popular item is 110. So what is the mod of this HP column is nothing but 110 right or you can say displacements here that is 160. So what is the value that is occurring most of the time right? If we just see here we talk about the range we talk about the interquartile variance standard deviation. So it's all about spread guys right? So the first thing that we talk about is uh range. So when you talk about the range here right it's not what is the minimum value and what is the maximum value or I can say that what is the values in which the data is spread right so of course the data will be spread in between minimum value and maximum value like for an example uh in a class of 60 students right uh the minimum marks are marks obtained in statistics course is 25 and the maximum is 98. So of course the range will be 98 minus 25 that's going to be a range when you talk about uh the IQR or you can call them interquartile range it's about quartiles right so like what we can do is we can distribute the data into quartiles right we can divide calculate those percentile basis right so it can be Q1 like 0 to 25% data. Q2 like 25 to 50% data. Q3 is like 50% to 75% data and remaining 75% to 100% data is Q4 is interquartile 4. Yeah. But what exactly is IQR? Interquartile range is nothing but the difference between the

Segment 6 (25:00 - 30:00)

percentile values. percentile in the sense is going to be Q3 that is 75% quant uh percentile minus Q1 value right there that is 25th percentile all right make sense so you can see I'm calculating the Q1 and Q2 in this data and you can just see the IQR is in between Q3 and QR so this can be also useful when you're working with the percentile data yeah you can see this interquartile range right so this is like middle value okay because that 25% data plus 25% data here all right then we talk about the variance is nothing but how far the values are from its mean value so you can just see the formula that's nothing but a data minus xar that's nothing but the mean of the data points we square it because this difference can be positive and negative. Okay, then you can see here we are just getting the sum of all the squared differences and divide by n. That's nothing but the variance. But when you talk about standard deviation, it is square root of the variance. Yeah. So you can see the idea, right? It is nothing but Right. But you can see when I have then we that idea in spread you can variance the square root variance is nothing but the standard deviation. Yeah standard deviation is more useful guys because when you talk about the standard deviation it is in the same unit in which the actual data is. Yeah. we'll move to the hands on what I want to do is next I want to jump into the code and what I want to show you that how Python can be useful to calculate all these statistical concepts guys Python is a super useful and simple intuitive language right everyone can actually learn it whether you are from technical background or non- tech background doesn't matter yeah so let's jump into the code What I'm going to do here, I'm going to take a data set because I just want to give you a real time feel that how we work on the data sets and how do we work on statistical measures and all. Yeah. So I have some outputs already. Just clear all those outputs. Then I will start from a fresh. So whenever we do something in Python, we always start from the libraries, right? We should import the libraries that we're going to use. So these are all generalpurpose libraries, guys. You can just see I'm importing pandas, numpy, math. Sebon. So in the courses right we just you know cover all these libraries in detail but uh as you know the time is not permitting. So I'm just importing them and uh the next thing that I'm doing is just filter warning. So let's say if sometimes you know in code gives you some warnings you can just ignore it. So we have done it. The next thing that I'm doing I'm using fortune 500 companies data. Yeah. So I have a data set. It is openly platforms like kegle and you can also download from here data world. Okay. And you can use it. It is freely available. That's why I'm just taking this as an example. So what I'm going to do is I have uploaded this CSV file here in this file section. I'm using Google Collab by the way guys. And you can see I just uploaded here. All right. And I'm good to read this data from this CSV file. Yeah, let's go for this. So you're going to see that DF is nothing but a data frame, right? That's the storage in pandas, right? Which read the data from the file and store it inside the data frame. Now before I move ahead guys, uh one thing that I would like to tell you, all these statistical measures comes under this idea, right? Exploratory data analysis or knowing your data even deeper even more. Yeah. So this particular step is a mandatory step in any of the data science concept. Think about machine learning, think about uh the predictive modeling, think about the deep learning, think about NLP, whatever it is. But this is a mandatory step, right? Exploratory data analysis. We analyze the data. Yeah, we explore the data using the descriptive or inferial concepts. All right. Now, let's see the data. I mean, I have not shown you the data, but before I jump into the data itself, I would like to know what is the shape of the data. So, right now, we are dealing with a huge data. You can just

Segment 7 (30:00 - 35:00)

see 13,940 records, right? I think again there's a typo. It is not there's many instance. Yeah. All right. And uh just correct it. It is 13,940 and I think I have just changed the data set that's older observation. Yeah. And what is that 16 by the way that's nothing but the number of columns that we are dealing with. Yeah. So in is huge data like columns it is huge. Now it's time to see that data preview. So you can simply call head function. So hit function by default gives you the top five u rows or the records from the data frame and u if you want more you can supply any number like for example here this is your data set this is a huge one right you can just see the name of the company right you can see general motors corporation Ford Motors company Exxon corporations Walmart yeah AT& T Bell Labs right and their rankings their year of establishments industries ries they belongs to the sector sometimes there are null yeah then headquarters right so these are like states of the US headquarter city right those informations are missing that's why you see n in here this is important right you're going to see revenue in millions yeah the million uh dollar revenue here and some of the things are not given that's why you see n here okay but not always right we are looking for this revenue. All right. So this is the data all right that we are working with. Now the next thing that we want to do is view the summary of the data set. Of course it is not possible to explore the entire data set. So we go for the summaries of data so that we get to know more about the data. So info function from pandas is the best choice for this. So I am calling that df nothing but dataf frame dot info is the function. All right, let's run it. Now you just see this is the info. This is very interesting output from info. You just see here there are uh you know 13,000 plus data, right? And here are those names of the columns like name, rank, year, industry, sector, all the data, right? They're not null count. Yeah. So how many data that we have as not null? Yeah. So it is giving you an idea that how many values we have in as null in the columns. Yeah. Or it gives you the idea of the presence of the null values in all the columns. Right. So you can see that here headquarter city has got 7,495 but total we have 13,940 records. That means this column have less data as compared to the total data. That means there are null values and we are going to detect it. One more interesting thing that you can find out here that's nothing but a data type. What type of data we are dealing with right? So whenever you see object it is nothing but string data. Okay. And you just see integer 64 float 64 these are numerical data. Yeah. So you just see that name of the company of course is going to be string right characters here and rank is going to be numerical that is integer 64. Yeah. Moving ahead with this right let's explore this in terms of null values in terms of statistical measures and everything. So as I said right there are a lot of null values in many of the columns as we see that columns have lesser values as compared to total. So it is important to find out how many missing values are there. So when you see that df doisnull is the function we use isnull function and dot sum is the function using which we will find out the total number of nulls in a data frame column wise right is null will detect whether it is a null or value or uh then sum will give you the total numbers. Now here it is very good and straightforward output name do not have any null value. Yeah. Sector has got 9,440 null values, right? Headquarter city has got some null values. Then market value in million. Yeah. You can you see profit in million, assets in million. There are a lot of null values here, right? So there are methods guys. We can actually uh get rid of this null value. We can use some simple methods like mean, median, right? Or sometimes mod for categorical variables. We'll also go for some smart technique. Yeah, smart is like generation of synthetic data. Yeah, it is synthetic minority oversampling technique. Smart. We'll also use KN to

Segment 8 (35:00 - 40:00)

fill up uh the null values, right? But is important to understand this. Now we are coming to the core idea that we want to understand is descriptive statistics. Yeah. So we'll be finding out a lot of things like count, mean, standard deviation, median. We can calculate all these things separately. Yeah, no problem. But there's a function in pandas called as describe function which give you the summaries straight away. I'll also tell you how we can calculate those things separately. No problem. So if you just see here df. describe is the function that I'm calling and just see the output, right? It will only consider the numerical columns. As you can see, there are less columns here. But it is only considering the numerical columns guys. Right? And what it is trying to show is count, mean, standard deviation, minimum value. Right? So if I just give you an example, the mean revenue like average revenue of Fortune 500 company is like $20,225 million. Yeah. What is the minimum revenue that a company is generating? Maybe the 500th company, right? $48,000 million or maybe million dollar, right? Sorry. And the maximum is like this, right? You can just see how much the data it is, right? Minimum value, maximum value here we have 25th value, 50th percentile, 75th percentile like this. You can easily calculate the IQR from here for any other columns, right? That's a very good idea to call this describe function because it give you all the things in one go. Yeah. But as I told you, we can also go for individual columns. Before I go for individual uh you know things like mean, model, median and other things. I can also include object type because you just see that here uh I have only numerical data. But we can also use the describe function to check for the object type columns as well. Object type means uh the strings. You just see a lot of object type column name, industry, sector, headquart headquarter, city, founder, right? All this right. So you just see these are something like we can get count we can get unique set of values there top there frequency of this right. You can see here US Bangor has got the 28 frequencies. So like multiple companies in Bangkok right like this. Now as I told you right we can compare or compute the major central the central tendencies separately as well. So if we see here I'm just getting the mean of this revenue in million right you can just see the mean is this right? It has already been shown in that describe function output. But you can definitely call the mean function on a specific column. This is a median, right? You can just see I'm calculating median uh with this median function. You can also see the mod here, right? So mod probably will not be a good idea, but you can see these are the three unique values and their frequencies. Yeah, but most of the time we see this number mostly useful categorical data. Now the next thing that we can do is we can also see the distribution of the revenue right we need to understand that how the data is distributed in that revenue column right so what am I doing right there are various ways we can do it right we can call the PMF from states modelapi there are various ways to do it but simply understanding it using this distribution plot this plot right you can say distribution plot here I'm passing the data which is nothing but my column here that revenue in million bins is nothing but I want the data in 10 uh differences of 10 right intervals of 10 histogram I want that's why his stick true KD is nothing but the distribution uh function here right that I'm going to use okay and then uh you can see the revenue in million that's a label by the way just see the distribution plot can you see distribution plot here Yeah, it tells you that a lot of data is right skewed. Yeah, you see very less number because there are very less companies who have very high as a right but maximum the revenues. Okay. are having very you see the distrib is highly right mostly we call so it can be you know huge we can do lot of things like minimum value separately maximum value separately the range we can call it like this okay we can uh the

Segment 9 (40:00 - 41:00)

maximum value for you can you see that we can also have variance we just have to call the v function guys you can observe pandas for use in this operation right every time I'm just using data frame and the functions from the pandas so this is the variance is large because squared value we can also calculate the standard deviation you can just see a smaller value we can also calculate the percentile let's say if you're talking about the Q2 or 58% function 5. 7 simply okay yeah and then let's say if you want 25th percentile like this and IQR is going to be like this you can see the value here so it can be a lot of details which we can find out using this in inferial statistics you can skewness like to and they're ahead. Yes.

Другие видео автора — edureka!

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник