# Top 10 SQL Cleaning Functions

## Метаданные

- **Канал:** Absent Data
- **YouTube:** https://www.youtube.com/watch?v=aDaGuNtqkzw
- **Дата:** 25.05.2026
- **Длительность:** 31:25
- **Просмотры:** 4,728

## Описание

In this video we clean a realistic business orders dataset using Microsoft SQL Server and real-world SQL data cleaning techniques. We’ll walk through how analysts clean and standardize data before it can be used for dashboards, reporting, BI analysis, or marketing analytics.

Topics Covered:
LTRIM() / RTRIM()
LOWER()
UPPER()
COALESCE()
NULLIF()
CASE WHEN
TRY_CAST()
TRY_CONVERT()
ROW_NUMBER()
Views

We also cover:
removing duplicate rows
creating clean reporting views
standardizing DMA regions
handling invalid data safely
building a reusable SQL cleaning workflow

Dataset  Included:
https://github.com/Gaelim/youtube/blob/master/business_orders_raw.csv

This is a simulated dataset created for educational purposes.

Practice SQL Questions Online with StrataScratch (Affiliate)
https://stratascratch.com/?via=absentdata

Find me on LinkedIn:
https://www.linkedin.com/in/gaelimholland/

#SQL #SQLServer #DataAnalytics #DataAnalyst #DataCleaning #BusinessIntelligence #LearnSQL #SQLTutorial #ETL #Database #MarketingAnalytics #DataEngineering #MicrosoftSQLServer

## Содержание

### [0:00](https://www.youtube.com/watch?v=aDaGuNtqkzw) Segment 1 (00:00 - 05:00)

Hi, welcome to the video where we are going to cover the top 10 or so data cleaning functions in SQL. We're going to clean a data set, create a view, and get it ready for production, whether that's going into a dashboard or being analyzed. So, we have this data set here uh and we have some tasks that we want to perform over these columns. So the columns we have are the order ID, the business name, account handle, the DMA region, uh business type, order date, order amount in the client tier. So we're going to process all of these things, uh all of these columns, and this is going to be kind of a part of our pipeline, our cleaning pipeline. So that means everything from removing duplicates, standardization, um where we're looking at handling duplication, blank values, uh we're going to make sure we convert all the data types and then uh create our view for us to use. If you want to see some of the functions we're going to cover, they're all right there. We might have a few more or a few less, but this is pretty much um primarily the ones that you're going to use and they may differ slightly from the database management system that you're using. So, as you see right now, I have a select query and all I'm doing is bringing in our business orders raw. Let's handle duplicates first by creating another column. And the way we're going to just simply do that is just make a comma and uh now use a window function where we count the rows based on our uh specification. So row number is a function and a window function has over with it. So over and then I am going to kind of space this out a little bit so we can see it. So uh make it nicely formatted. So let's look at our data set. We have the order ID and business ID. Those are pretty unique. And when we put them together, that should make each one of those rows supremely unique. So let's partition or segment in a sense by and let's say let's segment by uh the order ID and the business ID. That means each one of those rows is tied to the I the order ID and the business ID. And if we have duplicates, we should be able to um count those from one to two to three, however many duplicates are for that particularly that specific set. So then I'm going to call this as row num. Uh and that looks like that's good to go. I'm going to run that. Looks like there's an error somewhere. Oh, um, we also need to put order by inside this uh clause in a sense. So, let's order by uh let's order by the order ID. All right, that should do it. All right, order by Oh, let's see what I did wrong. Business ID is spelled wrong. Biz looks like this. multiple S's there. There we go. So, what has this essentially done? It's ordered by the order ID. We have made unique rows here with uh these two elements or columns. And then now we can go over to our row number and you can see it's one. This means it's a single count. So, we can just go down a little bit and now we can see a two here. And this looks like that's about 20. What? 27. So, let's go over row 27 there. And then you can see now we have these two orders 26 and 27. We have the same information, but what's going to give it away is you can see the two order amounts and the dates are the same. So, we can go down a little bit more. See if we see another one. pretty sizable data set. Okay, let's look at this one. That looks about 164 and 165. Also, you can see the same business name, same account handle, everything's the same. And then, of course, um if we get down to that amount, you can see that it's just a duplication. So, we're going to remove those. And we can easily do this. uh um at the moment or we can do it later. I'm

### [5:00](https://www.youtube.com/watch?v=aDaGuNtqkzw&t=300s) Segment 2 (05:00 - 10:00)

go what I'm going to do now is if you use a wear function that's just going to bring back the um the duplicates that we don't need because it's just going to bring back one. So if I just put where row num equals one, what's that's going to do? I uh oh well, let's do it this way. I don't want to write uh this over again. So what I'm going to do is first let me comment this out. And now if I run that again and the reason why is because it can't pick up the row number. So let's make this into a CTE. Let's call that with duplicate check. And then what we're going to do with duplicate check as and then let's encapsulate that. And what this does is turn it into a kind of a temporary table. And then I am going to just say all right now I want to select uh everything from duplicate check. So that means this is already logged in. Now we can get rid of this. And now if I run this now I have gotten rid of all of this information. Now, as we build our clean data set, we want to go row by row. So, I don't need to process the order ID and business ID. So, I'm going to bring those in. So, let's say order ID, business ID, and now what I need to do is fix the business name. So, I'm going to bring in those three columns and just, you know, organize this a little bit. There we go. So, I am going to organize that a little bit. And let's pull this back. There we go. All right. So, so if I run this, we're only going to be pulling in those three columns. And looks like I am having an error somewhere. There we go. I'm only going to pull back those three columns. Now, we want to be able to look at what we're doing as we clean. So I am going to just take this value here and after the query I'm going to end this query with the semicolon. And then next I'm going to pull up the raw table so I can look at them at the same time while I'm cleaning. So from and then we just take business orders raw here because that is our untransform data set. And then I run both. And now I have both of those in the window. So the first thing we want to handle is you can see we have non-standardized um business names. Some of them are capitalized, not. So what we want to do is clean that up. So we can use upper or lower. I'm going to use um upper or lower for the business name. Let's just go ahead and use lower just for this purposes. So, the first thing we're going to do is wrap that in lower. So, lower is just going to lower all everything there. There we go. And then make sure we keep the same name because we're just replacing the column right with our clean version. And let's make sure we have everything standardized. And then I'm going to write after this standardize. So I'm going to use a document string with that uh double dash standardized. So now we have that standardized. All that's going to do is lower the name for us. Now, we could use capitalize or anything, but also we want to remove the spaces that may be before or after this. And you could use

### [10:00](https://www.youtube.com/watch?v=aDaGuNtqkzw&t=600s) Segment 3 (10:00 - 15:00)

trim, but some of your database managements don't use trim. They use a combination of LTI, which is left trim and right trim. So, I'm going to say LTIM. So, I'm just going to chain these together. Ltrim and then I'm going to just, you know, put that in the parentheses. So now I've lrimmed it and now I can also R trim it. And then run this. Make sure that's all done. Run it. And now we have no missing spaces. So very easy way to handle that. Now that means we could standardize these by just removing the spaces just in case stuff was entered wrongly like a form. So now we have our three columns done. I'm going to go here and go to the next column which is and all I'm doing is copying and pasting that because that's my standardization. And I am going to um get our account handle. So, let me make sure this is here standardize. And then I'm going to call this as account handle. And then we want to use the same thing here inside that function. So, account handle. Now, what do we want to do with this account handle? We have nles. So a lot of times when we have nulls, we want to be able to say, okay, if there's a null, how do we handle it? We could just leave it blank as a null. But let's process this by using a function called coales. And coales handles nulls. What it's is like an ifn. So I'm going to say, okay, if this value, and you see how we're chaining these functions together. If this value is null, I want you to give me. So now I'm putting a comma and I'm going to say um missing handle and now cover that with that quotation and now we have handled the null and we also have lowered the account numbers. So they're standardized the account handle they're standardized and we have handled nulls. We're calling this as handle. So those are also standardized. So you could also say standardize and hand and process nulls. So this is just when you have this query, you can whoever has it knows what you've done. Of course, you could add a pretty detailed document string there, but let's uh let's keep writing. I just spell this wrong. It says incorrect syntax near coales. Oh, let's see what I've done here. No issue there. We are missing a comma. So that is the problem. Okay, so now that's handled. And you can see where there's no we have a missing handle there. Let's move to the next one. So our next column here is going to be the DMA region. Uh, we can do the same, right? We can just take this, make sure we capture all of it, put a comma behind account handle so we don't run into the problem that we just ran in before. And now just drop this in. We are now pulling the DMA region. All right. And then we also want to call this DMA region. And all we want to do is standardize those and then we should be good to go with DMA region. So now DMA region is okay. Now there is um a capitalization here that we have in some of the database management systems. I think it's called incap or something like that. So you might if you're not using Microsoft SQL Server, you might want to use this instead. And what that will do is capitalize the first word. But this is a bit hard with um these um uh what do you call these DMA regions because sometime they're combination of both. Okay. So I'm just going to say standardize that. So we're good to go with that. I am going to make a comma

### [15:00](https://www.youtube.com/watch?v=aDaGuNtqkzw&t=900s) Segment 4 (15:00 - 20:00)

here because now we're going to handle the next one. And there. Uh, okay. I'm going to just copy this again. Run that again. And now, um, we're going to do quite a bit more with this. So, first we want to rtrim, ltrim them, but also we want to deal with um, pulling these together. So, finance and retail, hospitality. Um, I'm going to start adding these together and grouping them together. So, we want to get the business type, right? So, this business type here and I'm going to just get business type. And that part is done. Now what we want to do is we want to use a case. So case is like using if. So case when and we want to format this easily. So we can just replicate. So I'm going to say case when. So this win here is going to be uh like the condition. So I'm going to say case when business type and I'm going to say when that business type is in. So I'm going to choose let's say um finance and technology. So I'm going to say finance and they're going to be lowered. So that means we're pulling in the lowered one technology. When know when you find either of those, then I want you to give me fin tech. So that is what we're going to use for that. And then now all I need to do is copy this part and get another when. And I'm going to say when uh let's say I'll do just uh logistics by itself. So I'm not going to worry about that one. Oh, let's do manufacturing and logistics for one. This is just an example. manufacturing and then logistics. I'm just going to call this uh all of that logistics. This is just a example for us how we want to put things together. And then we have marketing and hospitality. Let's do one of those. Let's say marketing and hospitality. And then we can call this our uh and I'm going to actually put retail in there also. Retail. And I'm just going to put that in our e-commerce business. Ecommerce and then I'm just going to say anything else. So we have those three else. Uh let's just keep let's just say we keep the category, right? So I'm going to just say else give me the category. So this is just us organizing things and then we also need to end this and we need to say as uh the business type there. So now we say as business type, we got a comma there. We have our case when statement. This should be good to go.

### [20:00](https://www.youtube.com/watch?v=aDaGuNtqkzw&t=1200s) Segment 5 (20:00 - 25:00)

Let's see. And what do we have? We have an error right there. So we don't have a W and win. We also have uh a period somewhere. Then e-commerce. So, let's see where our period is. Incorrect syntax. Looks like Oh, we have an extra comma there. All right. So, that might be it. Let's see. Let's run it. Okay, that did it. So, now you can see we have these categories that we created. Now, we can process the next column. So I'm just going to say grouped and standardized. All right. So next let's deal with our date column. So we have two ways of doing this. We can use cast or we can use a uh something called tririccast. And both are pretty standard. um they work exactly the same. So you want to be able to um just kind of use them interchangeably. So I'll use tririccast. And what tririccast just tries and um does that and it will it's like a safer way to use cast but it's the same thing. It's cast. So I say let's do instead of Oh, let's do convert. Convert is also a good one. So, let's do try convert. And we're going to convert that date. So, we just tell it, okay, this is the data type I want to use and order date as order date. And then we close this off. and run this. We have a comma. No, we don't. Comma order type. Try convert. Spell that right. Convert. There we go. Still didn't work. Okay. Order. Order date. Order date. Should have seen that. Order date. Order underscore date. All right. And then we need to close this up. Okay. There we go. So now you see this has been turned into a date. How do you want to handle nulls? That's up to us. Usually with dates, I don't change them or drop them when I'm processing information in the beginning just because we don't know how they want to be handled by our staff. So, uh, convert to data type. So, I'm going to just leave that as it is. Put a comma here so we can go to the next one. And then the next one we have here is we have order amount. So okay, we also need to deal with this. Now let's use cast for this. So cast is similar. You can use tririccast. So I'm going to say um first I don't want to have a null. So I'm just going to put a zero there. And we know we can use in that cast we can use coales because coales deals with uh the missing null. So I'm just going to write this nicely coales and then I'm going to use order amount say if that's null give me a zero. So I'm going to make sure we have this uh cast here. So now I'm this is what we want to do is cast right. So now that we have this value with zeros we want to say okay I want to cast it as a decimal and I want that decimal you can put I don't know 12 spaces in front of the period and two behind. And now what we can do, I'm just going to write this. And now that's a comma as and then our what is it called? The order amount. Okay. So now this is what we have. So we have a comma here. We have cast. Uh let's pull this. This is a little bit too far over. There we go. Um

### [25:00](https://www.youtube.com/watch?v=aDaGuNtqkzw&t=1500s) Segment 6 (25:00 - 30:00)

and then we should have this cleaned up where it has all the values in front and then only two spaces in back so that doesn't have these long repeating um figures after the decimal. So, let's see. Let's see if that works. Yes, it does. Now, you see there is a zero. Now, we do have negative values in here. We should handle those, but we need to get information from our bosses. So I'm going to say convert it and convert it dt type add and add it zeros for nulls. This just gives us a um way to track what we're doing. And then the last thing here is what we have uh we have our client here. And all we need to do is for our client here is just grab one of these and then come back down where we have client here. And I'm going to say I just want that to be trimmed and lowered. And then I'm going to call this our client tier. that as client tier. And we just, you know, we just say standardize. All right. Uh let's run that. Okay. Now we have a cleaned data set. So what should we do next? We want to save this whole query as a view. So we have this long query and I'm just going to pull this down a little bit. Then I'm going to go ahead and get rid of this. So I don't need that. And if I run that again, now we can see we have this clean data set. We have this full query here. Uh it's pretty long. So now what we can do is create a view. So we can go here and we could say create And then we want to give it a name. So let's call it business orders clean. And then maybe also just for you can put the date, right? date that you did it 2026 just so you have a record. That's not necessary. Um, so create I think it's as view. Oops, hitting the wrong one. As view Oh, actually we create view. This is the way it's done. Create view as. So we can do it this way. That is the right should be the right uh terminology. And a lot of times when you're creating a view, some database managements do this themselves. They put a VW in front. Um, now let's run that. And you can see what it says is completed here. And then what I can do is go over and let's see if we can open up this. And then we go down to our views. And now you can see you have this whole stored table. So I'm just going to pull the first thousand rows of that. And now we have our clean data set. Now I want to show you one more cleaning step and how to update your view. So let's go back over to our main query here and well let's circle back. As you see here, the order ID has ORD and a dash. That doesn't add any value to this order ID. So maybe we want to replace it. And that's what we're going to do. We're going to use the replace function. So, we go R, go to our view, go circle down to the main query, and then we're going to go to where we have our order ID, and let's use replace. And we encapsulate that. And now, what do we want to replace? We want to replace OD dash. And what do we want to replace it with? Just an empty string. Nothing there. So, we just put an empty string.

### [30:00](https://www.youtube.com/watch?v=aDaGuNtqkzw&t=1800s) Segment 7 (30:00 - 31:00)

And then we call this as order. ID. Oops, we don't want it capitalized. Order ID. Now, uh, let's put a comma. And the one thing that will happen because it says create, wait, we create a view. So, I'm just going to show you what happens if we run it. So, if we run this, what we're going to do is get this error that there already is a view that exists. So as you are thinking we want to say no we don't want to create a view it already exists. So let's alter it. A all alter and if we run that now we don't get an error. And then if we circle back over here and we rerun this um SQL query that pulls in that view and execute it. Boom. It updates for you. So that means as your team or someone uses this view anytime you go back and change it will be updated for them. So tell me how you would handle the order I order date that has a null and tell me how you would deal with these negative values here. Also let me know if you would like a part two to this because I'll be happy to do an advanced cleaning uh video for you if we get enough interest. Thank you. I hope that

---
*Источник: https://ekstraktznaniy.ru/video/52913*