# Watch Me Clean Messy Location Data with SQL

## Метаданные

- **Канал:** Absent Data
- **YouTube:** https://www.youtube.com/watch?v=MGIawRRtNQc
- **Дата:** 15.03.2026
- **Длительность:** 18:11
- **Просмотры:** 2,225

## Описание

#sql  #datascience #dataanlysis 
In this video we take a messy location column with ZIP codes, cities, and states all jumbled together, extract just the ZIP code using PATINDEX and SUBSTRING, map it to a region using a lookup table, and find out which region is driving the most revenue.

Here is our SQL workflow
PATINDEX — finds the position of the ZIP code in the messy location string
SUBSTRING — extracts exactly 5 characters from that position
LEFT JOIN — matches the clean ZIP code to the lookup table
SUM — totals revenue by region
AVG — calculates average deal size by region

Find the Datasets here:
Main Sales Table
https://github.com/Gaelim/youtube/blob/master/sales_location.csv
Zipcode mapping table 
https://github.com/Gaelim/youtube/blob/58941e21855c867f6731625dd8d8b509868aa483/zip_lookup.csv

My Linked Profile
https://www.linkedin.com/gaelimholland

## Содержание

### [0:00](https://www.youtube.com/watch?v=MGIawRRtNQc) Segment 1 (00:00 - 05:00)

In this episode, we're going to clean up some messy location data and analyze it by regional revenue in minutes. So, let's get started. So, here's our problem statement. Our problem statement says, our location column is a mess. Zip codes, cities, and states are all jumbled together in a single field within our data. So, we it's really impossible for us to get regional data. Clean it up and identify which region is driving the most revenue. So, let's look at this table that we're looking at. We have the sales ID, the sales date, the product, the location, and revenue. Now, look at our location column. We have the first instance is New York and the zip code. Our second instance is just a zip code. And we have New York and a zip code again, which is a different zip code. San Francisco and a zip code and so on. So, it's very jumbled. So, our goal is to find a way to pull out the location and link it to a region because a state is not a region. And a lot of times in databases, we have a mapping table. And that mapping table will give us additional information that we can join to our existing table or existing query. So, I have this mapping table and let's just run that. And I'll actually run these two together so you can see them side by side. So, on the top is our original data. So, we're just we're doing a wild card select. And on the bottom here is our mapping table. And our mapping table has zip code, region, and city. Region is what we want. Now, we could extract the city here, but that'll be much harder because uh the city is different lengths and the city might be spelled different like San Francisco might be San Fran in some of the instances. So, the zip code is our best bet. And also um the zip code is going to be you a unique identifier, right? Because the zip code is one as opposed to here, you can see that uh a single zi- uh a single zip code is here when we have multiple cities. So, now we are going to use this mapping table to bring back the region into this uh table. Okay. So, we don't need the sales date and we don't need product. So, the most important thing for us to get is the location and the revenue column. So, first, I'm going to just comment out my mapping table for right now. And then I am going to just kind of start building this a little bit. I am going to use my uh query here and I'm going to bring back the location. And then now, once I have that location, um I'm not really concerned with anything else um at the moment. So, I'm just going to bring back the location. And I also want to be able to identify where my um zip code is and within this column. So, what we going to use is a pattern index function. And that's called patindex. Patindex. And that function takes two arguments or two different uh variables. The first thing we need to do is to tell it uh what is our uh, pattern? So, uh, we want to index the zip code. The zip code has 1 2 3 4 5 digits, but that can start in the beginning like uh, number two or can start at the end. So, that means we don't know exactly where and the index is going to tell us exactly where that five-digit code is. So, first we start off with our parentheses. So, those are those two and we want to say it's somewhere maybe before or after. And this is what this percent is going to do. So, now we just need to tell it to bring back a number. So, any number between zero and nine and we just bracket that off. So, now this will bring back one number. I want five numbers as my zip code. So, I'm just going to copy this and I'm just going to control C, control V, control V, control V until I have that pattern of five there. Now, I have my comma and

### [5:00](https://www.youtube.com/watch?v=MGIawRRtNQc&t=300s) Segment 2 (05:00 - 10:00)

I can promptly run this and get a response. Okay, what did I do wrong there? Two arguments. I need to make sure I'm pulling in the location column. My mistake there. And now if I run that, let's take a look. Uh, here is our location column and here's the uh, index that we're bringing back. So, you can see, let's go to number two. 0 1 uh, 07101 starts at the first index. So, we can see that this is one. Now, New York has seven digits that we can see, but it also has spaces. That's why it starts at 10. So, this is going to give us the location of the zip code in each one of these. So, now all we need to do is pull out the five digits that start at the index or of 10, which is a 10 spot, or pull out the zip code that starts at one and pull out five of these. If you used to Excel, you can think about using left or right or the middle as a way to find the index, but this is a little bit more nuanced. So, now that we have this, I just want to bring back the information within the string. And we can bring back with a string from a string within a string with the function substring. So, I'm just going to wrap my path index with substring. Substring also takes two um arguments or two bits of information. It gets the location. That's the column that we want. And then we give it the index. Now, we're lucky. We use path index to bring back the index. So, now we can get our index and then we need to tell it how many uh how many um bits of information to bring back, which is the fifth uh the five different digits, which would be the third piece of information we need to give this function. I think I said two, but it's actually three parts. So, now we tell it to bring back five spaces or five characters within that string starting for in this example, start with number 10, start with number one, 15. Now, this should bring back what we're looking for. So, let's see if we have any errors. It doesn't look like it. So, let's just run this function. I'll hit execute and now you can see we have that zip code. I'm going to alias this column and call it our as clean uh zip clean. Let's call it zip {underscore} clean. And that will be our column. Now that we have this information, we also want the revenue from this. So, let's bring back uh the revenue column. So, revenue from that sales location column and rerun that again. And now we have our location, we pulled out a zip, and then we have the revenue. Now we can use uh something like our mapping table. Now, one thing I want to do is I don't think I have any um zeros in there. No, so we don't need any where statements to filter anything out. Looks like the zip code came out pretty good. Uh so, now let's turn this into a CTE or common table expression. Let me just pull this down a little bit so we can see everything. Yeah. So, let's turn this into a CTE. So, with I'm going to call this our sales table. And then I'm going to bracket that with our quotate uh yeah, our parentheses there. Um and then I'm going to select everything from the sales table and that would allow me to bring back the sales, anything from that sales table as you see. Uh what did we do wrong? Oh, with as Sorry, with as sales I would uh sorry, with sales as getting ahead of myself. The next one will be a little different. Mhm. With sales as, now there we go. Um, so this is your CTE structure for our uh CT. Now, next what we want to do is join the zip

### [10:00](https://www.youtube.com/watch?v=MGIawRRtNQc&t=600s) Segment 3 (10:00 - 15:00)

code to this. But, um let's go to join here, and what we're going to do is we're going to join our mapping table. And let me get rid of this. So, we want to join our mapping table. So, let's this left join. And what do we want to join? That mapping table, which is here. So, it's called zip lookup. So, I'm just going to copy that. Join zip look lookup, and how are we going to join that? We're going to join it on and then now we should alias these tables, right? So, I'm going to call this A and B. Just so we can uh have a clear alias. So, I'm going to join this on our zip clean, which is our table at the top here equals uh B, which is our lookup table, with our zip code. So, now we have this table. But, what do we want to bring back from this? Now, let's clean up this query a little bit. So, let's just say and let me pull this down. There we go. So, let's we definitely want to be bringing back um our um information from our top table, which is the location A A we want to bring back the zip clean uh a. revenue. And then in the B table, right? We don't need to bring back the zip code again. We just want to bring back the region cuz that's what we most concerned with. So, now we have our sales, location, the zip code, the revenue, and now the region from our mapping table. That's why we alias it B. Let's run that and see if we got any issues. Uh, we do something near near. Let's see. Incorrect. Oh, we just need a commas here. So, that should Come on. Come on. Yep. There we go. So, now we have location, zip clean, revenue, and the region. Now you know what's next. We only need to get the uh, table where we are bringing back. Let's go back up to our problem statement. Uh, we want to identify which regions are driving the most revenue. So, what we're going to do here is we're going to turn this into a CTE. And let's call this uh, map mapped maybe regions. And as and then now we put this within our uh, brackets there. Uh, Now, you can see this is a little bit different with sales as. Once you pull in another CTE, you need to start with the naming this way. So, this gives us a second naming. Now, once we have that, we need to also put a comma there to indicate those things are different. And now what we can do is we can use another select statement here. And we can say select and let's bring back everything just to test. So, we're going to bring back everything from our map to region. And this should be good to go, but let's see what up. Yes, everything's good. So, now all we need to do is get our aggregations going. So, our aggregation, we can let's move this down a little bit. We want to bring back what? We want to bring back the region as we see there. And let's like make sure this is cleanly formatted. So, let's bring this down here. And then let's get rid of the wild card. And let me see if I can show you the whole query while we're here. So, I'm going to bring this down a little bit more. Then I'm going to go up here. And yeah, we can kind of see the whole query there. Okay. So, what do we want to bring back? We want to get the region. I'm really making this a little bit ugly. So, let's get rid of all of this. It's getting a little bit complicated. There you go. So, we want to bring back region. Uh and then we definitely want the sum of our revenue, right? So, we the revenue. So, sum of revenue. And let's call that as total revenue. And what else do we want? Let's get the count. And that should tell us um the number of transactions, right?

### [15:00](https://www.youtube.com/watch?v=MGIawRRtNQc&t=900s) Segment 4 (15:00 - 18:00)

Count. And pull this up so you can see it. Count of everything. So, we'll just use wild card there. Cuz we didn't bring back any IDs or anything like that. So, let's just bring back uh the wild card. So, count of everything and we'll call that as total transactions. And one thing I like to do is uh count uh Oh, sorry. I like to get the average of the revenue also. So, revenue and then let's get the average. And we can call that like average deal size because that's going to be the average transaction size. And then we make sure we have our uh commas there. And then because this is a aggregation, we need to group, right? So, let's group by the region. And then let's order by uh total revenue. I think I can do this. Revenue. Um let's do descending. And now I'm going to see if I can zoom out. There we go. I zoomed out a little bit. I think you Let me zoom out a little bit more. Now you can see that whole um that whole query here. And this would be our two CTEs. And if we don't have any issues, let's see we're missing a comma there. Comma, comma. Revenue. This looks like it should run, but let's go and find out. Okay. Boom. Bam. Now that's what we got. Now probably to make this look uh slightly better, you probably want to round these things because you can see we have these all these significant digits and this can be ugly. You could do that in Excel like once you export it, but let's do that now. All we're going to do is wrap this statement in a round uh and then put a two there. So, it only We're just going to round each one of these. So, let's do this again. Round and then putting two. and now if we run that again and that should be all good, yes? Okay, so now we have our regions ordered by the total um revenue descending. We have our total transactions and then we have the average deal size. Now this is a great table that you could just pass on the management because that answers our problem statement on which region, they only wanted one uh identify which region is driving the most revenue, we answered that in our query. I hope that helps. If you have another way of doing this, please comment in the section below. If you have any suggestions, but make sure you subscribe and pass this on to someone you think you can help. Thank you.

---
*Источник: https://ekstraktznaniy.ru/video/52916*