# SQL Tutorial - How to compare multiple columns in SQL

## Метаданные

- **Канал:** BeardedDev
- **YouTube:** https://www.youtube.com/watch?v=3RkZwgxhilw

## Содержание

### [0:00](https://www.youtube.com/watch?v=3RkZwgxhilw) Segment 1 (00:00 - 05:00)

hello and welcome to my channel i am bearded dev in this video we're going to be talking about the age-old problem of comparing multiple columns in sql as you can see on screen here i'm trying to simulate sort of a slowly changing dimension process but this applies to anywhere you need to compare multiple columns now we have our first table which is our customer staging as we can see we've got the source key and we're going to be using that to join the data on and then we've got some attributes related to people or customers and that is actually what we're going to be comparing now i've set this up so i can see that uh her original debra austin has actually changed her title and her last name and i can see in the third row as well carl smith has actually no moved address so these are the kinds of things that we want to actually detect and we're going to be going through a few methods of how we can carry out these checks to um find out when a value within any of those columns has changed and we'll be introducing some new sql functions that should be available in whatever version of sql you are using now if you do enjoy the video or are interested in anything in regards to data engineering or data analysis do subscribe to the channel and check out the other videos if you do enjoy the video don't forget to hit that thumbs up button and comment below your thoughts i do really appreciate it and it does help the channel grow now the first approach i've got here on screen as i mentioned in the introduction we will be joining these two tables on source key but what we actually do within our where clause then is comparing every single column to see if any of them have actually changed and as we can see this process is a bit tedious i don't actually have a large amount of columns here but if i did i would have to write out a very long where clause to compare but if we go ahead and execute that query and have a look at that results that's at least going to give us a starting point so we can see out of the three customers we're just working with a small sample of data here out of those three customers we can see we are getting the correct results back we've identified that deborah title and name has changed although it doesn't tell me specifically what's changed within this query it's just that something has changed and the same for carl as well so that's the traditional or old school approach is just to compare every single column but sql uh in and within t sql microsoft sql server or azure sql database there are a few different functions we could use to have a look at this process and that's what we're going to have a look at in this video so i've actually written out the queries prior to recording as it just saves me you watching me type so the first option we've got is and if we have a look at checksum what that actually does is computes an integer value for what we pass in so if i just have a look at a checksum on the customer staging table we can see we get an integer value returned for a comma separated list of columns that we pass in so if we just amend that what we can actually do is within our where clause we are again doing the same join on source key and then passing in our columns from table a to a checksum and then doing a not equal to our columns from table b so if we have a look at running this query and if it's not clear on screen they will be available in the description as well so we get exactly the same results so there's our four first option we have checksum function but when it comes to checksum there are some limitations it works quite well in this case because we're only working with a small amount of data the first limitation is checksum will return an integer value so we are limited on the amount of unique combinations we can create i think off the top of my head it's something like

### [5:00](https://www.youtube.com/watch?v=3RkZwgxhilw&t=300s) Segment 2 (05:00 - 10:00)

4. 294 million the range of an integer data type so that means if we had customers exceeding that amount then we'd have some what are called collisions and that is where we pass different values into the function but it returns the same checksum value so that's something to bear in mind now the other thing is dashes and trailing spaces are ignored so i've got some examples here as you can see i've got one and minus one as strings and the minus one has got some trailing spaces after it no training space is probably not a bad thing to have those ignored typically we don't really enter values with trailing spaces on purpose so if i execute this we can see that those actually return the same value so in this case we can see that dash is completely ignored and again trailing spaces but like i say that wouldn't be a concern the other problem is case sensitivity so again if we want to check certain values but we are working with case sensitive probably not traditional in a database but yeah if we are working with case sensitive data then that's something to bear in mind if i pass in lower case bearded dev and uppercase beardeddev again they're going to return the same checksum value so the next step is how do we get over these limitations well there's a sort of an another function binary checksum and again that is going to return an integer value if i just show you that in the select that's again going to return an integer value so if we go ahead and pass in our columns to binary checksum execute the query we can see we get the two results as we'd expect which is great but again binary checksum will come with limitations the first is again exactly the same as checksum so we are limited by the data type so again we could have that collision there dashes and trailing spaces are considered so if we compare the same examples we can see we'll get different values for the one and minus one with the training spaces case sensitivity is also considered so we get different values for uppercase and lowercase beardeddev but we do have a limitation of strings it will only compare the first 255 characters and if you have a look at this string here what i've set and we just check the length of this string 255 characters and i do believe that deserves a like for how i've managed to do that so that string is hello and welcome to my channel i am bearded dev in this video we'll be discussing how to compare multiple columns in sql want to learn more about data engineering and data analysis then subscribe to the channel if you are enjoying the video hit the so what we're going to do is compare a binary checksum on the string as original and then we're going to concatenate that string with to hear your thoughts at the end which admittedly is my mistake should be hit the like button that makes sense and i've now revealed that i had to work quite hard to get that string to 255 characters so if we go ahead and have a look at that uh and we'll see so we can see the length is two five the original uh which is just a binary checksum on that string has been returned as this value here ended in one zero eight minus um and then exactly the same as happened when we've added we've concatenated like button so we do have that problem with binary checksum so overcome some of the limitations of checksum but introduces some others so there is a third approach and this is my preferred approach as well and if we this is the hash points function and if we actually have a look at the documentation on microsoft docs there's just this paragraph i want to read out here

### [10:00](https://www.youtube.com/watch?v=3RkZwgxhilw&t=600s) Segment 3 (10:00 - 15:00)

so if at least one of the values in the expression list changes the checksum will probably change however this is not guaranteed therefore to detect whether values have changed we recommend use of checksum only if your app application can tolerate an occasional mischange so if you're happy that some changes may be missed then checksum could be a good choice otherwise consider using hash bytes instead the probability that hash bytes will return the same result for two different inputs is much lower compared to checksum i think it's well over the billions actually of hash bytes compared to uh checksum so let's have a look at hash points and hash bytes is not just let's go straight to hash points it does come with his own limitations so the first is as we can see here we need to choose an algorithm um so there's some algorithms that are more legacy that do still function but will return sort of a deprecated event in your database and i'll leave a link to the microsoft docs page on those in the description so you can have a bit of more look into detail of that the major problem with hash bytes is it can only accept one value so we're going to have to concatenate our string which we're doing here and then we're passing that into the hash bytes function and the first thing we declare is the algorithm and then the string we're actually passing in so again if i drop that into the select and we have a look at that look what that looks like and we'll see a major difference here is that hash bytes returns a binary value uh via binary value rather than the integer so we can see there the range of the different values we could have and again uh we deploy that to our b table as well so if we have a look at running that then again we're getting exactly the same results so limitations in terms of hash points we're going to pay a small penalty in terms of performance because of how the algorithm works there's a different range of values and then because we need to apply that concatenation thing to bear in mind with hash bytes as well is we will need to add a separator so best practice is to add something into a concat perhaps a pipe for example if commas are available in the data then we need to add that in so if i look at this simple example here where i'm comparing the concatenation of one and ten to a hundred and ten if we have a look at the hash points that's returned we can see on screen they're exactly the same value so that's why we need to add that separator in and you should choose a separator that's not going to exist within your data so pipe is a traditional one that i would choose lastly the recommendation i would make would be to actually store these hash bytes values these well what the values of the hash points function actually return so and actually store that var binary value within the data itself otherwise we're going to end up with some complex comparisons in terms of what we're doing within our where clauses but also we can't really have visibility of how that data has been processed so imagine this runs daily um and we see some changes that we can't understand why it's happened it'd be very easy to see when we can have the hash bytes columns available in both tables so i'll just go through a quick recommendation of that so if we alter our customer staging table and add in our hash match well i'm calling the column hash match we then update our customer staging of course this would normally be within the insert process into our customer staging table we do the same within customer and again apply the same

### [15:00](https://www.youtube.com/watch?v=3RkZwgxhilw&t=900s) Segment 4 (15:00 - 16:00)

update and then we can then do a simple select again we're going to join on source key but then we only actually have to compare those two columns if we execute that query we can see now we have got this column with uh a lot of uh maybe not easy to understand what that value is but it makes it easier if i was to look back and say well okay we've this customer's change today why um it just makes it easier for debugging so that would actually be my recommendation and that's a process i typically implement especially for scds within the within stored procedures really hope you have enjoyed that video like i said i'd love to get your thoughts in the comments below if you're not already a subscriber hit that subscribe button there is lots of great content on the channel i'll see you in the next one thanks a lot for watching

---
*Источник: https://ekstraktznaniy.ru/video/44473*