# Performance best practices

## Метаданные

- **Канал:** Azure Synapse Analytics
- **YouTube:** https://www.youtube.com/watch?v=QYbpLSUoghI

## Содержание

### [0:00](https://www.youtube.com/watch?v=QYbpLSUoghI) Segment 1 (00:00 - 05:00)

hello everyone and welcome to the latest episode of fabric espresso my name is philli and today we'll talk about fabric Warehouse performance best practices now fabric Warehouse provides you with a great performance out of the box it does so by making sure that data that you ingest into fabric Warehouse is always in the best shape for optimal querying it also automatically scales for you meaning that it provides the amount of resources needed to a query for optimal execution however there are some small things you can check in your workload to squeeze that last drop of performance and then there are some things that are out of control of the engine for example the location of your data or for example maybe you're quering tables that are not injested in the fabric house so maybe it's not in optimal shape and we will discuss how you can get the best performance there as well let's start with the collocation of your resources fabric Warehouse is software as a service meaning the compute is somewhere in the cloud in some Asia region and when you execute a query in order for data to be processed it needs to be brought read from the storage and brought where the compute is your compute is in the region where the capacity is so when you attach the capacity to the works space you defined which where the compute is located when you ingest the data is ingested into the same region where your comput is and when you query that data you get the best performance because Network latency is so low because the data is in the same region where the compute is you can also create shortcuts uh to query data in adless Gen 2 storage account in the same region for example or maybe shortcut to a lake house table in a same region in those cases your performance will be great however you can create shortcuts pointing to aess Gen 2 storage accounts in remote region and you can create shortcuts in pointing to some other clouds like Amazon S free or Google Cloud Storage in those cases data needs to be read from the remote region from different part of the world and to be brought where the compute is for processing in that case you may experience some performance overhead now I'm not saying that you should not use shortcuts are awesome shortcuts are really great they allow you to query data directly from where it is and join with some other Warehouse table or a lake house table or whatever you need to do it literally enables you to have a single copy environment however you should be mindful about this potential Network um overhead in transferring the data so if you suffer from um bad performance or you want to get your performance better you might consider to move large tables into the fabric Warehouse and maybe keep those reference tables where they are and use shortcuts to access them you can access fabric Warehouse from anywhere in the world um you can access it for example from your local laptop and um there you can use browser to access the fabric Fab UI or you may use SQL Server management Studio or asure data Studio to write in the execute queries please be mindful that your queries May return large amounts of data and in that case it may take some time because data needs to be transferred from the region where the fabric is to your local laptop if you're are looking for U low latencies uh please use S SM SMS and ads because fabric web UI uses non-binary protocol to access fabric it uses an API um while SS SMS and ads use binary protocol to communicate directly to the with the fabric so obviously fabric febbi have a slight overhead compared to SS SMS and ads let's talk about your data model first and foremost your model needs to accurately represent the real world scenario and only then if you're not happy with performance you can check data types and maybe further optimize use smallest data type that accommodate values for example you may have a column year defined as a big in while that's fine it will probably work fine uh bigs can hold a huge range of integer values and you probably don't need that big range for a column like here so you should consider using your int or even small int small in should be quite fine in terms of ranges of values if you're using decimals you should specify scale such precision and scale that accommodates your values use proper data types for example if you know that day times will be stored use daytime to column do not use a character column and then store a date

### [5:00](https://www.youtube.com/watch?v=QYbpLSUoghI&t=300s) Segment 2 (05:00 - 10:00)

and a time there in a string format if you know that your column will not have characters uh well don't use a character column use integers for example if you have a character column and you know that your column values will not have fixed lengths or most of them will not then use a VAR car instead of car if you know that values in a character column will not be longer than certain length then please specify the proper length instead of using um the largest possible supported um length it means for example if you have name probably varar 50 is good enough to hold each and every name that exists uh instead of using Vare 8000 that being said string improvements are planned for a fabric warehouse and they're planned for this year as you can see in a public road map if you are using some tool to create a table tables for you for example some tool to ingest the data please make sure that you check the data types that tool use to create table collums just like John showed us in a previous episode in a case of pipelines we mentioned already that fabric Warehouse automatically maintains tables that you ingest into fabric Warehouse meaning you do not need to care about the number of files sizes of files quality of groups and so on but fabric Warehouse allows you to query tables that are not ingested into fabric Warehouse by using shortcuts so you can have shortcuts to um louse table or a shortcut to any table that is created by any other engine in that case table layout and storage might not be in the best shape for Optimal Performance and fabric Warehouse cannot optimize it because the table is not under fabric Warehouse control still there are some things that you can do there you can use fabric Warehouse to query tables that are not ingested into fabric Warehouse so since it is the same engine just this time reading other tables that are not ingested into fabric Warehouse still the same rules apply the same best practices let's start with checking the data types since same rules apply I would say that here when quering Lous tables or tables produced by other engines the most important thing to check is how character columns are defined Lakehouse users are not used to specify uh character column lengths as you can see in this example the same applies to py spark users as well and this results in having a character column without defined maximum length in the metadata here we have a ppar script that inspects the metadata to confirm that it is defined as a string type but there is no any metadata telling us what is the maximum uh column length since there is no maximum column length SQL engine sees this column as column of maximum supported length this can negatively impact query performance and here's how you can define a character column with specifying maximum length for that column using Spar SQL so it's pretty much the same as in SQL server and if we take a look again at the metadata of this column we can see that in a metadata we can find the length of a column since maximum column length is defined we can see that SQL is able to recognize it now you probably noticed that we Define the column as worker five but Warehouse sees it as varar 20 the reason is simple when you define a maximum column length in spark you specify maximum number of characters when you're specifying the maximum column length in a fabric Warehouse you specify the maximum number of bytes and since Delta is used underlying there is par uses UTF encoding to store strings in UTF coding any character Latin cilic or emojis can take from one up to four characters so in order to be able that to make sure that fabric Warehouse can read properly five characters that are encoding with utf8 we need to Define it as varker 20 bytes in the case of fabric Warehouse after you checked data types particularly strings please make sure that the table is optimized statement is used to optimize the layout of the table in the storage meaning to make sure that the number of files is good and that the file sizes are more or less equal so when do you need to optimize the table well whenever it changes significantly or after some period of time you can take for example an extreme example which is streaming meaning that at the high throughput the rows are being ingested in a very small batches probably just one row at a time

### [10:00](https://www.youtube.com/watch?v=QYbpLSUoghI&t=600s) Segment 3 (10:00 - 15:00)

meaning that one insert one row results in one transaction resulting in one new file so number of files can easily add up and then the engine struggles reading opening the files from the storage reading the metadata and probably spends more time there than actually reading the data and processing the data within those files so with optimize you make sure that the number of files is not that big and the files themselves are of larger size hence much better for optimal quering so when it comes to reading uh tables using shortcuts including Lous tables you need to make sure that the proper data types are used and that the table is optimized if you're not uh happy with a performance then you can consider partitioning the table so if you are often quering the table and using the same columns in aware Clause you can consider partitioning the table you partition table by those uh columns that you often use in filters and in that way you help the engine dramatically reduce the io it's easy to partition the table when you create it you need to specify a column or specify a partition by clause and specify columns that you want to partition this table on this will result in rearranging grows in a storage in different set of files so without partitioning you will have all uh files in the from the for this table in one folder while when table is partitioned for example by column year you will have folders year equals 2016 year 2 equals 2017 and so on and there you will have only files that contain value rows with where this column year has values 2016 or 2017 this helps the engine to dramatically reduce the io uh to read only files from specific partition ition that you're interested in here we have a query um simple one that counts rows from the table that is not partitioned and counts only rows from specific year and month this query finished in 32 seconds because it had to scan billions of rows and to apply a function year and month to each one of it the same query just this time when uh targeting partition table will take only 2. 6 seconds so obviously we got a dramatic performance Improvement by reducing the Io if you still need better performance and you have some columns that you often use in filtering but not out of those partitioning columns you can consider set ordering your table by those columns so what is set ordering is rearranging the rows in a way that ranges of values uh it's highly unlikely for ranges of values to overlap between row groups and files you can think of it something like sorting on specific column but not exactly sorting how does it help well when the engine reads parquet files are consisted of row groups and each row group contains a set of rows obviously but each row group has metadata about minimum and maximum column for each and every column in that row group so when Engine reads the file it reads the row group metadata and can decide whether it should read or should just skip that row group based on your ver condition this way we also reduce the iio it just works differently than partitioning here we have a query that also counts the rows but it uses a different column for filtering so not a partitioning column and this table is uh neither partition nor set ordered this query took almost 10 seconds we can set order this table we use the same optimized statement provide the name of a table and then we set order specified set order uh by keyword and specify one or more columns that we want the table to be set ordered by please know that the impact of set ordering is the positive impact on performance is the largest for the First Column that you specify if you specify more columns and filter by second or third well data is set ordered by the First Column first then by second and then you cannot expect as much of performance Improvement as when filtering on the First Column that you specify here this time we have a query that also counts rows using the same condition as previously targeting the same column but this time table is set ordered and this query took only 2. 5 seconds so what we did here is that we applied set ordering and reduce the duration of a query from 9 plus seconds to 2. 5 seconds so while fabric

### [15:00](https://www.youtube.com/watch?v=QYbpLSUoghI&t=900s) Segment 4 (15:00 - 17:00)

provides you with a great performance out of the box we showed how you can tweak your workload to squeeze that last drop of performance we also discussed how location of your resources where your tables are compared to where your computer is can impact performance we mentioned that using the tool on your local laptop introduces some uh you know um overhead because of the network transfer when you are returning the large amounts of rows then we discussed how you can optimize performance with curing data that is not ingested into the warehouse so louse tables or using shortcuts to query any other table that was produced by any other engine and you can do that by checking the data types of course primarily uh the lengths of character columns then optimizing the table to make sure that the layout and the storage is in a best shape for Optimal Performance meaning number of files and sizes of files that should be done on a regular basis depending on the rate of changes of your table you know in inserting deleting and updating rows and then if you have some columns that you often filter on you can partition the table in lake house and finally you can also set order the table by some columns some addition additional columns that you want to filter on or you filter on off now please note that partitioning is supported in a Lakehouse and as operation on a table but partitioning is not supported in a warehouse so you cannot partition a warehouse table because Warehouse provides the performance great performance out of the box however fabric Warehouse allows you to read partition tables that are created by other engine and as a reminder at the end please take a look at our previous episode where John talked about ingestion best practices if you like the content please share like And subscribe if you have any questions or comments please let them know in the comment area as usual this was Phillip bye

---
*Источник: https://ekstraktznaniy.ru/video/44745*