Querying structs just got 6x faster in DuckDB 1.2

Querying structs just got 6x faster in DuckDB 1.2

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI

Оглавление (1 сегментов)

Segment 1 (00:00 - 04:00)

in this video we're going to look at updates around structs in Duck DB 1. 2 so let's launch duck DB 1. 2 and we're going to create ourselves a struct stored in a variable called struct one uh struct is basically a dictionary or a hashmap so you can have key values so we're going to have a key of name and a value of duct DB and then a key of first release and a value of 2019 and these can be nested as far as you like let's get back that variable so we can see what it looks like and you see at the top in the table header you can see it's a struct name vure first relase integer let's have a look at our first new feature which is the struct extract at function this function lets you read from a struct by index rather than key it's one indexed so the first value is one the second value is two and so on so if we call that function on struct one and say position one we get back duct DB if we say position two we get back 2019 all right next feature struct concat this lets us concatenate as many structs as we like so we'll create ourselves another struct two and this one's going to have a description in process SQL database if we then call the struct concap function get struct one get struct two you see they're now combined together the next change is that we can Now cast structs even if the structure doesn't match perfectly so for example if we try and cast that concatenated struct to a new struct that has a name string and a description string it'll be able to do that we could even add a field that doesn't exist so in this case Fu and it will then just put that one as null finally there have been big improvements around projection push down on structs when reading native database files as well as when scanning Json and parket files what this means is that when we're filtering on a part of a struct even just retrieving a particular nested field it now doesn't need to scan the entire struct we're going to test it out with some Json files that I've downloaded from the stats bomb open data repository so let's have a look at one of the event files so these files contain over 12 million football match events and you can see this one in particular has the starting 11 for a team in a match I'm going to turn my timer on I'm just going to create a table called events we're going to read all the Jason files we're going to tell it Union by name because each of these files has a slightly different structure we want to just get it to include all the structures kind of concatenate them all together and then we're going to say records is false and what this means is it's going to put everything in a single column rather than trying to kind of break it out into columns and we run that and you can see we've now created the table we can check the count of events is just over 12 million and then let's describe that events table and if we scroll up you can see this has got a very complicated structure because it's capturing loads of different types of events I'm just going to convert that to park format so that we can then go and compare the performance afterwards and you can see it takes just over 10 seconds to do that now we're going to launch duck DB 1. 1 and 1. 2 side by side in readon mode so that we can compare query performance we're going to start with this query here that works out which team has the most possession so you see it's going into the Json structure possession team and then name and Counting everything and you see in 1. 1 that takes 0. 4 seconds and then in 1. 2 it takes 0. 06 seconds so that's about 6 and a half times faster let's try another query so this one here works the pass recipient of high passes that were incomplete by Japan where the recipient is not null and you see again 1. 1 0. 4 seconds and in 1. 2 it takes about 0. 07 seconds so again six times faster this time how about if we go against the parket file instead so 1. 1 first is 1. 7 seconds and then 1. 2 is 0. 2 seconds so that's just under N9 times faster okay and finally what about if we go against the raw Json files we don't need records equals false this time so 1. 1 is off and running and it's still going we might have to speed this up a bit for the video and eventually it completes in 82 seconds how about 1. 2 again we're going to have to speed this up for the video and you see it takes just over 13 seconds so again six times faster

Другие видео автора — Learn Data with Mark

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник