How to Create and Reuse Pipelines with "Package and Pull" CLI

How to Create and Reuse Pipelines with "Package and Pull" CLI

DataEngineerOne

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Оглавление (2 сегментов)

Segment 1 (00:00 - 05:00)

what's up data pipeliners welcome back to another episode on writing data five flies with ketchup in today's episode we're gonna be talking about pipeline pull and pipeline package these are two cli commands that are released with kedro that allow you to take advantage of your modular pipeline let's go ahead and check it out okay what i have here is a very simple pedro pipeline this is the example pipeline that you get when you do kedro new and then you hit yes as an example and i've called it keto pipeline full what we're going to be doing today is experimenting with the kedro pipeline command and so this option inside of the command allows you to do a lot of things with your catcher pipelines uh with 0. 16. 3 they put out much more emphasis on making pipelines modular so reusing pipelines is pipeline reusability and so in order to address this reusability they've added a few commands to this pipeline option so the first is create and so this allows you to create a catro pipeline from the command line as a template for how you write your pipelines and so if i type in ketro pipeline create and then we'll give it a name we'll just call it de1 what we're going to get is template code that is named with de1 or namespace with de1 so you can see here on the list we've created pipeline tests as well as configuration files and then finally the actual python source code itself and so that exists in their respective places we have it in source tests and configuration the pipelines when they're created this way are very standard so if we look inside of here there's nothing inside of the nodes. pi and there's only a very bare minimum pipeline creation function but it also comes with your readme which is really great and so for us we're just going to have a very simple pipeline we're going to take the node and the lambda function and what we'll do is we'll just print out the head of the x and so what we're going to be getting into this pipeline is we're just going to pull in the example iris data the output is going to be none of course and so what this will do is it'll read the example iris data and then print out the first x number of rows from that data now in order to package up and reuse this pipeline you have to do a few things and this is important the first is that you actually need to have this pipeline as this kind of like parent pipeline or this root pipeline so here we're going to type in de1 as the pipeline and then we're going to import this pipeline right here so here i've created the pipeline underneath this key of de1 and so that means that i can do cadro run and i can even type in pipeline de1 and this will actually run that de1 pipeline so your pipeline needs to be runnable from the command line in order for you to use these packaging functions so now if i type in pedro pipeline let's take a look at the command line again and here we have the create which we just did delete which you don't want to do just yet as well as list package and pull describe is actually a very nice helper function this is kind of unrelated but it allows you to see what nodes exist i mean in this case we're using a lambda so unfortunately the node parsing doesn't really work here if we were to use a proper function with a proper name then you would be able to see this guy but the next thing that we're going to do is pedro pipeline package and what this is going to do is it will take our modular pipeline that we just created it's going to package it up and then make it available for other people to use so we do get your pipeline package and then we type in de1 it's going to go into our catro project it's going to find the de1 pipeline and it's going to pull out all the resources that it uses so everything inside of that modular pipeline folder is going to pull out as well as the configurations and any of the tests that we've created for it so now we've actually created a package and that package now lives inside of the source dist folder right here inside of the project what i'm going to do is i'm actually just going to go ahead and delete the pipeline from our project so i'm just going to go here and manually go into our configuration go into our source as well as go into our tests i'm going to just delete this guy from the project so now let's just pretend i don't have a de1 pipeline in fact we're going to comment this guy out as well now if i don't have a de1 pipeline if i wanted to get one what i would need to do is maybe on your internal servers or internal git repository you guys can distribute

Segment 2 (05:00 - 08:00)

the wheel file that was created with our pipeline package once we have access to that wheel file we're going to download that all we need to do is type in kedrow pipeline pull and this will take an argument of the path to the wheel file and so here we're putting in the path to the de1 wheel file for kedrow pipeline pool we hit enter and what it's going to do is it's going to read from that package expand that package and install all of the files necessary to run that pipeline so it's really cool it's just like magic here and so here we have our de1 pipeline yet again and here on the left hand side you can see we have the pipeline that we created earlier with the example iris data and we can reuse that again in our pipeline and it's almost as if we never deleted it in the first place so if i hit control z a bunch of times here we should be able to create our de1 pipeline and when we type in kedro run pipeline de1 we should be able to rerun that pipeline and get the head of that data now there's a few caveats with using modular pipelines the first is that if you want to import any libraries or any code from the rest of your project you need to actually make sure that code lives inside of the modular pipeline folder so for example if i were to take this function here and separate that function into a separate file we're going to call this print head if i want to reuse that pipeline be able to package that guy i need to make sure that i am not referring to the current project in the code so right here in kpp this is the current project but if i want to reuse this pipeline in another project i need to make sure that kbp doesn't exist instead what we're going to do is we're going to use relative imports so we're going to just do a dot lib which means that we're going to be pulling the printhead function from the library file that is located directly inside of the package itself now this of course lends to some problems you could potentially have code duplication these kinds of things um it'll definitely depend on the way that you guys have decided on the standards of how you write your pipelines whether you want to have your library functions maybe in another pipeline you can import that pipeline or you want to keep your library functions in part of your starter there's a lot of ways that you can go about this because it's very flexible but it'll be up to each of the individual teams to decide on what works best for them but at the very least if you want to import files you have to make sure that file exists inside of the folder with a relative package name the other thing to note when you are pulling the package you have to make sure that your project actually runs your project has to be runnable if there are any errors in your project then your package pull is going to fail so for example here i have de1 that is missing from the imports if i try to run this project it's going to fail because it can't do the import and as a result the ketchup pipeline pull is also going to fail so what i need to do let's just comment this guy out one more time and there it goes the pipeline pool works and here we have our library file as well as the pipeline file with the library relative path and that's it for today's video thank you very much for joining me if you guys made it this far make sure that you button that like sub that scribe and ring that ding if you want to know when we are pipelining and i'll see you guys in the next one take care bye

Другие видео автора — DataEngineerOne

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник