# Creating Shared Catalogs for your Kedro Projects on GitHub

## Метаданные

- **Канал:** DataEngineerOne
- **YouTube:** https://www.youtube.com/watch?v=GwSj64Uqnhk

## Содержание

### [0:00](https://www.youtube.com/watch?v=GwSj64Uqnhk) Segment 1 (00:00 - 05:00)

what's up data pipeliners data engineer one here in today's episode we're going to be talking about how you can share your data catalog as your kendra projects grow and as your teams start to use ketro there's going to be a need to share your data assets and in this episode i'm going to show you guys how you can use github to share your catalog entries with other team members and the rest of your organization let's go ahead and get started now with the release of 0. 16. 5 we have this new hooks. pi file in this hooks. pi file you have a bunch of hooks that the project uses in order to construct the pipelines we have the register pipelines register config loader and the register catalog and now we're not going to be talking about the other two today we're only going to focus here on the register catalog this is the key now as you know kedrow uses the catalog to hold all of entries its data sets as catalog entries the registration of the catalog simply reads from the catalog. yaml file the catalog description and then loads it as a catalog object for us to be able to share our data sources what we need to do is we need to inject new catalog data into this function such that your project can read a catalog from a different source and so this is what i've done here today located in the description you will find a gist for this file this is called the hook now the shared catalog hook that i've written uses github in order to hold your catalog. yaml entries the reason why i chose github is it's much easier to collaborate using a git repository and github is one of the largest git repository holders in the known internet in the known universe and so that's why i chose github i think it's a fantastic way to share your catalog entries because this way you can add in any kind of management or curation that you would like taking advantage of the tools that git gives you so here in this example what i've done is i've actually written a few functions one called the read github repo file and gist file what these two functions do is they will actually take a access token a gist or repo name slash id and a file path or name and then in the case of the repo it takes a branch and what it'll do is it'll go directly to that repository or to that gist file download the file that you point at and return its raw contents then what i do is i take the raw contents and then transform that into a dictionary using yaml. load and then we feed that new catalog configuration into the and return that data catalog and so as you can see like the register catalog function is just superb it's fantastic because it allows us to do such complicated things very simply and easily by just moving the register catalog into this separate hook i can create a single file that you can use inside of your projects to take advantage of shared catalogs and so the relies on a github access token and thanks to this registered catalog hook we actually do have access to the credentials file and so what i've done is i've added this github access token in my credentials. yaml next what it does is it takes that access token and then passes it into this library called pi github it's a fantastic library that allows you to interact with github using python directly uh and so what it'll do is it'll use that library and that access token in order to download this data and i have a gist right here and this is this gist that i'm using is one that i wrote and as you can see it's just a very simple iris data set catalog entry this should be familiar to everyone who's ever watched a few videos on this channel next of course i have to pass in the name of the file because just allows you to have multiple files so we got to make sure that we put the name this yaml. load will return a dictionary and then i pass that dictionary into the data catalog from config if you would like you can actually combine the previous catalog the local catalog with the shared catalog by doing something as simple as this dictionary splatting so what i can do is i can just splat the loaded catalog and then i can splat the local catalog and it'll combine the two catalogs for this new catalog output so we can actually get the best of both worlds which is our local as well as our remote next you need to make sure that you're actually adding the hook and removing the previous implementation of the hook so inside of hooks dot pi we already

### [5:00](https://www.youtube.com/watch?v=GwSj64Uqnhk&t=300s) Segment 2 (05:00 - 07:00)

have a register catalog function the truth is that you can only have one register catalog hook implementation so we have to remove one of these guys in order to get this to work in this case i'm going to go into the hooks dot pi and remove the register catalog from the project hooks that's here next i'm going to take the register catalog hook itself and add it into the run. pi context now this is one method of adding your hooks to your project context there's actually a few others that we can do namely using the. kedro. yaml file or even using the pi project. tamil file to add those hooks you can take a closer look at that api in the ketchup documentation for now i'm just going to use the project context hooks equal tuple paradigm and since we've already added it here if i open up my jupyter notebook and i just go ahead and do reload kedrow this is a ketchup jupiter notebook so i can do that and i can type in catalog. list you will see here that this iris data set does exist and then if i remove the hooks and then i restart the notebook loading the catalog list one more time we see that the iris data set has disappeared and again just like a magic trick i put the hook back i reload kedrow i re-list the catalog and voila we have iris data here and so using github of course is just one way that you can share your catalog entries using fs spec and even the built-in data sets that kedrow supports you can also create your own custom catalog sharing mechanisms that will work for your projects and your project teams that's it for today's episode thank you for joining me if you enjoy this content make sure that you button that like sub that scribe and ring that ding if you want to know when we are pipelining and i'll see you guys in the next one take care bye and now i have to take a photo or the thumbnail all right i think these are good enough i don't know

---
*Источник: https://ekstraktznaniy.ru/video/38948*