# Installing and Using Local AI for n8n

## Метаданные

- **Канал:** n8n
- **YouTube:** https://www.youtube.com/watch?v=xz_X2N-hPg0
- **Дата:** 23.07.2024
- **Длительность:** 29:55
- **Просмотры:** 48,913
- **Источник:** https://ekstraktznaniy.ru/video/15633

## Описание

You can now use n8n with locally run LLMs for cheaper or more privacy-focused operation. Join n8n engineer Oleg as he shows three different methods for setting this up.

This video was part of the n8n Community Hangout of July, 2024.

Links:

- Full Community Hangout: https://www.youtube.com/watch?v=noBdpccJCgs 
- Sign up for future n8n Community Hangouts: https://n8n.io/community/events
- Sign up for the n8n Community Newsletter: https://n8n.io/newsletter/

You can download the local AI package here: https://github.com/n8n-io/demo-setup/

## Транскрипт

### Segment 1 (00:00 - 05:00) []

uh so yeah I'm going to tell you about how you can run local Ai and na10 what are some of the available ways uh we're going to show three ways if uh if we have time starting with the most convenient or the simplest one to somewhat more robust one um you might be wondering why would you even want to do this run AI locally or not necessarily even locally but to run it on your own Hardware uh Hardware the uh either the open source models or the models you train and the big advantage of that uh obviously the Privacy reason so you might not want to share your data uh with some other providers like open AI or anthropic whatnot you want to keep it uh offline local uh another reason could be cost Effectiveness because uh it might be more cost effective for you to to do your processing on your own Hardware rather than paying for all the tokens um as mentioned there's three ways you can well not three but the three ways I'm going to show you today how to run them uh first very simple way is uh using some open AI API analog so one example would be this app called LM Studio or some other similar app which uh gives you an uh an API that you can use that is very similar to open AI API but you can download Open Source models and host them so for example here I'm having uh metal Lama 3 uh model uh loaded there's this local server that uh I can start that will give me an URL you can see here and then I can take this URL and go into open AI not so we would be using the open AI node maybe the chat model and in open I note we have this option called base uh oops sorry not my yet open AI base URL so this option would allow you to overwrite the API endpoint to which we're sending the request to process your completion uh can do uh I'm going to put host doer. internal uh which would be the IP address of my machine but because I'm running n10 in Docker I need to reference it in this way and now I should be able to well you can select some credentials but for this it doesn't matter because the LM Studio doesn't require me to set up a credential just so that the credential doesn't complain now if we refresh it we should be seeing the models that are available from the LM studio so right now there's only three uh only two uh the first one being the metal Lama model which is the chat model and another one is the embedded model so if I would select the chat one um I can say hello are you and we should see some logs as it's doing the processing and eventually it should finish there we go and it's telling me that it's doing great uh we can do the same thing for embedding so again there is this space URL that we can change and then we'll be able to use um the model that from Alm studio um another way that you can run uh local models are isuse is to use AMA so amaama is a wrapper around llama CCP so it allows you to also download the models and to do the inference for you uh as Bart mentioned we prepared this Docker compose file and a repository that you can use which has all of this already set up looks like this so if we take a look at the docker compon post okay and it basically set ups the na10 for you with the post grass so that's what's happening here this service uh and it set ups the downloads the AMA latest version uh it binds the volume to it and it also setups the qu quadrant for storing your vector store embeddings there's two profiles you can run this with so you can run it either with the CP profile or you can run it

### Segment 2 (05:00 - 10:00) [5:00]

with the GPU Nvidia uh profile that would allow to utilize your uh GPU to the inference so it would be much quicker but it's not a requir to have a GPU so I already have it running here U stop it again you would start it like this Docker compost profile CPU up this would start all the required Services as we see um so now the is rying again and then one if you would like to download the models you would have to SSH into the Container so Docker EXA AMA b bash that would give you the access to AMA cama list to see the list of Ava available models and if you would like to pull a model which is to um we can select some model from here for example mistro and now it would be uh downloading the model once model is downloaded you would see it in the list I'm not going to download it right now because I already have a few models that we can use I'm going to stop it and I'm going to show you how you can use the Alama uh note to do some stuff so here we had our open a CH we can remove that now and what we can do is um let's say we want to uh we want to describe an image using the the open source model so I'm going to connect an a Lama note here L check model going to select in the credential so in the credential you just need to provide the URL you see here that uh it's just AMA and that is again because we're using Docker so we can use the name of the containers rather than their IP uh you see here the container name is AMA so when we're referencing these containers between them themselves inside the docker Network uh we can just sayama uh and this is the port that ama is running on you see that the connection tested successfully uh it gives me the list of models uh as we said we're going to use image uh detection or feature detection model so that would be Lava um Lava Lama 3 we don't have to mess with any of the settings although there's plenty of you need to for example to keep the model loaded in the uh in the GPU for more than 5 minutes uh there's a way to configure that the temperature check temperature a bit and now we can send it a file so we recently released a new feature called file uploads so now in the chat trigger there's an option to allow file uploads this is available both for the development chant and also when you make it publicly available uh so I'm going to going toggle this I'm going to set it only one image is and then when then we need to pass the image so you see in the prompt uh we're taking everything from the previous note but that wouldn't include the binary file for that I just need to do add another message which would only contain the binary file so message type binary uh the field in which the binary is going to be and image details is not important for this uh model then if we go into the chat you see there's a option now to add files we can select this image we can ask what is on you see that there's stuff happening in the background as it's processing the image oh that was quite quick correct it's in fact it is a large clock mounted on the side of the building so this is theonomic clock from Brock ory um maybe it could be a bit more creative well it's not wrong it um ask it um let's say

### Segment 3 (10:00 - 15:00) [10:00]

this there you go so correctly uh was able to tell me the city uh the was called prag as well so uh yeah but it doesn't know about the you sure it's not cheating no okay very cool no sorry I promise you it doesn't know about the F just messing with you yeah um so this Docker Docker setup also uh contains qu quadrant and qu quadrant is a vector store provider they have an open- Source version uh you we can see it here the UI it started on board 6333 um and we can see collections here maybe let me delete this one and we can populate some new collection so another use case we might want to do is you know the classic chat with the PDF or ask questions about PDF so let's set that up really quick uh we're going to keep the chat the chat trigger but now we're going to add the quadrant Vector store so the way we're going to do this is we're going to take the binary data from the chat input assuming it's a document we're going to chunk this document and populate the vector store with it and then we pass it the users promt the user prompt to an agent which has an access to this uh to a vector store tool that allows it to query this Vector store that we just populated and then based on this it should provide us an answer so let's go ahead and set that up to using the quadrant Vector store here you can see again that uh we're referencing the docker container uh we don't have any API key set up so we can leave this one empty uh for the qu Quon collection uh we can put it statically or we can just use for example the session ID from the chat sorry so that different users could have uh different sessions basically so we need to change the operation mode to insert documents like so then we can connect it and we can insert the session ID as a qu collection we don't need any options right now then for embeddings we would connect AMA embeddings here select the credentials that would give us all the models that we have and let's use this MX by m at large model uh then we need to chunk the document so for that we would connect the data loader change the type of data to Binary because we're going to be uploading files we say that we want to load all of the data that you get this binary uh connected Tex token text splitter we know that this uh embeddings uh models uh has a context window of 724 tokens so we don't want to go above that so let's say we keep the chunk size to 650 and chunk overlab to 50 tokens that would populate the vector store we can see if that indeed works ah uh I have this setting that I only allow images right now so we changed it to actually we don't need to set it we can just accept everything for this session then we have a Bitcoin white paper that we can upload um can ask what is the size of the BDC block header uh this is not going to work yet we just want to test that this is indeed populating the vector store and that seems to be happening so if we go to the quadrant UI we can check that this a new collection has been created with the session ID here and it contains the CH uh the chunk uh content this it also has some metadata but that's not really important right now uh okay we can go ahead set up the other part of it so now that we have this in the vector store we want to add a agent so uh for theama the tools agent is not yet supported because uh AMA until yesterday didn't support function calling but now it's got or yesterday it got merge so in the coming week it uh should be also available that

### Segment 4 (15:00 - 20:00) [15:00]

increase the capabilities of it but right now we need to use the conversational agent um we take the prompt uh we need to Define prompt find it below because the previous note is no longer chat it's this quadrant Vector store so we go in through the schema from our chat and input the chat input put the options we'll leave it empty one important thing to change here because you can see that there's uh 10 items but so that would mean that this agent would be executed 10 times which is not something we want to do so we can go into settings and only execute once um like so then we can connected some model like Lama chat model Here Local we use Lama 3 dease the temperature a bit again and now we need to give it access to this tool so we have a vector store tool we need to name it yeah it's very simple and we said that we want to retrieve for four chunks from the vector store for then we need to connect the actual Vector store so we know that we have it in quadrant uh this is already set up we just need to set up the collection and we again need to use the same session ID as we used here so if we go inside uh we go mapping select the chat Jason and session ID so this should match what we have here in the UI uh no options here that should connect the vector store we connect the same embedding uh Alama model as we used for the chunking that is quite important to use the same model for the retriever retrieval as you used for the uh populating the vector store otherwise the tokenization method would be different and it the scores wouldn't make much sense or they wouldn't be comparable uh we also need to connect a model for the vector store Tool uh because the way it works is that it retrieves these uh these chunks from the vector store then there is a model that answers based uh based on the query and that answer is going to get passed to our agent so I can just copy this one connect it so it's using the same uh Alama model as for the agent this one we want to make it uh really deterministic so we send temperature to zero to reduce the hallucinations as much as possible so we can be somewhat sure that it's only answering based on the chunks and it's not making stuff up uh now we have it connected like this and should be able to execute this agent and it tells me that the size of the Bitcoin block header is 80 bytes which is correct uh we can take a look at the logs to see how it got that answer um so first we see this in this alamat model that uh there's this classic conversational agent prompt uh it executed the tool user file knowledge base asked what is the size of the BTC blog header then that tool responded with the size of the Bitcoin block header is 80 bytes uh based on these why don't we see them execution we should be saying seeing the individual chunks in the vector from the vector store uh not sure what why they're not showing here something we need to look into oh yeah but you can see it here as it's getting passed to this llm that these are the chunks from our Vector store so it retre four chunks they pass here

### Segment 5 (20:00 - 25:00) [20:00]

and then the uh the vector store llm uh was able to provide this answer which is a bit more detailed and then uh our agent LM just answered in a more simpler way exactly what we asked that is that the size 80 bytes um yeah and then maybe the final method that I can show you to host these uh models would be to uh do it on some rented GPU so I have a PO setup using BRP iio which allows you to rent GPU um uh VPS servers to run your models on uh so we can the way I've created this uh I just want to deploy the RT RTX 4090 you can see the price per hour and the available uh resources so I would select the GPU I would change the template uh or I would set the template to py 2. 1 um edit template you need to expose the AMA Port so this still again so you would expose it like this uh you might want to increase the volume dis a bit to be able to store the models uh and then you would go into the environment variables and you would add Lama host to zero to make sure it gets uh it's binded to the URL that we would be using I already have this SP set up so I can just start it again because it also contains a model that I downloaded so we don't have to wait I start a pot let see it running hope in a bit M we can connect to it VI the terminal seems like Alama is not installed so I'm just going to install it again okay now the Llama is installed and it's running on uh Local Host on this sport Now do there's the model okay that we need to quickly fch let's try Miss okay so now I fetch the model so we can now access it and to access it would use this URL and see that this list of models and we can copy that to n81 here maybe not for this workflow let's just disable that one add a new base FM chain we add the new credentials the URL being this one see connected successfully and there is the model that we've just downloaded and we can talk to it so we can execute this and we got a response large language model train by mro AI uh and that is coming from our pod you see it was signific significantly quicker than running it my um local machine because we are using a GPU now RTX 490 um I'm going to share the link for this Joker compost file uh and how you

### Segment 6 (25:00 - 29:00) [25:00]

can run it after the session so Bart is probably going to send it over that should get you started with yeah local Ai and using Coline in anytime and that's it from my side thank you for your attention and if you have any question please share them brilliant um just as mentioned I will add the link to the video as I publish it early next week maybe for now like you could post it in the chat so everyone has immediate access to it um and I imagine you can also just go to our GitHub repository and find it there correct that is correct yeah okay awesome Okay cool so uh let's go back to my slides and see uh what the questions are I need to pop out again um and I'll speed up a little bit because we're going might be going a little over our time um mnine to know if you can split paragraphs into chunks um I think that is what you just covered right so does this need more answering or uh is it possible to split longer yeah yeah so you would set up uh one of the uh text Splitters that we have I've used the token splitter which would be static based on the amount of tokens uh but there's also an U recursive text SP which would allow you to split on some conditions like paragraph or some markdown headlines for example so on um thas wants to know the specs of your Mac that you're running this on so I have uh Apple silicon M2 I think I have 64 gabes of Ram uh but for these models um the the Lama 3 Model I used it was 8 billion parameter model so if you run it in the f8 qu then the 8 GB of RAM should be enough um and if you run want to run it on GPU then maybe 16 gbits of a VR V thank you um and you showed quadrant as your vector store a couple of times is that like a preference from us or can you use any Vector store that you that you want you can want for this use case even the in memory Vector store would be nice because we don't care much about preserving it uh in the like a long term because we just use it to answer the users question uh so the in memory Vector store would be fine I also like to use uh super base Vector stores because then you can also have your database there and they have support for files but quadrant is also a good option L thanks next this is a long one I need to read that sorry um SE wants to know about embedding if I use an open AI embedding model to create embeddings for our user profiles a few months ago and it's now been Sunset and replaced by their 40 model does that mean that those embeddings are no longer compatible for retrieval just wondering if these previously embedded records are being ignored now or if it's because it's still open the eye it's backwards compatible so I think the 40 model that that's uh not an embedding model um they semi- recently released like the this uh text small embeddings model um and but either way you would need to run the embeddings uh or Reed the content again uh because I mentioned the tokenization method always changed so you like it's not comparable so the results you would get the scores wouldn't be very high sorry I'll just chime in just real quick I think it's easier this way um so I had received an email from them that said it the uh recommendation was that you move from their 16k embedding thing to 40 like they said they were sunsetting the old thing and that they were moving everyone to that um which I did and now I'm just curious if like that makes all of those old records where I already did the embeddings on them are those no longer being retrieved in the system it sounds like yes is the answer to that question okay so they are being retrieved probably you would still see some chunks that you would get but it's just that the score would be very low so there's no guarantee that those chunks are really relevant got it okay and just checking yep that's the last question um if you have any others um we can chat about it later right um thanks o that was very comprehensive and it was really great to see all the different methods that we have available um please remember to share that link if you haven't already and I'll make sure it gets shared out next week as well
