# LiteParse - The Local Document Parser

## Метаданные

- **Канал:** Sam Witteveen
- **YouTube:** https://www.youtube.com/watch?v=_lpYx03VVBM

## Содержание

### [0:00](https://www.youtube.com/watch?v=_lpYx03VVBM) Segment 1 (00:00 - 05:00)

Okay, so your coding agent can write thousands of lines in Python, but as soon as you start to give it PDFs and other documents to start using, you often find that a lot of that useful context just disappears. Tables get flattened, charts disappear, numbers hallucinate, and often you need to end up writing sort of janky workarounds for things like PIP PD or different OCR models that are out there just to get the basic text out of these files. So, in this video, I want to look at Light Path, which is a new tool that the team at Llama Index has open- sourced. And the tool itself is really cool, but that's not the interesting part for me here. The interesting part is why they built it. And really, that comes down to the fact that they just admitted that the framework era, the thing that they helped create and were one of the first open-source packages to do things like rag frameworks, is basically over. So, in this video, I want to deep dive into what they're doing, why they're doing it, and what it means for anyone who's building with agents right now. Okay. So, before I get into the technical stuff, I think it's important to understand sort of a little bit about who Llama Index is. Unfortunately, I never really covered them that much on the channel, and that was not because I didn't think their framework was very good. I did feel that their docs were often lacking and with the rate of change going on it was often hard to keep up with the different abstractions etc. But from the start, they were one of the first people to build a very solid LLM framework, particularly for rag. And this is what Llama Index was really known for a long time was being this sort of hardcore rag framework that could do lots of different types of rag. And Jerry, one of the founders, built the first version of Llama Index back, I think in like November 2022, right as the whole rag wave was starting. And I think back then they were one of the first to really sort of see that LLM needed some kind of data layer that really it was all about getting the right data, putting that into the context window and then using that to get the best responses back. So they grew this thing to 47,000 GitHub stars with over sort of 5 million monthly downloads. And it really became the starter toolkit for anyone building rag applications partly because they were so good at building the different connectors for different kinds of data that you wanted to put into your rag and doing a lot of work around things like chunking strategies and how to actually break things down for use in a rag application. But here's the thing that they wrote about in a recent blog post that I think is actually kind of more important. From day one, they kept seeing the same pattern over and over. Developers would try to build rag chat bots over PDFs. They'd try every OCR tool that was out there. And generally, they would find that the results were pretty terrible for a lot of this stuff. Existing OCR tools would misalign tables, ignore charts, introduce sort of text gibberish when it didn't understand a graphic, etc. And I think that's what sort of initiated this kind of pivot where they moved away from just being a purely LLM framework or a Gentic framework to actually being much more about building passing infrastructure and document understanding. Now, this became a very cool product called Llama Pass, but a couple of times I thought about making a video about it. I just never got around to it. And I should point out that I first actually met Jerry and Simon in person about two years ago. And I was very impressed having lunch with Jerry and talking to him about where he saw a lot of these things going. And I think even back then they were making some bold calls. But now in their recent blog post, I think they're making the boldest one yet. So the blog post basically says Llama Index is more than a rag framework. It is a gentic document processing. And what's remarkable about this blog post is actually just how candid this is. Jerry basically lays out three things here of why the framework era is ending. And I think if you've been following what I've been saying on this channel, this is going to sound very familiar. The first reason is that agent reasoning has gotten way better. If you think back to 2023, the best we had was sort of simple React agents and deterministic workflows. Now, agent loops can do extended reasoning, self-correction, multi-step planning, and the gap between what a basic tool calling agent could do in 2023 compared to what sort of claude code, claude co-work, open AI's codeex, etc. can do in an agent loop today is massive. The second reason is that MCPs and skills have changed how agents discover and use tools. You no longer need framework integration for every single tool. The agent can sort of discover tools on its own. And if you think about what that means for Llama Index or Langchain where a huge part of

### [5:00](https://www.youtube.com/watch?v=_lpYx03VVBM&t=300s) Segment 2 (05:00 - 10:00)

their value was always these integrations, that value has sort of been eaten by this new protocol layer. The third reason is that coding agents themselves have changed how people build software. When we look at something like clawed code or codeex or even things like anti-gravity, they can just write the Python for you, the value of framework abstractions has dropped significantly. You don't need a library to wrap LLM calls anymore. And what I really admired is that Jerry literally writes, "A general purpose LLM frameworks aren't as central as they used to be, and that's okay. " Now, this echoes one of the things I've been giving live talks about over the past year of that you should learn in the frameworks and ship in Python. And I've got a whole video coming up just about that. But I do think this is really interesting in that here you've got the founder of one of the biggest frameworks basically saying the same thing. So the cool thing for Llama Index is as orchestration has gotten commoditized, we still have the problem of getting clean text out of documents. And this is a bigger deal than most people realize. The vast majority of enterprise knowledge is locked in PDFs, PowerPoints, Word docs, Excels, all those sorts of things. And the temptation is to think that okay, we've got frontier vision models now just screenshot everything and prompt the model to pass it into markdown. And look, I thought that myself for a long time as well. But it turns out that often that just doesn't work at production scale. Frontier vision models struggle with the long tail of stuff, dense tables, hundreds of rows and columns, lines and charts, not to mention things like handwritten forms and other data like that. Most production OCR stacks are processing millions of pages every month. And you just don't want to be burning expensive vision tokens on textheavy pages that don't need vision. And often the existing tools around these kind of OCR systems have always been really kind of fragile. They break with layout changes. They need frequent retraining and they often produce error rates that just really cause lots of problems when you're doing something in production. And while you often look at the error rates on benchmarks and stuff like that and you can compare one system to another and see that it's not that different, actually in the real world, the difference between a 90% passing accuracy and 99% passing accuracy is huge. That gap is the difference between automating a process end to end and needing a human to actually review every single output. Now, Llama Index has focused on this for a while and built a very nice product in Llama Pass, and that's a paid product that while you've got a free tier there, it's clearly built for enterprise. But I do admire that Jerry and the team didn't throw away their open-source roots. They've actually looked at the lessons that they learned from building Llama Pass and open-sourced a smaller, lighter version of this. So, that brings us to Light Pass. Light Pass is their open-source document passer. It's completely free and you don't even need a GPU here. In many ways, you can think of this as Llama index's answer to PIP PDF or you know different forms of sort of markdown extractors etc. Now this doesn't just support PDF files. It supports 50 different file formats from things like office docs and even raw images. And the interesting thing is that they've built this in a way to be plugandplay with clawed code with openclaw and with the agentic systems that we're seeing people use for single user agents. Okay. So interestingly this is not Python. It's got a Python wrapper if you want to use this. It's TypeScript native built on PDFJS and Tessarakjs. There's no API key needed. You don't even need a GPU to run this. So literally you can have your coding agent run this locally. So there's a lot of interesting things about how it works. Generally most of these passing tools will try to detect tables then convert them to markdown. And in many ways that's introducing lots of steps where you could have failure. You've got to detect where the table is. You've got to work out the rows and the columns. And you got to convert all of that into some kind of structured format. What light pass does instead is it preserves the spatial layout by projecting text onto a spatial grid. So basically it keeps things where they appear on the page using indentation and whites space. And it turns out that LMS actually understand this quite well. They've been trained on ASKI tables on code indentation and readmes etc. The other cool thing about light pass is that it enables this two-stage agent pattern. You pass text fast for initial understanding and then you fall back to screenshots when you need deeper visual reasoning. So if your coding agent does actually need to understand at a much higher degree, it can pass that screenshot into a multimodal model which

### [10:00](https://www.youtube.com/watch?v=_lpYx03VVBM&t=600s) Segment 3 (10:00 - 11:00)

can then basically reason over it and extract out what it actually needs. Now the key thing there is that you pay for those calls when you need them. But for the majority of the time when you don't need that, you don't need to pay for those calls. Now you can also output to JSON files which gives you bounding boxes. If you need precise location of data, you've got it. And if you want to incorporate a better quality OCR model or something like that, you can also do that here. So they've actually included example servers for paddle OCR and easy OCR. And my guess is from just looking at those, you could have claude code pretty much write an integration for any particular OCR model that you wanted to use. So obviously if you need something bigger or you need something that can be done at scale perhaps for a multi-user agent system in that case you would then basically use llama pass. But if you're looking for something that you can run yourself and run locally and like I said before you can even do it without any GPU you definitely should check out light pass. So as always I've put the link in the description. If you're building agents that need to touch documents at all this is definitely worth giving it a try. And I think the broader trend here is important that the value is moving down the stack, right? And I think for anyone who's building in this space, it's really worth thinking about where your defensible layer actually is. Anyway, let me know in the comments what you think. A framework's dead. Like I said, I've got a whole video that I'm finishing up just about this. Love to hear what you're seeing, especially if you're dealing with this kind of stuff in production. And if you want to find out more about singleuser agents versus multi-user agents, check out this video over here.

---
*Источник: https://ekstraktznaniy.ru/video/22369*