Lessons From Processing a Billion Pages with Reducto

50:24

Lessons From Processing a Billion Pages with Reducto

Jason Liu 20.01.2026 595 просмотров 10 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

Document parsing remains a critical challenge despite advances in frontier models. Even models achieving 99% accuracy can produce errors that break downstream applications, and frontier models consistently hallucinate information not present in documents. In this talk, Evan Vogelbaum (Machine Learning and Engineering), Yifei Hu (Machine Learning Researcher), and Alvin Ryanputra (Engineer) from Reducto discuss pushing the frontiers of document processing, sharing insights from processing over 1 billion documents and announcing their Series B funding led by a16z. We discuss: • Why document parsing remains critical despite the existence of frontier models • How frontier models fail on complex documents with handwritten text, dense tables, watermarks, and ambiguous forms - and why better parsing quality always improves downstream tasks (RAG, information extraction, chatbots) • Reducto's blended approach: dividing documents into regions and routing to specialized models (OCR vs VLMs) based on complexity rather than one-size-fits-all solutions • Research showcases: autoregressive layout detection on text-dense pages, novel agentic chart extraction achieving near-pixel-perfect accuracy, and edit endpoints for automated form filling • Citations and bounding boxes for building trust - grounding extracted information back to source documents, including citing when information is missing • Real-world benchmarks: Reducto outperforming AWS Textract, Microsoft Azure, and Google Cloud OCR on challenging documents • Human-in-the-loop workflows through form schemas for customers filling the same forms repeatedly (immigration, healthcare intake, government documents) The team shares practical insights on routing between traditional OCR and VLMs based on document complexity, handling long-range dependencies across page boundaries, and building production-grade document processing pipelines. About Reducto: https://reducto.ai/ Connect with the speakers: Evan’s LinkedIn: https://www.linkedin.com/in/evan-vogelbaum-b6383771/ Yifei’s LinkedIn: https://www.linkedin.com/in/yifei-hu-683499113 Yifei’s X/Twitter: https://x.com/hu_yifei Alvin’s LinkedIn: https://www.linkedin.com/in/alvinryanputra/ Alvin’s X/Twitter: https://x.com/AlvinRyanputra TIME STAMPS 00:00 Introductions 01:30 Problem Framing: The Challenges of OCR in 2025 13:15 Reducto's Approach to Document Parsing 15:25 Research Showcase: Innovations in Document Parsing 18:20 Form Filling and Editing with Reducto 25:22 Q&A and Closing Remarks 29:34 Routing Different Extractors Based on Complexity 30:37 Citation and Location Grounding in PDFs 33:43 Open Source Contributions and Community Engagement 35:38 Handling Complex Tables and Diagrams 39:45 Benchmarking and Evaluation in Document Parsing If you want to learn more about improving rag applications check out: https://improvingrag.com/ Stay updated: X/Twitter: https://x.com/jxnlco LinkedIn: https://www.linkedin.com/in/jxnlco Site: https://jxnl.co/ Newsletter: https://subscribe.jxnl.co/

Оглавление (11 сегментов)

Introductions

Hey everyone. So today I have a pretty exciting guest. Some of the friends from Reducto are going to come are coming by to talk a little bit more about pushing document proing frontiers. I think everyone sort of thinks about things like dockling or just you pass documents to Gemini and it's going to work. But really, you know, even if you have something that's 99% accurate, a single PDF could have a handful of errors. And so today we have um Rodto talking a little bit more about what they're looking at, how they're thinking about evaluating these systems, and also show you some of the horror stories that I've seen in the PDF minds. — Awesome. Well, thank you very much, Jason. Um my name is Evan. I help out with the machine learning and engineering here at Reduct. And I'm joined by Yee, our PDF Somalier, uh, one of our machine learning, uh, researchers, as well as Alvin, one of our engineers. And we're super pumped to, uh, be here to discuss how we've been pushing the frontiers of document parsing uh, here at Reductto. Before I get into the full presentation, I also just want to broadcast that we actually just announced our series B today led by A16Z. Um, and this comes at a really exciting time for the company. We've processed over 1 billion documents and that's honestly just getting started. I think most of that has been the last several months. So now I want to get into what we're

Problem Framing: The Challenges of OCR in 2025

here to present today. Uh, we have three points that we want to cover. We want to start with just a little bit of problem framing. EIE is going to talk through why the OCR the document problem is even still relevant in 2025 when we have things like you know Gemini 2. 5 Pro and in addition to that why this problem is also so hard for even frontier models or specialized optical character recognition models to get right and then after we do some of that problem framing I'm going to talk with Alvin about our take on document parsing and beyond and do some really cool research showcases of ideas that we've taken all the way from the lab to production that have pushed what models are able to do in the document parsing space. So with that said, Ephe, please feel free to take it away and let's discuss why OCR the document is such a relevant problem in 2025. — Thank you Aan and uh thank you everyone for joining. Uh we'll cut uh to the chase here. We have a scan form from a I don't know zoning permit application. Uh top left corner looks like a stamp of some sort. I cannot really read it. And then in the middle there's a small box which I believe contains a zone number. I cannot really read. And then down below there's a bigger box has all the check boxes but uh instead of a real checkbox like a square shape checkbox uh they need to circle the options here. So all these elements makes parsing extremely difficult but all these are actually very important information we cannot miss in when we deal with documents. Uh let's go to the next page and see another example. So here is a very dense handwritten form of some sort. So English is not my first language. I cannot read most of it. But if we look at the red box in the middle like there are some text overlapping the center divider. I don't know if they are for the gauge or they are part of the right part. Well Jason, can you tell? I literally cannot say — feel like the seven looks like a bracket for the right side but the six might be something else. Goodness. — Exactly. Okay. Uh let's go to the next page. So when we say like parsing dopin parsing is still relevant in 2025 nowadays we really mean it. So internally you have this benchmark called RD forcebench. is for information extraction and uh let's go to the next page. So it's a very simple test uh data set we use for benchmark internal models. So the input is basically a scan form something you we showed in the previous few pages and uh it has a handwritten tag stamps watermarks checkboxes and we want to extract information from there. So the user will normally provide you a document and a JSON schema for information extraction. Then the expected output is basically a JSON object containing all the information you should extract from this document. So this is a test for information extraction but uh we also use it to test our parse model. Uh this page just shows you okay reduct specialized in information extraction. If you have that use case, please talk to us. But let's go to the next one. It's more interesting in my opinion. So we have an oblation study on the extraction score based on different parsing results. So here we have our internal extract model called RD extract v2. The naming could be better but it's just v2. Uh then we run this on the parsing result from uh different parsing pipelines internally. We found that with our default pipeline but a slightly different layout detection model, we actually get way better extraction accuracy because uh some of the information we parsed uh incorrectly in the past are fixed by this layout model. Why? Because layout model didn't cut off any tags or group all the regions and in the correct way. But uh we know that let's go to the next page. Our point is better document parsing always gives you better downstream task accuracy and downstream task can be rag can be information extraction can be build a chatbot to talk to documents uh it can be anything but the parsing quality or data quality is really the core. So why is this still so hard in 2025? Like we have OCR technology from decades ago and uh we have OCR models uh open sourced closed sourced everywhere but uh I would say I worked on this document OCR problems parsing problems for quite a long time. I'm still struggling with certain use cases like this one. We have a huge dense table. If you are on a screen without zooming, you cannot really see anything. Uh but these are some of the documents or tables our customers send us and they want a very clean representation of the table either in markdown or in HTML. Let's go to the next page. So uh in the bottom you see the original table image and on top is the result from a frontier model. I don't want to review the name. So we basically send this image crop and ask this model please parse this table as a HTML and do not hallucinate do not do this do not do that not make up anything. We tried really hard with the prompting but then let's look at the area in the red box. You see you get you just got some extra data points from the parsing result in some columns and then you are missing some data points from a different column. So imagine you are uh building a financial chatbot or financial agents. You need to get data from some uh big financial tables with 100 rows and 20 columns and uh all of a sudden you got extra information that's not even part of the document and you cannot say hey my agent has higher quality or anything because the from the bottom the information is just wrong. Uh I here is another example. Yes, from Twitter. Uh I follow a lot of people talking about docu parsing uh in different languages as well. So this uh this friend on Twitter is from China and he has a test case of a Chinese table with a huge watermark in the middle. Uh I would say for human watermarks are not a big deal. we can ignore it because our brains are smart enough. Uh but for models uh according to this author of the Twitter, none of the models, none of the frontier models can parse this table perfectly. So he tried to prompt Gemini 25 Pro and suggested that Gemini 25 Pro is probably the best model to handle this case even though the results are not perfect. and someone asked, "Hey, I use the same Dramatics Wi-Fi Pro, but I got some very bad results. I'm always missing some data. " Uh, so what was your prompt? And this author gave you like a super long prompt basically telling the model, don't do this, instead do this. The issue here you can see uh you can definitely get very good results from prompting the frontier models but uh it's quite difficult to adjust a prompt like every single time when you have a new document tricky use case like you cannot write custom prompts for everything. Uh and in our reducts internal challenging docs channel we always share cases that uh even humans are debating what they are. So on the left so uh previous page please. Uh on the left we have zero and o. So I still believe uh before the letter s all the circles means zero. After the letter S, it means O. I'm open to debate. And on the right, we have this screen, this little piece. whether it's VS or is or LS I mean you really need to understand these cases with good OCR quality and some context of the full document to understand maybe it means liter per second instead of VS or LS or something uh yeah so these are the tricky cases we found that's that most of the models will fail and then when we talk about like cutting edge visual language models or uh any auto reggressor models, we have other issues from these models is it's a flaws in their nature. So if you're into large model research, you know, models always repeat themselves when in cases like this, you send a image with dot dot dot and the Gemini 25 flash. I had this test a while ago. So it was now the newest model. German 25 flash will basically give you dot dot dot until the end of the context window. And uh the reputation problems is basically a sampling issue. So if you upper the temperature from 0. 1 to probably 0. 5 or 0. 7 the issues mostly go away. But then when you have a higher temperature to parse documents, the models are more likely to make up some stuff that's not even in the documents. So it's now a tricky case that whether I want to address the reputation problem or whether I want the model to hallucinate extra information. Uh so next Evan will talk about our take or reduct's take on parsing documents. — Cool. Thank you so much, Epha. So, I think at this point you guys all have a good idea that document parsing can be a really hard problem. We have very challenging examples where you need really fine grain understanding of the image just to figure out what text is in the document. And if you try to use an endtoend large language model approach, there's always going to be issues with hallucination or sampling. I think uh actually just as I was browsing this morning. — Oh, I think somebody their uh volume on. Thank you very much for that. — Um as I was browsing hacker news this morning, I saw this post about somebody trying to use GPT5 mini to extract uh medical information uh I think for like a med school application or something. and they were shocked by just how much these uh large frontier models can hallucinate information that's like not even in the image uh or in the document. So how do we get around these issues?

Reducto's Approach to Document Parsing

Well, Redto's secret sauce is that we use a blended approach. We divide and conquer. Here we have an example from our reductto studio product where uh somebody's uploaded an example report from this fake brokerage uh Goldman Stanley uh for Jazz Pharmaceuticals. And you can see that we first passed our layout models over the document and identified each of the different regions. We have this table in green over here. We have uh all of these headers and titles in red up here. And we have this text in orange over here. And then for each of these regions, we use the right model for the task. So for these orange text boxes over here, we might use a more traditional OCR model uh that's very good at, you know, taking in very simple regions and producing the correct results with high precision and high recall. But for these much more complex regions over here where you have handwritten forms and you know vertical text along with horizontal text, we might use a more advanced in-house trained vision language model that is specially trained to take uh images um that are very hard to parse and output the correct markdown format for them. Most importantly, whenever we're extracting information from a document, we make sure to provide references and citations back to the document so that the end user can have trust that the values that they're getting are actually represented in the document. We even give confidences with an in-house uh citation confidence model. You can see an example of it here where we've correctly extracted the municipality and the street name and given their locations in the document. But one of the most impressive things is that even for the uh information that we wanted to extract but was missing from the document, we were able to site in the document that it was missing and where we saw that this information was missing. And this is something that most Frontier models just aren't really capable of at the moment.

Research Showcase: Innovations in Document Parsing

I now want to jump into our research showcase and talk about some of the really cool ideas that we've been pioneering on the machine learning team. As Epha mentioned, layout detection is something that is super important to our pipeline. And when you think about layout and like detecting objects on a page, oftentimes you might think of, you know, simple smaller models like YOLO or like a segmentation model, but we have so much highquality in-house data from our in-house labeling team that we've also been able to train auto reggressive models to output layout boxes on even very text dense pages, which means that we're outputting hundreds and hundreds of tokens with no hallucinations and very high fidelity in our predictions. Another example that's very close to my heart is uh some of the work that one of the engineers on our team Earth has been pioneering on chart extraction. I think up to this point we've mostly just discussed, you know, you have a document and it's got some text and you want to get the text out of your document. But think about, for example, a medical report that has a chart of a patient's vitals or a brokerage report which might have stock price performance over time. That's critical information in the document that you want your models to be aware of. But right now, most models are very bad at taking in that information from the chart and outputting it into a textual markdown format. We've pioneered a truly novel agentic approach that uses a mix of strong and weak vision language models so that we can take in really complex charts like this spiky line chart over here and reproduce them with near pixel perfect accuracy. The way that a traditional vision language model might do something like this is to just patch up the input so that you know this region right over here of just like purely blank text receives just as much attention as this very informationririch region over here. And that doesn't really make sense if what you're going for is high quality extraction. The pipeline that we've pioneered and hope to release in general availability over the coming weeks is able to take a blended approach to the whole figure and use different models to extract each part of the figure like the individual lines, the spikes, the axes, etc. The end result is super highquality extraction that is 100% grounded in the original image and matches it to nearly pixel perfect precision. The last idea that I want to talk about is something that Alvin on our team has really been leading and this is taking us from the traditional parsing and extraction world into the world of actually acting on documents with our edit endpoint. So Alvin, I'll let you

Form Filling and Editing with Reducto

take it from here and just let me know when you want to move to the next slide. — Right. Thanks Evan. So at Redto, we're really building the interface for documents. So from documents, you're able to read using our parse and extract endpoints to grab the data out of your documents and store it in whatever format you want. But now the natural evolution of this is how can we use the data to write back into the documents. And we've had enterprises use our parse and extract endpoints and whatever other data source that they may have. use that information to use our edit endpoint to fill up a form uh for their use case. So, we're starting out with edit for form filling. Next slide. So, this is an example of filling out an intake form. This is a word document with some patient information. And next slide. The majority of our cases are PDFs. So this is an SS4 form and you can see the filled version on the right. Next slide. So the core problem of editing boils down to two things. Detecting the form fields which is where do we fill the data and describing the form fields which is you know what data should I fill in this field. And to a human, these things come really naturally. But to kind of do it programmatically, we've had to perfect each individual problem in order to have a working product. Next slide. And so for detecting form fields, these are some examples. Uh a lot of PDFs are simply like images. Some have form fields, but a good number of them are simply like images. And so we have to use our own in-house train models in order to detect text boxes and checkboxes. In these two examples, it's not too challenging really because these horns have a lot of lines. They have great features for the model to detect these fields really easily. So we started out doing really well on this and after some experimentation we realized that we were missing uh certain types of forms. So in the next slide, if you look at this form, there are no lines um only words. And so the next challenge is how can we handle this type of forms? So we've had to specifically look for data, look for forms that are of this type of format and train on these types of data. And next slide, we've had more ambiguous use cases. Um it's really challenging if you look at the bottom left for medical under medical history to identify what you're supposed to do with all these random texts and after much training after much data wrangling and you know error analysis we were able to get to a pretty reasonable result. Next slide. And this is also a common like a different type of format from your original like forms. This is more like questionnaires or like word documents that were converted into PDFs and we've also had to train on these specific type of documents. Next slide. And so the next um really challenging problem is describing form fields. So our pipeline uses a mix of VLMs and a lot of reducts underlying models that power our parse and extract um pipelines. And a pretty cool I guess um some techniques that we use is that PD met PDF metadata is really valuable. A lot of government documents are basically labeled. they have the descriptions in each form field and so we use those as a starting point to achieve even higher accuracy. Also, sometimes form fields require additional context because some form fields may have headers on the previous page or may be a really long table that spans multiple pages and has its header like one to two pages back. And so we have to account for all these cases in order to really fill up a form. um usefully. Next slide. And so the process of iterating from zero to one, as I've sort of illustrated, is that we've had to try to improve our model, run some emails or iterate with our users, and really understand where we're falling short. And we're always discovering new types of forms, weird types of ways you can fill out forms. And it's always a cycle when we iterate to get to a place that really delivers value for our customers. So lastly, I wanted to show you something that we've built called the form schemas. So the form schema is basically uh it basically describes all the form fields in a document and the descriptions of what to fill in for each field. So we found that while a good number of customers want to fill a variety of PDFs, there is a subset of users that want to fill the same form over and over again. You can think of like a startup that does immigration, right? They're filling the same immigration forms for a wide variety of users. And in those cases, it makes more sense to, you know, do this pipeline once. So in our studio, we've made this process really easy. They first call our edit endpoint in the studio. We return the form schema in the response and there they have full control to modify and improve these descriptions. They can even adjust like the boxes. They can move it you know up, down, left, right. They can pass in additional values like specific values whether to fill a form field or not and much more. So we've kind of enabled sort of like a human in the loop but subhuman control over this process to really deliver maximum value for our customers. With edit we've seen some really amazing use cases across many industries. We've had customers that fill out questionnaires with customer fill out like intake forms and also this has part more generic consumer use cases as well. And you know maybe someday you can fill out your tax forms um using this someday. But to end off you know with hope we hope that you know sharing a part of how we've trained these models and iterated through all these technologies have shown you how we're really pushing different tiers of document processing. And while parsing a document may seem really simple and that's not all we do. There's really much more to it and we're continuously trying to improve our tech. So, uh, thanks Epha and Evan and Jason for organizing this. Happy to take any questions.

Q&A and Closing Remarks

questions. Great. Wow. I'm speechless. I really am excited for some of the chart extraction stuff. I've been I was aware of the edit models and have been sharing that with some folks, but I the chart extraction. It looks amazing. Yeah, that's something that uh really we've been iterating on almost for two months now and it's gotten so far. So, I'm really pumped. I think that uh we already have it working, I'd say, for the vast majority of cases. We want to get to a point where like when we release it, it just like performs uniformly across the board with almost no failure cases. Um but if you guys have really hard charts, like send them our way. Uh we'll just run the pipeline. I think almost every single time it has just been spectacular results. Um, so we're super pumped to release it soon. — And I guess this the idea here is that the chart would also just output the table, right? — Correct. Yeah. What I showed was like our table representation repplotted because it's much easier than reading like a super dense table, but that just shows like how accurate the model is. — Yeah, I feel like I could talk to you for like 30 minutes about that because I feel I was like, oh, there's like date ticks and there's missing date ticks. Do you like Anyways, maybe — we'll do a follow on uh with the chart extraction. Uh GA, how's that? — Perfect. Let's do it. Um yeah, I think everyone in the chat is kind of mind-b blown at some of the capabilities, especially the examples that you've pulled up. You know, I think again this one kind of goes back to maybe the first question, which is around like AWS Textract and document intelligence. Um I think people have a sense of what that answer now looks like, but could you talk a little bit more about those? I mean I definitely know they don't do some of the things that you just demonstrated. — Yeah. Yeah. Absolutely. I think like you know text track document intelligence these are oftentimes like batch products kind of meant for people who aren't as I guess accuracy sensitive um or maybe you know are just like using it because like they're using Google cloud for everything and like this is just kind of one of the services that they offer. Um, but in the end of the day, I don't think that they're bringing the same kind of focus and, you know, iteration speed to improving the model pipeline that Reduct is. I mean, this is like what we specialize in. This is exactly what we do. Um, and it shows in the benchmarks. So, we actually have this thing called RD table bench where we compare our table parsing capability to, you know, AWS Textract and Microsoft Azure and Google Cloud OCR. And across all of them, you can see that when you take a specialized model and a specialized approach, it just produces a higher level of accuracy uh than all of these kind of big batch products out there. That totally makes sense. I guess one of the things people are asking about is uh synthetic data. I guess here you can imagine a lot of the chart extraction might be around synthetic data, right? Like you can just generate different versions of pipot or like — stuff like that. Um, again, I think Textract does not do any of that kind of stuff. — Correct. Yeah, synthetic data is uh definitely I think a part of our overall approach. Um, I think it's actually interesting how we've evolved on that. We thought that would be like the catchall. Um, but we've actually found that there are ways to use just like real charts in creative ways along with our in-house labeling team as well as uh other larger models. Uh and so the stuff that I showed I think basically all of the training data there was actually just real charts. Um you might say like how do we possibly collect the data for that because like extracting the data for a chart is you know like incredibly like timeconuming and you need to be very detail- oriented. Uh I guess I'd say if that's the kind of problem that really interests you uh we have a careers page and we're hiring for the machine learning team. Uh there's a lot of really exciting stuff going on here. — Sweet. Um, you talked a little bit about this like layout model and the fact that it's auto reggressive. I'm going to try to mix two questions together. Does that like is the idea that like the auto reggressive layout model also sort of determines which method you use like it sounds like the model is saying okay like uh this is simple this is a simple layout I can use OCR this is a more complex layout I'll use the VLM. I think people are sort of asking about how do you route these different extractors? Yeah, Epha, do you want to take this

Routing Different Extractors Based on Complexity

one? — Sure. Uh, so in Redux pipeline, it's not like you send documenting, we run big model and give you the result. — Yeah. — Uh, instead we have a lot of smaller steps like we run OCR with our OCR model, we run layout with layout model, then table extraction model, uh, certain handwritten stuff, uh, we call it key value parsing model, etc., etc. So we have a lot of uh points that we need to make a decision based on the complexity of the tags, based on the length, based on the size and we'll route to certain models that can handle that part. — I see that makes sense. — Mhm. — Very cool. — And for our auto reggressive models uh they are only needed for the most challenging cases. Again they are not the fastest model. They are quite expensive to run but when we have to run them they always gave us the best result. — Very cool. Yeah. I think the top four or five questions was basically all around this like location grounding and stuff like that.

Citation and Location Grounding in PDFs

— Um one question that folks often have around answering questions on PDFs or doing rag over PDFs is the ability to site sort of the regions, right? like how do you guys think about integrating like the rag citation component of what a downstream user might use with whatever data that uh reducto returns to the user. — E yeah, you should take this one. — Uh this is a great question. So when you send a document to just a generic OCR service or a vision language model, you don't really get the location information or positional information from the parsing result. You got an image in you get text out and you don't know this text is in this location in the page. Uh so that's some core fundamental issues we started facing when we build our extract or citation models. Uh so we actually develop a lot of strong models that from the beginning we keeps in mind that grounding is everything we need. So when we actually parse your document from the parse stage, we already know okay this word locates at this position and this empty input field is on the top right corner. So we basically have a pixel map for every single elements we return to the customer. We don't expose that to the customer but internally that pipeline powers our like very accurate parse service and help us give you very good word level or character level citation if necessary. — Yeah. And I just want to emphasize like this is not something that we kind of just like patch on at the end like oh you know please find the location in the document or something like that as a prompt. uh the models that Eve has built are actually like handtuned specifically for this task to make sure that we are accurately setting the location of the document because it's just such an important thing to know that you have like a correct result and you can only know that if you like know where in the document is coming from makes sense. Do you guys have an example of sort of even like a toy application that tries to do this citation? I'm really curious to see or even just show one on the playground where maybe you ask a question that's like what's the total revenue and like a little box shows up on the uh — so if you go to Redux Studio right now uh you can try uh probably get a free account with 50k free credits and uh you can create a pipeline called extract and you upload your document you upload a schema that you are looking for information from and turn on citation. So that's a toggle in our config options. Once you turn on citation, you will see uh returned information. Let's say in this image, you have all the boxes corresponding to every single field extracted. So everyone can try it right now. Um let me see there's like a bunch of questions around

Open Source Contributions and Community Engagement

Yeah, let's talk a little bit about the open source. I know that you guys have done some like open benchmarks. Um, I think people are asking like are any plans to open source any of these components that you guys are thinking about? — Yeah, I know EIE um actually open sourced one of our first OCR models. Uh, RD1, I think it is. Um, — it's a Rome OCR. — Yeah, sorry. Rome OCR. Um, I think that was the internal name. Uh, RD1. Um, but yeah, that I think has been uh like a super cool contribution. And actually recently we saw that Huckingface like the HuggingFace team used Rome OCR um to build a new data set of uh PDF extraction. Um so we're really excited to see that like the community has picked this kind of thing up. Also for the RT table bench that I mentioned um all of that data is actually available publicly on Hogging Face. We even have this like super cool comparison thing uh where you know you can like take in a table and you can see exactly what the parsed output was for it as well as the ground truth. Um so that's definitely I think a really important thing to us to contribute back to the community. Um and of course you know we also uh want to use as much of the open source as we can um and try to contribute to any discussions that are happening around OCR. I think Eve has been pioneering a lot of that stuff on Twitter uh where he engages very regularly with all the discourse that's going on there. — Like Ruben just said, I've learned so much from OCR and Rome OCR. Thanks E for all the contributions. — I think Rome OCR is a classic case for like open source collaboration. you know the base model was from Quinn and the data set was from AI2 and we did some data processing and training and that's Rome OCR so it's really a huge collaboration between a lot of parties makes sense um I know the last time you

Handling Complex Tables and Diagrams

guys did a talk you talked a lot about table representations and it's something I've been pushing a lot of the uh like of my customers to do which is like when tables are simple represent them in markdown if they're very complex X represent them as like HTML or XML. Um does Redu do that automatically? Can it detect that like okay this is a 2x4 table do represent it this way or is there some other representation behind the scenes? — Uh — yeah it's a great question. I think right now uh we don't do any of that auto routing. Um but users have full control and we actually have several different formats that we support. Um, so we support of course like HTML and markdown as well as uh like JSON outputs. Um, and I think even XML if not XML uh now than like very soon. Um, so right now it's a user controlled thing. Um, but honestly I think it would not be too hard to integrate in auto routing in the future. — Makes sense. Um, someone asked another question. I feel like yeah we just got chart extraction and editing. Someone asked uh will you guys offer a service to parse 2D engineering drawings? I'm curious like where does the flowcharts and diagrams and those things uh come in the future? I know I have a lot of like mermaid diagrams or like manufacturing workflows that people have uh questions about. — Yeah, that's actually a super good question. Um so we have one of our uh machine learning researchers on our team, Rey. um he's actually worked a lot on I think uh creating good layout models for that use case where you have like you know for example uh building diagrams right where you have like a floor plan or something like that um and we have another uh guy on our team Josh one of our solutions engineers um he's been working on trying to extract exactly that like mermaid diagram use case where you might have like you know an org chart or something like that um and he's built actually some very impressive PC's uh there so I think those are uh ideas are in the pipeline right now. Um because you know this is like just what we want to be doing is like taking all the really strong pars and extract that we have and then building applications on top of it. — That's sweet. Yeah. And we're spending like three months working on like a blueprint extraction system with like YOLO and all that stuff and for the day I can — it's pretty hard all from scratch. I mean this is like I think why Reduct was built in the first place. Um, you know, I think Ranak has told me uh some of like the founding story of Redducto and they were originally I think trying to do a similar like rag style thing. Um, but you know, customers kept saying like, "Hey, I want my PDFs in this. You know, I want to like be able to have it like inject my documents into my queries. " Uh, and so at first they turned to like AWS Textract or I think like G-Cloud OCR and they were like, "Man, like the outputs are not good. Why are they so bad? " And they just looked at like what the OCR services were returning. they're like, "Wow, like you know, that's terrible. " Um, and so that was, I think, like a big motivation for Redto in the first place to be like the foundation layer that people can build on top of to make all of these really cool rag applications. Uh, and now we're even starting to provide some of those as services like with edit uh that Alvin has been pioneering. I think that's like a great example of something you can only do if you have really strong parsing and extraction capabilities. — Yeah, it totally makes sense. I mean when we did the blueprint extraction it's like okay first we have to spin up some data labeling team that can do bounding box modeling and then it's like okay how do we do that inference and all that stuff it's been a huge hassle. Yeah even someone else mentioned about like mapping like — schemas from like electrical circuits. I imagine those are all things are like within scoped of uh — what you guys are tackling. — Exactly. — Next time you be helping like design chips with — Yeah. That would be awesome. Bring it on. We'll wait for the Broadcom contract with the vector. Um, let me take another look at some of these questions. I guess as I'm scanning through these questions, one question I'd like to ask you guys is around, you know, when people are building these new AI applications, what is something you guys feel like uh people are missing? Like what's the mistake that they're missing? Maybe with processing, maybe it's like rag over PDFs. Um, yeah, think about that as I scan through some of these other questions.

Benchmarking and Evaluation in Document Parsing

— Yeah, that's a really good question. I think one thing that is often very hard in this space is benchmarking properly. — Um, you know, a thing that we see a lot is like people will kind of benchmark like say five documents or something like that. But if you're building an application that you want to reach, you know, tens of thousands of users, you've got to like account for the fact that you're going to get an enormous heterogeneity of different documents that come in. And so if you're trying to, you know, benchmark different providers or different methods, I think it's really important that you have an automated pipeline that you can, you know, feed in like thousands of examples into uh and make sure that you can like figure out is choice A better than choice B across all of these examples rather than just trying to go through yourself with like, you know, like five examples and like make some very high variance estimate of, you know, which method is better. And this is actually something that um I helped to build out in I think like my first month here. Uh it's a framework that we have called autoe eval. Uh and basically what this does is it does automatic evaluation across like the wide variety of types of documents that we get from our internal data sources. And we use that to make production decisions um day-to-day about like which model should we be serving because we want to see that like not just when you consider like a handful of samples that were labeled in a data set but the whole rich uh heterogeneity of documents that your uh applications are going to be experiencing that the new models that you're pushing out are actually going to produce superior results. Um I think we're also going to try to make this easier for other people. Obviously, we've built the internal pipeline. Um, but now that it works so well for us, we want other people to experience it. So, I think Josh and our team has been pioneering this concept called RD Arena, uh, that we hope to share with people soon where you can basically like put in a batch of documents and like we'll run all the parses for you between like Reduct or like Textract or, you know, some other services that you might want to use and you can actually be the judge. You can like just go through and choose like A or B, which one's better very quickly. And so that way you can get like a very high sample estimate of what is the best provider for your use case. — That that's great. I'm actually working on like a Jupyter notebook arena as well. So I'm really excited to see these come out. Yeah. I mean that's going to be super expensive because of the code. But uh I think that also kind of addresses some of the questions folks had about like — uh you know comparing your models like things like Llama Parse or these other providers. Um I can imagine like once these arenas come out it'll be a much easier. — Yeah, exactly. I think we're actually really gunning to do it. Like when we were first discussing the idea, I think everybody was like, "Yeah, that's like an obvious thing to do. " Cuz we want to be in those comparisons. We want people to, you know, compare us to like llama pars or what have you. Um since we know that when they see the outputs, they're going to be like, "Yeah, is the higher quality output. " — Yeah. — I mean, once you guys have things like chart extraction and uh you know, all that other stuff, it'll just be a situation where like none of the other vendors can be on the leading board. — Yeah. Yeah, I mean this was something that just like blew us away um when we first saw the results and you know this was really I think all credit goes to this guy on our team Arth one of our engineers. Um he's really just put in an enormous amount of work on this. Um I will tell you it was not this pretty in the beginning. Um this is something that took like you know months of serious iteration. Uh but the pipeline we put together I don't think like anybody has figured out something like this. Uh it's a really novel kind of pipeline that we figured out. Um, and it's incredibly high accuracy. So, we're super pumped to be uh releasing it generally at some point soon. Um, but in the meantime, like send us your hardest charts on Twitter. Um, we're happy to run them through the pipeline. — Is there like a Oh, you guys should make like a Twitter account that's just like, you know, evil PDF or something. You just have them — automatically. Yeah. I think right now Ephe is basically that Twitter account. But we should set that up. — Yeah. Yeah, I mean I took a screenshot of that uh chart and I was like comparing it side by side while the talk was happening. I was like, "Oh my god, this is really good. " — Yeah. — All right. Any other questions that the chat had? — Um nothing too crazy. I mean, — yeah, I think a lot of them were just around like metadata, anything like that. Were there any other things you wanted to chat about as well? Yeah, I mean I can talk a little bit about metadata since that came up. Um you know I think we are like a machine learning company uh first and foremost like you know we have a machine learning team that builds all these cool data cool models um but you know we're not going to like miss out on very good data uh whenever it's there and there is a class of PDFs where like you can parse it really quickly and really accurately with high quality metadata and that's another thing that like we've built internally that's like a part of our pipeline called hybrid extraction mode where it will look at the metadata and say like is this actually good or is this something that you know we just like can't use uh it's inaccurate to the representation of the document. Um and so we use that a lot with like financial clients for example where oftentimes they have these like super duper long like digitally native brokerage reports and they need them to be like really highly accurately extracted. Um in those cases the metadata is actually very helpful. So that is a part of our pipeline where it's good. I think that just goes back to like the overall approach of Reduct. Uh you know we we're not like a onetrick pony in any way. We try to take each of the subp parts of document parsing and use the best tool for that problem rather than like you know sticking to like just a VLM or like just traditional OCR. — Yeah. I remember sharing with some examples with you guys on like legal document redlinining whereas like you would have the original document but then you have like words copied and then deleted and then some other squiggly lines. I imagine those are things that like you could train custom models on if they already don't exist on the uh the — correct actually uh they do and they're actually part of the series B release. — Oh, sweet. Okay. I need to look at some of the docs now. I feel like I got to go send like 20 emails to different clients — to Yeah. — Yeah. Can you take talk a little bit more about like the actual product offerings now? Can anyone just sign up, set up an API key and get going or uh is there anything else that folks need to do with — No. Uh I think with the series B release, uh we've released two big things on the product side. Um one is like a kind of pay as you go uh offering so that people can just you know sign up and immediately get started processing documents. Uh no onboarding or anything like that uh required. Uh and then the other thing is Ethay mentioned is I think we're doing what 15k free credits. Um, and so that way like it costs you nothing to try out reducto. Uh, which means that I think there's like almost no reason not to. Um, because when you can get things like, you know, this chart extraction potentially or like, you know, these super high accuracy bounding boxes that Epha showed, um, I think it's like worth at least going for it. And now anybody can try out their hardest docs on reductot totally for free with the 15k credits uh that we have as part of the series B announcement. — Sweet. Amazing. Um, well, one question here was around page boundaries. This happens a lot. Um, I think when I was doing a lot of like manufacturing work, it would be the case that we have a PDF, the header is on page one, but the the PDF is like 15 pages long. Um, [clears throat] — yeah. How's the I guess I'm sure the FEMA thought about this, but I'm curious what the callouts would be. — Yeah. E, do you want to talk about that? I know you've thought a lot about that. — Yeah. Let's uh clarify the page boundary. So when you have a long document you have long dependencies like uh title for table two is two pages before table two maybe. — Yeah. — Uh so if you are doing rag I would say you will have to uh have very large chunks to maintain the overall context. But there are also some fancy tricks. I'm not the rag expert here. Jason, please jump in anytime or correct me. But you can also prompt the model to do like summarize the overall context for these two pages and attach them to the rack chunks before you embed the chunks, right? Etc., etc. Uh but also if you are working on information extraction tasks uh when you have super long documents we found that it always easier to parse the document first and then send the text into the model for any kind of uh information extraction or similar stuff instead of sending the images to the models. We know this frontier model they can handle your documents either as image inputs so a series of image or it's you can par you can give the parse result as a like a long context but uh we found that it works way better everything works way better with parsed documents and uh with the text input you can push the context be way longer internally we tested extraction models um 100 plus pages like just one shot, no fancy tricks, 100 plus documents and ask for maybe a 100 different uh fields. Our models can just get it if the input is in text form. Uh but if your input is like a 100 images, uh it'll be a very different story. Hallucinations are everywhere. So we have a long documents with long dependency issues, we would say always parse them and then work with a parsed text. I've seen this will try to like merge the columns back or combine everything. — That totally makes sense. Um I think with that we're basically at time, you know, give everyone some five minutes — or get organized, go to the next meeting. But uh — wow. I mean this is really incredible. I feel like the advanced chart thing was kind of something that needs to just be like retweeted a thousand times. — I'm really excited. — Yeah. Well, I'm super pumped for that. Um and you know, if people have examples they want to send our way, uh we're happy to run the pipeline. We're super pumped to be releasing it at some point soon. Um, and not just this, but I think like all of the upcoming offerings, uh, from Redducto, uh, whether it's like improving edit and improving, you know, our extract and parse pipelines to the really frontier things. Uh, I think we just have like a lot of exciting energy right now. And with the series B, I think it's like never been a more high growth time at the company. So I also just want to broadcast out there like if this is something that interests you, uh we're hiring for GTM, we're hiring for ENGE, we're hiring for machine learning, for DevOps, whatever your experience is, uh I think you'll have something really valuable to bring to the team. So please reach out uh reductto. ai/careers. We would love to meet you. Perfect. — Thank you. Thank you very much.

Другие видео автора — Jason Liu

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник