We spent a year building enterprise RAG that actually works. The exact stack, the extraction blo
Everyone thinks building RAG is easy until they are staring down a 500-page unstructured PDF full of schematics and broken tables.
For the past year, my team and I have been building the architecture to process massive, complex documents, make them AI-ready, and spin up agents that actually reason. So far, we have been exclusively working with highly technical companies. We’ve battle-tested our pipelines on complex schematics, architectural graphs, dense code blocks, graphics, and nightmare image tables.
Now, we are looking to take this exact architecture and point it at the law and insurance sectors.
Getting data out of complex documents and into a single source of truth across PuppyGraph/Neo4j and Qdrant is a complete nightmare. Here is the exact stack we use, the failures we hit, and the system we finally built to make it work.
Phase 1: The Extraction Bloodbath Without pristine data, your agents are just confidently spewing rubbish.
We started with deterministic pipelines, running heavy regex on complex docs. It failed completely. We then moved to PyMuPDF , Camelot , and PDFPlumber to try and parse structure, alongside Tesseract for the images.
The core problem? OCR is inherently fuzzy.
When you are dealing with critical field data or technical specs, OCR failing on complex images or confusing an "l" for a "1" or an "i" ruins downstream computation.
We tore that down and tested Unstructured.io , Docling , and MinerU . There is no silver bullet, but deploying our own MinerU instance has been an absolute beast. Its pipeline mode produces great results, but if you want industry-grade extraction (VLM and hybrid modes), you have to pay the hardware tax of at least 16GB VRAM.
The OCR AI Fix: OCR is still a problem even with good models. To fix this, we isolate only the broken OCR chunks and submit them to an LLM to specifically correct the broken formatting. You have to mind your context windows here, models degrade fast on heavy context, so we use Anthropic’s Claude, which handles larger context windows far better for this specific needle-in-a-haystack correction.
Phase 2: Storing the Truth (Graph vs. Vector) Once you have the extracted data, how do you store it so the AI can actually reason?
1. The Vector DB (Qdrant)
Vector DBs introduce a whole new set of headaches. Chunking strategies dictate your retrieval success. Too fine, and you lose the semantic benefits. Too coarse, and the vector gets diluted.
And extracted tables? A broken-up table is completely useless. It has to stay in the exact same chunk to maintain spatial context. To get retrieval right, we implemented hybrid ingestion using SPLADE/BM25 (for keyword precision) and Voyage embeddings (for semantic depth).
2. The Graph DB (PuppyGraph / Neo4j)
This is the ultimate source of truth. We build the graph ontology, strictly defining the schema; nodes, and relationships. If you map this well, you get incredible, highly-relational data. If you don't, your agents are nonsense.
Phase 3: Retrie…
为什么值得关注
能改变理解方式,而不只是重复常识;符合当前抓取需求;它提供了新的理解或解释,而不只是表面观点
来源:reddit,领域:tech,保留分:0.64
讨论总结
讨论量较低,暂无明显增量信息。