This tree search framework achieves 98.7% on documents where vector search fails



A new open source framework called Page Index solves one of the old problems of retrieval augmented generation (RAG): the management of very long documents.

The classic RAG workflow (slicing documents, calculating embeddings, storing them in a vector database, and retrieving top matches based on semantic similarity) works well for basic tasks such as Q&A on small documents.

PageIndex abandons the standard "fragment and integrate" method entirely and treats document retrieval not as a search problem, but as a navigation problem.

But as companies try to scale RAG toward high-stakes workflows—auditing financial statements, analyzing legal contracts, navigating pharmaceutical protocols—they run into an accuracy hurdle that fragment optimization can’t solve.

AlphaGo for documents

PageIndex addresses these limitations by borrowing a concept from AI rather than search engines: tree search.

When humans need to find specific information in a dense manual or long annual report, they don’t go through each paragraph in a linear fashion. They consult the table of contents to identify the relevant chapter, then the section and finally the specific page. PageIndex forces the LLM to reproduce this human behavior.

Instead of pre-computing vectors, the framework constructs a "Global Index" of the document structure, creating a tree structure where the nodes represent chapters, sections and subsections. When a query arrives, the LLM performs a tree search, explicitly classifying each node as relevant or irrelevant based on the full context of the user’s request.

"In computer terms, a table of contents is a tree representation of a document, and its navigation corresponds to a tree search," » said Zhang. "PageIndex applies the same basic idea – tree searching – to document retrieval and can be thought of as an AlphaGo-style system for retrieval rather than gaming."

This shifts the architectural paradigm from passive retrieval, where the system simply retrieves matching text, to active navigation, where an agent model decides where to search.

The limits of semantic similarity

There is a fundamental flaw in the way traditional cloth manages complex data. Vector retrieval assumes that the text most semantically similar to a user’s query is also the most relevant. In professional fields, this assumption frequently fails.

Mingtian Zhang, co-founder of PageIndex, cites financial reporting as a prime example of this failure mode. If a financial analyst asks an AI about "Ebitda" (earnings before interest, taxes, depreciation and amortization), a standard vector database will retrieve every chunk where this acronym or a similar term appears.

"Multiple sections may mention EBITDA with similar wording, but only one section defines the precise calculation, adjustments or reporting scope relevant to the question," Zhang told VentureBeat. "A similarity-based retriever has difficulty distinguishing these cases because the semantic signals are almost indistinguishable."

This is the "intention vs. content" gap. User does not want to find the word "Ebitda"; they want to understand the “logic” behind it for that specific quarter.

Additionally, traditional integrations strip the query of its context. Since embedding models have strict input length limits, the retrieval system typically only sees the specific question asked, ignoring previous turns in the conversation. This detaches the retrieval step from the user’s reasoning process. The system associates documents with a short, decontextualized query rather than the full history of the problem the user is trying to solve.

Solving the multi-hop reasoning problem

The concrete impact of this structural approach is more visible in "multi-hop" queries that require the AI ​​to follow a breadcrumb trail through different parts of a document.

In a recent benchmark test known as FinanceBench, a system built on PageIndex called "More 2.5" achieved a state-of-the-art accuracy score of 98.7%. The performance gap between this approach and vector systems becomes evident when analyzing how they handle internal references.

Zhang gives the example of querying the total value of deferred assets in a Federal Reserve annual report. The main section of the report describes the “change” in value but does not list the total. The text, however, contains a footnote: “See Appendix G of this report… for more detailed information. »

A vector system usually fails here. The text in Appendix G bears no resemblance to the user’s query on deferred assets; it’s probably just a table of numbers. Since there is no semantic match, the vector database ignores it.

However, the reasoning-based fetcher reads the clue in the main text, follows the structural link to Appendix G, locates the correct table, and returns the exact figure.

The latency trade-off and infrastructure change

For enterprise architects, the immediate concern with an LLM-based research process is latency. Vector searches occur in milliseconds; have an LLM "read" a table of contents means a significantly slower user experience.

However, Zhang explains that end-user perceived latency may be negligible due to the way recovery is integrated into the build process. In a classic RAG configuration, retrieval is a blocking step: the system must perform a search in the database before it can start generating a response. With PageIndex, retrieval occurs inline, during the model reasoning process.

"The system can start streaming immediately and recover as it generates," » said Zhang. "This means that PageIndex does not add an additional “fetch gate” before the first token, and the time to first token (TTFT) is comparable to a normal LLM call."

This architectural change also simplifies the data infrastructure. By removing the need for integrations, businesses no longer need to maintain a dedicated vector database. The tree-structured index is lightweight enough to fit into a traditional relational database like PostgreSQL.

This addresses a growing problem in LLM systems with retrieval components: the complexity of keeping vector stores in sync with living documents. PageIndex separates structure indexing from text extraction. If a contract is changed or a policy updated, the system can handle small changes by reindexing only the affected subtree rather than reprocessing the entire corpus of documents.

A decision matrix for the company

Although the accuracy gains are compelling, tree search does not universally replace vector search. Technology is best viewed as a specialized tool for "deep work" rather than a catch-all for every recovery task.

For short documents, such as emails or chat logs, the entire context often fits within the pop-up window of a modern LLM, making any retrieval system unnecessary. Conversely, for tasks purely based on semantic discovery, such as recommending similar products or finding content with a similar character "atmosphere," vector embeddings remain the superior choice because the goal is proximity and not reasoning.

PageIndex falls squarely in the middle: long, highly structured documents where the cost of error is high. This includes technical manuals, FDA filings, and merger agreements. In these scenarios, the requirement is auditability. An enterprise system must be able to explain not only the answer, but also the path it took to find it (for example, confirming that it checked Section 4.1, followed the reference to Appendix B, and summarized the data there).

The future of agent recovery

The rise of frameworks like PageIndex signals a broader trend in the AI ​​stack: the evolution toward "Agentic RAG." As models become more capable of planning and reasoning, the responsibility for finding data shifts from the database layer to the model layer.

We’re already seeing this in the coding space, where agents like Claude Code and Cursor move away from simple vector searches towards active exploration of the code base. Zhang believes generic document recovery will follow the same trajectory.

"Vector databases still have suitable use cases," » said Zhang. "But their historical role as the default database for LLMs and AI will become less clear over time."



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *