In this post, we describe our Mixture-of-PageRanks RAG system, which is built to perform long-context tasks in a highly computationally efficient manner. We describe key features of the algorithm and the SOTA results it achieves across a variety of long-context benchmarks. MixPR can augment any existing foundation model, robustly outperforms frontier long-context models on a variety of benchmarks and can extend effective LLM context lengths into the billions while being able to run efficiently on CPU.
Long-context LLMs are highly capable but require enormous compute and memory resources. This makes them expensive to serve in the cloud and renders them largely impractical for compute-constrained scenarios such as on-device applications. This raises the question of how we can reduce the cost of processing long-contexts without hurting performance.
At Zyphra, we are developing fast but effective RAG systems to preprocess long-context inputs for LLMs. The idea is to use RAG to select a subset of the text most relevant for the task at hand before sending it to the LLM. By only inputting a small portion of the context to the LLM, the cost of LLM inference can be drastically reduced.
We previously presented a preliminary graph-based RAG system engineered for this purpose and demonstrated its effectiveness on the HashHop benchmark, achieving state-of-the-art accuracy with unprecedented speed and memory efficiency.
Now, we present an expanded version of our system and demonstrate its effectivenessacross an extensive set of long-context benchmarks. Details can be found in our new paper. This new version of the algorithm, which we call Mixture-of-PageRanks (MixPR), achieves SOTA results across a variety of standard long-context benchmarks while being highly computationally efficient. Below is a summary of the algorithm and main results.
Previous works applying RAG to long-context tasks have two main limitations. First, they tend to focus on simple QA type questions, even though many long-context tasks are more complex, e.g., reasoning or summarization. Second, they do not analyze or focus on reducing the compute cost of the retrieval pipeline.
RAG models are normally tested on standard informational retrieval benchmarks, where it is assumed a pre-embedded vector database is available to answer questions at test time. However, in long context tasks the data that the RAG system is retrieving from, i.e., the long context, is given at test time. This means the RAG system is not just required to do fast retrieval but also fast database construction. If a RAG system is very slow at constructing the database, it will be much less useful as a long-context preprocessing algorithm.
We have developed a RAG system which can handle long-context inputs and solve complex tasks beyond simple QA. Our system uses a novel retriever based on the PageRank algorithm, which is the classic graph-based retrieval algorithm originally developed by Google. A novel feature of our algorithm is the usage of a mixture of two different types of PageRank that apply to two main categories of tasks: local and global retrieval tasks.
Tasks requiring local retrieval are those where we want to retrieve text chunks that are semantically and syntactically related to the query, e.g., QA, key-value retrieval, reasoning, etc. Tasks requiring global retrieval are those where we want to retrieve the most important parts of the overall document, where importance is independent of query-relatedness. For example, for summarizing books we may want to retrieve the text chunks describing the book’s main events, which will not directly relate to the specific content of the phrase “summarize this book” but will nonetheless be important in relation to the structure of the events in the book.
We argue Standard PageRank (PR) provides a way to address global retrieval tasks. Given an adjacency matrix, A, linking items in a database, PR will assign importance to each item based only on their structural importance in the graph (e.g., the number of incoming connections and importance of neighbors). We visualize this in the bottom of the figure above, where structurally important nodes are represented in dark blue.
Conversely, Personalized PageRank (PPR) can provide a way to do local retrieval for difficult tasks. PPR weights items in a database based both on their relatedness to a ‘personalization vector’ and based on their structural importance. We use the personalization vector to bias weights toward a node representing the query. We visualize this in the middle row of the figure above.
PPR can be seen as working akin to approximate Bayesian inference: it tries to maximize the likelihood, P(X | Z), of the input data X (i.e., maximize similarity to the query), while also maximizing the prior probability, P(Z), of the node (i.e., the structural importance of node in the graph). In the case of local retrieval tasks, an X is given in the form of a query. In global retrieval tasks, no X is given (the query does not contain relevant info to be searched), and thus we end up just wanting to maximize the prior probability/structural importance over our weights Z.
Finally, we need a way to do routing between PR and PPR based on the type of query. We use the base LLM in the RAG system with a zero-shot prompt to classify the query as either requiring global retrieval (route to PR) or local retrieval (route to PPR). This simple method worked very well on multiple LLMs.
The final retrieval algorithm can be summarized:
Long-context LLMs are highly capable but require enormous compute and memory resources. This makes them expensive to serve in the cloud and renders them largely impractical for compute-constrained scenarios such as on-device applications. This raises the question of how we can reduce the cost of processing long-contexts without hurting performance.
At Zyphra, we are developing fast but effective RAG systems to preprocess long-context inputs for LLMs. The idea is to use RAG to select a subset of the text most relevant for the task at hand before sending it to the LLM. By only inputting a small portion of the context to the LLM, the cost of LLM inference can be drastically reduced.
We previously presented a preliminary graph-based RAG system engineered for this purpose and demonstrated its effectiveness on the HashHop benchmark, achieving state-of-the-art accuracy with unprecedented speed and memory efficiency.
Now, we present an expanded version of our system and demonstrate its effectivenessacross an extensive set of long-context benchmarks. Details can be found in our new paper. This new version of the algorithm, which we call Mixture-of-PageRanks (MixPR), achieves SOTA results across a variety of standard long-context benchmarks while being highly computationally efficient. Below is a summary of the algorithm and main results.
Previous works applying RAG to long-context tasks have two main limitations. First, they tend to focus on simple QA type questions, even though many long-context tasks are more complex, e.g., reasoning or summarization. Second, they do not analyze or focus on reducing the compute cost of the retrieval pipeline.
RAG models are normally tested on standard informational retrieval benchmarks, where it is assumed a pre-embedded vector database is available to answer questions at test time. However, in long context tasks the data that the RAG system is retrieving from, i.e., the long context, is given at test time. This means the RAG system is not just required to do fast retrieval but also fast database construction. If a RAG system is very slow at constructing the database, it will be much less useful as a long-context preprocessing algorithm.
We have developed a RAG system which can handle long-context inputs and solve complex tasks beyond simple QA. Our system uses a novel retriever based on the PageRank algorithm, which is the classic graph-based retrieval algorithm originally developed by Google. A novel feature of our algorithm is the usage of a mixture of two different types of PageRank that apply to two main categories of tasks: local and global retrieval tasks.
Tasks requiring local retrieval are those where we want to retrieve text chunks that are semantically and syntactically related to the query, e.g., QA, key-value retrieval, reasoning, etc. Tasks requiring global retrieval are those where we want to retrieve the most important parts of the overall document, where importance is independent of query-relatedness. For example, for summarizing books we may want to retrieve the text chunks describing the book’s main events, which will not directly relate to the specific content of the phrase “summarize this book” but will nonetheless be important in relation to the structure of the events in the book.
We argue Standard PageRank (PR) provides a way to address global retrieval tasks. Given an adjacency matrix, A, linking items in a database, PR will assign importance to each item based only on their structural importance in the graph (e.g., the number of incoming connections and importance of neighbors). We visualize this in the bottom of the figure above, where structurally important nodes are represented in dark blue.
Conversely, Personalized PageRank (PPR) can provide a way to do local retrieval for difficult tasks. PPR weights items in a database based both on their relatedness to a ‘personalization vector’ and based on their structural importance. We use the personalization vector to bias weights toward a node representing the query. We visualize this in the middle row of the figure above.
PPR can be seen as working akin to approximate Bayesian inference: it tries to maximize the likelihood, P(X | Z), of the input data X (i.e., maximize similarity to the query), while also maximizing the prior probability, P(Z), of the node (i.e., the structural importance of node in the graph). In the case of local retrieval tasks, an X is given in the form of a query. In global retrieval tasks, no X is given (the query does not contain relevant info to be searched), and thus we end up just wanting to maximize the prior probability/structural importance over our weights Z.
Finally, we need a way to do routing between PR and PPR based on the type of query. We use the base LLM in the RAG system with a zero-shot prompt to classify the query as either requiring global retrieval (route to PR) or local retrieval (route to PPR). This simple method worked very well on multiple LLMs.
The final retrieval algorithm can be summarized:
Next, how do we implement MixPR so it can be executed quickly in real-time, on compute limited hardware? Although the abstract MixPR algorithm issimple it alone does not specify how to implement the algorithm so that it is compute and memory efficient. We need a way to embed the text, construct a graph between text chunks, and perform retrieval quickly.
Previous SOTA graph RAG systems tend to have a slow embedding process involving LLMs writing entity pairs from the text, which are then used to construct a graph (e.g., see here and here).
To speed this process up we first embed text chunks using a symbolic program that creates sparse embeddings from keyword statistics, using the TF-IDF algorithm. This program is fast and runs entirely on the CPU. The embeddings are stored in memory-efficient sparse matrices. To construct the adjacency matrix we construct the proximity matrix representing cosine similarity values between all text embeddings, and use it as the adjacency matrix. This operation can be donewith a single matrix multiply.
The figure above shows that this process of chunking text, embedding, constructing the graph, and retrieving text is very fast, taking only a few seconds to process contexts of a million+ tokens on a desktop CPU. And it is much faster than the standard process that uses dense embeddings.
Next, how do we implement MixPR so it can be executed quickly in real-time, on compute limited hardware? Although the abstract MixPR algorithm issimple it alone does not specify how to implement the algorithm so that it is compute and memory efficient. We need a way to embed the text, construct a graph between text chunks, and perform retrieval quickly.
Previous SOTA graph RAG systems tend to have a slow embedding process involving LLMs writing entity pairs from the text, which are then used to construct a graph (e.g., see here and here).
To speed this process up we first embed text chunks using a symbolic program that creates sparse embeddings from keyword statistics, using the TF-IDF algorithm. This program is fast and runs entirely on the CPU. The embeddings are stored in memory-efficient sparse matrices. To construct the adjacency matrix we construct the proximity matrix representing cosine similarity values between all text embeddings, and use it as the adjacency matrix. This operation can be donewith a single matrix multiply.
The figure above shows that this process of chunking text, embedding, constructing the graph, and retrieving text is very fast, taking only a few seconds to process contexts of a million+ tokens on a desktop CPU. And it is much faster than the standard process that uses dense embeddings.
Long-context LLMs are highly capable but require enormous compute and memory resources. This makes them expensive to serve in the cloud and renders them largely impractical for compute-constrained scenarios such as on-device applications. This raises the question of how we can reduce the cost of processing long-contexts without hurting performance.
At Zyphra, we are developing fast but effective RAG systems to preprocess long-context inputs for LLMs. The idea is to use RAG to select a subset of the text most relevant for the task at hand before sending it to the LLM. By only inputting a small portion of the context to the LLM, the cost of LLM inference can be drastically reduced.
We previously presented a preliminary graph-based RAG system engineered for this purpose and demonstrated its effectiveness on the HashHop benchmark, achieving state-of-the-art accuracy with unprecedented speed and memory efficiency.
Now, we present an expanded version of our system and demonstrate its effectivenessacross an extensive set of long-context benchmarks. Details can be found in our new paper. This new version of the algorithm, which we call Mixture-of-PageRanks (MixPR), achieves SOTA results across a variety of standard long-context benchmarks while being highly computationally efficient. Below is a summary of the algorithm and main results.
Next, how do we implement MixPR so it can be executed quickly in real-time, on compute limited hardware? Although the abstract MixPR algorithm issimple it alone does not specify how to implement the algorithm so that it is compute and memory efficient. We need a way to embed the text, construct a graph between text chunks, and perform retrieval quickly.
Previous SOTA graph RAG systems tend to have a slow embedding process involving LLMs writing entity pairs from the text, which are then used to construct a graph (e.g., see here and here).
To speed this process up we first embed text chunks using a symbolic program that creates sparse embeddings from keyword statistics, using the TF-IDF algorithm. This program is fast and runs entirely on the CPU. The embeddings are stored in memory-efficient sparse matrices. To construct the adjacency matrix we construct the proximity matrix representing cosine similarity values between all text embeddings, and use it as the adjacency matrix. This operation can be donewith a single matrix multiply.
The figure above shows that this process of chunking text, embedding, constructing the graph, and retrieving text is very fast, taking only a few seconds to process contexts of a million+ tokens on a desktop CPU. And it is much faster than the standard process that uses dense embeddings.
Previous works applying RAG to long-context tasks have two main limitations. First, they tend to focus on simple QA type questions, even though many long-context tasks are more complex, e.g., reasoning or summarization. Second, they do not analyze or focus on reducing the compute cost of the retrieval pipeline.
RAG models are normally tested on standard informational retrieval benchmarks, where it is assumed a pre-embedded vector database is available to answer questions at test time. However, in long context tasks the data that the RAG system is retrieving from, i.e., the long context, is given at test time. This means the RAG system is not just required to do fast retrieval but also fast database construction. If a RAG system is very slow at constructing the database, it will be much less useful as a long-context preprocessing algorithm.
Next, how do we implement MixPR so it can be executed quickly in real-time, on compute limited hardware? Although the abstract MixPR algorithm issimple it alone does not specify how to implement the algorithm so that it is compute and memory efficient. We need a way to embed the text, construct a graph between text chunks, and perform retrieval quickly.
Previous SOTA graph RAG systems tend to have a slow embedding process involving LLMs writing entity pairs from the text, which are then used to construct a graph (e.g., see here and here).
To speed this process up we first embed text chunks using a symbolic program that creates sparse embeddings from keyword statistics, using the TF-IDF algorithm. This program is fast and runs entirely on the CPU. The embeddings are stored in memory-efficient sparse matrices. To construct the adjacency matrix we construct the proximity matrix representing cosine similarity values between all text embeddings, and use it as the adjacency matrix. This operation can be donewith a single matrix multiply.
The figure above shows that this process of chunking text, embedding, constructing the graph, and retrieving text is very fast, taking only a few seconds to process contexts of a million+ tokens on a desktop CPU. And it is much faster than the standard process that uses dense embeddings.
Next, how do we implement MixPR so it can be executed quickly in real-time, on compute limited hardware? Although the abstract MixPR algorithm issimple it alone does not specify how to implement the algorithm so that it is compute and memory efficient. We need a way to embed the text, construct a graph between text chunks, and perform retrieval quickly.
Previous SOTA graph RAG systems tend to have a slow embedding process involving LLMs writing entity pairs from the text, which are then used to construct a graph (e.g., see here and here).
To speed this process up we first embed text chunks using a symbolic program that creates sparse embeddings from keyword statistics, using the TF-IDF algorithm. This program is fast and runs entirely on the CPU. The embeddings are stored in memory-efficient sparse matrices. To construct the adjacency matrix we construct the proximity matrix representing cosine similarity values between all text embeddings, and use it as the adjacency matrix. This operation can be donewith a single matrix multiply.
The figure above shows that this process of chunking text, embedding, constructing the graph, and retrieving text is very fast, taking only a few seconds to process contexts of a million+ tokens on a desktop CPU. And it is much faster than the standard process that uses dense embeddings.
Next, how do we implement MixPR so it can be executed quickly in real-time, on compute limited hardware? Although the abstract MixPR algorithm issimple it alone does not specify how to implement the algorithm so that it is compute and memory efficient. We need a way to embed the text, construct a graph between text chunks, and perform retrieval quickly.
Previous SOTA graph RAG systems tend to have a slow embedding process involving LLMs writing entity pairs from the text, which are then used to construct a graph (e.g., see here and here).
To speed this process up we first embed text chunks using a symbolic program that creates sparse embeddings from keyword statistics, using the TF-IDF algorithm. This program is fast and runs entirely on the CPU. The embeddings are stored in memory-efficient sparse matrices. To construct the adjacency matrix we construct the proximity matrix representing cosine similarity values between all text embeddings, and use it as the adjacency matrix. This operation can be donewith a single matrix multiply.
The figure above shows that this process of chunking text, embedding, constructing the graph, and retrieving text is very fast, taking only a few seconds to process contexts of a million+ tokens on a desktop CPU. And it is much faster than the standard process that uses dense embeddings.
A small sample of our results can be seen in the figure above and below. Please see our paper for the full results. We test our MixPR model across multiple long context benchmarks including RULER, a synthetic long context dataset with over 12 different tasks, BABILong a natural language long-context reasoning benchmark, and HashHop a synthetic long-context reasoning benchmark. All of these benchmarks contain tasks that go beyond simple QA.
Across the benchmarks there are 8 tasks requiring multi-hop retrieval. Compared to nearest neighbor baselines and the base LLMs, MixPR performs better on difficult multi-hop retrieval questions.
Long-context LLMs are highly capable but require enormous compute and memory resources. This makes them expensive to serve in the cloud and renders them largely impractical for compute-constrained scenarios such as on-device applications. This raises the question of how we can reduce the cost of processing long-contexts without hurting performance.
At Zyphra, we are developing fast but effective RAG systems to preprocess long-context inputs for LLMs. The idea is to use RAG to select a subset of the text most relevant for the task at hand before sending it to the LLM. By only inputting a small portion of the context to the LLM, the cost of LLM inference can be drastically reduced.
We previously presented a preliminary graph-based RAG system engineered for this purpose and demonstrated its effectiveness on the HashHop benchmark, achieving state-of-the-art accuracy with unprecedented speed and memory efficiency.
Now, we present an expanded version of our system and demonstrate its effectivenessacross an extensive set of long-context benchmarks. Details can be found in our new paper. This new version of the algorithm, which we call Mixture-of-PageRanks (MixPR), achieves SOTA results across a variety of standard long-context benchmarks while being highly computationally efficient. Below is a summary of the algorithm and main results.
Previous works applying RAG to long-context tasks have two main limitations. First, they tend to focus on simple QA type questions, even though many long-context tasks are more complex, e.g., reasoning or summarization. Second, they do not analyze or focus on reducing the compute cost of the retrieval pipeline.
RAG models are normally tested on standard informational retrieval benchmarks, where it is assumed a pre-embedded vector database is available to answer questions at test time. However, in long context tasks the data that the RAG system is retrieving from, i.e., the long context, is given at test time. This means the RAG system is not just required to do fast retrieval but also fast database construction. If a RAG system is very slow at constructing the database, it will be much less useful as a long-context preprocessing algorithm.
We have developed a RAG system which can handle long-context inputs and solve complex tasks beyond simple QA. Our system uses a novel retriever based on the PageRank algorithm, which is the classic graph-based retrieval algorithm originally developed by Google. A novel feature of our algorithm is the usage of a mixture of two different types of PageRank that apply to two main categories of tasks: local and global retrieval tasks.
Tasks requiring local retrieval are those where we want to retrieve text chunks that are semantically and syntactically related to the query, e.g., QA, key-value retrieval, reasoning, etc. Tasks requiring global retrieval are those where we want to retrieve the most important parts of the overall document, where importance is independent of query-relatedness. For example, for summarizing books we may want to retrieve the text chunks describing the book’s main events, which will not directly relate to the specific content of the phrase “summarize this book” but will nonetheless be important in relation to the structure of the events in the book.
We argue Standard PageRank (PR) provides a way to address global retrieval tasks. Given an adjacency matrix, A, linking items in a database, PR will assign importance to each item based only on their structural importance in the graph (e.g., the number of incoming connections and importance of neighbors). We visualize this in the bottom of the figure above, where structurally important nodes are represented in dark blue.
Conversely, Personalized PageRank (PPR) can provide a way to do local retrieval for difficult tasks. PPR weights items in a database based both on their relatedness to a ‘personalization vector’ and based on their structural importance. We use the personalization vector to bias weights toward a node representing the query. We visualize this in the middle row of the figure above.
PPR can be seen as working akin to approximate Bayesian inference: it tries to maximize the likelihood, P(X | Z), of the input data X (i.e., maximize similarity to the query), while also maximizing the prior probability, P(Z), of the node (i.e., the structural importance of node in the graph). In the case of local retrieval tasks, an X is given in the form of a query. In global retrieval tasks, no X is given (the query does not contain relevant info to be searched), and thus we end up just wanting to maximize the prior probability/structural importance over our weights Z.
Finally, we need a way to do routing between PR and PPR based on the type of query. We use the base LLM in the RAG system with a zero-shot prompt to classify the query as either requiring global retrieval (route to PR) or local retrieval (route to PPR). This simple method worked very well on multiple LLMs.
The final retrieval algorithm can be summarized:
Next, how do we implement MixPR so it can be executed quickly in real-time, on compute limited hardware? Although the abstract MixPR algorithm issimple it alone does not specify how to implement the algorithm so that it is compute and memory efficient. We need a way to embed the text, construct a graph between text chunks, and perform retrieval quickly.
Previous SOTA graph RAG systems tend to have a slow embedding process involving LLMs writing entity pairs from the text, which are then used to construct a graph (e.g., see here and here).
To speed this process up we first embed text chunks using a symbolic program that creates sparse embeddings from keyword statistics, using the TF-IDF algorithm. This program is fast and runs entirely on the CPU. The embeddings are stored in memory-efficient sparse matrices. To construct the adjacency matrix we construct the proximity matrix representing cosine similarity values between all text embeddings, and use it as the adjacency matrix. This operation can be donewith a single matrix multiply.
The figure above shows that this process of chunking text, embedding, constructing the graph, and retrieving text is very fast, taking only a few seconds to process contexts of a million+ tokens on a desktop CPU. And it is much faster than the standard process that uses dense embeddings.
A small sample of our results can be seen in the figure above and below. Please see our paper for the full results. We test our MixPR model across multiple long context benchmarks including RULER, a synthetic long context dataset with over 12 different tasks, BABILong a natural language long-context reasoning benchmark, and HashHop a synthetic long-context reasoning benchmark. All of these benchmarks contain tasks that go beyond simple QA.
Across the benchmarks there are 8 tasks requiring multi-hop retrieval. Compared to nearest neighbor baselines and the base LLMs, MixPR performs better on difficult multi-hop retrieval questions.
MixPR also outcompetes nearest neighbor baselines on global retrieval tasks.
Importantly, MixPR, which uses standard PR for global retrieval and PPR for local retrieval, outcompetes a retriever that uses PPR on all tasks, supporting our idea that using a mixture of PR and PPR, rather than just PPR, provides performance benefits.
How does MixPR compare to SOTA on these benchmarks? We match the SOTA model on HashHop, outcompeting the SOTA model at very long context lengths. We also achieve SOTA results on RULER, where MixPR models outperform base long-context LLMs. MixPR with GPT-4o achieves the 2nd best result on BABILong only behind a specialized recurrent memory transformer finetuned specifically for BABILong, and first among previous RAG models on the BABILong leaderboard. See our paper for full result tables.
At Zyphra, we are committed to developing AI systems that can work on a variety of devices, including compute-constrained devices like phones and personal computers. We believe RAG will play an important role in dealing with the computational costs of conditioning language models on large databases of texts. Our MixPR RAG system provides evidence for this claim by showing how RAG can be used in a highly compute efficient way while still being effective on long-context tasks.
Long-context LLMs are highly capable but require enormous compute and memory resources. This makes them expensive to serve in the cloud and renders them largely impractical for compute-constrained scenarios such as on-device applications. This raises the question of how we can reduce the cost of processing long-contexts without hurting performance.
At Zyphra, we are developing fast but effective RAG systems to preprocess long-context inputs for LLMs. The idea is to use RAG to select a subset of the text most relevant for the task at hand before sending it to the LLM. By only inputting a small portion of the context to the LLM, the cost of LLM inference can be drastically reduced.
We previously presented a preliminary graph-based RAG system engineered for this purpose and demonstrated its effectiveness on the HashHop benchmark, achieving state-of-the-art accuracy with unprecedented speed and memory efficiency.
Now, we present an expanded version of our system and demonstrate its effectivenessacross an extensive set of long-context benchmarks. Details can be found in our new paper. This new version of the algorithm, which we call Mixture-of-PageRanks (MixPR), achieves SOTA results across a variety of standard long-context benchmarks while being highly computationally efficient. Below is a summary of the algorithm and main results.
Previous works applying RAG to long-context tasks have two main limitations. First, they tend to focus on simple QA type questions, even though many long-context tasks are more complex, e.g., reasoning or summarization. Second, they do not analyze or focus on reducing the compute cost of the retrieval pipeline.
RAG models are normally tested on standard informational retrieval benchmarks, where it is assumed a pre-embedded vector database is available to answer questions at test time. However, in long context tasks the data that the RAG system is retrieving from, i.e., the long context, is given at test time. This means the RAG system is not just required to do fast retrieval but also fast database construction. If a RAG system is very slow at constructing the database, it will be much less useful as a long-context preprocessing algorithm.
We have developed a RAG system which can handle long-context inputs and solve complex tasks beyond simple QA. Our system uses a novel retriever based on the PageRank algorithm, which is the classic graph-based retrieval algorithm originally developed by Google. A novel feature of our algorithm is the usage of a mixture of two different types of PageRank that apply to two main categories of tasks: local and global retrieval tasks.
Tasks requiring local retrieval are those where we want to retrieve text chunks that are semantically and syntactically related to the query, e.g., QA, key-value retrieval, reasoning, etc. Tasks requiring global retrieval are those where we want to retrieve the most important parts of the overall document, where importance is independent of query-relatedness. For example, for summarizing books we may want to retrieve the text chunks describing the book’s main events, which will not directly relate to the specific content of the phrase “summarize this book” but will nonetheless be important in relation to the structure of the events in the book.
We argue Standard PageRank (PR) provides a way to address global retrieval tasks. Given an adjacency matrix, A, linking items in a database, PR will assign importance to each item based only on their structural importance in the graph (e.g., the number of incoming connections and importance of neighbors). We visualize this in the bottom of the figure above, where structurally important nodes are represented in dark blue.
Conversely, Personalized PageRank (PPR) can provide a way to do local retrieval for difficult tasks. PPR weights items in a database based both on their relatedness to a ‘personalization vector’ and based on their structural importance. We use the personalization vector to bias weights toward a node representing the query. We visualize this in the middle row of the figure above.
PPR can be seen as working akin to approximate Bayesian inference: it tries to maximize the likelihood, P(X | Z), of the input data X (i.e., maximize similarity to the query), while also maximizing the prior probability, P(Z), of the node (i.e., the structural importance of node in the graph). In the case of local retrieval tasks, an X is given in the form of a query. In global retrieval tasks, no X is given (the query does not contain relevant info to be searched), and thus we end up just wanting to maximize the prior probability/structural importance over our weights Z.
Finally, we need a way to do routing between PR and PPR based on the type of query. We use the base LLM in the RAG system with a zero-shot prompt to classify the query as either requiring global retrieval (route to PR) or local retrieval (route to PPR). This simple method worked very well on multiple LLMs.
The final retrieval algorithm can be summarized:
Next, how do we implement MixPR so it can be executed quickly in real-time, on compute limited hardware? Although the abstract MixPR algorithm issimple it alone does not specify how to implement the algorithm so that it is compute and memory efficient. We need a way to embed the text, construct a graph between text chunks, and perform retrieval quickly.
Previous SOTA graph RAG systems tend to have a slow embedding process involving LLMs writing entity pairs from the text, which are then used to construct a graph (e.g., see here and here).
To speed this process up we first embed text chunks using a symbolic program that creates sparse embeddings from keyword statistics, using the TF-IDF algorithm. This program is fast and runs entirely on the CPU. The embeddings are stored in memory-efficient sparse matrices. To construct the adjacency matrix we construct the proximity matrix representing cosine similarity values between all text embeddings, and use it as the adjacency matrix. This operation can be donewith a single matrix multiply.
The figure above shows that this process of chunking text, embedding, constructing the graph, and retrieving text is very fast, taking only a few seconds to process contexts of a million+ tokens on a desktop CPU. And it is much faster than the standard process that uses dense embeddings.
A small sample of our results can be seen in the figure above and below. Please see our paper for the full results. We test our MixPR model across multiple long context benchmarks including RULER, a synthetic long context dataset with over 12 different tasks, BABILong a natural language long-context reasoning benchmark, and HashHop a synthetic long-context reasoning benchmark. All of these benchmarks contain tasks that go beyond simple QA.
Across the benchmarks there are 8 tasks requiring multi-hop retrieval. Compared to nearest neighbor baselines and the base LLMs, MixPR performs better on difficult multi-hop retrieval questions.
MixPR also outcompetes nearest neighbor baselines on global retrieval tasks.
Importantly, MixPR, which uses standard PR for global retrieval and PPR for local retrieval, outcompetes a retriever that uses PPR on all tasks, supporting our idea that using a mixture of PR and PPR, rather than just PPR, provides performance benefits.
How does MixPR compare to SOTA on these benchmarks? We match the SOTA model on HashHop, outcompeting the SOTA model at very long context lengths. We also achieve SOTA results on RULER, where MixPR models outperform base long-context LLMs. MixPR with GPT-4o achieves the 2nd best result on BABILong only behind a specialized recurrent memory transformer finetuned specifically for BABILong, and first among previous RAG models on the BABILong leaderboard. See our paper for full result tables.
At Zyphra, we are committed to developing AI systems that can work on a variety of devices, including compute-constrained devices like phones and personal computers. We believe RAG will play an important role in dealing with the computational costs of conditioning language models on large databases of texts. Our MixPR RAG system provides evidence for this claim by showing how RAG can be used in a highly compute efficient way while still being effective on long-context tasks.
Long-context LLMs are highly capable but require enormous compute and memory resources. This makes them expensive to serve in the cloud and renders them largely impractical for compute-constrained scenarios such as on-device applications. This raises the question of how we can reduce the cost of processing long-contexts without hurting performance.
At Zyphra, we are developing fast but effective RAG systems to preprocess long-context inputs for LLMs. The idea is to use RAG to select a subset of the text most relevant for the task at hand before sending it to the LLM. By only inputting a small portion of the context to the LLM, the cost of LLM inference can be drastically reduced.
We previously presented a preliminary graph-based RAG system engineered for this purpose and demonstrated its effectiveness on the HashHop benchmark, achieving state-of-the-art accuracy with unprecedented speed and memory efficiency.
Now, we present an expanded version of our system and demonstrate its effectivenessacross an extensive set of long-context benchmarks. Details can be found in our new paper. This new version of the algorithm, which we call Mixture-of-PageRanks (MixPR), achieves SOTA results across a variety of standard long-context benchmarks while being highly computationally efficient. Below is a summary of the algorithm and main results.
Previous works applying RAG to long-context tasks have two main limitations. First, they tend to focus on simple QA type questions, even though many long-context tasks are more complex, e.g., reasoning or summarization. Second, they do not analyze or focus on reducing the compute cost of the retrieval pipeline.
RAG models are normally tested on standard informational retrieval benchmarks, where it is assumed a pre-embedded vector database is available to answer questions at test time. However, in long context tasks the data that the RAG system is retrieving from, i.e., the long context, is given at test time. This means the RAG system is not just required to do fast retrieval but also fast database construction. If a RAG system is very slow at constructing the database, it will be much less useful as a long-context preprocessing algorithm.
We have developed a RAG system which can handle long-context inputs and solve complex tasks beyond simple QA. Our system uses a novel retriever based on the PageRank algorithm, which is the classic graph-based retrieval algorithm originally developed by Google. A novel feature of our algorithm is the usage of a mixture of two different types of PageRank that apply to two main categories of tasks: local and global retrieval tasks.
Tasks requiring local retrieval are those where we want to retrieve text chunks that are semantically and syntactically related to the query, e.g., QA, key-value retrieval, reasoning, etc. Tasks requiring global retrieval are those where we want to retrieve the most important parts of the overall document, where importance is independent of query-relatedness. For example, for summarizing books we may want to retrieve the text chunks describing the book’s main events, which will not directly relate to the specific content of the phrase “summarize this book” but will nonetheless be important in relation to the structure of the events in the book.
We argue Standard PageRank (PR) provides a way to address global retrieval tasks. Given an adjacency matrix, A, linking items in a database, PR will assign importance to each item based only on their structural importance in the graph (e.g., the number of incoming connections and importance of neighbors). We visualize this in the bottom of the figure above, where structurally important nodes are represented in dark blue.
Conversely, Personalized PageRank (PPR) can provide a way to do local retrieval for difficult tasks. PPR weights items in a database based both on their relatedness to a ‘personalization vector’ and based on their structural importance. We use the personalization vector to bias weights toward a node representing the query. We visualize this in the middle row of the figure above.
PPR can be seen as working akin to approximate Bayesian inference: it tries to maximize the likelihood, P(X | Z), of the input data X (i.e., maximize similarity to the query), while also maximizing the prior probability, P(Z), of the node (i.e., the structural importance of node in the graph). In the case of local retrieval tasks, an X is given in the form of a query. In global retrieval tasks, no X is given (the query does not contain relevant info to be searched), and thus we end up just wanting to maximize the prior probability/structural importance over our weights Z.
Finally, we need a way to do routing between PR and PPR based on the type of query. We use the base LLM in the RAG system with a zero-shot prompt to classify the query as either requiring global retrieval (route to PR) or local retrieval (route to PPR). This simple method worked very well on multiple LLMs.
The final retrieval algorithm can be summarized:
Next, how do we implement MixPR so it can be executed quickly in real-time, on compute limited hardware? Although the abstract MixPR algorithm issimple it alone does not specify how to implement the algorithm so that it is compute and memory efficient. We need a way to embed the text, construct a graph between text chunks, and perform retrieval quickly.
Previous SOTA graph RAG systems tend to have a slow embedding process involving LLMs writing entity pairs from the text, which are then used to construct a graph (e.g., see here and here).
To speed this process up we first embed text chunks using a symbolic program that creates sparse embeddings from keyword statistics, using the TF-IDF algorithm. This program is fast and runs entirely on the CPU. The embeddings are stored in memory-efficient sparse matrices. To construct the adjacency matrix we construct the proximity matrix representing cosine similarity values between all text embeddings, and use it as the adjacency matrix. This operation can be donewith a single matrix multiply.
The figure above shows that this process of chunking text, embedding, constructing the graph, and retrieving text is very fast, taking only a few seconds to process contexts of a million+ tokens on a desktop CPU. And it is much faster than the standard process that uses dense embeddings.
A small sample of our results can be seen in the figure above and below. Please see our paper for the full results. We test our MixPR model across multiple long context benchmarks including RULER, a synthetic long context dataset with over 12 different tasks, BABILong a natural language long-context reasoning benchmark, and HashHop a synthetic long-context reasoning benchmark. All of these benchmarks contain tasks that go beyond simple QA.
Across the benchmarks there are 8 tasks requiring multi-hop retrieval. Compared to nearest neighbor baselines and the base LLMs, MixPR performs better on difficult multi-hop retrieval questions.
MixPR also outcompetes nearest neighbor baselines on global retrieval tasks.
Importantly, MixPR, which uses standard PR for global retrieval and PPR for local retrieval, outcompetes a retriever that uses PPR on all tasks, supporting our idea that using a mixture of PR and PPR, rather than just PPR, provides performance benefits.
How does MixPR compare to SOTA on these benchmarks? We match the SOTA model on HashHop, outcompeting the SOTA model at very long context lengths. We also achieve SOTA results on RULER, where MixPR models outperform base long-context LLMs. MixPR with GPT-4o achieves the 2nd best result on BABILong only behind a specialized recurrent memory transformer finetuned specifically for BABILong, and first among previous RAG models on the BABILong leaderboard. See our paper for full result tables.
Long-context LLMs are highly capable but require enormous compute and memory resources. This makes them expensive to serve in the cloud and renders them largely impractical for compute-constrained scenarios such as on-device applications. This raises the question of how we can reduce the cost of processing long-contexts without hurting performance.
At Zyphra, we are developing fast but effective RAG systems to preprocess long-context inputs for LLMs. The idea is to use RAG to select a subset of the text most relevant for the task at hand before sending it to the LLM. By only inputting a small portion of the context to the LLM, the cost of LLM inference can be drastically reduced.
We previously presented a preliminary graph-based RAG system engineered for this purpose and demonstrated its effectiveness on the HashHop benchmark, achieving state-of-the-art accuracy with unprecedented speed and memory efficiency.
Now, we present an expanded version of our system and demonstrate its effectivenessacross an extensive set of long-context benchmarks. Details can be found in our new paper. This new version of the algorithm, which we call Mixture-of-PageRanks (MixPR), achieves SOTA results across a variety of standard long-context benchmarks while being highly computationally efficient. Below is a summary of the algorithm and main results.
Previous works applying RAG to long-context tasks have two main limitations. First, they tend to focus on simple QA type questions, even though many long-context tasks are more complex, e.g., reasoning or summarization. Second, they do not analyze or focus on reducing the compute cost of the retrieval pipeline.
RAG models are normally tested on standard informational retrieval benchmarks, where it is assumed a pre-embedded vector database is available to answer questions at test time. However, in long context tasks the data that the RAG system is retrieving from, i.e., the long context, is given at test time. This means the RAG system is not just required to do fast retrieval but also fast database construction. If a RAG system is very slow at constructing the database, it will be much less useful as a long-context preprocessing algorithm.
We have developed a RAG system which can handle long-context inputs and solve complex tasks beyond simple QA. Our system uses a novel retriever based on the PageRank algorithm, which is the classic graph-based retrieval algorithm originally developed by Google. A novel feature of our algorithm is the usage of a mixture of two different types of PageRank that apply to two main categories of tasks: local and global retrieval tasks.
Tasks requiring local retrieval are those where we want to retrieve text chunks that are semantically and syntactically related to the query, e.g., QA, key-value retrieval, reasoning, etc. Tasks requiring global retrieval are those where we want to retrieve the most important parts of the overall document, where importance is independent of query-relatedness. For example, for summarizing books we may want to retrieve the text chunks describing the book’s main events, which will not directly relate to the specific content of the phrase “summarize this book” but will nonetheless be important in relation to the structure of the events in the book.
We argue Standard PageRank (PR) provides a way to address global retrieval tasks. Given an adjacency matrix, A, linking items in a database, PR will assign importance to each item based only on their structural importance in the graph (e.g., the number of incoming connections and importance of neighbors). We visualize this in the bottom of the figure above, where structurally important nodes are represented in dark blue.
Conversely, Personalized PageRank (PPR) can provide a way to do local retrieval for difficult tasks. PPR weights items in a database based both on their relatedness to a ‘personalization vector’ and based on their structural importance. We use the personalization vector to bias weights toward a node representing the query. We visualize this in the middle row of the figure above.
PPR can be seen as working akin to approximate Bayesian inference: it tries to maximize the likelihood, P(X | Z), of the input data X (i.e., maximize similarity to the query), while also maximizing the prior probability, P(Z), of the node (i.e., the structural importance of node in the graph). In the case of local retrieval tasks, an X is given in the form of a query. In global retrieval tasks, no X is given (the query does not contain relevant info to be searched), and thus we end up just wanting to maximize the prior probability/structural importance over our weights Z.
Finally, we need a way to do routing between PR and PPR based on the type of query. We use the base LLM in the RAG system with a zero-shot prompt to classify the query as either requiring global retrieval (route to PR) or local retrieval (route to PPR). This simple method worked very well on multiple LLMs.
The final retrieval algorithm can be summarized:
Next, how do we implement MixPR so it can be executed quickly in real-time, on compute limited hardware? Although the abstract MixPR algorithm issimple it alone does not specify how to implement the algorithm so that it is compute and memory efficient. We need a way to embed the text, construct a graph between text chunks, and perform retrieval quickly.
Previous SOTA graph RAG systems tend to have a slow embedding process involving LLMs writing entity pairs from the text, which are then used to construct a graph (e.g., see here and here).
To speed this process up we first embed text chunks using a symbolic program that creates sparse embeddings from keyword statistics, using the TF-IDF algorithm. This program is fast and runs entirely on the CPU. The embeddings are stored in memory-efficient sparse matrices. To construct the adjacency matrix we construct the proximity matrix representing cosine similarity values between all text embeddings, and use it as the adjacency matrix. This operation can be donewith a single matrix multiply.
The figure above shows that this process of chunking text, embedding, constructing the graph, and retrieving text is very fast, taking only a few seconds to process contexts of a million+ tokens on a desktop CPU. And it is much faster than the standard process that uses dense embeddings.
A small sample of our results can be seen in the figure above and below. Please see our paper for the full results. We test our MixPR model across multiple long context benchmarks including RULER, a synthetic long context dataset with over 12 different tasks, BABILong a natural language long-context reasoning benchmark, and HashHop a synthetic long-context reasoning benchmark. All of these benchmarks contain tasks that go beyond simple QA.
Across the benchmarks there are 8 tasks requiring multi-hop retrieval. Compared to nearest neighbor baselines and the base LLMs, MixPR performs better on difficult multi-hop retrieval questions.
MixPR also outcompetes nearest neighbor baselines on global retrieval tasks.
Importantly, MixPR, which uses standard PR for global retrieval and PPR for local retrieval, outcompetes a retriever that uses PPR on all tasks, supporting our idea that using a mixture of PR and PPR, rather than just PPR, provides performance benefits.
How does MixPR compare to SOTA on these benchmarks? We match the SOTA model on HashHop, outcompeting the SOTA model at very long context lengths. We also achieve SOTA results on RULER, where MixPR models outperform base long-context LLMs. MixPR with GPT-4o achieves the 2nd best result on BABILong only behind a specialized recurrent memory transformer finetuned specifically for BABILong, and first among previous RAG models on the BABILong leaderboard. See our paper for full result tables.
At Zyphra, we are committed to developing AI systems that can work on a variety of devices, including compute-constrained devices like phones and personal computers. We believe RAG will play an important role in dealing with the computational costs of conditioning language models on large databases of texts. Our MixPR RAG system provides evidence for this claim by showing how RAG can be used in a highly compute efficient way while still being effective on long-context tasks.
At Zyphra, we are committed to developing AI systems that can work on a variety of devices, including compute-constrained devices like phones and personal computers. We believe RAG will play an important role in dealing with the computational costs of conditioning language models on large databases of texts. Our MixPR RAG system provides evidence for this claim by showing how RAG can be used in a highly compute efficient way while still being effective on long-context tasks.
RAG is not all we are working on here at Zyphra. To get more information about our work on model training, data curation, and algorithmic innovation, check out our other blog posts.
Long-context LLMs are highly capable but require enormous compute and memory resources. This makes them expensive to serve in the cloud and renders them largely impractical for compute-constrained scenarios such as on-device applications. This raises the question of how we can reduce the cost of processing long-contexts without hurting performance.
At Zyphra, we are developing fast but effective RAG systems to preprocess long-context inputs for LLMs. The idea is to use RAG to select a subset of the text most relevant for the task at hand before sending it to the LLM. By only inputting a small portion of the context to the LLM, the cost of LLM inference can be drastically reduced.
We previously presented a preliminary graph-based RAG system engineered for this purpose and demonstrated its effectiveness on the HashHop benchmark, achieving state-of-the-art accuracy with unprecedented speed and memory efficiency.
Now, we present an expanded version of our system and demonstrate its effectivenessacross an extensive set of long-context benchmarks. Details can be found in our new paper. This new version of the algorithm, which we call Mixture-of-PageRanks (MixPR), achieves SOTA results across a variety of standard long-context benchmarks while being highly computationally efficient. Below is a summary of the algorithm and main results.
Previous works applying RAG to long-context tasks have two main limitations. First, they tend to focus on simple QA type questions, even though many long-context tasks are more complex, e.g., reasoning or summarization. Second, they do not analyze or focus on reducing the compute cost of the retrieval pipeline.
RAG models are normally tested on standard informational retrieval benchmarks, where it is assumed a pre-embedded vector database is available to answer questions at test time. However, in long context tasks the data that the RAG system is retrieving from, i.e., the long context, is given at test time. This means the RAG system is not just required to do fast retrieval but also fast database construction. If a RAG system is very slow at constructing the database, it will be much less useful as a long-context preprocessing algorithm.
We have developed a RAG system which can handle long-context inputs and solve complex tasks beyond simple QA. Our system uses a novel retriever based on the PageRank algorithm, which is the classic graph-based retrieval algorithm originally developed by Google. A novel feature of our algorithm is the usage of a mixture of two different types of PageRank that apply to two main categories of tasks: local and global retrieval tasks.
Tasks requiring local retrieval are those where we want to retrieve text chunks that are semantically and syntactically related to the query, e.g., QA, key-value retrieval, reasoning, etc. Tasks requiring global retrieval are those where we want to retrieve the most important parts of the overall document, where importance is independent of query-relatedness. For example, for summarizing books we may want to retrieve the text chunks describing the book’s main events, which will not directly relate to the specific content of the phrase “summarize this book” but will nonetheless be important in relation to the structure of the events in the book.
We argue Standard PageRank (PR) provides a way to address global retrieval tasks. Given an adjacency matrix, A, linking items in a database, PR will assign importance to each item based only on their structural importance in the graph (e.g., the number of incoming connections and importance of neighbors). We visualize this in the bottom of the figure above, where structurally important nodes are represented in dark blue.
Conversely, Personalized PageRank (PPR) can provide a way to do local retrieval for difficult tasks. PPR weights items in a database based both on their relatedness to a ‘personalization vector’ and based on their structural importance. We use the personalization vector to bias weights toward a node representing the query. We visualize this in the middle row of the figure above.
PPR can be seen as working akin to approximate Bayesian inference: it tries to maximize the likelihood, P(X | Z), of the input data X (i.e., maximize similarity to the query), while also maximizing the prior probability, P(Z), of the node (i.e., the structural importance of node in the graph). In the case of local retrieval tasks, an X is given in the form of a query. In global retrieval tasks, no X is given (the query does not contain relevant info to be searched), and thus we end up just wanting to maximize the prior probability/structural importance over our weights Z.
Finally, we need a way to do routing between PR and PPR based on the type of query. We use the base LLM in the RAG system with a zero-shot prompt to classify the query as either requiring global retrieval (route to PR) or local retrieval (route to PPR). This simple method worked very well on multiple LLMs.
The final retrieval algorithm can be summarized:
Next, how do we implement MixPR so it can be executed quickly in real-time, on compute limited hardware? Although the abstract MixPR algorithm issimple it alone does not specify how to implement the algorithm so that it is compute and memory efficient. We need a way to embed the text, construct a graph between text chunks, and perform retrieval quickly.
Previous SOTA graph RAG systems tend to have a slow embedding process involving LLMs writing entity pairs from the text, which are then used to construct a graph (e.g., see here and here).
To speed this process up we first embed text chunks using a symbolic program that creates sparse embeddings from keyword statistics, using the TF-IDF algorithm. This program is fast and runs entirely on the CPU. The embeddings are stored in memory-efficient sparse matrices. To construct the adjacency matrix we construct the proximity matrix representing cosine similarity values between all text embeddings, and use it as the adjacency matrix. This operation can be donewith a single matrix multiply.
The figure above shows that this process of chunking text, embedding, constructing the graph, and retrieving text is very fast, taking only a few seconds to process contexts of a million+ tokens on a desktop CPU. And it is much faster than the standard process that uses dense embeddings.
A small sample of our results can be seen in the figure above and below. Please see our paper for the full results. We test our MixPR model across multiple long context benchmarks including RULER, a synthetic long context dataset with over 12 different tasks, BABILong a natural language long-context reasoning benchmark, and HashHop a synthetic long-context reasoning benchmark. All of these benchmarks contain tasks that go beyond simple QA.
Across the benchmarks there are 8 tasks requiring multi-hop retrieval. Compared to nearest neighbor baselines and the base LLMs, MixPR performs better on difficult multi-hop retrieval questions.
MixPR also outcompetes nearest neighbor baselines on global retrieval tasks.
Importantly, MixPR, which uses standard PR for global retrieval and PPR for local retrieval, outcompetes a retriever that uses PPR on all tasks, supporting our idea that using a mixture of PR and PPR, rather than just PPR, provides performance benefits.
How does MixPR compare to SOTA on these benchmarks? We match the SOTA model on HashHop, outcompeting the SOTA model at very long context lengths. We also achieve SOTA results on RULER, where MixPR models outperform base long-context LLMs. MixPR with GPT-4o achieves the 2nd best result on BABILong only behind a specialized recurrent memory transformer finetuned specifically for BABILong, and first among previous RAG models on the BABILong leaderboard. See our paper for full result tables.
At Zyphra, we are committed to developing AI systems that can work on a variety of devices, including compute-constrained devices like phones and personal computers. We believe RAG will play an important role in dealing with the computational costs of conditioning language models on large databases of texts. Our MixPR RAG system provides evidence for this claim by showing how RAG can be used in a highly compute efficient way while still being effective on long-context tasks.
At Zyphra, we are committed to developing AI systems that can work on a variety of devices, including compute-constrained devices like phones and personal computers. We believe RAG will play an important role in dealing with the computational costs of conditioning language models on large databases of texts. Our MixPR RAG system provides evidence for this claim by showing how RAG can be used in a highly compute efficient way while still being effective on long-context tasks.
RAG is not all we are working on here at Zyphra. To get more information about our work on model training, data curation, and algorithmic innovation, check out our other blog posts.
Long-context LLMs are highly capable but require enormous compute and memory resources. This makes them expensive to serve in the cloud and renders them largely impractical for compute-constrained scenarios such as on-device applications. This raises the question of how we can reduce the cost of processing long-contexts without hurting performance.
At Zyphra, we are developing fast but effective RAG systems to preprocess long-context inputs for LLMs. The idea is to use RAG to select a subset of the text most relevant for the task at hand before sending it to the LLM. By only inputting a small portion of the context to the LLM, the cost of LLM inference can be drastically reduced.
We previously presented a preliminary graph-based RAG system engineered for this purpose and demonstrated its effectiveness on the HashHop benchmark, achieving state-of-the-art accuracy with unprecedented speed and memory efficiency.
Now, we present an expanded version of our system and demonstrate its effectivenessacross an extensive set of long-context benchmarks. Details can be found in our new paper. This new version of the algorithm, which we call Mixture-of-PageRanks (MixPR), achieves SOTA results across a variety of standard long-context benchmarks while being highly computationally efficient. Below is a summary of the algorithm and main results.
Previous works applying RAG to long-context tasks have two main limitations. First, they tend to focus on simple QA type questions, even though many long-context tasks are more complex, e.g., reasoning or summarization. Second, they do not analyze or focus on reducing the compute cost of the retrieval pipeline.
RAG models are normally tested on standard informational retrieval benchmarks, where it is assumed a pre-embedded vector database is available to answer questions at test time. However, in long context tasks the data that the RAG system is retrieving from, i.e., the long context, is given at test time. This means the RAG system is not just required to do fast retrieval but also fast database construction. If a RAG system is very slow at constructing the database, it will be much less useful as a long-context preprocessing algorithm.
We have developed a RAG system which can handle long-context inputs and solve complex tasks beyond simple QA. Our system uses a novel retriever based on the PageRank algorithm, which is the classic graph-based retrieval algorithm originally developed by Google. A novel feature of our algorithm is the usage of a mixture of two different types of PageRank that apply to two main categories of tasks: local and global retrieval tasks.
Tasks requiring local retrieval are those where we want to retrieve text chunks that are semantically and syntactically related to the query, e.g., QA, key-value retrieval, reasoning, etc. Tasks requiring global retrieval are those where we want to retrieve the most important parts of the overall document, where importance is independent of query-relatedness. For example, for summarizing books we may want to retrieve the text chunks describing the book’s main events, which will not directly relate to the specific content of the phrase “summarize this book” but will nonetheless be important in relation to the structure of the events in the book.
We argue Standard PageRank (PR) provides a way to address global retrieval tasks. Given an adjacency matrix, A, linking items in a database, PR will assign importance to each item based only on their structural importance in the graph (e.g., the number of incoming connections and importance of neighbors). We visualize this in the bottom of the figure above, where structurally important nodes are represented in dark blue.
Conversely, Personalized PageRank (PPR) can provide a way to do local retrieval for difficult tasks. PPR weights items in a database based both on their relatedness to a ‘personalization vector’ and based on their structural importance. We use the personalization vector to bias weights toward a node representing the query. We visualize this in the middle row of the figure above.
PPR can be seen as working akin to approximate Bayesian inference: it tries to maximize the likelihood, P(X | Z), of the input data X (i.e., maximize similarity to the query), while also maximizing the prior probability, P(Z), of the node (i.e., the structural importance of node in the graph). In the case of local retrieval tasks, an X is given in the form of a query. In global retrieval tasks, no X is given (the query does not contain relevant info to be searched), and thus we end up just wanting to maximize the prior probability/structural importance over our weights Z.
Finally, we need a way to do routing between PR and PPR based on the type of query. We use the base LLM in the RAG system with a zero-shot prompt to classify the query as either requiring global retrieval (route to PR) or local retrieval (route to PPR). This simple method worked very well on multiple LLMs.
The final retrieval algorithm can be summarized:
Next, how do we implement MixPR so it can be executed quickly in real-time, on compute limited hardware? Although the abstract MixPR algorithm issimple it alone does not specify how to implement the algorithm so that it is compute and memory efficient. We need a way to embed the text, construct a graph between text chunks, and perform retrieval quickly.
Previous SOTA graph RAG systems tend to have a slow embedding process involving LLMs writing entity pairs from the text, which are then used to construct a graph (e.g., see here and here).
To speed this process up we first embed text chunks using a symbolic program that creates sparse embeddings from keyword statistics, using the TF-IDF algorithm. This program is fast and runs entirely on the CPU. The embeddings are stored in memory-efficient sparse matrices. To construct the adjacency matrix we construct the proximity matrix representing cosine similarity values between all text embeddings, and use it as the adjacency matrix. This operation can be donewith a single matrix multiply.
The figure above shows that this process of chunking text, embedding, constructing the graph, and retrieving text is very fast, taking only a few seconds to process contexts of a million+ tokens on a desktop CPU. And it is much faster than the standard process that uses dense embeddings.
A small sample of our results can be seen in the figure above and below. Please see our paper for the full results. We test our MixPR model across multiple long context benchmarks including RULER, a synthetic long context dataset with over 12 different tasks, BABILong a natural language long-context reasoning benchmark, and HashHop a synthetic long-context reasoning benchmark. All of these benchmarks contain tasks that go beyond simple QA.
Across the benchmarks there are 8 tasks requiring multi-hop retrieval. Compared to nearest neighbor baselines and the base LLMs, MixPR performs better on difficult multi-hop retrieval questions.
MixPR also outcompetes nearest neighbor baselines on global retrieval tasks.
Importantly, MixPR, which uses standard PR for global retrieval and PPR for local retrieval, outcompetes a retriever that uses PPR on all tasks, supporting our idea that using a mixture of PR and PPR, rather than just PPR, provides performance benefits.
How does MixPR compare to SOTA on these benchmarks? We match the SOTA model on HashHop, outcompeting the SOTA model at very long context lengths. We also achieve SOTA results on RULER, where MixPR models outperform base long-context LLMs. MixPR with GPT-4o achieves the 2nd best result on BABILong only behind a specialized recurrent memory transformer finetuned specifically for BABILong, and first among previous RAG models on the BABILong leaderboard. See our paper for full result tables.
At Zyphra, we are committed to developing AI systems that can work on a variety of devices, including compute-constrained devices like phones and personal computers. We believe RAG will play an important role in dealing with the computational costs of conditioning language models on large databases of texts. Our MixPR RAG system provides evidence for this claim by showing how RAG can be used in a highly compute efficient way while still being effective on long-context tasks.
At Zyphra, we are committed to developing AI systems that can work on a variety of devices, including compute-constrained devices like phones and personal computers. We believe RAG will play an important role in dealing with the computational costs of conditioning language models on large databases of texts. Our MixPR RAG system provides evidence for this claim by showing how RAG can be used in a highly compute efficient way while still being effective on long-context tasks.
RAG is not all we are working on here at Zyphra. To get more information about our work on model training, data curation, and algorithmic innovation, check out our other blog posts.
Long-context LLMs are highly capable but require enormous compute and memory resources. This makes them expensive to serve in the cloud and renders them largely impractical for compute-constrained scenarios such as on-device applications. This raises the question of how we can reduce the cost of processing long-contexts without hurting performance.
At Zyphra, we are developing fast but effective RAG systems to preprocess long-context inputs for LLMs. The idea is to use RAG to select a subset of the text most relevant for the task at hand before sending it to the LLM. By only inputting a small portion of the context to the LLM, the cost of LLM inference can be drastically reduced.
We previously presented a preliminary graph-based RAG system engineered for this purpose and demonstrated its effectiveness on the HashHop benchmark, achieving state-of-the-art accuracy with unprecedented speed and memory efficiency.
Now, we present an expanded version of our system and demonstrate its effectivenessacross an extensive set of long-context benchmarks. Details can be found in our new paper. This new version of the algorithm, which we call Mixture-of-PageRanks (MixPR), achieves SOTA results across a variety of standard long-context benchmarks while being highly computationally efficient. Below is a summary of the algorithm and main results.
Previous works applying RAG to long-context tasks have two main limitations. First, they tend to focus on simple QA type questions, even though many long-context tasks are more complex, e.g., reasoning or summarization. Second, they do not analyze or focus on reducing the compute cost of the retrieval pipeline.
RAG models are normally tested on standard informational retrieval benchmarks, where it is assumed a pre-embedded vector database is available to answer questions at test time. However, in long context tasks the data that the RAG system is retrieving from, i.e., the long context, is given at test time. This means the RAG system is not just required to do fast retrieval but also fast database construction. If a RAG system is very slow at constructing the database, it will be much less useful as a long-context preprocessing algorithm.
We have developed a RAG system which can handle long-context inputs and solve complex tasks beyond simple QA. Our system uses a novel retriever based on the PageRank algorithm, which is the classic graph-based retrieval algorithm originally developed by Google. A novel feature of our algorithm is the usage of a mixture of two different types of PageRank that apply to two main categories of tasks: local and global retrieval tasks.
Tasks requiring local retrieval are those where we want to retrieve text chunks that are semantically and syntactically related to the query, e.g., QA, key-value retrieval, reasoning, etc. Tasks requiring global retrieval are those where we want to retrieve the most important parts of the overall document, where importance is independent of query-relatedness. For example, for summarizing books we may want to retrieve the text chunks describing the book’s main events, which will not directly relate to the specific content of the phrase “summarize this book” but will nonetheless be important in relation to the structure of the events in the book.
We argue Standard PageRank (PR) provides a way to address global retrieval tasks. Given an adjacency matrix, A, linking items in a database, PR will assign importance to each item based only on their structural importance in the graph (e.g., the number of incoming connections and importance of neighbors). We visualize this in the bottom of the figure above, where structurally important nodes are represented in dark blue.
Conversely, Personalized PageRank (PPR) can provide a way to do local retrieval for difficult tasks. PPR weights items in a database based both on their relatedness to a ‘personalization vector’ and based on their structural importance. We use the personalization vector to bias weights toward a node representing the query. We visualize this in the middle row of the figure above.
PPR can be seen as working akin to approximate Bayesian inference: it tries to maximize the likelihood, P(X | Z), of the input data X (i.e., maximize similarity to the query), while also maximizing the prior probability, P(Z), of the node (i.e., the structural importance of node in the graph). In the case of local retrieval tasks, an X is given in the form of a query. In global retrieval tasks, no X is given (the query does not contain relevant info to be searched), and thus we end up just wanting to maximize the prior probability/structural importance over our weights Z.
Finally, we need a way to do routing between PR and PPR based on the type of query. We use the base LLM in the RAG system with a zero-shot prompt to classify the query as either requiring global retrieval (route to PR) or local retrieval (route to PPR). This simple method worked very well on multiple LLMs.
The final retrieval algorithm can be summarized:
Long-context LLMs are highly capable but require enormous compute and memory resources. This makes them expensive to serve in the cloud and renders them largely impractical for compute-constrained scenarios such as on-device applications. This raises the question of how we can reduce the cost of processing long-contexts without hurting performance.
At Zyphra, we are developing fast but effective RAG systems to preprocess long-context inputs for LLMs. The idea is to use RAG to select a subset of the text most relevant for the task at hand before sending it to the LLM. By only inputting a small portion of the context to the LLM, the cost of LLM inference can be drastically reduced.
We previously presented a preliminary graph-based RAG system engineered for this purpose and demonstrated its effectiveness on the HashHop benchmark, achieving state-of-the-art accuracy with unprecedented speed and memory efficiency.
Now, we present an expanded version of our system and demonstrate its effectivenessacross an extensive set of long-context benchmarks. Details can be found in our new paper. This new version of the algorithm, which we call Mixture-of-PageRanks (MixPR), achieves SOTA results across a variety of standard long-context benchmarks while being highly computationally efficient. Below is a summary of the algorithm and main results.
Previous works applying RAG to long-context tasks have two main limitations. First, they tend to focus on simple QA type questions, even though many long-context tasks are more complex, e.g., reasoning or summarization. Second, they do not analyze or focus on reducing the compute cost of the retrieval pipeline.
RAG models are normally tested on standard informational retrieval benchmarks, where it is assumed a pre-embedded vector database is available to answer questions at test time. However, in long context tasks the data that the RAG system is retrieving from, i.e., the long context, is given at test time. This means the RAG system is not just required to do fast retrieval but also fast database construction. If a RAG system is very slow at constructing the database, it will be much less useful as a long-context preprocessing algorithm.
We have developed a RAG system which can handle long-context inputs and solve complex tasks beyond simple QA. Our system uses a novel retriever based on the PageRank algorithm, which is the classic graph-based retrieval algorithm originally developed by Google. A novel feature of our algorithm is the usage of a mixture of two different types of PageRank that apply to two main categories of tasks: local and global retrieval tasks.
Tasks requiring local retrieval are those where we want to retrieve text chunks that are semantically and syntactically related to the query, e.g., QA, key-value retrieval, reasoning, etc. Tasks requiring global retrieval are those where we want to retrieve the most important parts of the overall document, where importance is independent of query-relatedness. For example, for summarizing books we may want to retrieve the text chunks describing the book’s main events, which will not directly relate to the specific content of the phrase “summarize this book” but will nonetheless be important in relation to the structure of the events in the book.
We argue Standard PageRank (PR) provides a way to address global retrieval tasks. Given an adjacency matrix, A, linking items in a database, PR will assign importance to each item based only on their structural importance in the graph (e.g., the number of incoming connections and importance of neighbors). We visualize this in the bottom of the figure above, where structurally important nodes are represented in dark blue.
Conversely, Personalized PageRank (PPR) can provide a way to do local retrieval for difficult tasks. PPR weights items in a database based both on their relatedness to a ‘personalization vector’ and based on their structural importance. We use the personalization vector to bias weights toward a node representing the query. We visualize this in the middle row of the figure above.
PPR can be seen as working akin to approximate Bayesian inference: it tries to maximize the likelihood, P(X | Z), of the input data X (i.e., maximize similarity to the query), while also maximizing the prior probability, P(Z), of the node (i.e., the structural importance of node in the graph). In the case of local retrieval tasks, an X is given in the form of a query. In global retrieval tasks, no X is given (the query does not contain relevant info to be searched), and thus we end up just wanting to maximize the prior probability/structural importance over our weights Z.
Finally, we need a way to do routing between PR and PPR based on the type of query. We use the base LLM in the RAG system with a zero-shot prompt to classify the query as either requiring global retrieval (route to PR) or local retrieval (route to PPR). This simple method worked very well on multiple LLMs.
The final retrieval algorithm can be summarized:
Next, how do we implement MixPR so it can be executed quickly in real-time, on compute limited hardware? Although the abstract MixPR algorithm issimple it alone does not specify how to implement the algorithm so that it is compute and memory efficient. We need a way to embed the text, construct a graph between text chunks, and perform retrieval quickly.
Previous SOTA graph RAG systems tend to have a slow embedding process involving LLMs writing entity pairs from the text, which are then used to construct a graph (e.g., see here and here).
To speed this process up we first embed text chunks using a symbolic program that creates sparse embeddings from keyword statistics, using the TF-IDF algorithm. This program is fast and runs entirely on the CPU. The embeddings are stored in memory-efficient sparse matrices. To construct the adjacency matrix we construct the proximity matrix representing cosine similarity values between all text embeddings, and use it as the adjacency matrix. This operation can be donewith a single matrix multiply.
The figure above shows that this process of chunking text, embedding, constructing the graph, and retrieving text is very fast, taking only a few seconds to process contexts of a million+ tokens on a desktop CPU. And it is much faster than the standard process that uses dense embeddings.
A small sample of our results can be seen in the figure above and below. Please see our paper for the full results. We test our MixPR model across multiple long context benchmarks including RULER, a synthetic long context dataset with over 12 different tasks, BABILong a natural language long-context reasoning benchmark, and HashHop a synthetic long-context reasoning benchmark. All of these benchmarks contain tasks that go beyond simple QA.
Across the benchmarks there are 8 tasks requiring multi-hop retrieval. Compared to nearest neighbor baselines and the base LLMs, MixPR performs better on difficult multi-hop retrieval questions.
MixPR also outcompetes nearest neighbor baselines on global retrieval tasks.
Importantly, MixPR, which uses standard PR for global retrieval and PPR for local retrieval, outcompetes a retriever that uses PPR on all tasks, supporting our idea that using a mixture of PR and PPR, rather than just PPR, provides performance benefits.
How does MixPR compare to SOTA on these benchmarks? We match the SOTA model on HashHop, outcompeting the SOTA model at very long context lengths. We also achieve SOTA results on RULER, where MixPR models outperform base long-context LLMs. MixPR with GPT-4o achieves the 2nd best result on BABILong only behind a specialized recurrent memory transformer finetuned specifically for BABILong, and first among previous RAG models on the BABILong leaderboard. See our paper for full result tables.
At Zyphra, we are committed to developing AI systems that can work on a variety of devices, including compute-constrained devices like phones and personal computers. We believe RAG will play an important role in dealing with the computational costs of conditioning language models on large databases of texts. Our MixPR RAG system provides evidence for this claim by showing how RAG can be used in a highly compute efficient way while still being effective on long-context tasks.
At Zyphra, we are committed to developing AI systems that can work on a variety of devices, including compute-constrained devices like phones and personal computers. We believe RAG will play an important role in dealing with the computational costs of conditioning language models on large databases of texts. Our MixPR RAG system provides evidence for this claim by showing how RAG can be used in a highly compute efficient way while still being effective on long-context tasks.
RAG is not all we are working on here at Zyphra. To get more information about our work on model training, data curation, and algorithmic innovation, check out our other blog posts.
Long-context LLMs are highly capable but require enormous compute and memory resources. This makes them expensive to serve in the cloud and renders them largely impractical for compute-constrained scenarios such as on-device applications. This raises the question of how we can reduce the cost of processing long-contexts without hurting performance.
At Zyphra, we are developing fast but effective RAG systems to preprocess long-context inputs for LLMs. The idea is to use RAG to select a subset of the text most relevant for the task at hand before sending it to the LLM. By only inputting a small portion of the context to the LLM, the cost of LLM inference can be drastically reduced.
We previously presented a preliminary graph-based RAG system engineered for this purpose and demonstrated its effectiveness on the HashHop benchmark, achieving state-of-the-art accuracy with unprecedented speed and memory efficiency.
Now, we present an expanded version of our system and demonstrate its effectivenessacross an extensive set of long-context benchmarks. Details can be found in our new paper. This new version of the algorithm, which we call Mixture-of-PageRanks (MixPR), achieves SOTA results across a variety of standard long-context benchmarks while being highly computationally efficient. Below is a summary of the algorithm and main results.
Previous works applying RAG to long-context tasks have two main limitations. First, they tend to focus on simple QA type questions, even though many long-context tasks are more complex, e.g., reasoning or summarization. Second, they do not analyze or focus on reducing the compute cost of the retrieval pipeline.
RAG models are normally tested on standard informational retrieval benchmarks, where it is assumed a pre-embedded vector database is available to answer questions at test time. However, in long context tasks the data that the RAG system is retrieving from, i.e., the long context, is given at test time. This means the RAG system is not just required to do fast retrieval but also fast database construction. If a RAG system is very slow at constructing the database, it will be much less useful as a long-context preprocessing algorithm.
We have developed a RAG system which can handle long-context inputs and solve complex tasks beyond simple QA. Our system uses a novel retriever based on the PageRank algorithm, which is the classic graph-based retrieval algorithm originally developed by Google. A novel feature of our algorithm is the usage of a mixture of two different types of PageRank that apply to two main categories of tasks: local and global retrieval tasks.
Tasks requiring local retrieval are those where we want to retrieve text chunks that are semantically and syntactically related to the query, e.g., QA, key-value retrieval, reasoning, etc. Tasks requiring global retrieval are those where we want to retrieve the most important parts of the overall document, where importance is independent of query-relatedness. For example, for summarizing books we may want to retrieve the text chunks describing the book’s main events, which will not directly relate to the specific content of the phrase “summarize this book” but will nonetheless be important in relation to the structure of the events in the book.
We argue Standard PageRank (PR) provides a way to address global retrieval tasks. Given an adjacency matrix, A, linking items in a database, PR will assign importance to each item based only on their structural importance in the graph (e.g., the number of incoming connections and importance of neighbors). We visualize this in the bottom of the figure above, where structurally important nodes are represented in dark blue.
Conversely, Personalized PageRank (PPR) can provide a way to do local retrieval for difficult tasks. PPR weights items in a database based both on their relatedness to a ‘personalization vector’ and based on their structural importance. We use the personalization vector to bias weights toward a node representing the query. We visualize this in the middle row of the figure above.
PPR can be seen as working akin to approximate Bayesian inference: it tries to maximize the likelihood, P(X | Z), of the input data X (i.e., maximize similarity to the query), while also maximizing the prior probability, P(Z), of the node (i.e., the structural importance of node in the graph). In the case of local retrieval tasks, an X is given in the form of a query. In global retrieval tasks, no X is given (the query does not contain relevant info to be searched), and thus we end up just wanting to maximize the prior probability/structural importance over our weights Z.
Finally, we need a way to do routing between PR and PPR based on the type of query. We use the base LLM in the RAG system with a zero-shot prompt to classify the query as either requiring global retrieval (route to PR) or local retrieval (route to PPR). This simple method worked very well on multiple LLMs.
The final retrieval algorithm can be summarized:
A small sample of our results can be seen in the figure above and below. Please see our paper for the full results. We test our MixPR model across multiple long context benchmarks including RULER, a synthetic long context dataset with over 12 different tasks, BABILong a natural language long-context reasoning benchmark, and HashHop a synthetic long-context reasoning benchmark. All of these benchmarks contain tasks that go beyond simple QA.
Across the benchmarks there are 8 tasks requiring multi-hop retrieval. Compared to nearest neighbor baselines and the base LLMs, MixPR performs better on difficult multi-hop retrieval questions.
Next, how do we implement MixPR so it can be executed quickly in real-time, on compute limited hardware? Although the abstract MixPR algorithm issimple it alone does not specify how to implement the algorithm so that it is compute and memory efficient. We need a way to embed the text, construct a graph between text chunks, and perform retrieval quickly.
Previous SOTA graph RAG systems tend to have a slow embedding process involving LLMs writing entity pairs from the text, which are then used to construct a graph (e.g., see here and here).
To speed this process up we first embed text chunks using a symbolic program that creates sparse embeddings from keyword statistics, using the TF-IDF algorithm. This program is fast and runs entirely on the CPU. The embeddings are stored in memory-efficient sparse matrices. To construct the adjacency matrix we construct the proximity matrix representing cosine similarity values between all text embeddings, and use it as the adjacency matrix. This operation can be donewith a single matrix multiply.
The figure above shows that this process of chunking text, embedding, constructing the graph, and retrieving text is very fast, taking only a few seconds to process contexts of a million+ tokens on a desktop CPU. And it is much faster than the standard process that uses dense embeddings.
MixPR also outcompetes nearest neighbor baselines on global retrieval tasks.
Importantly, MixPR, which uses standard PR for global retrieval and PPR for local retrieval, outcompetes a retriever that uses PPR on all tasks, supporting our idea that using a mixture of PR and PPR, rather than just PPR, provides performance benefits.
How does MixPR compare to SOTA on these benchmarks? We match the SOTA model on HashHop, outcompeting the SOTA model at very long context lengths. We also achieve SOTA results on RULER, where MixPR models outperform base long-context LLMs. MixPR with GPT-4o achieves the 2nd best result on BABILong only behind a specialized recurrent memory transformer finetuned specifically for BABILong, and first among previous RAG models on the BABILong leaderboard. See our paper for full result tables.
At Zyphra, we are committed to developing AI systems that can work on a variety of devices, including compute-constrained devices like phones and personal computers. We believe RAG will play an important role in dealing with the computational costs of conditioning language models on large databases of texts. Our MixPR RAG system provides evidence for this claim by showing how RAG can be used in a highly compute efficient way while still being effective on long-context tasks.
Long-context LLMs are highly capable but require enormous compute and memory resources. This makes them expensive to serve in the cloud and renders them largely impractical for compute-constrained scenarios such as on-device applications. This raises the question of how we can reduce the cost of processing long-contexts without hurting performance.
At Zyphra, we are developing fast but effective RAG systems to preprocess long-context inputs for LLMs. The idea is to use RAG to select a subset of the text most relevant for the task at hand before sending it to the LLM. By only inputting a small portion of the context to the LLM, the cost of LLM inference can be drastically reduced.
We previously presented a preliminary graph-based RAG system engineered for this purpose and demonstrated its effectiveness on the HashHop benchmark, achieving state-of-the-art accuracy with unprecedented speed and memory efficiency.
Now, we present an expanded version of our system and demonstrate its effectivenessacross an extensive set of long-context benchmarks. Details can be found in our new paper. This new version of the algorithm, which we call Mixture-of-PageRanks (MixPR), achieves SOTA results across a variety of standard long-context benchmarks while being highly computationally efficient. Below is a summary of the algorithm and main results.
Previous works applying RAG to long-context tasks have two main limitations. First, they tend to focus on simple QA type questions, even though many long-context tasks are more complex, e.g., reasoning or summarization. Second, they do not analyze or focus on reducing the compute cost of the retrieval pipeline.
RAG models are normally tested on standard informational retrieval benchmarks, where it is assumed a pre-embedded vector database is available to answer questions at test time. However, in long context tasks the data that the RAG system is retrieving from, i.e., the long context, is given at test time. This means the RAG system is not just required to do fast retrieval but also fast database construction. If a RAG system is very slow at constructing the database, it will be much less useful as a long-context preprocessing algorithm.
We have developed a RAG system which can handle long-context inputs and solve complex tasks beyond simple QA. Our system uses a novel retriever based on the PageRank algorithm, which is the classic graph-based retrieval algorithm originally developed by Google. A novel feature of our algorithm is the usage of a mixture of two different types of PageRank that apply to two main categories of tasks: local and global retrieval tasks.
Tasks requiring local retrieval are those where we want to retrieve text chunks that are semantically and syntactically related to the query, e.g., QA, key-value retrieval, reasoning, etc. Tasks requiring global retrieval are those where we want to retrieve the most important parts of the overall document, where importance is independent of query-relatedness. For example, for summarizing books we may want to retrieve the text chunks describing the book’s main events, which will not directly relate to the specific content of the phrase “summarize this book” but will nonetheless be important in relation to the structure of the events in the book.
We argue Standard PageRank (PR) provides a way to address global retrieval tasks. Given an adjacency matrix, A, linking items in a database, PR will assign importance to each item based only on their structural importance in the graph (e.g., the number of incoming connections and importance of neighbors). We visualize this in the bottom of the figure above, where structurally important nodes are represented in dark blue.
Conversely, Personalized PageRank (PPR) can provide a way to do local retrieval for difficult tasks. PPR weights items in a database based both on their relatedness to a ‘personalization vector’ and based on their structural importance. We use the personalization vector to bias weights toward a node representing the query. We visualize this in the middle row of the figure above.
PPR can be seen as working akin to approximate Bayesian inference: it tries to maximize the likelihood, P(X | Z), of the input data X (i.e., maximize similarity to the query), while also maximizing the prior probability, P(Z), of the node (i.e., the structural importance of node in the graph). In the case of local retrieval tasks, an X is given in the form of a query. In global retrieval tasks, no X is given (the query does not contain relevant info to be searched), and thus we end up just wanting to maximize the prior probability/structural importance over our weights Z.
Finally, we need a way to do routing between PR and PPR based on the type of query. We use the base LLM in the RAG system with a zero-shot prompt to classify the query as either requiring global retrieval (route to PR) or local retrieval (route to PPR). This simple method worked very well on multiple LLMs.
The final retrieval algorithm can be summarized:
Next, how do we implement MixPR so it can be executed quickly in real-time, on compute limited hardware? Although the abstract MixPR algorithm issimple it alone does not specify how to implement the algorithm so that it is compute and memory efficient. We need a way to embed the text, construct a graph between text chunks, and perform retrieval quickly.
Previous SOTA graph RAG systems tend to have a slow embedding process involving LLMs writing entity pairs from the text, which are then used to construct a graph (e.g., see here and here).
To speed this process up we first embed text chunks using a symbolic program that creates sparse embeddings from keyword statistics, using the TF-IDF algorithm. This program is fast and runs entirely on the CPU. The embeddings are stored in memory-efficient sparse matrices. To construct the adjacency matrix we construct the proximity matrix representing cosine similarity values between all text embeddings, and use it as the adjacency matrix. This operation can be donewith a single matrix multiply.
The figure above shows that this process of chunking text, embedding, constructing the graph, and retrieving text is very fast, taking only a few seconds to process contexts of a million+ tokens on a desktop CPU. And it is much faster than the standard process that uses dense embeddings.
A small sample of our results can be seen in the figure above and below. Please see our paper for the full results. We test our MixPR model across multiple long context benchmarks including RULER, a synthetic long context dataset with over 12 different tasks, BABILong a natural language long-context reasoning benchmark, and HashHop a synthetic long-context reasoning benchmark. All of these benchmarks contain tasks that go beyond simple QA.
Across the benchmarks there are 8 tasks requiring multi-hop retrieval. Compared to nearest neighbor baselines and the base LLMs, MixPR performs better on difficult multi-hop retrieval questions.
MixPR also outcompetes nearest neighbor baselines on global retrieval tasks.
Importantly, MixPR, which uses standard PR for global retrieval and PPR for local retrieval, outcompetes a retriever that uses PPR on all tasks, supporting our idea that using a mixture of PR and PPR, rather than just PPR, provides performance benefits.
How does MixPR compare to SOTA on these benchmarks? We match the SOTA model on HashHop, outcompeting the SOTA model at very long context lengths. We also achieve SOTA results on RULER, where MixPR models outperform base long-context LLMs. MixPR with GPT-4o achieves the 2nd best result on BABILong only behind a specialized recurrent memory transformer finetuned specifically for BABILong, and first among previous RAG models on the BABILong leaderboard. See our paper for full result tables.
At Zyphra, we are committed to developing AI systems that can work on a variety of devices, including compute-constrained devices like phones and personal computers. We believe RAG will play an important role in dealing with the computational costs of conditioning language models on large databases of texts. Our MixPR RAG system provides evidence for this claim by showing how RAG can be used in a highly compute efficient way while still being effective on long-context tasks.
At Zyphra, we are committed to developing AI systems that can work on a variety of devices, including compute-constrained devices like phones and personal computers. We believe RAG will play an important role in dealing with the computational costs of conditioning language models on large databases of texts. Our MixPR RAG system provides evidence for this claim by showing how RAG can be used in a highly compute efficient way while still being effective on long-context tasks.
RAG is not all we are working on here at Zyphra. To get more information about our work on model training, data curation, and algorithmic innovation, check out our other blog posts.
Long-context LLMs are highly capable but require enormous compute and memory resources. This makes them expensive to serve in the cloud and renders them largely impractical for compute-constrained scenarios such as on-device applications. This raises the question of how we can reduce the cost of processing long-contexts without hurting performance.
At Zyphra, we are developing fast but effective RAG systems to preprocess long-context inputs for LLMs. The idea is to use RAG to select a subset of the text most relevant for the task at hand before sending it to the LLM. By only inputting a small portion of the context to the LLM, the cost of LLM inference can be drastically reduced.
We previously presented a preliminary graph-based RAG system engineered for this purpose and demonstrated its effectiveness on the HashHop benchmark, achieving state-of-the-art accuracy with unprecedented speed and memory efficiency.
Now, we present an expanded version of our system and demonstrate its effectivenessacross an extensive set of long-context benchmarks. Details can be found in our new paper. This new version of the algorithm, which we call Mixture-of-PageRanks (MixPR), achieves SOTA results across a variety of standard long-context benchmarks while being highly computationally efficient. Below is a summary of the algorithm and main results.
Previous works applying RAG to long-context tasks have two main limitations. First, they tend to focus on simple QA type questions, even though many long-context tasks are more complex, e.g., reasoning or summarization. Second, they do not analyze or focus on reducing the compute cost of the retrieval pipeline.
RAG models are normally tested on standard informational retrieval benchmarks, where it is assumed a pre-embedded vector database is available to answer questions at test time. However, in long context tasks the data that the RAG system is retrieving from, i.e., the long context, is given at test time. This means the RAG system is not just required to do fast retrieval but also fast database construction. If a RAG system is very slow at constructing the database, it will be much less useful as a long-context preprocessing algorithm.
We have developed a RAG system which can handle long-context inputs and solve complex tasks beyond simple QA. Our system uses a novel retriever based on the PageRank algorithm, which is the classic graph-based retrieval algorithm originally developed by Google. A novel feature of our algorithm is the usage of a mixture of two different types of PageRank that apply to two main categories of tasks: local and global retrieval tasks.
Tasks requiring local retrieval are those where we want to retrieve text chunks that are semantically and syntactically related to the query, e.g., QA, key-value retrieval, reasoning, etc. Tasks requiring global retrieval are those where we want to retrieve the most important parts of the overall document, where importance is independent of query-relatedness. For example, for summarizing books we may want to retrieve the text chunks describing the book’s main events, which will not directly relate to the specific content of the phrase “summarize this book” but will nonetheless be important in relation to the structure of the events in the book.
We argue Standard PageRank (PR) provides a way to address global retrieval tasks. Given an adjacency matrix, A, linking items in a database, PR will assign importance to each item based only on their structural importance in the graph (e.g., the number of incoming connections and importance of neighbors). We visualize this in the bottom of the figure above, where structurally important nodes are represented in dark blue.
Conversely, Personalized PageRank (PPR) can provide a way to do local retrieval for difficult tasks. PPR weights items in a database based both on their relatedness to a ‘personalization vector’ and based on their structural importance. We use the personalization vector to bias weights toward a node representing the query. We visualize this in the middle row of the figure above.
PPR can be seen as working akin to approximate Bayesian inference: it tries to maximize the likelihood, P(X | Z), of the input data X (i.e., maximize similarity to the query), while also maximizing the prior probability, P(Z), of the node (i.e., the structural importance of node in the graph). In the case of local retrieval tasks, an X is given in the form of a query. In global retrieval tasks, no X is given (the query does not contain relevant info to be searched), and thus we end up just wanting to maximize the prior probability/structural importance over our weights Z.
Finally, we need a way to do routing between PR and PPR based on the type of query. We use the base LLM in the RAG system with a zero-shot prompt to classify the query as either requiring global retrieval (route to PR) or local retrieval (route to PPR). This simple method worked very well on multiple LLMs.
The final retrieval algorithm can be summarized:
Next, how do we implement MixPR so it can be executed quickly in real-time, on compute limited hardware? Although the abstract MixPR algorithm issimple it alone does not specify how to implement the algorithm so that it is compute and memory efficient. We need a way to embed the text, construct a graph between text chunks, and perform retrieval quickly.
Previous SOTA graph RAG systems tend to have a slow embedding process involving LLMs writing entity pairs from the text, which are then used to construct a graph (e.g., see here and here).
To speed this process up we first embed text chunks using a symbolic program that creates sparse embeddings from keyword statistics, using the TF-IDF algorithm. This program is fast and runs entirely on the CPU. The embeddings are stored in memory-efficient sparse matrices. To construct the adjacency matrix we construct the proximity matrix representing cosine similarity values between all text embeddings, and use it as the adjacency matrix. This operation can be donewith a single matrix multiply.
The figure above shows that this process of chunking text, embedding, constructing the graph, and retrieving text is very fast, taking only a few seconds to process contexts of a million+ tokens on a desktop CPU. And it is much faster than the standard process that uses dense embeddings.
A small sample of our results can be seen in the figure above and below. Please see our paper for the full results. We test our MixPR model across multiple long context benchmarks including RULER, a synthetic long context dataset with over 12 different tasks, BABILong a natural language long-context reasoning benchmark, and HashHop a synthetic long-context reasoning benchmark. All of these benchmarks contain tasks that go beyond simple QA.
Across the benchmarks there are 8 tasks requiring multi-hop retrieval. Compared to nearest neighbor baselines and the base LLMs, MixPR performs better on difficult multi-hop retrieval questions.
MixPR also outcompetes nearest neighbor baselines on global retrieval tasks.
Importantly, MixPR, which uses standard PR for global retrieval and PPR for local retrieval, outcompetes a retriever that uses PPR on all tasks, supporting our idea that using a mixture of PR and PPR, rather than just PPR, provides performance benefits.
How does MixPR compare to SOTA on these benchmarks? We match the SOTA model on HashHop, outcompeting the SOTA model at very long context lengths. We also achieve SOTA results on RULER, where MixPR models outperform base long-context LLMs. MixPR with GPT-4o achieves the 2nd best result on BABILong only behind a specialized recurrent memory transformer finetuned specifically for BABILong, and first among previous RAG models on the BABILong leaderboard. See our paper for full result tables.
At Zyphra, we are committed to developing AI systems that can work on a variety of devices, including compute-constrained devices like phones and personal computers. We believe RAG will play an important role in dealing with the computational costs of conditioning language models on large databases of texts. Our MixPR RAG system provides evidence for this claim by showing how RAG can be used in a highly compute efficient way while still being effective on long-context tasks.
At Zyphra, we are committed to developing AI systems that can work on a variety of devices, including compute-constrained devices like phones and personal computers. We believe RAG will play an important role in dealing with the computational costs of conditioning language models on large databases of texts. Our MixPR RAG system provides evidence for this claim by showing how RAG can be used in a highly compute efficient way while still being effective on long-context tasks.
RAG is not all we are working on here at Zyphra. To get more information about our work on model training, data curation, and algorithmic innovation, check out our other blog posts.
We present histograms depicting distribution of cluster sizes in all the datasets (see Fig. 7-11). Please, note that all the figures are in log-log scale. We see a significant drop in the number of clusters starting from the size of around 100. This drop is present both in DCLM and FineWeb-Edu2 (see Fig. 8 and 9 respectively), and most likely is explained by a combination of the deduplication strategy and quality when creating both datasets: DCLM deduplication was done individually within 10 shards, while FineWeb-Edu2 was deduplicated within every Common Crawl snapshot. We find that large clusters usually contain low quality material (repeated advertisements, license agreements templates, etc), so it’s not surprising that such documents were removed. Notably, DCLM still contained one cluster with the size close to 1 million documents, containing low quality documents seemingly coming from the advertisements (see Appendix).We find both Zyda-1and Dolma-CC contain a small amount of duplicates, which is expected, since both datasets were deduplicated globally by their authors. Remaining duplicates are likely false negatives from the initial deduplication procedure. Note, that distribution of duplicates clusters sizes of these two datasets (Fig. 10 and 11) don’t contain any sharp drops, but rather hyper exponentially decreases with cluster size.
Below is an example of the document from the largest cluster (~1M documents) of duplicates in DCLM (quality score 0.482627):
Is safe? Is scam?
Is safe for your PC?
Is safe or is it scam?
Domain is SafeSafe score: 1
The higher the number, the more dangerous the website.Any number higher than 1 means DANGER.
Positive votes:
Negative votes:
Vote Up Vote Down review
Have you had bad experience with Warn us, please!
Below one will find a few documents with different quality scores from DCLM coming from the same duplicates cluster. Quality score varies from ~0.2 to ~0.04.
Long-context LLMs are highly capable but require enormous compute and memory resources. This makes them expensive to serve in the cloud and renders them largely impractical for compute-constrained scenarios such as on-device applications. This raises the question of how we can reduce the cost of processing long-contexts without hurting performance.
At Zyphra, we are developing fast but effective RAG systems to preprocess long-context inputs for LLMs. The idea is to use RAG to select a subset of the text most relevant for the task at hand before sending it to the LLM. By only inputting a small portion of the context to the LLM, the cost of LLM inference can be drastically reduced.
We previously presented a preliminary graph-based RAG system engineered for this purpose and demonstrated its effectiveness on the HashHop benchmark, achieving state-of-the-art accuracy with unprecedented speed and memory efficiency.
Now, we present an expanded version of our system and demonstrate its effectivenessacross an extensive set of long-context benchmarks. Details can be found in our new paper. This new version of the algorithm, which we call Mixture-of-PageRanks (MixPR), achieves SOTA results across a variety of standard long-context benchmarks while being highly computationally efficient. Below is a summary of the algorithm and main results.
Previous works applying RAG to long-context tasks have two main limitations. First, they tend to focus on simple QA type questions, even though many long-context tasks are more complex, e.g., reasoning or summarization. Second, they do not analyze or focus on reducing the compute cost of the retrieval pipeline.
RAG models are normally tested on standard informational retrieval benchmarks, where it is assumed a pre-embedded vector database is available to answer questions at test time. However, in long context tasks the data that the RAG system is retrieving from, i.e., the long context, is given at test time. This means the RAG system is not just required to do fast retrieval but also fast database construction. If a RAG system is very slow at constructing the database, it will be much less useful as a long-context preprocessing algorithm.
We have developed a RAG system which can handle long-context inputs and solve complex tasks beyond simple QA. Our system uses a novel retriever based on the PageRank algorithm, which is the classic graph-based retrieval algorithm originally developed by Google. A novel feature of our algorithm is the usage of a mixture of two different types of PageRank that apply to two main categories of tasks: local and global retrieval tasks.
Tasks requiring local retrieval are those where we want to retrieve text chunks that are semantically and syntactically related to the query, e.g., QA, key-value retrieval, reasoning, etc. Tasks requiring global retrieval are those where we want to retrieve the most important parts of the overall document, where importance is independent of query-relatedness. For example, for summarizing books we may want to retrieve the text chunks describing the book’s main events, which will not directly relate to the specific content of the phrase “summarize this book” but will nonetheless be important in relation to the structure of the events in the book.
We argue Standard PageRank (PR) provides a way to address global retrieval tasks. Given an adjacency matrix, A, linking items in a database, PR will assign importance to each item based only on their structural importance in the graph (e.g., the number of incoming connections and importance of neighbors). We visualize this in the bottom of the figure above, where structurally important nodes are represented in dark blue.
Conversely, Personalized PageRank (PPR) can provide a way to do local retrieval for difficult tasks. PPR weights items in a database based both on their relatedness to a ‘personalization vector’ and based on their structural importance. We use the personalization vector to bias weights toward a node representing the query. We visualize this in the middle row of the figure above.
PPR can be seen as working akin to approximate Bayesian inference: it tries to maximize the likelihood, P(X | Z), of the input data X (i.e., maximize similarity to the query), while also maximizing the prior probability, P(Z), of the node (i.e., the structural importance of node in the graph). In the case of local retrieval tasks, an X is given in the form of a query. In global retrieval tasks, no X is given (the query does not contain relevant info to be searched), and thus we end up just wanting to maximize the prior probability/structural importance over our weights Z.
Finally, we need a way to do routing between PR and PPR based on the type of query. We use the base LLM in the RAG system with a zero-shot prompt to classify the query as either requiring global retrieval (route to PR) or local retrieval (route to PPR). This simple method worked very well on multiple LLMs.
The final retrieval algorithm can be summarized:
Next, how do we implement MixPR so it can be executed quickly in real-time, on compute limited hardware? Although the abstract MixPR algorithm issimple it alone does not specify how to implement the algorithm so that it is compute and memory efficient. We need a way to embed the text, construct a graph between text chunks, and perform retrieval quickly.
Previous SOTA graph RAG systems tend to have a slow embedding process involving LLMs writing entity pairs from the text, which are then used to construct a graph (e.g., see here and here).
To speed this process up we first embed text chunks using a symbolic program that creates sparse embeddings from keyword statistics, using the TF-IDF algorithm. This program is fast and runs entirely on the CPU. The embeddings are stored in memory-efficient sparse matrices. To construct the adjacency matrix we construct the proximity matrix representing cosine similarity values between all text embeddings, and use it as the adjacency matrix. This operation can be donewith a single matrix multiply.
The figure above shows that this process of chunking text, embedding, constructing the graph, and retrieving text is very fast, taking only a few seconds to process contexts of a million+ tokens on a desktop CPU. And it is much faster than the standard process that uses dense embeddings.
A small sample of our results can be seen in the figure above and below. Please see our paper for the full results. We test our MixPR model across multiple long context benchmarks including RULER, a synthetic long context dataset with over 12 different tasks, BABILong a natural language long-context reasoning benchmark, and HashHop a synthetic long-context reasoning benchmark. All of these benchmarks contain tasks that go beyond simple QA.
Across the benchmarks there are 8 tasks requiring multi-hop retrieval. Compared to nearest neighbor baselines and the base LLMs, MixPR performs better on difficult multi-hop retrieval questions.
MixPR also outcompetes nearest neighbor baselines on global retrieval tasks.
Importantly, MixPR, which uses standard PR for global retrieval and PPR for local retrieval, outcompetes a retriever that uses PPR on all tasks, supporting our idea that using a mixture of PR and PPR, rather than just PPR, provides performance benefits.
How does MixPR compare to SOTA on these benchmarks? We match the SOTA model on HashHop, outcompeting the SOTA model at very long context lengths. We also achieve SOTA results on RULER, where MixPR models outperform base long-context LLMs. MixPR with GPT-4o achieves the 2nd best result on BABILong only behind a specialized recurrent memory transformer finetuned specifically for BABILong, and first among previous RAG models on the BABILong leaderboard. See our paper for full result tables.
At Zyphra, we are committed to developing AI systems that can work on a variety of devices, including compute-constrained devices like phones and personal computers. We believe RAG will play an important role in dealing with the computational costs of conditioning language models on large databases of texts. Our MixPR RAG system provides evidence for this claim by showing how RAG can be used in a highly compute efficient way while still being effective on long-context tasks.
At Zyphra, we are committed to developing AI systems that can work on a variety of devices, including compute-constrained devices like phones and personal computers. We believe RAG will play an important role in dealing with the computational costs of conditioning language models on large databases of texts. Our MixPR RAG system provides evidence for this claim by showing how RAG can be used in a highly compute efficient way while still being effective on long-context tasks.
RAG is not all we are working on here at Zyphra. To get more information about our work on model training, data curation, and algorithmic innovation, check out our other blog posts.
Long-context LLMs are highly capable but require enormous compute and memory resources. This makes them expensive to serve in the cloud and renders them largely impractical for compute-constrained scenarios such as on-device applications. This raises the question of how we can reduce the cost of processing long-contexts without hurting performance.
At Zyphra, we are developing fast but effective RAG systems to preprocess long-context inputs for LLMs. The idea is to use RAG to select a subset of the text most relevant for the task at hand before sending it to the LLM. By only inputting a small portion of the context to the LLM, the cost of LLM inference can be drastically reduced.
We previously presented a preliminary graph-based RAG system engineered for this purpose and demonstrated its effectiveness on the HashHop benchmark, achieving state-of-the-art accuracy with unprecedented speed and memory efficiency.
Now, we present an expanded version of our system and demonstrate its effectivenessacross an extensive set of long-context benchmarks. Details can be found in our new paper. This new version of the algorithm, which we call Mixture-of-PageRanks (MixPR), achieves SOTA results across a variety of standard long-context benchmarks while being highly computationally efficient. Below is a summary of the algorithm and main results.
Previous works applying RAG to long-context tasks have two main limitations. First, they tend to focus on simple QA type questions, even though many long-context tasks are more complex, e.g., reasoning or summarization. Second, they do not analyze or focus on reducing the compute cost of the retrieval pipeline.
RAG models are normally tested on standard informational retrieval benchmarks, where it is assumed a pre-embedded vector database is available to answer questions at test time. However, in long context tasks the data that the RAG system is retrieving from, i.e., the long context, is given at test time. This means the RAG system is not just required to do fast retrieval but also fast database construction. If a RAG system is very slow at constructing the database, it will be much less useful as a long-context preprocessing algorithm.
We have developed a RAG system which can handle long-context inputs and solve complex tasks beyond simple QA. Our system uses a novel retriever based on the PageRank algorithm, which is the classic graph-based retrieval algorithm originally developed by Google. A novel feature of our algorithm is the usage of a mixture of two different types of PageRank that apply to two main categories of tasks: local and global retrieval tasks.
Tasks requiring local retrieval are those where we want to retrieve text chunks that are semantically and syntactically related to the query, e.g., QA, key-value retrieval, reasoning, etc. Tasks requiring global retrieval are those where we want to retrieve the most important parts of the overall document, where importance is independent of query-relatedness. For example, for summarizing books we may want to retrieve the text chunks describing the book’s main events, which will not directly relate to the specific content of the phrase “summarize this book” but will nonetheless be important in relation to the structure of the events in the book.
We argue Standard PageRank (PR) provides a way to address global retrieval tasks. Given an adjacency matrix, A, linking items in a database, PR will assign importance to each item based only on their structural importance in the graph (e.g., the number of incoming connections and importance of neighbors). We visualize this in the bottom of the figure above, where structurally important nodes are represented in dark blue.
Conversely, Personalized PageRank (PPR) can provide a way to do local retrieval for difficult tasks. PPR weights items in a database based both on their relatedness to a ‘personalization vector’ and based on their structural importance. We use the personalization vector to bias weights toward a node representing the query. We visualize this in the middle row of the figure above.
PPR can be seen as working akin to approximate Bayesian inference: it tries to maximize the likelihood, P(X | Z), of the input data X (i.e., maximize similarity to the query), while also maximizing the prior probability, P(Z), of the node (i.e., the structural importance of node in the graph). In the case of local retrieval tasks, an X is given in the form of a query. In global retrieval tasks, no X is given (the query does not contain relevant info to be searched), and thus we end up just wanting to maximize the prior probability/structural importance over our weights Z.
Finally, we need a way to do routing between PR and PPR based on the type of query. We use the base LLM in the RAG system with a zero-shot prompt to classify the query as either requiring global retrieval (route to PR) or local retrieval (route to PPR). This simple method worked very well on multiple LLMs.
The final retrieval algorithm can be summarized:
Next, how do we implement MixPR so it can be executed quickly in real-time, on compute limited hardware? Although the abstract MixPR algorithm issimple it alone does not specify how to implement the algorithm so that it is compute and memory efficient. We need a way to embed the text, construct a graph between text chunks, and perform retrieval quickly.
Previous SOTA graph RAG systems tend to have a slow embedding process involving LLMs writing entity pairs from the text, which are then used to construct a graph (e.g., see here and here).
To speed this process up we first embed text chunks using a symbolic program that creates sparse embeddings from keyword statistics, using the TF-IDF algorithm. This program is fast and runs entirely on the CPU. The embeddings are stored in memory-efficient sparse matrices. To construct the adjacency matrix we construct the proximity matrix representing cosine similarity values between all text embeddings, and use it as the adjacency matrix. This operation can be donewith a single matrix multiply.
The figure above shows that this process of chunking text, embedding, constructing the graph, and retrieving text is very fast, taking only a few seconds to process contexts of a million+ tokens on a desktop CPU. And it is much faster than the standard process that uses dense embeddings.
A small sample of our results can be seen in the figure above and below. Please see our paper for the full results. We test our MixPR model across multiple long context benchmarks including RULER, a synthetic long context dataset with over 12 different tasks, BABILong a natural language long-context reasoning benchmark, and HashHop a synthetic long-context reasoning benchmark. All of these benchmarks contain tasks that go beyond simple QA.
Across the benchmarks there are 8 tasks requiring multi-hop retrieval. Compared to nearest neighbor baselines and the base LLMs, MixPR performs better on difficult multi-hop retrieval questions.
MixPR also outcompetes nearest neighbor baselines on global retrieval tasks.
Importantly, MixPR, which uses standard PR for global retrieval and PPR for local retrieval, outcompetes a retriever that uses PPR on all tasks, supporting our idea that using a mixture of PR and PPR, rather than just PPR, provides performance benefits.
How does MixPR compare to SOTA on these benchmarks? We match the SOTA model on HashHop, outcompeting the SOTA model at very long context lengths. We also achieve SOTA results on RULER, where MixPR models outperform base long-context LLMs. MixPR with GPT-4o achieves the 2nd best result on BABILong only behind a specialized recurrent memory transformer finetuned specifically for BABILong, and first among previous RAG models on the BABILong leaderboard. See our paper for full result tables.
At Zyphra, we are committed to developing AI systems that can work on a variety of devices, including compute-constrained devices like phones and personal computers. We believe RAG will play an important role in dealing with the computational costs of conditioning language models on large databases of texts. Our MixPR RAG system provides evidence for this claim by showing how RAG can be used in a highly compute efficient way while still being effective on long-context tasks.