There has recently been growing interest in conversational agents with long-term memory which has led to the rapid development of language models that use retrieval-augmented generation (RAG). Until recently, most work on RAG has focused on information retrieval from large databases of texts, like Wikipedia, rather than information from long-form conversations. In this paper, we argue that effective retrieval from long-form conversational data faces two unique problems compared to static database retrieval: 1) time/event-based queries, which requires the model to retrieve information about previous conversations based on time or the order of a conversational event (e.g., the third conversation on Tuesday), and 2) ambiguous queries that require surrounding conversational context to understand. To better develop RAG-based agents that can deal with these challenges, we generate a new dataset of ambiguous and time-based questions that build upon a recent dataset of long-form, simulated conversations, and demonstrate that standard RAG based approaches handle such questions poorly. We then develop a novel retrieval model which combines chained-of-table search methods, standard vector-database retrieval, and a prompting method to disambiguate queries, and demonstrate that this approach substantially improves over current methods at solving these tasks. We believe that this new dataset and more advanced RAG agent can act as a key benchmark and stepping stone towards effective memory augmented conversational agents that can be used in a wide variety of AI applications.¹
Conversational agents, such as chatbots, personal assistants, and language interfacing operating systems, are currently seeing rapid development. One specific area of interest is conversational agents that utilize retrieval-augmented generation (RAG), which imbues LLMs with long-term memory. However, popular Question-Answering (QA) benchmarks that typically test RAG systems, focus primarily on information retrieval (IR) from a static database of texts, such as Wikipedia. The increasing importance of conversational agents raises the question of how to address the unique challenges RAG models face in conversational contexts that they do not in offline, database retrieval contexts.
There seem to be two crucial challenges conversational agents face that are not tested in most standard database retrieval benchmarks:
We found that there is currently a lack of benchmarks that directly test models on both of these challenges simultaneously in conversational contexts. Further, current conversational LLMs that use standard vector database retrieval methods do not directly address these challenges.
To address these challenges, we at Zyphra
We created an open source dataset by augmenting an existing long-form chat dataset known as LoCoMo [1]. We used the open-source chat logs from the LoCoMo dataset and created three types of questions based on the chat-log.
Vector database retrieval retrieves text chunks based on their semantic similarity to the query text. Vector search is the standard for RAG chatbots, but will not alone work for the sorts of questions in our dataset, which require retrieval based on the meta-data of the text. To deal with this issue we combine vector database search with a tabular search method known as chain-of-table.
Conversational agents, such as chatbots, personal assistants, and language interfacing operating systems, are currently seeing rapid development. One specific area of interest is conversational agents that utilize retrieval-augmented generation (RAG), which imbues LLMs with long-term memory. However, popular Question-Answering (QA) benchmarks that typically test RAG systems, focus primarily on information retrieval (IR) from a static database of texts, such as Wikipedia. The increasing importance of conversational agents raises the question of how to address the unique challenges RAG models face in conversational contexts that they do not in offline, database retrieval contexts.
There seem to be two crucial challenges conversational agents face that are not tested in most standard database retrieval benchmarks:
We found that there is currently a lack of benchmarks that directly test models on both of these challenges simultaneously in conversational contexts. Further, current conversational LLMs that use standard vector database retrieval methods do not directly address these challenges.
To address these challenges, we at Zyphra
We created an open source dataset by augmenting an existing long-form chat dataset known as LoCoMo [1]. We used the open-source chat logs from the LoCoMo dataset and created three types of questions based on the chat-log.
Vector database retrieval retrieves text chunks based on their semantic similarity to the query text. Vector search is the standard for RAG chatbots, but will not alone work for the sorts of questions in our dataset, which require retrieval based on the meta-data of the text. To deal with this issue we combine vector database search with a tabular search method known as chain-of-table.
First, we store chat text in a table, where each row represents one response. Columns represent meta-data and store an index to the response’s associated semantic vector. Our retrieval algorithm performs the following steps to retrieve text
To deal with ambiguous queries, we also adapt SoTA query writing methods to our algorithm with promising results.
First, we store chat text in a table, where each row represents one response. Columns represent meta-data and store an index to the response’s associated semantic vector. Our retrieval algorithm performs the following steps to retrieve text
To deal with ambiguous queries, we also adapt SoTA query writing methods to our algorithm with promising results.
Conversational agents, such as chatbots, personal assistants, and language interfacing operating systems, are currently seeing rapid development. One specific area of interest is conversational agents that utilize retrieval-augmented generation (RAG), which imbues LLMs with long-term memory. However, popular Question-Answering (QA) benchmarks that typically test RAG systems, focus primarily on information retrieval (IR) from a static database of texts, such as Wikipedia. The increasing importance of conversational agents raises the question of how to address the unique challenges RAG models face in conversational contexts that they do not in offline, database retrieval contexts.
There seem to be two crucial challenges conversational agents face that are not tested in most standard database retrieval benchmarks:
We found that there is currently a lack of benchmarks that directly test models on both of these challenges simultaneously in conversational contexts. Further, current conversational LLMs that use standard vector database retrieval methods do not directly address these challenges.
To address these challenges, we at Zyphra
First, we store chat text in a table, where each row represents one response. Columns represent meta-data and store an index to the response’s associated semantic vector. Our retrieval algorithm performs the following steps to retrieve text
To deal with ambiguous queries, we also adapt SoTA query writing methods to our algorithm with promising results.
We created an open source dataset by augmenting an existing long-form chat dataset known as LoCoMo [1]. We used the open-source chat logs from the LoCoMo dataset and created three types of questions based on the chat-log.
First, we store chat text in a table, where each row represents one response. Columns represent meta-data and store an index to the response’s associated semantic vector. Our retrieval algorithm performs the following steps to retrieve text
To deal with ambiguous queries, we also adapt SoTA query writing methods to our algorithm with promising results.
First, we store chat text in a table, where each row represents one response. Columns represent meta-data and store an index to the response’s associated semantic vector. Our retrieval algorithm performs the following steps to retrieve text
To deal with ambiguous queries, we also adapt SoTA query writing methods to our algorithm with promising results.
First, we store chat text in a table, where each row represents one response. Columns represent meta-data and store an index to the response’s associated semantic vector. Our retrieval algorithm performs the following steps to retrieve text
To deal with ambiguous queries, we also adapt SoTA query writing methods to our algorithm with promising results.
We find our CoTable+Semantic retrieval method significantly outperforms standard semantic retrieval methods.
We see that for both GPT3.5 and a Mistral-7b model, our method achieves almost perfect recall and very high F2 precision in the items it retrieves from memory, showing both that our method is retrieving the correct text from memory (recall) without retrieving extra irrelevant items (F2). This is unlike existing semantic-based retrieval methods which perform poorly since they are unable to handle time related metadata correctly.
For ambiguous queries, we find that augmenting this method with state of the art query rewriting methods, allows the model to perform almost as well as it did in the unambiguous questions case (see table 2).
Conversational agents, such as chatbots, personal assistants, and language interfacing operating systems, are currently seeing rapid development. One specific area of interest is conversational agents that utilize retrieval-augmented generation (RAG), which imbues LLMs with long-term memory. However, popular Question-Answering (QA) benchmarks that typically test RAG systems, focus primarily on information retrieval (IR) from a static database of texts, such as Wikipedia. The increasing importance of conversational agents raises the question of how to address the unique challenges RAG models face in conversational contexts that they do not in offline, database retrieval contexts.
There seem to be two crucial challenges conversational agents face that are not tested in most standard database retrieval benchmarks:
We found that there is currently a lack of benchmarks that directly test models on both of these challenges simultaneously in conversational contexts. Further, current conversational LLMs that use standard vector database retrieval methods do not directly address these challenges.
To address these challenges, we at Zyphra
We created an open source dataset by augmenting an existing long-form chat dataset known as LoCoMo [1]. We used the open-source chat logs from the LoCoMo dataset and created three types of questions based on the chat-log.
Vector database retrieval retrieves text chunks based on their semantic similarity to the query text. Vector search is the standard for RAG chatbots, but will not alone work for the sorts of questions in our dataset, which require retrieval based on the meta-data of the text. To deal with this issue we combine vector database search with a tabular search method known as chain-of-table.
First, we store chat text in a table, where each row represents one response. Columns represent meta-data and store an index to the response’s associated semantic vector. Our retrieval algorithm performs the following steps to retrieve text
To deal with ambiguous queries, we also adapt SoTA query writing methods to our algorithm with promising results.
We find our CoTable+Semantic retrieval method significantly outperforms standard semantic retrieval methods.
We see that for both GPT3.5 and a Mistral-7b model, our method achieves almost perfect recall and very high F2 precision in the items it retrieves from memory, showing both that our method is retrieving the correct text from memory (recall) without retrieving extra irrelevant items (F2). This is unlike existing semantic-based retrieval methods which perform poorly since they are unable to handle time related metadata correctly.
For ambiguous queries, we find that augmenting this method with state of the art query rewriting methods, allows the model to perform almost as well as it did in the unambiguous questions case (see table 2).
An important capability for conversational agents is an understanding of when conversational events occurred in the past and an ability to handle ambiguous queries, which are commonplace in conversational contexts. We believe our dataset can be used as a basis to test how well various IR methods understand temporal and conversational context, and our empirical results suggest that our approach that combines chain-of-table and semantic retrieval methods makes a useful starting point.
[1] Maharana, A., Lee, D. H., Tulyakov, S., Bansal, M., Barbieri, F., & Fang, Y. (2024). Evaluating very long-term conversational memory of llm agents. arXiv preprint arXiv:2402.17753.
Conversational agents, such as chatbots, personal assistants, and language interfacing operating systems, are currently seeing rapid development. One specific area of interest is conversational agents that utilize retrieval-augmented generation (RAG), which imbues LLMs with long-term memory. However, popular Question-Answering (QA) benchmarks that typically test RAG systems, focus primarily on information retrieval (IR) from a static database of texts, such as Wikipedia. The increasing importance of conversational agents raises the question of how to address the unique challenges RAG models face in conversational contexts that they do not in offline, database retrieval contexts.
There seem to be two crucial challenges conversational agents face that are not tested in most standard database retrieval benchmarks:
We found that there is currently a lack of benchmarks that directly test models on both of these challenges simultaneously in conversational contexts. Further, current conversational LLMs that use standard vector database retrieval methods do not directly address these challenges.
To address these challenges, we at Zyphra
We created an open source dataset by augmenting an existing long-form chat dataset known as LoCoMo [1]. We used the open-source chat logs from the LoCoMo dataset and created three types of questions based on the chat-log.
Vector database retrieval retrieves text chunks based on their semantic similarity to the query text. Vector search is the standard for RAG chatbots, but will not alone work for the sorts of questions in our dataset, which require retrieval based on the meta-data of the text. To deal with this issue we combine vector database search with a tabular search method known as chain-of-table.
First, we store chat text in a table, where each row represents one response. Columns represent meta-data and store an index to the response’s associated semantic vector. Our retrieval algorithm performs the following steps to retrieve text
To deal with ambiguous queries, we also adapt SoTA query writing methods to our algorithm with promising results.
We find our CoTable+Semantic retrieval method significantly outperforms standard semantic retrieval methods.
We see that for both GPT3.5 and a Mistral-7b model, our method achieves almost perfect recall and very high F2 precision in the items it retrieves from memory, showing both that our method is retrieving the correct text from memory (recall) without retrieving extra irrelevant items (F2). This is unlike existing semantic-based retrieval methods which perform poorly since they are unable to handle time related metadata correctly.
For ambiguous queries, we find that augmenting this method with state of the art query rewriting methods, allows the model to perform almost as well as it did in the unambiguous questions case (see table 2).
An important capability for conversational agents is an understanding of when conversational events occurred in the past and an ability to handle ambiguous queries, which are commonplace in conversational contexts. We believe our dataset can be used as a basis to test how well various IR methods understand temporal and conversational context, and our empirical results suggest that our approach that combines chain-of-table and semantic retrieval methods makes a useful starting point.
[1] Maharana, A., Lee, D. H., Tulyakov, S., Bansal, M., Barbieri, F., & Fang, Y. (2024). Evaluating very long-term conversational memory of llm agents. arXiv preprint arXiv:2402.17753.
Conversational agents, such as chatbots, personal assistants, and language interfacing operating systems, are currently seeing rapid development. One specific area of interest is conversational agents that utilize retrieval-augmented generation (RAG), which imbues LLMs with long-term memory. However, popular Question-Answering (QA) benchmarks that typically test RAG systems, focus primarily on information retrieval (IR) from a static database of texts, such as Wikipedia. The increasing importance of conversational agents raises the question of how to address the unique challenges RAG models face in conversational contexts that they do not in offline, database retrieval contexts.
There seem to be two crucial challenges conversational agents face that are not tested in most standard database retrieval benchmarks:
We found that there is currently a lack of benchmarks that directly test models on both of these challenges simultaneously in conversational contexts. Further, current conversational LLMs that use standard vector database retrieval methods do not directly address these challenges.
To address these challenges, we at Zyphra
We created an open source dataset by augmenting an existing long-form chat dataset known as LoCoMo [1]. We used the open-source chat logs from the LoCoMo dataset and created three types of questions based on the chat-log.
Vector database retrieval retrieves text chunks based on their semantic similarity to the query text. Vector search is the standard for RAG chatbots, but will not alone work for the sorts of questions in our dataset, which require retrieval based on the meta-data of the text. To deal with this issue we combine vector database search with a tabular search method known as chain-of-table.
First, we store chat text in a table, where each row represents one response. Columns represent meta-data and store an index to the response’s associated semantic vector. Our retrieval algorithm performs the following steps to retrieve text
To deal with ambiguous queries, we also adapt SoTA query writing methods to our algorithm with promising results.
We find our CoTable+Semantic retrieval method significantly outperforms standard semantic retrieval methods.
We see that for both GPT3.5 and a Mistral-7b model, our method achieves almost perfect recall and very high F2 precision in the items it retrieves from memory, showing both that our method is retrieving the correct text from memory (recall) without retrieving extra irrelevant items (F2). This is unlike existing semantic-based retrieval methods which perform poorly since they are unable to handle time related metadata correctly.
For ambiguous queries, we find that augmenting this method with state of the art query rewriting methods, allows the model to perform almost as well as it did in the unambiguous questions case (see table 2).
An important capability for conversational agents is an understanding of when conversational events occurred in the past and an ability to handle ambiguous queries, which are commonplace in conversational contexts. We believe our dataset can be used as a basis to test how well various IR methods understand temporal and conversational context, and our empirical results suggest that our approach that combines chain-of-table and semantic retrieval methods makes a useful starting point.
Conversational agents, such as chatbots, personal assistants, and language interfacing operating systems, are currently seeing rapid development. One specific area of interest is conversational agents that utilize retrieval-augmented generation (RAG), which imbues LLMs with long-term memory. However, popular Question-Answering (QA) benchmarks that typically test RAG systems, focus primarily on information retrieval (IR) from a static database of texts, such as Wikipedia. The increasing importance of conversational agents raises the question of how to address the unique challenges RAG models face in conversational contexts that they do not in offline, database retrieval contexts.
There seem to be two crucial challenges conversational agents face that are not tested in most standard database retrieval benchmarks:
We found that there is currently a lack of benchmarks that directly test models on both of these challenges simultaneously in conversational contexts. Further, current conversational LLMs that use standard vector database retrieval methods do not directly address these challenges.
To address these challenges, we at Zyphra
We created an open source dataset by augmenting an existing long-form chat dataset known as LoCoMo [1]. We used the open-source chat logs from the LoCoMo dataset and created three types of questions based on the chat-log.
Vector database retrieval retrieves text chunks based on their semantic similarity to the query text. Vector search is the standard for RAG chatbots, but will not alone work for the sorts of questions in our dataset, which require retrieval based on the meta-data of the text. To deal with this issue we combine vector database search with a tabular search method known as chain-of-table.
First, we store chat text in a table, where each row represents one response. Columns represent meta-data and store an index to the response’s associated semantic vector. Our retrieval algorithm performs the following steps to retrieve text
To deal with ambiguous queries, we also adapt SoTA query writing methods to our algorithm with promising results.
We find our CoTable+Semantic retrieval method significantly outperforms standard semantic retrieval methods.
We see that for both GPT3.5 and a Mistral-7b model, our method achieves almost perfect recall and very high F2 precision in the items it retrieves from memory, showing both that our method is retrieving the correct text from memory (recall) without retrieving extra irrelevant items (F2). This is unlike existing semantic-based retrieval methods which perform poorly since they are unable to handle time related metadata correctly.
For ambiguous queries, we find that augmenting this method with state of the art query rewriting methods, allows the model to perform almost as well as it did in the unambiguous questions case (see table 2).
An important capability for conversational agents is an understanding of when conversational events occurred in the past and an ability to handle ambiguous queries, which are commonplace in conversational contexts. We believe our dataset can be used as a basis to test how well various IR methods understand temporal and conversational context, and our empirical results suggest that our approach that combines chain-of-table and semantic retrieval methods makes a useful starting point.
[1] Maharana, A., Lee, D. H., Tulyakov, S., Bansal, M., Barbieri, F., & Fang, Y. (2024). Evaluating very long-term conversational memory of llm agents. arXiv preprint arXiv:2402.17753.
Conversational agents, such as chatbots, personal assistants, and language interfacing operating systems, are currently seeing rapid development. One specific area of interest is conversational agents that utilize retrieval-augmented generation (RAG), which imbues LLMs with long-term memory. However, popular Question-Answering (QA) benchmarks that typically test RAG systems, focus primarily on information retrieval (IR) from a static database of texts, such as Wikipedia. The increasing importance of conversational agents raises the question of how to address the unique challenges RAG models face in conversational contexts that they do not in offline, database retrieval contexts.
There seem to be two crucial challenges conversational agents face that are not tested in most standard database retrieval benchmarks:
We found that there is currently a lack of benchmarks that directly test models on both of these challenges simultaneously in conversational contexts. Further, current conversational LLMs that use standard vector database retrieval methods do not directly address these challenges.
To address these challenges, we at Zyphra
We created an open source dataset by augmenting an existing long-form chat dataset known as LoCoMo [1]. We used the open-source chat logs from the LoCoMo dataset and created three types of questions based on the chat-log.
Vector database retrieval retrieves text chunks based on their semantic similarity to the query text. Vector search is the standard for RAG chatbots, but will not alone work for the sorts of questions in our dataset, which require retrieval based on the meta-data of the text. To deal with this issue we combine vector database search with a tabular search method known as chain-of-table.
First, we store chat text in a table, where each row represents one response. Columns represent meta-data and store an index to the response’s associated semantic vector. Our retrieval algorithm performs the following steps to retrieve text
To deal with ambiguous queries, we also adapt SoTA query writing methods to our algorithm with promising results.
We find our CoTable+Semantic retrieval method significantly outperforms standard semantic retrieval methods.
We see that for both GPT3.5 and a Mistral-7b model, our method achieves almost perfect recall and very high F2 precision in the items it retrieves from memory, showing both that our method is retrieving the correct text from memory (recall) without retrieving extra irrelevant items (F2). This is unlike existing semantic-based retrieval methods which perform poorly since they are unable to handle time related metadata correctly.
For ambiguous queries, we find that augmenting this method with state of the art query rewriting methods, allows the model to perform almost as well as it did in the unambiguous questions case (see table 2).
An important capability for conversational agents is an understanding of when conversational events occurred in the past and an ability to handle ambiguous queries, which are commonplace in conversational contexts. We believe our dataset can be used as a basis to test how well various IR methods understand temporal and conversational context, and our empirical results suggest that our approach that combines chain-of-table and semantic retrieval methods makes a useful starting point.
[1] Maharana, A., Lee, D. H., Tulyakov, S., Bansal, M., Barbieri, F., & Fang, Y. (2024). Evaluating very long-term conversational memory of llm agents. arXiv preprint arXiv:2402.17753.
Conversational agents, such as chatbots, personal assistants, and language interfacing operating systems, are currently seeing rapid development. One specific area of interest is conversational agents that utilize retrieval-augmented generation (RAG), which imbues LLMs with long-term memory. However, popular Question-Answering (QA) benchmarks that typically test RAG systems, focus primarily on information retrieval (IR) from a static database of texts, such as Wikipedia. The increasing importance of conversational agents raises the question of how to address the unique challenges RAG models face in conversational contexts that they do not in offline, database retrieval contexts.
There seem to be two crucial challenges conversational agents face that are not tested in most standard database retrieval benchmarks:
We found that there is currently a lack of benchmarks that directly test models on both of these challenges simultaneously in conversational contexts. Further, current conversational LLMs that use standard vector database retrieval methods do not directly address these challenges.
To address these challenges, we at Zyphra
We created an open source dataset by augmenting an existing long-form chat dataset known as LoCoMo [1]. We used the open-source chat logs from the LoCoMo dataset and created three types of questions based on the chat-log.
Vector database retrieval retrieves text chunks based on their semantic similarity to the query text. Vector search is the standard for RAG chatbots, but will not alone work for the sorts of questions in our dataset, which require retrieval based on the meta-data of the text. To deal with this issue we combine vector database search with a tabular search method known as chain-of-table.
First, we store chat text in a table, where each row represents one response. Columns represent meta-data and store an index to the response’s associated semantic vector. Our retrieval algorithm performs the following steps to retrieve text
To deal with ambiguous queries, we also adapt SoTA query writing methods to our algorithm with promising results.
We find our CoTable+Semantic retrieval method significantly outperforms standard semantic retrieval methods.
We see that for both GPT3.5 and a Mistral-7b model, our method achieves almost perfect recall and very high F2 precision in the items it retrieves from memory, showing both that our method is retrieving the correct text from memory (recall) without retrieving extra irrelevant items (F2). This is unlike existing semantic-based retrieval methods which perform poorly since they are unable to handle time related metadata correctly.
For ambiguous queries, we find that augmenting this method with state of the art query rewriting methods, allows the model to perform almost as well as it did in the unambiguous questions case (see table 2).
An important capability for conversational agents is an understanding of when conversational events occurred in the past and an ability to handle ambiguous queries, which are commonplace in conversational contexts. We believe our dataset can be used as a basis to test how well various IR methods understand temporal and conversational context, and our empirical results suggest that our approach that combines chain-of-table and semantic retrieval methods makes a useful starting point.
[1] Maharana, A., Lee, D. H., Tulyakov, S., Bansal, M., Barbieri, F., & Fang, Y. (2024). Evaluating very long-term conversational memory of llm agents. arXiv preprint arXiv:2402.17753.
Conversational agents, such as chatbots, personal assistants, and language interfacing operating systems, are currently seeing rapid development. One specific area of interest is conversational agents that utilize retrieval-augmented generation (RAG), which imbues LLMs with long-term memory. However, popular Question-Answering (QA) benchmarks that typically test RAG systems, focus primarily on information retrieval (IR) from a static database of texts, such as Wikipedia. The increasing importance of conversational agents raises the question of how to address the unique challenges RAG models face in conversational contexts that they do not in offline, database retrieval contexts.
There seem to be two crucial challenges conversational agents face that are not tested in most standard database retrieval benchmarks:
We found that there is currently a lack of benchmarks that directly test models on both of these challenges simultaneously in conversational contexts. Further, current conversational LLMs that use standard vector database retrieval methods do not directly address these challenges.
To address these challenges, we at Zyphra
We created an open source dataset by augmenting an existing long-form chat dataset known as LoCoMo [1]. We used the open-source chat logs from the LoCoMo dataset and created three types of questions based on the chat-log.
Vector database retrieval retrieves text chunks based on their semantic similarity to the query text. Vector search is the standard for RAG chatbots, but will not alone work for the sorts of questions in our dataset, which require retrieval based on the meta-data of the text. To deal with this issue we combine vector database search with a tabular search method known as chain-of-table.
Conversational agents, such as chatbots, personal assistants, and language interfacing operating systems, are currently seeing rapid development. One specific area of interest is conversational agents that utilize retrieval-augmented generation (RAG), which imbues LLMs with long-term memory. However, popular Question-Answering (QA) benchmarks that typically test RAG systems, focus primarily on information retrieval (IR) from a static database of texts, such as Wikipedia. The increasing importance of conversational agents raises the question of how to address the unique challenges RAG models face in conversational contexts that they do not in offline, database retrieval contexts.
There seem to be two crucial challenges conversational agents face that are not tested in most standard database retrieval benchmarks:
We found that there is currently a lack of benchmarks that directly test models on both of these challenges simultaneously in conversational contexts. Further, current conversational LLMs that use standard vector database retrieval methods do not directly address these challenges.
To address these challenges, we at Zyphra
We created an open source dataset by augmenting an existing long-form chat dataset known as LoCoMo [1]. We used the open-source chat logs from the LoCoMo dataset and created three types of questions based on the chat-log.
Vector database retrieval retrieves text chunks based on their semantic similarity to the query text. Vector search is the standard for RAG chatbots, but will not alone work for the sorts of questions in our dataset, which require retrieval based on the meta-data of the text. To deal with this issue we combine vector database search with a tabular search method known as chain-of-table.
First, we store chat text in a table, where each row represents one response. Columns represent meta-data and store an index to the response’s associated semantic vector. Our retrieval algorithm performs the following steps to retrieve text
To deal with ambiguous queries, we also adapt SoTA query writing methods to our algorithm with promising results.
We find our CoTable+Semantic retrieval method significantly outperforms standard semantic retrieval methods.
We see that for both GPT3.5 and a Mistral-7b model, our method achieves almost perfect recall and very high F2 precision in the items it retrieves from memory, showing both that our method is retrieving the correct text from memory (recall) without retrieving extra irrelevant items (F2). This is unlike existing semantic-based retrieval methods which perform poorly since they are unable to handle time related metadata correctly.
For ambiguous queries, we find that augmenting this method with state of the art query rewriting methods, allows the model to perform almost as well as it did in the unambiguous questions case (see table 2).
An important capability for conversational agents is an understanding of when conversational events occurred in the past and an ability to handle ambiguous queries, which are commonplace in conversational contexts. We believe our dataset can be used as a basis to test how well various IR methods understand temporal and conversational context, and our empirical results suggest that our approach that combines chain-of-table and semantic retrieval methods makes a useful starting point.
[1] Maharana, A., Lee, D. H., Tulyakov, S., Bansal, M., Barbieri, F., & Fang, Y. (2024). Evaluating very long-term conversational memory of llm agents. arXiv preprint arXiv:2402.17753.