💡 The Align AI Research Review is a weekly dive into the most relevant research papers, AI industry news and updates!
Self-Retrieval is an architecture for end-to-end information retrieval that utilizes a large language model. It improves the efficacy of downstream applications and substantially outperforms previous retrieval methods.
There has been a significant shift in how humans acquire information with the emergence of large language models (LLMs). Previously, people turned to information retrieval (IR) systems for their information needs. However, with the emergence of LLMs, the paradigm has shifted towards information generation,
IR continues to play a significant role in this paradigm shift; however, its direct assistance to humans has been replaced by indirect assistance to LLMs.
A perfect example of this can be seen in RAG (Lewis et al., 2020), where LLMs initially utilize an information retrieval system to acquire relevant data that corresponds with the specified information requirement. Subsequently, they generate a suitable response predicated on the documents that were retrieved.
Traditional information retrieval systems cannot fully meet the demands of LLMs, exploit the robust functionalities offered by LLMs, or support downstream applications compatible with LLMs due to their isolated architecture and limited interaction.
More precisely, limited interaction that only retrieves information and feeds it into LLMs within its designated context cannot ensure that the retrieved information is exactly what the LLMs need and can utilize.
(Tang et al., 2024) proposed a Self-Retrieval based LLM-driven, end-to-end information retrieval architecture that extensively leverages the capabilities of LLMs throughout the IR process, fully integrating the required functionalities of IR systems into a single LLM (see fig 1).
Self-Retrieval achieves this by successively generating natural language indices and document segments, and performing self-evaluation to score and rank the generated documents.
What does Self-Retrieval primarily focus on when conducting passage-level retrieval?
Self-retrieval comprises three components in its architecture: indexing, retrieval, and self-assessment. LLMs utilize self-supervised learning to internalize the corpus into parameters, enabling the construction of the index.
When retrieving information in response to a user query, the LLM generates the relevant passages after generating the natural language index. Self-assessment involves the LLM assigning a score to passages to determine whether they satisfy the query requirements.
Self-retrieval consists of the following essential elements:
Query: The query represents the instruction of LLM, which is the user input.
Natural language index: The natural language index acts as a comprehensive resource for every document in the corpus, linking the query directly to the corresponding passage.
Generated passage: The LLM will generate the required passage instantly.
Passage assessment: After completing self-retrieval, the LLM proceeds to generate responses for self-assessment. These responses may relevant or irrelevant the relevance of the passage to the inquiry. To assign a score to the generated passage, the response is then utilized.
Indexing in Self-retrieval
By utilizing self-supervised learning, self-retrieval initially constructs an index from a given corpus. The objective of indexing is to incorporate the corpus within the parameters of the LLM (see fig 2).
Each sentence from the corpus corresponds to the respective query. The LLM learns to generate passages by using the sentences from a given passage as input. The corpus can be organized and then memorized by the LLM in this manner.
Retrieval via natural language index
The self-retrieval model can perform retrieval for the input query once the corpus index has been generated. Passages are then generated by the LLM after mapping the query into natural language indexes. Self-retrieval specifically involves two steps to produce the retrieved passage.
The first step involves natural language index which is generated by the self-retrieval model. Additionally, to enhance the quality of the generated index, ten candidate indexes are initially generated via beam search. The Self-retrieval model will then generate the corresponding natural language indexes using the query and the candidates as subsequent inputs.
Second, after the generation of the natural language index, the LLM will use both the query and the natural language index as inputs to produce the retrieved passage.
The proposed system employ a trie-based constrained decoding algorithm to generate precise exemplars from the provided corpus (Chen et al., 2020).
Self-assessment in Self-retrieval
The Self-retrieval model is utilized to introduce self-assessment for scoring passages. The score consists of two components: the natural language index score and the confidence-aware passage score.
The research directly applies the generation probability of the natural language index to determine its score. Initially, the self-retrieval model evaluates the passage to generate the confidence-aware score, which is subsequently calculated based on this evaluation.
What data is utilized to train LLMs for Self-retrieval ?
The self-retrieval model utilizes four distinct categories of training data to develop its various capabilities:
Mapping the query input to a natural language index: The self-retrieval model generates the natural language index from the query input. The ability to retrieve passages is crucial. The Self-retrieval model takes the query and natural language index candidates as input and produces the natural language index and passage as output.
To acquire the natural language index candidates for training data, the proposed system initially trains an auxiliary model using the training data format (1). The natural language index candidates are then generated via beam search.
The Self-retrieval model possesses the ability for self-assessment. It utilizes the provided query, natural language index, and passage to generate the passage assessment. To implement rejection response data, negative passages are randomly selected from the corpus.
A few self-supervised instances are included to mitigate catastrophic forgetting, as the annotated instances may not cover all passages (query, passage).
Analyzing its Impact on Information Retrieval
Compared to the most effective sparse and dense retrieval baselines, self-retrieval shows an average improvement of 11% in MRR@5.
These findings suggest that Self-retrieval helps the LLM efficiently memorize and comprehend the corpus's knowledge, as well as generate the necessary passages.
Self-retrieval demonstrates commendable performance in retrieving passages, potentially surpassing the capabilities of dense retrievers.
This highlights the enhanced integration of retrievers and LLMs achieved through self-retrieval approach.
What Key Insights Does Self-Retrieval Offer for Scaling LLMs and Downstream Applications?
Self-retrieval integrates the parameters of LLMs into storing the corpus to be retrieved by internalizing the documents and creating a natural language index.
The research paper assessed the effectiveness of the proposed method by applying it to models with 3B and 7B parameters using 40,000 documents.
However, further investigation is needed to understand the scaling law heading the link between document size and model parameters. Moreover, this paper used RAG to validate the effectiveness of self-retrieval for downstream applications relying on LLMs.
Wrap-Up
Self-Retrieval is an end-to-end LLM-driven information retrieval architecture. Self-Retrieval consists of three steps: indexing, retrieval, and self-assessment. This design allows a single LLM to entirely execute the retrieval task.
The LLM internalizes the corpus, enabling it to generate the corpus directly. This results in high-confidence and traceable production by the LLM.