[AARR] G-EVAL: NLG Evaluation using GPT-4 with Better Human Alignment

[AARR] G-EVAL: NLG Evaluation using GPT-4 with Better Human Alignment

[AARR] G-EVAL: NLG Evaluation using GPT-4 with Better Human Alignment

Author

Align AI R&D Team

Author

Author

Author

Align AI R&D Team

Align AI R&D Team

Align AI R&D Team

💡 The Align AI Research Review is a weekly dive into the most relevant research papers, AI industry news and updates!


The reliability of LLMs in evaluating their own outputs raises questions about leveraging a model like GPT-4 for the bulk of evaluations.

Recent research suggests using large language models (LLMs) as reference-free metrics for evaluating NLGs. These metrics offer the advantage of flexibility, allowing the application to novel tasks lacking human references. However, human correspondence tends to be lower with LLM-based evaluators compared to medium-sized neural evaluators.

In response, the Microsoft Cognitive Services Research team proposed the G-EVAL framework, employing complex language models with chain-of-thoughts (CoT) and a form-filling strategy to evaluate the quality of natural language generation (NLG) [1]outputs.

Let's explore into the G-EVAL framework and its approach to assessing the quality of NLG outputs.

The G-EVAL framework evaluator operates on a prompt-based model and consists of three main elements:

  • Initially, a prompt specifies the definition of the evaluation task and the desired evaluation criteria. This prompt, expressed in natural language, outlines the evaluation task and includes customized criteria tailored to different NLG tasks, such as coherence, conciseness, or grammar.

  • Second, a Chain-of-Thought (CoT) consists of a sequence of intermediate instructions explaining the detailed evaluation steps. Some criteria require evaluation instructions that go beyond a simple definition for evaluation tasks.

  • Third, a scoring function which invokes the LLM and computes the score corresponding to the probabilities of the return tokens. The scoring function starts the LLM with the specified prompt, auto CoT, input context, and the text that needs to be evaluated. Moreover, it directly performs the evaluation task with a form-filling model.


Does G-EVAL prefer LLM-based outputs?

To investigate this potential drawback, the research compared the evaluation scores of summaries generated by LLMs with those written by humans in an experiment focusing on the summarization task.

This allowed for a closer examination of whether LLMs exhibit a preference for their own outputs over carefully composed texts of higher quality.

Next, the proposed system incorporated the dataset from (Zhang et al., 2023) by gathering high-quality summaries for news articles from freelance writers. Annotators then compare these human-written summaries with those generated by LLMs (using GPT-3.5 and text-davinci003).


Can we rely on LLMs to assess LLM outputs?

Align AI's analysis has pinpointed some questions within this study.

  • Do LLMs show a bias towards the outputs they generate during evaluations?

The G-EVAL development team evaluated this by comparing the assessment ratings of LLM-generated summaries with those of superior-quality summaries written by humans.

The findings are as follows:

The LLM consistently gives higher scores to GPT-3.5 summaries, even though human judges prefer summaries written in human language., Though not ideal, the situation isn't as dire as it initially seems; however, it requires further explanation. Assessing outputs generated by Large Language Models (LLMs) of superior quality is inherently challenging. So, the question is:


Should we consider using LLMs as evaluators?

The answer lies somewhere in between. While LLMs offer exceptional efficiency in handling vast data volumes, making them valuable for scaling evaluation processes, their current Spearman correlation scores suggest they're not yet reliable enough to be the sole evaluators.

Align AI suggests a hybrid approach as the most effective strategy. By integrating the computational power of LLMs with the expert judgment of human evaluators, you can join the best of both worlds.


Learn More !

  • Read more about the paper [Link]

  • To know more about the code of this paper [Link]

  • How to create a custom metric that uses LLMs for evaluation? [Link]

Want to ensure that your AI chatbot

is performing the way it is supposed to?

Want to ensure that your AI chatbot is performing the way it is supposed to?

Want to ensure that your AI chatbot

is performing the way it is supposed to?

United States

3003 N. 1st Street, San Jose, California

South Korea

23 Yeouidaebang-ro 69-gil, Yeongdeungpo-gu, Seoul

India

Eldeco Centre, Malviya Nagar, New Delhi

United States

3003 N. 1st Street, San Jose, California

South Korea

23 Yeouidaebang-ro 69-gil, Yeongdeungpo-gu, Seoul

India

Eldeco Centre, Malviya Nagar, New Delhi

United States

3003 N. 1st Street, San Jose, California

South Korea

23 Yeouidaebang-ro 69-gil, Yeongdeungpo-gu, Seoul

India

Eldeco Centre, Malviya Nagar, New Delhi