Large Language Models (LLMs) have drawn a massive amount of attention because of their outstanding performance on a variety of tasks. They have been developed in such a way that they frequently outperform supervised models and even humans in some circumstances. Though their capabilities are amazing, prior research has shown a number of functional constraints that can have an impact on their usefulness in the real world. These models’ sensitivity to subtleties in prompt language, few-shot demonstrations, and the organization of these demonstrations poses a considerable performance issue. This sensitivity hampers the objective assessment of LLMs’ ability.
In recent research by Megagon Labs, a group of researchers have studied the robustness of LLMs in handling multiple-choice questions, which is a popular task for testing their capacity for inference and fact-retrieval. The main focus of the investigation is how LLMs respond to the rearranging of choices in multiple-choice tests. When answer choices are altered, a significant performance discrepancy that ranges from roughly 13% to 75% across several benchmarks becomes apparent after a thorough study.
A hypothesis has been presented after a thorough analysis, which was that the observed sensitivity occurs when LLMs are unsure between the top-2 or top-3 options for a prediction. Due to a positional bias brought on by the question’s wording, the order of some options may favor some predictions among these top selections. Interesting patterns that either emphasize or lessen the model’s propensity for certain option placements may be seen in the top two options.
For the purpose of accentuating bias, an optimal strategy has been used by the team, which is to make the first and last alternatives from the top two lists in order to emphasize partiality. On the other hand, a suggestion has been given to scatter these selections among the surrounding options in order to combat bias. A variety of studies have been carried out to validate the hypothesized sensitivity. Additionally, two different calibration techniques have been used to improve the predictions made by LLMs. Performance gains of up to 8 percentage points have been seen across several models and benchmarks, which results in a noticeable improvement.
The research has set out certain questions, including the extent of sensitivity, i.e., to what degree are LLMs affected by the order of options in MCQs, the factors contributing to LLMs’ sensitivity, and how can LLMs’ robustness to option order be enhanced? On five different MCQ benchmarks, experiments were done using GPT-4 and InstructGPT to answer the first question. A sizable sensitivity gap of up to 75% was found in the zero-shot situation. Regarding the second query, the data suggested that positional prejudice is what causes LLMs’ sensitivity, as LLMs have a tendency to favor particular placements when they are unsure of the best decision among the top options. In order to answer the final query, the study showed that using two distinct calibration techniques greatly increased LLM performance by up to 8 percentage points.
In conclusion, this study emphasizes the necessity of confronting LLMs’ sensitivity to prompt aspects and their arrangements. It has shed light on the decision-making procedures of LLMs by examining the subtleties of their answers to reordered options in multiple-choice questions. This can definitely lead to an improvement in the usability and reliability of LLMs in real-world circumstances.
Check out the Pre-Print Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 29k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
If you like our work, you will love our newsletter..
The post A New AI Research Study Answers if Large Language Models are Sensitive to the Order of Choices in Multiple-Choice Questions appeared first on MarkTechPost.
Large Language Models (LLMs) have drawn a massive amount of attention because of their outstanding performance on a variety of tasks. They have been developed in such a way that they frequently outperform supervised models and even humans in some circumstances. Though their capabilities are amazing, prior research has shown a number of functional constraints
The post A New AI Research Study Answers if Large Language Models are Sensitive to the Order of Choices in Multiple-Choice Questions appeared first on MarkTechPost. Read More AI Shorts, Applications, Artificial Intelligence, Editors Pick, Language Model, Large Language Model, Machine Learning, Staff, Tech News, Technology, Uncategorized