Large language models (LLMs) have become essential tools for many tasks, from translating articles to detecting financial fraud. Despite their impressive capabilities and versatility, these models sometimes generate inaccurate responses. Compounding this issue, LLMs can be overly confident in their incorrect answers or need more confidence in their correct ones, making it challenging for users to gauge when a model can be trusted.

To address this, researchers typically calibrate machine-learning models to align their confidence levels with their accuracy. A well-calibrated model needs to be more confident about incorrect predictions and more confident about correct ones. However, traditional calibration methods fall short for LLMs due to their application to various diverse tasks.

MIT and MIT-IBM Watson AI Lab have introduced a new calibration method designed explicitly for LLMs. This method, called Thermometer, involves building a smaller auxiliary model that runs atop an LLM to calibrate it. Thermometer is more efficient than existing methods, requiring less computational power while maintaining model accuracy and enabling better-calibrated responses for previously unseen tasks.

Thermometer aids users in identifying situations where a model might be overconfident about false predictions, thus preventing the deployment of the model in scenarios where it could fail. Traditional machine-learning models are typically calibrated for a single task, which is impractical for LLMs that perform many tasks. Using a conventional method for one task might compromise the model’s performance on another.

Calibrating an LLM usually involves sampling from the model multiple times to gather different predictions and then aggregating these to obtain better-calibrated confidence. Given the billions of parameters in LLMs, this approach is computationally expensive. The thermometer, however, leverages a classical calibration method called temperature scaling to calibrate an LLM for new tasks efficiently.

In this context, “temperature” refers to a scaling parameter that adjusts a model’s confidence to align with its prediction accuracy. The correct temperature is traditionally determined using a labelled validation dataset of task-specific examples. Since LLMs are often applied to new tasks, obtaining labelled datasets is nearly impossible. For example, a user deploying an LLM to answer customer questions about a new product likely lacks a dataset with such questions and answers.

Instead, the researchers train an auxiliary model on top of an LLM to predict the temperature needed for calibrating new tasks. They train the Thermometer model using labelled datasets from a few representative tasks, which can be generalised to new tasks in a similar category without additional labelled data.

A Thermometer model trained on multiple-choice question datasets, such as those with algebra and medical questions, could calibrate an LLM for answering geometry or biology questions. The Thermometer model accesses a small part of the LLM’s inner workings to predict the right temperature to calibrate specific task predictions. This technique requires no multiple training runs and only slightly slows the LLM, preserving its accuracy.

In comparisons with several baselines across multiple tasks, the Thermometer consistently produced better-calibrated uncertainty measures while requiring significantly less computation. The Thermometer model trained for a smaller LLM can be directly applied to calibrate a larger LLM within the same family.

Future adaptations of the Thermometer aim to tackle more complex text-generation tasks and apply the technique to even larger LLMs. The researchers also plan to quantify the diversity and number of labelled datasets needed to train a Thermometer model for generalizing to new tasks.

Source: MIT News

Source: Article

Image source: Unsplash

Want to publish your content?

Publish an article and share your insights to the world.

Get Published Icon
ALSO EXPLORE