Results for ""
Researchers solved a problematic challenge that could slow large language models like ChatGPT in a simple but effective way.
Extended human-AI conversations can lead to performance degradation in chatbots powered by advanced large-language machine-learning models such as ChatGPT.
Researchers from MIT and other institutions have identified an unexpected root of this issue and created a straightforward solution that allows a chatbot to sustain continuous communication without crashing or decreasing speed. Their approach includes:
Some approaches remove the first pieces of data when the cache exceeds its capacity. It may result in the model's failure.
The researchers' technology ensures that the first data points are retained in memory, enabling a chatbot to sustain a discussion indefinitely. The Streaming LLM technique allows a model to maintain efficiency throughout conversations over 4 million words. StreamingLLM was more than 22 times faster than another solution that avoids crashing by constantly recomputing part of earlier discussions.
This feature could enable a chatbot to engage in extended conversations during the workday without requiring frequent reboots, facilitating the use of efficient AI assistants for activities such as copywriting, editing, or code generation.
Large language models convert data, such as words in a user query, into representations known as tokens. Several models utilize an attention mechanism that uses tokens to produce fresh text.
An AI chatbot usually generates new text by referencing recently observed text, storing these tokens in memory as a KV Cache for future usage. The attention mechanism constructs a grid containing all tokens in the cache, known as an "attention map," which indicates the strength of the relationship between each token or phrase. Comprehending these connections is a critical factor that allows large language models to produce text that resembles human language. However, if the cache size increases, the attention map might also grow significantly, leading to a slowdown in computing.
If encoding content necessitates more tokens than the cache's capacity, the model's performance decreases. One popular model has a capacity of 4,096 tokens, while an academic article typically contains around 10,000 tokens. Researchers use a "sliding cache" to address these issues by replacing the oldest tokens with new ones. Yet, the model's effectiveness frequently drops significantly once the initial token is removed, quickly diminishing the quality of subsequent generated sentences.
The researchers discovered that retaining the initial token in the sliding cache allows the model to sustain its performance even when its size is surpassed. Furthermore, the researchers identified the reason for this phenomenon in their recent publication.
Image source: Unsplash