Results for ""
Microsoft's multimodal large language model (MLLM) Kosmos-1 can process visual and linguistic inputs. Kosmos-1 may be used for various activities, including image captioning, visual question answering, and more.
ChatGPT has promoted LLMs like the GPT model and the capacity to convert a text prompt or input into an output.
Microsoft's AI researchers claim in a paper titled "Language Is Not All You Need: Matching Perception with Language Models" that while consumers are impressed by these conversation skills, LLMs still struggle with multimodal inputs, such as visual and audio suggestions. The study proposes that multimodal perception, or knowledge acquisition and "grounding" in the actual world, is necessary to go from ChatGPT-like skills to artificial general intelligence (AGI).
Last year, Alphabet-owned robotics company Everyday Robots and Google's Brain Team demonstrated the relevance of grounding in getting robots to follow human descriptions of physical activities by utilising LLMs. The strategy consisted of rooting the language model in feasible tasks in a given real-world setting. Likewise, Microsoft employed grounding in its Prometheus AI model to integrate OpenAI's GPT models with real-world feedback from Bing's search ranking and search results.
Microsoft claims that their Kosmos-1 MLLM is capable of perceiving general modalities, following instructions (zero-shot learning), and contextual learning (few-shot learning). The study states, "The objective is to correlate perception with LLMs such that the models can see and speak."
Each example shows how MLLMs like Kosmos-1 could automate a task in different situations. For example, they could tell a Windows 10 user how to restart their computer (or do any other task with a visual prompt), read a web page to start a web search, understand health data from a device, caption images, and so on. But the model can't analyse videos.
The researchers also evaluated Kosmos-1's performance on the Raven IQ exam. The results revealed a "significant performance difference between the current model and the average level of adults." Still, the model's accuracy suggested that MLLMs may be able to "perceive abstract conceptual patterns in a nonverbal context" by aligning perception with language models.
With Microsoft's intention to leverage Transformer-based language models to make Bing a stronger competitor to Google Search, the research into "web page question answering" seems intriguing.
Conclusion
In this paper, the researchers introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can understand general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot). In particular, they train Kosmos-1 from scratch using large-scale multimodal corpora from the web, such as text and images that are mixed up in any way, image-caption pairs, and text data.
Researchers test different settings, such as zero-shot, few-shot, and multimodal chain-of-thought prompting, on a wide range of tasks without doing any fine-tuning or updating of gradients. Experimental results show that Kosmos-1 does a great job with
(i) language understanding, generation, and even OCR-free NLP (directly fed with document images),
(ii) perception-language tasks like multimodal dialogue, image captioning, and visual question answering, and
(iii) vision tasks like image recognition with descriptions (specifying classification via text instructions).
Furthermore, the researchers also show that MLLMs can benefit from cross-modal transfer, which means moving knowledge from language to multimodal and from multimodal to language. Finally, they also offer a dataset of the Raven IQ test, which measures how well MLLMs can reason without words.