Microsoft has released a detailed paper about their new model OmniParser, a pure vision-based GUI agent. The paper was written by Yadong Lu, Jianwei Yang, Yelong Shen and Ahmed Awadallah.

The recent success of large vision language models shows excellent potential in driving the agent system operating on user interfaces. However, the multimodal models like GPT-4V as a general agent on multiple operating systems across different applications are largely underestimated due to the lack of a robust screen parsing technique capable of reliably identifying interactable icons within the user interface and understanding the semantics of various elements in a screenshot and accurately associate the intended action with the corresponding region on the screen. OmniParser will fill these gaps. It is a comprehensive method for parsing user interface screenshots into structured elements.

Developing the model

The team first curated an interactable icon detection dataset using popular web pages and an icon description dataset. These datasets were utilized to finetune specialized models: a detection model to parse interactable regions on the screen and a caption model to extract the functional semantics of the detected elements. OmniParser significantly improves GPT-4V's performance on the ScreenSpot benchmark. And on Mind2Web and AITW benchmark, OmniParser with screenshot only input outperforms the GPT-4V baselines requiring additional information outside of screenshot.

Curated dataset

The team curated a dataset of interactable icon detection dataset containing 67k unique screenshot images, each labelled with bounding boxes of interactable icons derived from the DOM tree. They first took a 100k uniform sample of popular publicly available URLs on the clueweb dataset and collected bounding boxes of interactable regions of the webpage from the DOM tree of each URL. They also collected 7k icon-description pairs to finetune the caption model.

Plugin-ready

To further demonstrate that OmniParser is a plugin choice for off-the-shelf vision language models, the team showcased the performance of OmniParser combined with recently announced vision language models Phi-3.5-V and Llama-3.2-V. Their findings show that the finetuned interactable region detection (ID) model significantly improves the task performance compared to the grounding dino model (w.o. ID) with local semantics across all subcategories for GPT-4V, Phi-3.5-V and Llama-3.2-V. In addition, the local semantics of icon functionality significantly help the performance of every vision language model. 

Sources of Article

Want to publish your content?

Get Published Icon
ALSO EXPLORE