AI-NEWS · 2024年 10月 26日

Microsoft Launches New Model OmniParser: Understanding Screenshot Content Instantly with GPT-4V

## GPT-4V and OmniParser: Improving Image Understanding

### Background
GPT-4V is a powerful model capable of understanding image content and performing tasks based on images. However, it struggles with accurately interpreting graphical user interfaces (GUIs), often acting like a "screen blind man" by tapping in the wrong places.

### Introduction to OmniParser
OmniParser, a new tool from Microsoft, addresses GPT-4V's limitations by translating screenshots into structured language that GPT-4V can understand. This tool combines an interactive icon detection model, an icon description model, and OCR (Optical Character Recognition) module outputs to create a DOM-like representation of the UI.

### Key Features
1. **Interactive Icon Detection**: OmniParser identifies all interactive icons and buttons on the screen, marking them with unique IDs.
2. **Functional Description**: It describes each icon's function in text form.
3. **Text Recognition**: Extracts text from the screen to provide context.

### Performance Evaluation
- **ScreenSpot Test**: Enhanced GPT-4V’s accuracy significantly; improved by 73 points over models trained specifically for graphical interfaces.
- **Mind2Web Test**: Improved web browsing task performance, surpassing GPT-4V's HTML-assisted accuracy.
- **AITW Test**: Significantly enhanced mobile navigation tasks.

### Shortcomings
1. **Confusion with Repetitive Icons/Text**: Requires more detailed descriptions to differentiate.
2. **Box Drawing Accuracy**: Sometimes leads to inaccurate button pressing due to misaligned bounding boxes.
3. **Icon Misunderstanding**: Needs contextual information for accurate description.

### Future Prospects
Researchers are continuously working on improving OmniParser, aiming to make it GPT-4V’s best partner in the future.

### Key Takeaways
1. **Enhanced Task Execution**: OmniParser helps GPT-4V better understand screen content.
2. **Proven Effectiveness**: Demonstrated exceptional performance across various tests.
3. **Areas for Improvement**: Continuous improvement is underway, and the tool's future looks promising.

[Paper Link](#)

---
**Copyright AIbase Base 2024**

This summary captures the essence of OmniParser's role in enhancing GPT-4V’s GUI interaction capabilities and its performance across different tests. It also highlights key areas for improvement and future prospects.
“`

Source:https://www.aibase.com/news/12748