Been experimenting with using a VLM to do GUI automation purely from screenshots for a few months now. No DOM parsing, no accessibility APIs, the model just looks at what's on screen and decides where to click. Wanted to share some things that surprised me.

The project is Mano-P, it's a 4B VLA model quantized to w4a16 that runs on Apple Silicon. On my M4 Pro (32GB) I'm getting about 476 tok/s prefill and 76 tok/s decode, peaks at around 4.3GB memory. So each "think about the screen and decide what to do" cycle takes roughly 3-4 seconds, which honestly felt slow at first but for background automation tasks it's fine.

What actually works well: filling out forms, navigating between apps, opening files, basic web browsing. Stuff where the UI elements are reasonably sized and the task is straightforward.

What breaks: dense toolbars with tiny icons (looking at you, Excel ribbon), anything where two buttons look almost identical, and long multi-step flows where one wrong click early on cascades into total failure. We ended up needing a "verify after each click" step where the model re-screenshots to check if the expected thing happened. Without that, success rate on complex tasks was terrible.

The thing that surprised me most was how much the token pruning mattered. GUI screenshots are expensive in tokens. We developed a pruning method (GSPruning) that keeps spatially important tokens and cuts the rest, gets about 2-3x throughput improvement without destroying the spatial awareness the model needs to find UI elements. Without it the model was way too slow for interactive use.

OSWorld benchmark: 58.2% success rate (specialized model category). Not amazing honestly, means it still fails ~40% of the time on complex desktop tasks. But for the repetitive stuff I actually use it for, reliability is much higher.

The whole thing runs fully local, no cloud calls in local mode, which was the original motivation. Didn't want screenshots of my desktop going anywhere.

Repo if anyone wants to look: https://github.com/Mininglamp-AI/Mano-P (Apache 2.0)

Curious if anyone else has tried pure vision approaches for desktop automation. Most stuff I see here is text/code focused. Is there much interest in GUI agents running locally or is that too niche?

[留言]

为什么值得关注

有直接可用的方法、工具或操作价值;符合当前抓取需求;它有实际可用价值,可以直接迁移到方法、工具或工作流

来源:reddit,领域:tech,保留分:0.65