ProSoftArena

Benchmarking Hierarchical Capabilities of Multimodal Agents in Professional Software Environments

Jiaxin Ai, Yukang Feng, Fanrui Zhang, Jianwen Sun, Zizhen Li, Chuanhao Li,
Yifan Chang, Wenxiao Wu, Ruoxi Wang, Mingliang Zhai, Kaipeng Zhang†

†Corresponding Author: zhangkaipeng@pjlab.org.cn
ProSoftArena_overview

🌈 ProSoftArena. We establish the first hierarchical taxonomy of agent capabilities in professional software environ- ments; and curate a comprehensive benchmark covering 6 disciplines, 20 subfields and 13 core professional applications. We construct a VM-based real computer environment for reproductible evaluations, and uniquely incorporates a human-in-the-loop evalution paradigm.

News

✨ [12/29/2025] We release our paper and project page. The data and codes will be openly available soon!

Data Samples

Samples

Representative task samples across six core domains in ProSoftArena.

Evaluation Framework

Framework

Automated and Human-in-the-loop Evaluation Framework.

Experimental Results

Main Results L4 Results

BibTeX