CV-Arena: An Open Benchmark for Instructional Computer Vision Problem Solving with Human-AI Collaborative Preferences

Fangzhou Lin1,2,3, Peiran Li1, Lingyu Xu2, Wenjing Chen1, Qianwen Ge4, Shuo Xing1,
Mingyang Wu1, Xiangbo Gao1, Siyuan Yang1, Kazunori Yamada3, Ziming Zhang2,
Haichong Zhang2, Zhen Dong5,6, Ming-Hsuan Yang7, Zhengzhong Tu1★

1Texas A&M University  ·  2Worcester Polytechnic Institute  ·  3Tohoku University  ·  4Georgia Tech  ·  5NVIDIA  ·  6UCSB  ·  7UC Merced

Corresponding author: tzz@tamu.edu

CV-Arena task taxonomy
CV-Arena covers 16 instruction-based visual task types across restoration, physical interaction, semantic manipulation, geometric / structural control, and typography / UI restoration — all on high-resolution real images.

Abstract

Instruction-guided image editing is becoming a general interface for visual work, yet existing benchmarks still focus largely on narrow appearance edits and do not fully capture the diversity of real-image tasks in professional workflows. Here, we define instructional computer vision problem solving as a broader formulation of image editing: given a real input image and a natural-language instruction, a system must produce an edited output that realizes the requested transformation while satisfying explicit preservation, geometric, physical, and usability constraints.

We introduce CV-Arena, an open benchmark designed to evaluate this capability at professional scales. CV-Arena contains 12K high-resolution real-image instruction pairs spanning 16 instruction-based visual task types, constructed using CogRetriever, a dual-track retrieval-and-curation pipeline that combines targeted web search, agentic query refinement, verification, and traceability.

To evaluate models at scale while preserving human fidelity, we propose Active Elo, a human-AI collaborative preference protocol that leverages CV-Judge — a logic-gated, multi-dimensional VLM evaluator — to reject clear failures and resolve high-confidence comparisons, and to route close, high-quality comparisons to expert raters. Mixed human and AI supervision is then aggregated through reliability-weighted Elo updates.

Our comprehensive evaluation of 21 systems, including proprietary, open-source, and agentic models, on CV-Arena reveals persistent gaps in instruction adherence, physical reasoning, structural control, and fine-grained detail preservation. We further develop CV-Agent, a lightweight agentic model that combines planning, editing, and verification, and demonstrate that closed-loop reasoning is a promising direction for professional-grade instruction-following visual editing.

CogRetriever: Dual-Track Data Curation

CV-Arena is built from open-domain real images at ≥ 20482 resolution through CogRetriever — a dual-track pipeline that combines targeted web search, agentic query refinement, and verification. Both tracks share an end-to-end verification stage so that every instruction–image pair is traceable to its source.

CogRetriever pipeline
CogRetriever: agentic and manual tracks for retrieving, refining, and verifying real-image editing tasks.

Active Elo with CV-Judge

Pairwise model comparison is the de-facto standard for evaluating generative systems, but exhaustive expert annotation does not scale. Active Elo routes comparisons through CV-Judge, a logic-gated multi-dimensional VLM evaluator that rejects clear failures and resolves high-confidence wins. Only close, high-quality pairs are escalated to expert raters; mixed supervision is then aggregated via reliability-weighted Elo updates.

Active Elo bootstrap protocol
Active Elo protocol: CV-Judge filters and resolves; experts disambiguate close calls.

Leaderboard

Full 21-system leaderboard across four evaluation settings, reproduced from Table 4 of the paper. Switch between Active Elo (Ours) and the three single-source baselines below; click any column to sort.

Win rate matrix across 21 systems
Pairwise win-rate matrix across 21 systems on CV-Arena.

Visual Comparisons

Qualitative comparisons across CV-Arena tasks, contrasting outputs from leading proprietary, open-source, and agentic editors on the same instruction–image pairs.

Main visual comparison across 21 systems
Representative outputs across 21 systems on CV-Arena, spanning restoration, semantic manipulation, and physical interaction tasks.
Additional visual comparisons
Additional comparisons highlighting geometric / structural control and typography / UI restoration cases.

Open Challenges

Across 21 evaluated systems, CV-Arena surfaces recurring difficulties in physical reasoning, structural and geometric control, and fine-grained detail preservation.

Representative open challenges on CV-Arena
Representative open challenges on CV-Arena across instruction adherence, physical plausibility, geometric control, and detail preservation.

Citation

@article{lin2026cvarena, title = {CV-Arena: An Open Benchmark for Instructional Computer Vision Problem Solving with Human-AI Collaborative Preferences}, author = {Lin, Fangzhou and Li, Peiran and Xu, Lingyu and Chen, Wenjing and Ge, Qianwen and Xing, Shuo and Wu, Mingyang and Gao, Xiangbo and Yang, Siyuan and Yamada, Kazunori and Zhang, Ziming and Zhang, Haichong and Dong, Zhen and Yang, Ming-Hsuan and Tu, Zhengzhong}, journal = {arXiv preprint}, year = {2026} }