CV-Arena: An Open Benchmark for Instructional Computer Vision Problem Solving

Abstract

Instruction-guided image editing is becoming a general interface for visual work, yet existing benchmarks still focus largely on narrow appearance edits and do not fully capture the diversity of real-image tasks in professional workflows. Here, we define instructional computer vision problem solving as a broader formulation of image editing: given a real input image and a natural-language instruction, a system must produce an edited output that realizes the requested transformation while satisfying explicit preservation, geometric, physical, and usability constraints.

We introduce CV-Arena, an open benchmark designed to evaluate this capability at professional scales. CV-Arena contains 12K high-resolution real-image instruction pairs spanning 16 instruction-based visual task types, constructed using CogRetriever, a dual-track retrieval-and-curation pipeline that combines targeted web search, agentic query refinement, verification, and traceability.

To evaluate models at scale while preserving human fidelity, we propose Active Elo, a human-AI collaborative preference protocol that leverages CV-Judge — a logic-gated, multi-dimensional VLM evaluator — to reject clear failures and resolve high-confidence comparisons, and to route close, high-quality comparisons to expert raters. Mixed human and AI supervision is then aggregated through reliability-weighted Elo updates.

Our comprehensive evaluation of 21 systems, including proprietary, open-source, and agentic models, on CV-Arena reveals persistent gaps in instruction adherence, physical reasoning, structural control, and fine-grained detail preservation. We further develop CV-Agent, a lightweight agentic model that combines planning, editing, and verification, and demonstrate that closed-loop reasoning is a promising direction for professional-grade instruction-following visual editing.

CogRetriever: Dual-Track Data Curation

CV-Arena is built from open-domain real images at ≥ 2048² resolution through CogRetriever — a dual-track pipeline that combines targeted web search, agentic query refinement, and verification. Both tracks share an end-to-end verification stage so that every instruction–image pair is traceable to its source.

CogRetriever: agentic and manual tracks for retrieving, refining, and verifying real-image editing tasks.

Active Elo with CV-Judge

Pairwise model comparison is the de-facto standard for evaluating generative systems, but exhaustive expert annotation does not scale. Active Elo routes comparisons through CV-Judge, a logic-gated multi-dimensional VLM evaluator that rejects clear failures and resolves high-confidence wins. Only close, high-quality pairs are escalated to expert raters; mixed supervision is then aggregated via reliability-weighted Elo updates.

Active Elo protocol: CV-Judge filters and resolves; experts disambiguate close calls.

Visual Comparisons

Qualitative comparisons across CV-Arena tasks, contrasting outputs from leading proprietary, open-source, and agentic editors on the same instruction–image pairs.

Main visual comparison across 21 systems

Representative outputs across 21 systems on CV-Arena, spanning restoration, semantic manipulation, and physical interaction tasks.

Additional comparisons highlighting geometric / structural control and typography / UI restoration cases.

Citation

@article{lin2026cvarena,
  title   = {CV-Arena: An Open Benchmark for Instructional Computer Vision Problem Solving with Human-AI Collaborative Preferences},
  author  = {Lin, Fangzhou and Li, Peiran and Xu, Lingyu and Chen, Wenjing and Ge, Qianwen and Xing, Shuo and Wu, Mingyang and Gao, Xiangbo and Yang, Siyuan and Yamada, Kazunori and Zhang, Ziming and Zhang, Haichong and Dong, Zhen and Yang, Ming-Hsuan and Tu, Zhengzhong},
  journal = {arXiv preprint},
  year    = {2026}
}

CV-Arena: An Open Benchmark for Instructional Computer Vision Problem Solving with Human-AI Collaborative Preferences

Abstract

CogRetriever: Dual-Track Data Curation

Active Elo with CV-Judge

Leaderboard

Visual Comparisons

Open Challenges

Citation