Paper Archive

Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

0

5.0/10

Miaosen Zhang, Xiaohan Zhao, Zhihong Tan, Zhou Huoshen, Yijia Fan, Yifan Yang, Kai Qiu, Bei Liu, Justin Wagle, Chenzhong Yin, Mingxi Cheng, Ji Li, Qi Dai, Chong Luo, Xu Yang, Xin Geng, Baining Guo 5/12/2026 arxiv

computer vision

Computer-use agents (CUAs) automate on-screen work, as illustrated by GPT-5.4 and Claude. Yet their reliability on complex, low-frequency interactions is still poor, limiting user trust. Our analysis of failure cases from advanced models suggests a long-tail pattern in GUI operations, where a relati...

Keywords: gpt

View Paper

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

0

5.0/10

Haiwen Diao, Penghao Wu, Hanming Deng, Jiahao Wang, Shihao Bai, Silei Wu, Weichen Fan, Wenjie Ye, Wenwen Tong, Xiangyu Fan, Yan Li, Yubo Wang, Zhijie Cao, Zhiqian Lin, Zhitao Yang, Zhongang Cai, Yuwei Niu, Yue Zhu, Bo Liu, Chengguang Lv, Haojia Yu, Haozhe Xie, Hongli Wang, Jianan Fan, Jiaqi Li, Jiefan Lu, Jingcheng Ni, Junxiang Xu, Kaihuan Liang, Lianqiang Shi, Linjun Dai, Linyan Wang, Oscar Qian, Peng Gao, Pengfei Liu, Qingping Sun, Rui Shen, Ruisi Wang, Shengnan Ma, Shuang Yang, Siyi Xie, Siying Li, Tianbo Zhong, Xiangli Kong, Xuanke Shi, Yang Gao, Yongqiang Yao, Yves Wang, Zhengqi Bai, Zhengyu Lin, Zixin Yin, Wenxiu Sun, Ruihao Gong, Quan Wang, Lewei Lu, Lei Yang, Ziwei Liu, Dahua Lin 5/12/2026 arxiv

computer vision

Recent large vision-language models (VLMs) remain fundamentally constrained by a persistent dichotomy: understanding and generation are treated as distinct problems, leading to fragmented architectures, cascaded pipelines, and misaligned representation spaces. We argue that this divide is not merely...

View Paper

EgoForce: Forearm-Guided Camera-Space 3D Hand Pose from a Monocular Egocentric Camera

0

5.0/10

Christen Millerdurai, Shaoxiang Wang, Yaxu Xie, Vladislav Golyanik, Didier Stricker, Alain Pagani 5/12/2026 arxiv

machine learning

Reconstructing the absolute 3D pose and shape of the hands from the user's viewpoint using a single head-mounted camera is crucial for practical egocentric interaction in AR/VR, telepresence, and hand-centric manipulation tasks, where sensing must remain compact and unobtrusive. While monocular RGB ...

Keywords: transformer

View Paper

From Web to Pixels: Bringing Agentic Search into Visual Perception

0

5.0/10

Bokang Yang, Xinyi Sun, Kaituo Feng, Xingping Dong, Dongming Wu, Xiangyu Yue 5/12/2026 arxiv

computer vision

Visual perception connects high-level semantic understanding to pixel-level perception, but most existing settings assume that the decisive evidence for identifying a target is already in the image or frozen model knowledge. We study a more practical yet harder open-world case where a visible object...

Keywords: segmentation

View Paper

CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives

0

5.0/10

Yihao Meng, Zichen Liu, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Yue Yu, Hanlin Wang, Haobo Li, Jiapeng Zhu, Yanhong Zeng, Xing Zhu, Yujun Shen, Qifeng Chen, Huamin Qu 5/12/2026 arxiv

natural language processing

Autoregressive video generation aims at real-time, open-ended synthesis. Yet, cinematic storytelling is not merely the endless extension of a single scene; it requires progressing through evolving events, viewpoint shifts, and discrete shot boundaries. Existing autoregressive models often struggle i...

Keywords: attention

View Paper

AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable Reward

0

5.0/10

Runhui Huang, Jie Wu, Rui Yang, Zhe Liu, Hengshuang Zhao 5/12/2026 arxiv

computer vision

In this paper, we propose AlphaGRPO, a novel framework that applies Group Relative Policy Optimization (GRPO) to AR-Diffusion Unified Multimodal Models (UMMs) to enhance multimodal generation capabilities without an additional cold-start stage. Our approach unlocks the model's intrinsic potential to...

View Paper

Revisiting Photometric Ambiguity for Accurate Gaussian-Splatting Surface Reconstruction

0

5.0/10

Jiahe Li, Jiawei Zhang, Xiao Bai, Jin Zheng, Xiaohan Yu, Lin Gu, Gim Hee Lee 5/12/2026 arxiv

machine learning

Surface reconstruction with differentiable rendering has achieved impressive performance in recent years, yet the pervasive photometric ambiguities have strictly bottlenecked existing approaches. This paper presents AmbiSuR, a framework that explores an intrinsic solution upon Gaussian Splatting for...

View Paper

LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

0

5.0/10

Di Wu, Zixiang Ji, Asmi Kawatkar, Bryan Kwan, Jia-Chen Gu, Nanyun Peng, Kai-Wei Chang 5/12/2026 arxiv

natural language processing

Long-term memory is crucial for agents in specialized web environments, where success depends on recalling interface affordances, state dynamics, workflows, and recurring failure modes. However, existing memory benchmarks for agents mostly focus on user histories, short traces, or downstream task su...

View Paper

Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation

0

5.0/10

Kexuan Shi, Hanxuan Li, Zeju Qiu, Yandong Wen, Simon Buchholz, Weiyang Liu 5/12/2026 arxiv

natural language processing

We introduce Pion, a spectrum-preserving optimizer for large language model (LLM) training based on orthogonal equivalence transformation. Unlike additive optimizers such as Adam and Muon, Pion updates each weight matrix through left and right orthogonal transformations, preserving its singular valu...

Keywords: pretraining

View Paper

Elastic Attention Cores for Scalable Vision Transformers

0

5.0/10

Alan Z. Song, Yinjie Chen, Mu Nan, Rui Zhang, Jiahang Cao, Weijian Mai, Muquan Yu, Hossein Adeli, Deva Ramanan, Michael J. Tarr, Andrew F. Luo 5/12/2026 arxiv

computer vision

Vision Transformers (ViTs) achieve strong data-driven scaling by leveraging all-to-all self-attention. However, this flexibility incurs a computational cost that scales quadratically with image resolution, limiting ViTs in high-resolution domains. Underlying this approach is the assumption that pair...

Keywords: transformer, attention, vision transformer, classification

View Paper

Export Archive Data

Browse by Date

Papers for May 13, 2026

Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

EgoForce: Forearm-Guided Camera-Space 3D Hand Pose from a Monocular Egocentric Camera

From Web to Pixels: Bringing Agentic Search into Visual Perception

CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives

AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable Reward

Revisiting Photometric Ambiguity for Accurate Gaussian-Splatting Surface Reconstruction

LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation

Elastic Attention Cores for Scalable Vision Transformers