Paper Archive

EventHub: Data Factory for Generalizable Event-Based Stereo Networks without Active Sensors

0

9.0/10

Luca Bartolomei, Fabio Tosi, Matteo Poggi, Stefano Mattoccia, Guillermo Gallego 4/2/2026 arxiv

computer vision

We propose EventHub, a novel framework for training deep-event stereo networks without ground truth annotations from costly active sensors, relying instead on standard color images. From these images, we derive either proxy annotations and proxy events through state-of-the-art novel view synthesis t...

Keywords: event-based vision, stereo, novel view synthesis, proxy events, data distillation, domain generalization, RGB-to-event

View Paper

ActionParty: Multi-Subject Action Binding in Generative Video Games

0

9.0/10

Alexander Pondaven, Ziyi Wu, Igor Gilitschenski, Philip Torr, Sergey Tulyakov, Fabio Pizzati, Aliaksandr Siarohin 4/2/2026 arxiv

generative models

Recent advances in video diffusion have enabled the development of "world models" capable of simulating interactive environments. However, these models are largely restricted to single-agent settings, failing to control multiple agents simultaneously in a scene. In this work, we tackle a fundamental...

Keywords: video diffusion, world models, action binding, subject state tokens, spatial biasing, multi-agent, Melting Pot, generative video games

View Paper

Generative World Renderer

0

9.0/10

Zheng-Hui Huang, Zhixiang Wang, Jiaming Tan, Ruihan Yu, Yidan Zhang, Bo Zheng, Yu-Lun Liu, Yung-Yu Chuang, Kaipeng Zhang 4/2/2026 arxiv

computer vision

Scaling generative inverse and forward rendering to real-world scenarios is bottlenecked by the limited realism and temporal coherence of existing synthetic datasets. To bridge this persistent domain gap, we introduce a large-scale, dynamic dataset curated from visually complex AAA games. Using a no...

Keywords: generative rendering, inverse rendering, G-buffer, dataset, AAA games, dual-screen capture, VLM evaluation, temporal coherence

View Paper

Steerable Visual Representations

0

9.0/10

Jona Ruthardt, Manu Gaur, Deva Ramanan, Makarand Tapaswi, Yuki M. Asano 4/2/2026 arxiv

computer vision

Pretrained Vision Transformers (ViTs) such as DINOv2 and MAE provide generic image features that can be applied to a variety of downstream tasks such as retrieval, classification, and segmentation. However, such representations tend to focus on the most salient visual cues in the image, with no way ...

Keywords: steerable representations, vision transformers, early fusion, cross-attention, DINOv2, MAE, CLIP, zero-shot

View Paper

Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation

0

9.0/10

Daiwei Chen, Zhoutong Fu, Chengming Jiang, Haichao Zhang, Ran Zhou, Tan Wang, Chunnan Yao, Guoyao Li, Rui Cai, Yihan Cao, Ruijie Jiang, Fedor Borisyuk, Jianqiang Shen, Jingwei Wu, Ramya Korlakai Vinayak 4/2/2026 arxiv

machine learning

Language models (LMs) are increasingly extended with new learnable vocabulary tokens for domain-specific tasks, such as Semantic-ID tokens in generative recommendation. The standard practice initializes these new tokens as the mean of existing vocabulary embeddings, then relies on supervised fine-tu...

Keywords: Grounded Token Initialization, GTI, token initialization, vocabulary extension, language models, generative recommendation, embedding grounding, semantic-ID tokens

View Paper

Beyond Referring Expressions: Scenario Comprehension Visual Grounding

0

9.0/10

Ruozhen He, Nisarg A. Shah, Qihua Dong, Zilin Xiao, Jaywon Koo, Vicente Ordonez 4/2/2026 arxiv

machine learning

Existing visual grounding benchmarks primarily evaluate alignment between image regions and literal referring expressions, where models can often succeed by matching a prominent named category. We explore a complementary and more challenging setting of scenario-based visual grounding, where the targ...

Keywords: scenario-based grounding, Referring Scenario Comprehension, visual grounding, curriculum learning, reinforcement learning, vision-and-language, dataset

View Paper

Batched Contextual Reinforcement: A Task-Scaling Law for Efficient Reasoning

0

9.0/10

Bangji Yang, Hongbo Ma, Jiajun Fan, Ge Liu 4/2/2026 arxiv

machine learning

Large Language Models employing Chain-of-Thought reasoning achieve strong performance but suffer from excessive token consumption that inflates inference costs. Existing efficiency methods such as explicit length penalties, difficulty estimators, or multi-stage curricula either degrade reasoning qua...

Keywords: Batched Contextual Reinforcement, Chain-of-Thought, token efficiency, task-scaling law, implicit budget, inference cost, emergent efficiency, length control

View Paper

Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining

0

9.0/10

Junxuan Li, Rawal Khirodkar, Chengan He, Zhongshi Jiang, Giljoo Nam, Lingchen Yang, Jihyun Lee, Egor Zakharov, Zhaoen Su, Rinat Abdrashitov, Yuan Dong, Julieta Martinez, Kai Li, Qingyang Tan, Takaaki Shiratori, Matthew Hu, Peihong Guo, Xuhua Huang, Ariyan Zarei, Marco Pesavento, Yichen Xu, He Wen, Teng Deng, Wyatt Borsos, Anjali Thakrar, Jean-Charles Bazin, Carsten Stoll, Ginés Hidalgo, James Booth, Lucy Wang, Xiaowen Ma, Yu Rong, Sairanjith Thalanki, Chen Cao, Christian Häne, Abhishek Kar, Sofien Bouaziz, Jason Saragih, Yaser Sheikh, Shunsuke Saito 4/2/2026 arxiv

computer vision

High-quality 3D avatar modeling faces a critical trade-off between fidelity and generalization. On the one hand, multi-view studio data enables high-fidelity modeling of humans with precise control over expressions and poses, but it struggles to generalize to real-world data due to limited scale and...

Keywords: 3D avatars, pretraining, in-the-wild, multi-view, avatar modeling, feedforward inference, relightability, codec avatars

View Paper

Stop Wandering: Efficient Vision-Language Navigation via Metacognitive Reasoning

0

9.0/10

Xueying Li, Feng Lyu, Hao Wu, Mingliu Liu, Jia-Nan Liu, Guozi Liu 4/2/2026 arxiv

machine learning

Training-free Vision-Language Navigation (VLN) agents powered by foundation models can follow instructions and explore 3D environments. However, existing approaches rely on greedy frontier selection and passive spatial memory, leading to inefficient behaviors such as local oscillation and redundant ...

Keywords: MetaNav, metacognition, vision-language navigation, 3D semantic map, LLM-guided correction, history-aware planning, frontier selection, GOAT-Bench

View Paper

A Simple Baseline for Streaming Video Understanding

0

9.0/10

Yujiao Shen, Shulin Tian, Jingkang Yang, Ziwei Liu 4/2/2026 arxiv

computer vision

Recent streaming video understanding methods increasingly rely on complex memory mechanisms to handle long video streams. We challenge this trend with a simple finding: a sliding-window baseline that feeds only the most recent N frames to an off-the-shelf VLM already matches or surpasses published s...

Keywords: SimpleStream, sliding-window, VLM, streaming video understanding, OVO-Bench, StreamingBench, perception-memory trade-off, backbone-dependent context

View Paper

Export Archive Data

Browse by Date

Papers for April 3, 2026

EventHub: Data Factory for Generalizable Event-Based Stereo Networks without Active Sensors

ActionParty: Multi-Subject Action Binding in Generative Video Games

Generative World Renderer

Steerable Visual Representations

Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation

Beyond Referring Expressions: Scenario Comprehension Visual Grounding

Batched Contextual Reinforcement: A Task-Scaling Law for Efficient Reasoning

Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining

Stop Wandering: Efficient Vision-Language Navigation via Metacognitive Reasoning

A Simple Baseline for Streaming Video Understanding