Paper Archive

Gaze Heads: How VLMs Look at What They Describe

0

5.0/10

Rohit Gandikota, David Bau 6/12/2026 arxiv

computer vision

How a vision-language model internally solves the task of describing an image is far from obvious. We find that the model develops a specific mechanism for this: a small set of attention heads in its language-model backbone, which we call gaze heads, whose attention tracks the image region the model...

Keywords: attention

View Paper

OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains

0

5.0/10

Xinyue Cai, Chaoyou Fu, Yi-Fan Zhang, Ran He, Caifeng Shan 6/12/2026 arxiv

computer vision

Current automated pipelines for audio-visual Question Answering (QA) generally adopt a ``video-caption-QA'' paradigm. However, these methods typically segment videos into short clips and generate separate descriptions for audio and visual modalities. This decoupled processing severs inherent associa...

Keywords: fine-tuning

View Paper

RATS! Patches Talk Through Registers: Emergent Parts in Register Attention Transformers

0

5.0/10

Timing Yang, Predrag Neskovic, Jansen Seheult, Wenchao Han, Anand Bhattad, Alan Yuille, Feng Wang 6/12/2026 arxiv

computer vision

When humans see a bird, they recognize far more than just "bird" -- they see a head, wings, and talons, a structured assembly of reusable parts that can be identified across every bird they have ever seen. We ask whether a self-supervised visual model can discover the same compositional structure on...

Keywords: transformer, attention, segmentation, classification

View Paper

RepFusion: Leveraging Multimodal Priors for Denoising in Representation Space

0

5.0/10

Xichen Pan, Aashu Singh, Satya Narayan Shukla, Xiangjun Fan, Shlok Kumar Mishra, Saining Xie 6/12/2026 arxiv

computer vision

Large language models (LLMs) are widely used in text-to-image (T2I) systems, but they are typically limited to text encoding, while denoising is handled by newly trained generative backbones. The emergence of representation autoencoders (RAEs) shifts the generation target toward semantically structu...

Keywords: transformer

View Paper

Instruct-Particulate: Scaling Feed-Forward 3D Object Articulation with Kinematic Control

0

5.0/10

Ruining Li, Yuxin Yao, Matt Zhou, Chuanxia Zheng, Christian Rupprecht, Joan Lasenby, Shangzhe Wu, Andrea Vedaldi 6/12/2026 arxiv

computer vision

Reconstructing articulated 3D objects is important for animation, gaming, and robotic simulations. Recent neural networks can estimate the articulated structure of 3D objects, but their generalization remains limited by the scarcity of annotated data for this task. To address this gap, we introduce ...

Keywords: neural network, segmentation

View Paper

ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning

0

5.0/10

Sicheng Yang, Hangjie Yuan, Wenjun Zhang, Jinwang Wang, Yichen Qian, Weihua Chen, Fan Wang, Lei Zhu 6/12/2026 arxiv

computer vision

Building trustworthy medical multimodal large language models (MLLMs) is critical for reliable clinical decision support. Existing medical hallucination benchmarks mainly focus on data collection, but often ignore where hallucinations originate within the reasoning process. We find that hallucinatio...

Keywords: fine-tuning

View Paper

Persona-Pruner: Sculpting Lightweight Models for Role-Playing

0

5.0/10

Jinsu Kim, Jihoon Tack, Noah Lee, Jongheon Jeong 6/12/2026 arxiv

natural language processing

Language Models (LMs) have shown remarkable potential as role-playing chatbots, delivering consistent, stylized interactions when given a specification of a character or user persona. However, applying these capabilities to real-world applications (e.g., ecosystems with numerous NPCs interacting sim...

View Paper

AdaSR: Adaptive Streaming Reasoning with Hierarchical Relative Policy Optimization

0

5.0/10

Junlong Tong, Wenqi Xu, Yingqi Fan, Anhao Zhao, Xuan Lu, Yang Tan, Xiaoyu Shen 6/12/2026 arxiv

natural language processing

Large reasoning models typically follow a read-then-think paradigm: they observe the complete input, reason over a static context, and then produce the answer. Yet many real-world scenarios are inherently dynamic, such as audio and video stream, where information arrives as a continuous stream and m...

Keywords: fine-tuning

View Paper

Learning Coordinated Preference for Multi-Objective Multi-Agent Reinforcement Learning

0

5.0/10

Pengxin Wang, Lihao Guo, Yi Xie, Bo Liu, Siyang Cao, Jingdi Chen 6/12/2026 arxiv

reinforcement learning

Cooperative multi-objective multi-agent reinforcement learning (MOMARL) models team decision making under multiple, potentially conflicting objectives. In this setting, conflicts arise not only across objectives but also across agents with different observations, roles, and contributions. We propose...

Keywords: reinforcement learning

View Paper

CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment

0

5.0/10

Jiayue Cao, Zhicong Lu, Xuehan Sun, Wei Jia, Hongling Zheng, Changyuan Tian, Zichuan Lin, Wenqian Lv, Nayu Liu 6/12/2026 arxiv

computer vision

Reinforcement learning with verifiable rewards (RLVR) has successfully elicited the reasoning capabilities of large language models, motivating its extension to multimodal scenarios. Existing methods primarily focus on improving the visual coverage of reasoning traces and mitigating visual hallucina...

Keywords: reinforcement learning

View Paper

Export Archive Data

Browse by Date

Papers for June 15, 2026

Gaze Heads: How VLMs Look at What They Describe

OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains

RATS! Patches Talk Through Registers: Emergent Parts in Register Attention Transformers

RepFusion: Leveraging Multimodal Priors for Denoising in Representation Space

Instruct-Particulate: Scaling Feed-Forward 3D Object Articulation with Kinematic Control

ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning

Persona-Pruner: Sculpting Lightweight Models for Role-Playing

AdaSR: Adaptive Streaming Reasoning with Hierarchical Relative Policy Optimization

Learning Coordinated Preference for Multi-Objective Multi-Agent Reinforcement Learning

CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment