Paper Archive

Multimodal Large Language Models as Image Classifiers

0

9.0/10

Nikita Kisel, Illia Volkov, Klara Janouskova, Jiri Matas 3/6/2026 arxiv

computer vision

Multimodal Large Language Models (MLLM) classification performance depends critically on evaluation protocol and ground truth quality. Studies comparing MLLMs with supervised and vision-language models report conflicting conclusions, and we show these conflicts stem from protocols that either inflat...

Keywords: Multimodal Large Language Models, MLLM, Image Classification, ImageNet-1k, ReGT, evaluation protocol, label noise, dataset curation

View Paper

Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion

0

9.0/10

Lijiang Li, Zuwei Long, Yunhang Shen, Heting Gao, Haoyu Cao, Xing Sun, Caifeng Shan, Ran He, Chaoyou Fu 3/6/2026 arxiv

machine learning

While recent multimodal large language models (MLLMs) have made impressive strides, they predominantly employ a conventional autoregressive architecture as their backbone, leaving significant room to explore effective and efficient alternatives in architectural design. Concurrently, recent studies h...

Keywords: masked discrete diffusion, multimodal, any-to-any, joint discrete tokens, generation, understanding, text, speech

View Paper

BEVLM: Distilling Semantic Knowledge from LLMs into Bird's-Eye View Representations

0

9.0/10

Thomas Monninger, Shaoyuan Xie, Qi Alfred Chen, Sihao Ding 3/6/2026 arxiv

machine learning

The integration of Large Language Models (LLMs) into autonomous driving has attracted growing interest for their strong reasoning and semantic understanding abilities, which are essential for handling complex decision-making and long-tail scenarios. However, existing methods typically feed LLMs with...

Keywords: BEV, LLMs, semantic distillation, bird's-eye view, autonomous driving, multi-view, spatial consistency, closed-loop driving

View Paper

Fly360: Omnidirectional Obstacle Avoidance within Drone View

0

9.0/10

Xiangkai Zhang, Dizhe Zhang, WenZhuo Cao, Zhaoliang Wan, Yingjie Niu, Lu Qi, Xu Yang, Zhiyong Liu 3/6/2026 arxiv

robotics

Obstacle avoidance in unmanned aerial vehicles (UAVs), as a fundamental capability, has gained increasing attention with the growing focus on spatial intelligence. However, current obstacle-avoidance methods mainly depend on limited field-of-view sensors and are ill-suited for UAV scenarios which re...

Keywords: omnidirectional perception, UAV, drone, panoramic RGB, depth estimation, obstacle avoidance, fixed random-yaw training, two-stage pipeline

View Paper

SCOPE: Scene-Contextualized Incremental Few-Shot 3D Segmentation

0

9.0/10

Vishal Thengane, Zhaochong An, Tianjin Huang, Son Lam Phung, Abdesselam Bouzerdoum, Lu Yin, Na Zhao, Xiatian Zhu 3/6/2026 arxiv

computer vision

Incremental Few-Shot (IFS) segmentation aims to learn new categories over time from only a few annotations. Although widely studied in 2D, it remains underexplored for 3D point clouds. Existing methods suffer from catastrophic forgetting or fail to learn discriminative prototypes under sparse superv...

Keywords: Incremental Few‑Shot, 3D segmentation, prototype enrichment, class‑agnostic segmentation, pseudo‑instances, ScanNet, S3DIS, continual learning

View Paper

SUREON: A Benchmark and Vision-Language-Model for Surgical Reasoning

0

9.0/10

Alejandra Perez, Anita Rau, Lee White, Busisiwe Mlambo, Chinedu Nwoye, Muhammad Abdullah Jamal, Omid Mohareri 3/6/2026 arxiv

machine learning

Surgeons don't just see -- they interpret. When an expert observes a surgical scene, they understand not only what instrument is being used, but why it was chosen, what risk it poses, and what comes next. Current surgical AI cannot answer such questions, largely because training data that explicitly...

Keywords: SUREON, surgical_reasoning, vision-language, videoQA, SureonVLM, SureonVLM-R1, Group Relative Policy Optimization, medical_ai

View Paper

Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders

0

9.0/10

Boqiang Zhang, Lei Ke, Ruihan Yang, Qi Gao, Tianyuan Qu, Rossell Chen, Dong Yu, Leoweiliang 3/6/2026 arxiv

machine learning

Vision Language Model (VLM) development has largely relied on scaling model size, which hinders deployment on compute-constrained mobile and edge devices such as smartphones and robots. In this work, we explore the performance limits of compact (e.g., 2B and 8B) VLMs. We challenge the prevailing pra...

Keywords: vision-language models, LLM-initialized encoder, contrastive learning, CLIP, compact VLMs, Penguin-VL, visual fidelity, edge AI

View Paper

A recipe for scalable attention-based MLIPs: unlocking long-range accuracy with all-to-all node attention

0

9.0/10

Eric Qu, Brandon M. Wood, Aditi S. Krishnapriyan, Zachary W. Ulissi 3/6/2026 arxiv

machine learning

Machine-learning interatomic potentials (MLIPs) have advanced rapidly, with many top models relying on strong physics-based inductive biases. However, as models scale to larger systems like biomolecules and electrolytes, they struggle to accurately capture long-range (LR) interactions, leading curre...

Keywords: AllScAIP, MLIP, all-to-all attention, long-range interactions, energy-conserving, molecular dynamics, OMol25, OMat24

View Paper

EgoReasoner: Learning Egocentric 4D Reasoning via Task-Adaptive Structured Thinking

0

9.0/10

Fangrui Zhu, Yunfeng Xi, Jianmo Ni, Mu Cai, Boqing Gong, Long Zhao, Chen Qu, Ian Miao, Yi Li, Cheng Zhong, Huaizu Jiang, Shwetak Patel 3/6/2026 arxiv

computer vision

Egocentric video understanding is inherently complex due to the dynamic 4D nature of the environment, where camera motion and object displacements necessitate a continuous re-evaluation of spatial relations. In this work, we target a suite of under-explored egocentric 4D reasoning tasks, including f...

Keywords: egocentric video, 4D reasoning, Chain-of-Thought, Task-Adaptive Thinking Templates, reinforcement learning, GRPO, HD-EPIC, EgoReasoner

View Paper

Causal Interpretation of Neural Network Computations with Contribution Decomposition

0

9.0/10

Joshua Brendan Melander, Zaki Alaoui, Shenghua Liu, Surya Ganguli, Stephen A. Baccus 3/6/2026 arxiv

machine learning

Understanding how neural networks transform inputs into outputs is crucial for interpreting and manipulating their behavior. Most existing approaches analyze internal representations by identifying hidden-layer activation patterns correlated with human-interpretable concepts. Here we take a direct a...

Keywords: CODEC, contribution decomposition, sparse autoencoder, neural network interpretability, causal attribution, sparse motifs, representation analysis, retina models

View Paper

Export Archive Data

Browse by Date

Papers for March 9, 2026

Multimodal Large Language Models as Image Classifiers

Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion

BEVLM: Distilling Semantic Knowledge from LLMs into Bird's-Eye View Representations

Fly360: Omnidirectional Obstacle Avoidance within Drone View

SCOPE: Scene-Contextualized Incremental Few-Shot 3D Segmentation

SUREON: A Benchmark and Vision-Language-Model for Surgical Reasoning

Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders

A recipe for scalable attention-based MLIPs: unlocking long-range accuracy with all-to-all node attention

EgoReasoner: Learning Egocentric 4D Reasoning via Task-Adaptive Structured Thinking

Causal Interpretation of Neural Network Computations with Contribution Decomposition