Paper Archive

DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search

0

9.0/10

Kartik Narayan, Yang Xu, Tian Cao, Kavya Nerella, Vishal M. Patel, Navid Shiee, Peter Grasch, Chao Jia, Yinfei Yang, Zhe Gan 10/14/2025 arxiv

machine learning

Multimodal Large Language Models (MLLMs) in real-world applications require access to external knowledge sources and must remain responsive to the dynamic and ever-changing real-world information in order to address information-seeking and knowledge-intensive user queries. Existing approaches, such ...

Keywords: Multimodal LLM, web search, query crafting, image crop search, DeepMMSearch-R1, DeepMMSearchVQA, reinforcement learning, multimodal VQA

View Paper

Detect Anything via Next Point Prediction

0

9.0/10

Qing Jiang, Junan Huo, Xingyu Chen, Yuda Xiong, Zhaoyang Zeng, Yihao Chen, Tianhe Ren, Junzhi Yu, Lei Zhang 10/14/2025 arxiv

computer vision

Object detection has long been dominated by traditional coordinate regression-based models, such as YOLO, DETR, and Grounding DINO. Although recent efforts have attempted to leverage MLLMs to tackle this task, they face challenges like low recall rate, duplicate predictions, coordinate misalignment,...

Keywords: next point prediction, Rex-Omni, MLLM, object detection, quantized coordinates, GRPO, reinforcement learning, zero-shot

View Paper

DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving

0

9.0/10

Yingyan Li, Shuyao Shang, Weisong Liu, Bing Zhan, Haochen Wang, Yuqi Wang, Yuntao Chen, Xiaoman Wang, Yasong An, Chufeng Tang, Lu Hou, Lue Fan, Zhaoxiang Zhang 10/14/2025 arxiv

machine learning

Scaling Vision-Language-Action (VLA) models on large-scale data offers a promising path to achieving a more generalized driving intelligence. However, VLA models are limited by a ``supervision deficit'': the vast model capacity is supervised by sparse, low-dimensional actions, leaving much of their ...

Keywords: DriveVLA-W0, world modeling, vision-language-action, data scaling law, autonomous driving, autoregressive model, diffusion model, self-supervised learning

View Paper

CuMPerLay: Learning Cubical Multiparameter Persistence Vectorizations

0

9.0/10

Caner Korkmaz, Brighton Nuwagira, Barış Coşkunuzer, Tolga Birdal 10/14/2025 arxiv

machine learning

We present CuMPerLay, a novel differentiable vectorization layer that enables the integration of Cubical Multiparameter Persistence (CMP) into deep learning pipelines. While CMP presents a natural and powerful way to topologically work with images, its use is hindered by the complexity of multifiltr...

Keywords: Cubical Multiparameter Persistence, CuMPerLay, topological data analysis, differentiable vectorization, bifiltration, Wasserstein stability, Swin Transformer, medical imaging

View Paper

ViCO: A Training Strategy towards Semantic Aware Dynamic High-Resolution

0

9.0/10

Long Cui, Weiyun Wang, Jie Shao, Zichen Wen, Gen Luo, Linfeng Zhang, Yanting Zhang, Yu Qiao, Wenhai Wang 10/14/2025 arxiv

machine learning

Existing Multimodal Large Language Models (MLLMs) suffer from increased inference costs due to the additional vision tokens introduced by image inputs. In this work, we propose Visual Consistency Learning (ViCO), a novel training algorithm that enables the model to represent images of varying semant...

Keywords: ViCO, Visual Consistency Learning, ViR, Visual Resolution Router, dynamic tokens, semantic complexity, MLP connector, KL divergence

View Paper

UniFusion: Vision-Language Model as Unified Encoder in Image Generation

0

9.0/10

Kevin Li, Manuel Brack, Sudeep Katakol, Hareesh Ravi, Ajinkya Kale 10/14/2025 arxiv

machine learning

Although recent advances in visual generation have been remarkable, most existing architectures still depend on distinct encoders for images and text. This separation constrains diffusion models' ability to perform cross-modal reasoning and knowledge transfer. Prior attempts to bridge this gap often...

Keywords: UniFusion, Layerwise Attention Pooling, LAP, VERIFI, frozen VLM, vision-language model, diffusion, DiT

View Paper

Ax-Prover: A Deep Reasoning Agentic Framework for Theorem Proving in Mathematics and Quantum Physics

0

9.0/10

Marco Del Tredici, Jacob McCarran, Benjamin Breen, Javier Aspuru Mijares, Weichen Winston Yin, Jacob M. Taylor, Frank Koppens, Dirk Englund 10/14/2025 arxiv

machine learning

We present Ax-Prover, a multi-agent system for automated theorem proving in Lean that can solve problems across diverse scientific domains and operate either autonomously or collaboratively with human experts. To achieve this, Ax-Prover approaches scientific problem solving through formal proof gene...

Keywords: Ax-Prover, theorem proving, Lean, Model Context Protocol, LLMs, multi-agent, formal verification, automated theorem proving

View Paper

MVP4D: Multi-View Portrait Video Diffusion for Animatable 4D Avatars

0

9.0/10

Felix Taubner, Ruihang Zhang, Mathieu Tuli, Sherwin Bahmani, David B. Lindell 10/14/2025 arxiv

computer vision

Digital human avatars aim to simulate the dynamic appearance of humans in virtual environments, enabling immersive experiences across gaming, film, virtual reality, and more. However, the conventional process for creating and animating photorealistic human avatars is expensive and time-consuming, re...

Keywords: multi-view, video diffusion, 4D avatar, animatable avatars, single-image capture, neural rendering, distillation, temporal consistency

View Paper

SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models

0

9.0/10

Weiyang Jin, Yuwei Niu, Jiaqi Liao, Chengqi Duan, Aoxue Li, Shenghua Gao, Xihui Liu 10/14/2025 arxiv

multimodal learning

Recently, remarkable progress has been made in Unified Multimodal Models (UMMs), which integrate vision-language generation and understanding capabilities within a single framework. However, a significant gap exists where a model's strong visual understanding often fails to transfer to its visual ge...

Keywords: SRUM, self-rewarding, unified multimodal models, global-local dual reward, post-training, understanding-as-evaluator, T2I-CompBench, T2I-ReasonBench

View Paper

Dr.LLM: Dynamic Layer Routing in LLMs

0

9.0/10

Ahmed Heakl, Martin Gubri, Salman Khan, Sangdoo Yun, Seong Joon Oh 10/14/2025 arxiv

natural language processing

Large Language Models (LLMs) process every token through all layers of a transformer stack, causing wasted computation on simple queries and insufficient flexibility for harder ones that need deeper reasoning. Adaptive-depth methods can improve efficiency, but prior approaches rely on costly inferen...

Keywords: dynamic routing, layer skipping, LLM efficiency, Monte Carlo Tree Search, adaptive-depth, windowed pooling, focal loss, bottleneck MLP

View Paper

Export Archive Data

Browse by Date

Papers for October 15, 2025

DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search

Detect Anything via Next Point Prediction

DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving

CuMPerLay: Learning Cubical Multiparameter Persistence Vectorizations

ViCO: A Training Strategy towards Semantic Aware Dynamic High-Resolution

UniFusion: Vision-Language Model as Unified Encoder in Image Generation

Ax-Prover: A Deep Reasoning Agentic Framework for Theorem Proving in Mathematics and Quantum Physics

MVP4D: Multi-View Portrait Video Diffusion for Animatable 4D Avatars

SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models

Dr.LLM: Dynamic Layer Routing in LLMs