Paper Archive

RynnVLA-002: A Unified Vision-Language-Action and World Model

0

9.0/10

Jun Cen, Siteng Huang, Yuqian Yuan, Hangjie Yuan, Chaohui Yu, Yuming Jiang, Jiayan Guo, Kehan Li, Hao Luo, Fan Wang, Xin Li, Deli Zhao, Hao Chen 11/21/2025 arxiv

robotics

We introduce RynnVLA-002, a unified Vision-Language-Action (VLA) and world model. The world model leverages action and visual inputs to predict future image states, learning the underlying physics of the environment to refine action generation. Conversely, the VLA model produces subsequent actions f...

Keywords: vision-language-action, world model, robotics, joint learning, simulation, sim2real, LIBERO, LeRobot

View Paper

HALO: High-Altitude Language-Conditioned Monocular Aerial Exploration and Navigation

0

9.0/10

Yuezhan Tao, Dexter Ong, Fernando Cladera, Jason Hughes, Camillo J. Taylor, Pratik Chaudhari, Vijay Kumar 11/21/2025 arxiv

robotics

We demonstrate real-time high-altitude aerial metric-semantic mapping and exploration using a monocular camera paired with a global positioning system (GPS) and an inertial measurement unit (IMU). Our system, named HALO, addresses two key challenges: (i) real-time dense 3D reconstruction using visio...

Keywords: metric-semantic mapping, monocular vision, aerial exploration, language-conditioned planning, real-time 3D reconstruction, quadrotor, GPS/IMU, autonomous mapping

View Paper

EvDiff: High Quality Video with an Event Camera

0

9.0/10

Weilun Li, Lei Sun, Ruixi Gao, Qi Jiang, Yuqin Ma, Kaiwei Wang, Ming-Hsuan Yang, Luc Van Gool, Danda Pani Paudel 11/21/2025 arxiv

computer vision

As neuromorphic sensors, event cameras asynchronously record changes in brightness as streams of sparse events with the advantages of high temporal resolution and high dynamic range. Reconstructing intensity images from events is a highly ill-posed task due to the inherent ambiguity of absolute brig...

Keywords: event cameras, diffusion models, video reconstruction, neuromorphic sensors, surrogate training, EvEncoder, high dynamic range

View Paper

Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination

0

9.0/10

Yolo Yunlong Tang, Daiki Shimada, Hang Hua, Chao Huang, Jing Bi, Rogerio Feris, Chenliang Xu 11/21/2025 arxiv

computer vision

Understanding text-rich videos requires reading small, transient textual cues that often demand repeated inspection. Yet most video QA models rely on single-pass perception over fixed frames, leading to hallucinations and failures on fine-grained evidence. Inspired by how humans pause, zoom, and re-...

Keywords: video reasoning, text-rich video, visual rumination, large multimodal models, reinforcement learning, SFT, GRPO, Video-R4-CoT-17k

View Paper

Harnessing Data from Clustered LQR Systems: Personalized and Collaborative Policy Optimization

0

9.0/10

Vinay Kanakeri, Shivam Bajaj, Ashwin Verma, Vijay Gupta, Aritra Mitra 11/21/2025 arxiv

machine learning

It is known that reinforcement learning (RL) is data-hungry. To improve sample-efficiency of RL, it has been proposed that the learning algorithm utilize data from 'approximately similar' processes. However, since the process models are unknown, identifying which other processes are similar poses a ...

Keywords: LQR, reinforcement_learning, clustering, personalized_policies, policy_optimization, sequential_elimination, zeroth-order, collaborative_learning

View Paper

Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models

0

9.0/10

Mark Endo, Serena Yeung-Levy 11/21/2025 arxiv

machine learning

Scaling up multimodal models has enabled remarkable advances in visual understanding and reasoning, but practical demands call for smaller, efficient systems. In this work, we conduct a principled analysis of downscaling intelligence in multimodal models, examining how reduced large language model (...

Keywords: multimodal models, LLM downscaling, visual perception, visual reasoning, visual extraction tuning, Extract+Think, efficiency, small models

View Paper

An Artificial Intelligence Framework for Measuring Human Spine Aging Using MRI

0

9.0/10

Roozbeh Bazargani, Saqib Abdullah Basar, Daniel Daly-Grafstein, Rodrigo Solis Pompa, Soojin Lee, Saurabh Garg, Yuntong Ma, John A. Carrino, Siavash Khallaghi, Sam Hashemi 11/21/2025 arxiv

machine learning

The human spine is a complex structure composed of 33 vertebrae. It holds the body and is important for leading a healthy life. The spine is vulnerable to age-related degenerations that can be identified through magnetic resonance imaging (MRI). In this paper we propose a novel computer-vison-based ...

Keywords: spine age, spine aging, MRI, deep learning, UMAP, HDBSCAN, spine age gap, SAG

View Paper

Counterfactual World Models via Digital Twin-conditioned Video Diffusion

0

9.0/10

Yiqing Shen, Aiza Maksutova, Chenjia Li, Mathias Unberath 11/21/2025 arxiv

computer vision

World models learn to predict the temporal evolution of visual observations given a control signal, potentially enabling agents to reason about environments through forward simulation. Because of the focus on forward simulation, current world models generate predictions based on factual observations...

Keywords: counterfactual world models, digital twin, video diffusion, large language models, structured scene representation, interventions, forward simulation

View Paper

OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe

0

9.0/10

Unknown authors 11/24/2025 huggingface

machine learning

Recent advancements in large reasoning models have fueled growing interest in extending such capabilities to multimodal domains. However, despite notable progress in visual reasoning, the lack of transparent and reproducible data curation and training strategies remains a major barrier to scalable r...

Keywords: multimodal reasoning, supervised fine-tuning, reinforcement learning, data curation, reproducibility, OpenMMReasoner, Qwen2.5-VL-7B-Instruct, benchmarks

View Paper

GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization

0

9.0/10

Unknown authors 11/24/2025 huggingface

computer vision

Current research on agentic visual reasoning enables deep multimodal understanding but primarily focuses on image manipulation tools, leaving a gap toward more general-purpose agentic models. In this work, we revisit the geolocalization task, which requires not only nuanced visual grounding but also...

Keywords: geolocalization, agentic reasoning, multimodal, web-augmented, GeoVista, GeoBench, reinforcement learning, hierarchical reward

View Paper

Export Archive Data

Browse by Date

Papers for November 24, 2025

RynnVLA-002: A Unified Vision-Language-Action and World Model

HALO: High-Altitude Language-Conditioned Monocular Aerial Exploration and Navigation

EvDiff: High Quality Video with an Event Camera

Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination

Harnessing Data from Clustered LQR Systems: Personalized and Collaborative Policy Optimization

Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models

An Artificial Intelligence Framework for Measuring Human Spine Aging Using MRI

Counterfactual World Models via Digital Twin-conditioned Video Diffusion

OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe

GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization