Paper Archive

DA-Flow: Degradation-Aware Optical Flow Estimation with Diffusion Models

0

9.0/10

[object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object] 3/24/2026 huggingface

computer vision

Optical flow models trained on high-quality data often degrade severely when confronted with real-world corruptions such as blur, noise, and compression artifacts. To overcome this limitation, we formulate Degradation-Aware Optical Flow, a new task targeting accurate dense correspondence estimation ...

Keywords: optical_flow, diffusion_models, degradation_robustness, image_restoration, spatio-temporal_attention, DA-Flow, zero-shot_correspondence

View Paper

WildWorld: A Large-Scale Dataset for Dynamic World Modeling with Actions and Explicit State toward Generative ARPG

0

9.0/10

[object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object] 3/24/2026 huggingface

computer vision

Dynamical systems theory and reinforcement learning view world evolution as latent-state dynamics driven by actions, with visual observations providing partial information about the state. Recent video world models attempt to learn this action-conditioned dynamics from data. However, existing datase...

Keywords: WildWorld, WildBench, action-conditioned world modeling, dataset, Monster Hunter: Wilds, explicit state annotations, skeletons, depth maps

View Paper

One View Is Enough! Monocular Training for In-the-Wild Novel View Generation

0

9.0/10

[object Object], [object Object], [object Object], [object Object] 3/24/2026 huggingface

computer vision

Monocular novel-view synthesis has long required multi-view image pairs for supervision, limiting training data scale and diversity. We argue it is not necessary: one view is enough. We present OVIE, trained entirely on unpaired internet images. We leverage a monocular depth estimator as a geometric...

Keywords: novel view synthesis, monocular training, masked loss, monocular depth, unpaired images, zero-shot, OVIE, 3D lifting

View Paper

SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning

0

9.0/10

[object Object], [object Object], [object Object], [object Object], [object Object], [object Object] 3/24/2026 huggingface

machine learning

Agentic multimodal large language models (MLLMs) (e.g., OpenAI o3 and Gemini Agentic Vision) achieve remarkable reasoning capabilities through iterative visual tool invocation. However, the cascaded perception, reasoning, and tool-calling loops introduce significant sequential overhead. This overhea...

Keywords: SpecEyes, speculative planning, cognitive gating, answer separability, heterogeneous parallel funnel, agentic depth, multimodal LLMs, latency

View Paper

VTAM: Video-Tactile-Action Models for Complex Physical Interaction Beyond VLAs

0

9.0/10

[object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object] 3/24/2026 huggingface

robotics

Video-Action Models (VAMs) have emerged as a promising framework for embodied intelligence, learning implicit world dynamics from raw video streams to produce temporally consistent action predictions. Although such models demonstrate strong performance on long-horizon tasks through visual reasoning,...

Keywords: VTAM, Video-Action Models, tactile perception, video transformer, multimodal fusion, modality transfer finetuning, tactile regularization, contact-rich manipulation

View Paper

UniFunc3D: Unified Active Spatial-Temporal Grounding for 3D Functionality Segmentation

0

9.0/10

[object Object], [object Object] 3/24/2026 huggingface

machine learning

Functionality segmentation in 3D scenes requires an agent to ground implicit natural-language instructions into precise masks of fine-grained interactive elements. Existing methods rely on fragmented pipelines that suffer from visual blindness during initial task parsing. We observe that these metho...

Keywords: functionality segmentation, 3D scenes, multimodal LLM, spatial-temporal grounding, coarse-to-fine, training-free, SceneFun3D, mIoU

View Paper

RealMaster: Lifting Rendered Scenes into Photorealistic Video

0

9.0/10

[object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object] 3/24/2026 huggingface

computer vision

State-of-the-art video generation models produce remarkable photorealism, but they lack the precise control required to align generated content with specific scene requirements. Furthermore, without an underlying explicit geometry, these models cannot guarantee 3D consistency. Conversely, 3D engines...

Keywords: video diffusion, sim-to-real, anchor-based propagation, IC-LoRA, photorealism, 3D consistency, GTA-V, render-to-video

View Paper

SIMART: Decomposing Monolithic Meshes into Sim-ready Articulated Assets via MLLM

0

9.0/10

[object Object], [object Object], [object Object], [object Object], [object Object], [object Object] 3/24/2026 huggingface

machine learning

High-quality articulated 3D assets are indispensable for embodied AI and physical simulation, yet 3D generation still focuses on static meshes, leaving a gap in "sim-ready" interactive objects. Most recent articulated object creation methods rely on multi-stage pipelines that accumulate errors acros...

Keywords: SIMART, Sparse 3D VQ-VAE, MLLM, articulated objects, PartNet-Mobility, 3D tokenization, simulation, robotics

View Paper

ABot-PhysWorld: Interactive World Foundation Model for Robotic Manipulation with Physics Alignment

0

9.0/10

[object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object] 3/24/2026 huggingface

robotics

Video-based world models offer a powerful paradigm for embodied simulation and planning, yet state-of-the-art models often generate physically implausible manipulations - such as object penetration and anti-gravity motion - due to training on generic visual data and likelihood-based objectives that ...

Keywords: ABot-PhysWorld, diffusion transformer, world model, physics alignment, DPO post-training, decoupled discriminators, parallel context block, action-controllable video

View Paper

FCL-COD: Weakly Supervised Camouflaged Object Detection with Frequency-aware and Contrastive Learning

0

9.0/10

[object Object], [object Object], [object Object], [object Object], [object Object], [object Object] 3/24/2026 huggingface

computer vision

Existing camouflage object detection (COD) methods typically rely on fully-supervised learning guided by mask annotations. However, obtaining mask annotations is time-consuming and labor-intensive. Compared to fully-supervised methods, existing weakly-supervised COD methods exhibit significantly poo...

Keywords: camouflaged object detection, weakly supervised learning, Segment Anything Model, SAM, frequency-aware learning, contrastive learning, low-rank adaptation, FoRA

View Paper

Export Archive Data

Browse by Date

Papers for March 25, 2026

DA-Flow: Degradation-Aware Optical Flow Estimation with Diffusion Models

WildWorld: A Large-Scale Dataset for Dynamic World Modeling with Actions and Explicit State toward Generative ARPG

One View Is Enough! Monocular Training for In-the-Wild Novel View Generation

SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning

VTAM: Video-Tactile-Action Models for Complex Physical Interaction Beyond VLAs

UniFunc3D: Unified Active Spatial-Temporal Grounding for 3D Functionality Segmentation

RealMaster: Lifting Rendered Scenes into Photorealistic Video

SIMART: Decomposing Monolithic Meshes into Sim-ready Articulated Assets via MLLM

ABot-PhysWorld: Interactive World Foundation Model for Robotic Manipulation with Physics Alignment

FCL-COD: Weakly Supervised Camouflaged Object Detection with Frequency-aware and Contrastive Learning