Paper Archive

Thinking in Blender: Staged Executable Inverse Graphics with Vision-Language Models

0

5.0/10

[object Object], [object Object], [object Object], [object Object] 6/1/2026 huggingface

computer vision

Inverse graphics is a longstanding and highly underconstrained problem that seeks to reconstruct images as editable 3D scenes which can be rendered, relit, and manipulated. In this work, we investigate whether pretrained vision-language models (VLMs) can perform executable inverse graphics directly ...

View Paper

Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling

0

5.0/10

[object Object], [object Object], [object Object], [object Object], [object Object], [object Object] 6/1/2026 huggingface

computer vision

Recent multimodal large language models have demonstrated strong reasoning ability, yet their reliability as automated evaluators remains limited by a critical weakness: when visual evidence conflicts with textual cues, MLLM judges tend to reward plausible narratives over perceptually correct answer...

View Paper

VISReg: Variance-Invariance-Sketching Regularization for JEPA training

0

5.0/10

[object Object], [object Object], [object Object] 6/1/2026 huggingface

computer vision

Self-supervised learning methods prevent embedding collapse via modeling heuristics or explicit regularization of the embedding space. Among the latter, VICReg decomposes regularization into variance and covariance objectives, offering flexibility and interpretability. However, covariance captures o...

View Paper

VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization

0

5.0/10

[object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object] 6/1/2026 huggingface

computer vision

The recent "Reasoning with Video" paradigm utilizes Video Generation Models (VGMs) to generate temporally coherent visual trajectories to complete reasoning tasks. Although state-of-the-art VGMs excel at visual quality, they often struggle to understand and follow task-specific rules, leading to log...

View Paper

LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation

0

5.0/10

[object Object], [object Object], [object Object], [object Object], [object Object] 6/1/2026 huggingface

natural language processing

Autoregressive (AR) video diffusion enables variable-length synthesis, but long-horizon generation often suffers from accumulated errors and identity drift. For efficiency, existing methods commonly adopt sliding-window attention during generation. This creates an irreversible generation trajectory:...

Keywords: attention

View Paper

Modeling Depth Ambiguity: A Mixture-Density Representation for Flying-Point-Free Depth Estimation

0

5.0/10

[object Object], [object Object], [object Object] 6/1/2026 huggingface

machine learning

Despite advances in depth estimation, flying points remain a persistent failure mode: near object boundaries, depth estimators often predict spurious 3D points in the empty space between foreground and background surfaces. We trace this artifact to a standard modeling choice: assigning each pixel a ...

View Paper

SkillHarm: Lifecycle-Aware Skill-Based Attacks via Automated Construction

0

5.0/10

[object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object] 6/1/2026 huggingface

natural language processing

Agent skills occupy a privileged position in the agent workflow, as agents are expected to implicitly follow and execute them, rendering third-party skills a vulnerable attack surface. Existing studies have revealed unsafe agent behaviors induced by skill-based attacks, but they primarily evaluate p...

View Paper

Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events

0

5.0/10

[object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object] 6/1/2026 huggingface

computer vision

Video multimodal large language models (MLLMs) have made rapid progress on general and long-form video understanding, yet their ability to preserve brief answer-critical visual evidence remains underexplored. Many practical questions are determined by momentary visual events: localized actions or st...

View Paper

X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding

0

5.0/10

[object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object] 6/1/2026 huggingface

natural language processing

While video streaming understanding has made significant strides, real-world applications, such as live sports broadcasting, autonomous driving, and multi-screen collaboration, inherently demand continuous, multi-stream interactions. However, existing benchmarks are confined to single-stream paradig...

Keywords: multi-modal

View Paper

MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation

0

5.0/10

[object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object] 6/1/2026 huggingface

natural language processing

The Model Context Protocol (MCP) has emerged as a transformative standard for connecting large language models (LLMs) with external data sources and tools, and has been rapidly adopted across personal applications and development platforms. However, existing benchmarks predominantly focus on generic...

View Paper

Export Archive Data

Browse by Date

Papers for June 2, 2026

Thinking in Blender: Staged Executable Inverse Graphics with Vision-Language Models

Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling

VISReg: Variance-Invariance-Sketching Regularization for JEPA training

VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization

LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation

Modeling Depth Ambiguity: A Mixture-Density Representation for Flying-Point-Free Depth Estimation

SkillHarm: Lifecycle-Aware Skill-Based Attacks via Automated Construction

Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events

X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding

MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation