Paper Archive

AdaState: Self-Evolving Anchors for Streaming Video Generation

0

5.0/10

[object Object], [object Object] 5/28/2026 huggingface

machine learning

Autoregressive video diffusion models generate streaming video by producing frames sequentially, conditioning each chunk on previously generated content. These models are structurally anchored to the first frame: its key-value representation occupies a privileged position in the attention cache and ...

Keywords: attention, diffusion model

View Paper

LLMSurgeon: Diagnosing Data Mixture of Large Language Models

0

5.0/10

[object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object] 5/28/2026 huggingface

natural language processing

The pretraining data mixture of Large Language Models (LLMs) constitutes their "digital DNA", shaping model behaviors, capabilities, and failure modes. Yet this composition is rarely disclosed, making post-hoc auditing of data combination or provenance difficult. In this work, we formalize Data Mixt...

Keywords: pretraining

View Paper

NeuROK: Generative 4D Neural Object Kinematics

0

5.0/10

[object Object], [object Object], [object Object], [object Object], [object Object], [object Object] 5/28/2026 huggingface

computer vision

Data-driven approaches have revolutionized 3D vision, enabling transformers to effectively reconstruct and generate static 3D objects. However, generating simulative 4D dynamics -- realistic temporal deformations of static objects under various physical conditions -- remains challenging and often ad...

Keywords: transformer

View Paper

YoCausal: How Far is Video Generation from World Model? A Causality Perspective

0

5.0/10

[object Object], [object Object], [object Object], [object Object], [object Object], [object Object] 5/28/2026 huggingface

reinforcement learning

As video diffusion models (VDMs) advance toward world models, a key question arises: do they truly understand causality, or merely overfit to statistical temporal patterns? Existing benchmarks mostly rely on synthetic data, limiting real-world generalization due to the sim-to-real gap. We present Yo...

Keywords: diffusion model

View Paper

Unlocking the Working Memory of Large Language Models for Latent Reasoning

0

5.0/10

[object Object], [object Object] 5/28/2026 huggingface

computer vision

To improve the reasoning capabilities of large language models, test-time compute is typically scaled by generating intermediate tokens before the final answer. However, this couples reasoning to autoregressive generation and thereby conflates internal computation with external communication. In con...

View Paper

GPIC: A Giant Permissive Image Corpus for Visual Generation

0

5.0/10

[object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object] 5/28/2026 huggingface

computer vision

Studying scalable methods for visual generative modeling requires large, accessible, and stable datasets. We introduce GPIC, a Giant Permissive Image Corpus of approximately 28 trillion pixels. GPIC comprises diverse internet images captioned by a state-of-the-art vision-language model, including 10...

View Paper

Colored Noise Diffusion Sampling

0

5.0/10

[object Object], [object Object], [object Object] 5/28/2026 huggingface

computer vision

Diffusion models achieve state-of-the-art image synthesis, with their generative trajectories fundamentally exhibiting a spectral bias, resolving low-frequency global structures early and high-frequency fine details later. Conventional stochastic differential equation (SDE) solvers fail to account f...

Keywords: diffusion model

View Paper

Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

0

5.0/10

[object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object] 5/28/2026 huggingface

computer vision

Embodied intelligence is often studied through specialized models for individual tasks such as manipulation or navigation, resulting in fragmented capabilities and limited generalization across tasks, environments, and robot embodiments. In this work, we study whether heterogeneous embodied decision...

Keywords: pretraining

View Paper

PhyGenHOI: Physically-Aware 4D Generation of Dynamic Human-Object Interactions

0

5.0/10

[object Object], [object Object], [object Object] 5/28/2026 huggingface

computer vision

We address the task of generating physically accurate and visually faithful 4D Human-Object Interaction (HOI). Given a static 3D human and target object represented as 3D Gaussian Splats (3DGS), our goal is to synthesize dynamic scenes where the human actively engages with the object through actions...

Keywords: diffusion model

View Paper

LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

0

5.0/10

[object Object], [object Object], [object Object], [object Object], [object Object] 5/28/2026 huggingface

computer vision

Vision-Language Models (VLMs) have achieved substantial progress across a wide range of understanding and reasoning tasks, driven by large-scale image-text training aimed at multimodal fusion. Ideally, replacing a textual question with its rendered-image counterpart should leave model performance es...

View Paper

Export Archive Data

Browse by Date

Papers for May 29, 2026

AdaState: Self-Evolving Anchors for Streaming Video Generation

LLMSurgeon: Diagnosing Data Mixture of Large Language Models

NeuROK: Generative 4D Neural Object Kinematics

YoCausal: How Far is Video Generation from World Model? A Causality Perspective

Unlocking the Working Memory of Large Language Models for Latent Reasoning

GPIC: A Giant Permissive Image Corpus for Visual Generation

Colored Noise Diffusion Sampling

Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

PhyGenHOI: Physically-Aware 4D Generation of Dynamic Human-Object Interactions

LoMo: Local Modality Substitution for Deeper Vision-Language Fusion