Paper Archive

Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism

0

9.0/10

[object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object] 4/10/2026 huggingface

machine learning

Large language models (LLMs) undergo alignment training to avoid harmful behaviors, yet the resulting safeguards remain brittle: jailbreaks routinely bypass them, and fine-tuning on narrow domains can induce ``emergent misalignment'' that generalizes broadly. Whether this brittleness reflects a fund...

Keywords: large language models, alignment, harmful content, weight pruning, emergent misalignment, representation compression, causal intervention, safety diagnostics

View Paper

VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images

0

9.0/10

[object Object], [object Object], [object Object], [object Object], [object Object], [object Object] 4/10/2026 huggingface

machine learning

Vision-language models (VLMs) still struggle with visual perception tasks such as spatial understanding and viewpoint recognition. One plausible contributing factor is that natural image datasets provide limited supervision for low-level visual skills. This motivates a practical question: can target...

Keywords: VisionFoundry, synthetic data, vision-language models, text-to-image, LLMs, visual perception, VQA, task-aware generation

View Paper

Envisioning the Future, One Step at a Time

0

9.0/10

[object Object], [object Object], [object Object], [object Object], [object Object] 4/10/2026 huggingface

computer vision

Accurately anticipating how complex, diverse scenes will evolve requires models that represent uncertainty, simulate along extended interaction chains, and efficiently explore many plausible futures. Yet most existing approaches rely on dense video or latent-space prediction, expending substantial c...

Keywords: autoregressive diffusion, sparse point trajectories, open-set motion prediction, OWM benchmark, long-horizon forecasting, multi-modal futures, sampling efficiency

View Paper

RecaLLM: Addressing the Lost-in-Thought Phenomenon with Explicit In-Context Retrieval

0

9.0/10

[object Object], [object Object] 4/10/2026 huggingface

machine learning

We propose RecaLLM, a set of reasoning language models post-trained to make effective use of long-context information. In-context retrieval, which identifies relevant evidence from context, and reasoning are deeply intertwined: retrieval supports reasoning, while reasoning often determines what must...

Keywords: RecaLLM, in-context retrieval, lost-in-thought, constrained decoding, long-context, RULER, HELMET, retrieval-augmented

View Paper

ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion

0

9.0/10

[object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object] 4/10/2026 huggingface

machine learning

Chest X-ray report generation (CXR-RG) has the potential to substantially alleviate radiologists' workload. However, conventional autoregressive vision--language models (VLMs) suffer from high inference latency due to sequential token decoding. Diffusion-based models offer a promising alternative th...

Keywords: ECHO, Direct Conditional Distillation, DCD, Response-Asymmetric Diffusion, RAD, one-step block diffusion, chest x-ray, report generation

View Paper

Visually-Guided Policy Optimization for Multimodal Reasoning

0

9.0/10

[object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object] 4/10/2026 huggingface

machine learning

Reinforcement learning with verifiable rewards (RLVR) has significantly advanced the reasoning ability of vision-language models (VLMs). However, the inherent text-dominated nature of VLMs often leads to insufficient visual faithfulness, characterized by sparse attention activation to visual tokens....

Keywords: VGPO, Visually-Guided Policy Optimization, Visual Attention Compensation, advantage re-weighting, vision-language models, visual forgetting, reinforcement learning with verifiable rewards, multimodal reasoning

View Paper

CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation

0

9.0/10

[object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object] 4/10/2026 huggingface

computer vision

Camera-controllable video generation aims to synthesize videos with flexible and physically plausible camera movements. However, existing methods either provide imprecise camera control from text prompts or rely on labor-intensive manual camera trajectory parameters, limiting their use in automated ...

Keywords: CT-1, camera-controllable video generation, vision-language, diffusion transformer, wavelet regularization, camera trajectories, CT-200K, video diffusion

View Paper

ELT: Elastic Looped Transformers for Visual Generation

0

9.0/10

[object Object], [object Object], [object Object], [object Object], [object Object], [object Object] 4/10/2026 huggingface

generative models

We introduce Elastic Looped Transformers (ELT), a highly parameter-efficient class of visual generative models based on a recurrent transformer architecture. While conventional generative models rely on deep stacks of unique transformer layers, our approach employs iterative, weight-shared transform...

Keywords: Elastic Looped Transformers, ELT, Intra-Loop Self Distillation, ILSD, weight sharing, recurrent transformer, Any-Time inference, parameter efficiency

View Paper

EquiformerV3: Scaling Efficient, Expressive, and General SE(3)-Equivariant Graph Attention Transformers

0

9.0/10

[object Object], [object Object], [object Object], [object Object], [object Object], [object Object] 4/10/2026 huggingface

machine learning

As SE(3)-equivariant graph neural networks mature as a core tool for 3D atomistic modeling, improving their efficiency, expressivity, and physical consistency has become a central challenge for large-scale applications. In this work, we introduce EquiformerV3, the third generation of the SE(3)-equiv...

Keywords: EquiformerV3, SE(3)-equivariant, graph attention transformer, SwiGLU-S^2, smooth cutoff attention, DeNS, PES, OC20

View Paper

Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

0

9.0/10

[object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object] 4/10/2026 huggingface

machine learning

With the advancement of interactive video generation, diffusion models have increasingly demonstrated their potential as world models. However, existing approaches still struggle to simultaneously achieve memory-enabled long-term temporal consistency and high-resolution real-time generation, limitin...

Keywords: world models, video generation, long-term memory, diffusion models, distribution matching distillation, VAE pruning, model quantization, synthetic data

View Paper

Export Archive Data

Browse by Date

Papers for April 13, 2026

Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism

VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images

Envisioning the Future, One Step at a Time

RecaLLM: Addressing the Lost-in-Thought Phenomenon with Explicit In-Context Retrieval

ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion

Visually-Guided Policy Optimization for Multimodal Reasoning

CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation

ELT: Elastic Looped Transformers for Visual Generation

EquiformerV3: Scaling Efficient, Expressive, and General SE(3)-Equivariant Graph Attention Transformers

Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory