Paper Archive

ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling

0

9.0/10

[object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object] 3/26/2026 huggingface

machine learning

Multi-shot video generation is crucial for long narrative storytelling, yet current bidirectional architectures suffer from limited interactivity and high latency. We propose ShotStream, a novel causal multi-shot architecture that enables interactive storytelling and efficient on-the-fly frame gener...

Keywords: ShotStream, streaming multi-shot, next-shot generation, causal architecture, Distribution Matching Distillation, dual-cache memory, RoPE discontinuity, text-to-video distillation

View Paper

Less Gaussians, Texture More: 4K Feed-Forward Textured Splatting

0

9.0/10

[object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object] 3/26/2026 huggingface

computer vision

Existing feed-forward 3D Gaussian Splatting methods predict pixel-aligned primitives, leading to a quadratic growth in primitive count as resolution increases. This fundamentally limits their scalability, making high-resolution synthesis such as 4K intractable. We introduce LGTM (Less Gaussians, Tex...

Keywords: Gaussian Splatting, LGTM, feed-forward, novel view synthesis, per-primitive texture, 4K rendering, scalability, textured splatting

View Paper

MuRF: Unlocking the Multi-Scale Potential of Vision Foundation Models

0

9.0/10

[object Object], [object Object], [object Object], [object Object], [object Object] 3/26/2026 huggingface

computer vision

Vision Foundation Models (VFMs) have become the cornerstone of modern computer vision, offering robust representations across a wide array of tasks. While recent advances allow these models to handle varying input sizes during training, inference typically remains restricted to a single, fixed scale...

Keywords: MuRF, Multi-Resolution Fusion, Vision Foundation Models, DINOv2, SigLIP2, multi-scale inference, training-free, feature fusion

View Paper

MegaFlow: Zero-Shot Large Displacement Optical Flow

0

9.0/10

[object Object], [object Object], [object Object], [object Object] 3/26/2026 huggingface

computer vision

Accurate estimation of large displacement optical flow remains a critical challenge. Existing methods typically rely on iterative local search or/and domain-specific fine-tuning, which severely limits their performance in large displacement and zero-shot generalization scenarios. To overcome this, w...

Keywords: MegaFlow, optical flow, zero-shot, Vision Transformer, global matching, large displacement, motion estimation, transfer learning

View Paper

SlotVTG: Object-Centric Adapter for Generalizable Video Temporal Grounding

0

9.0/10

[object Object], [object Object], [object Object], [object Object] 3/26/2026 huggingface

machine learning

Multimodal Large Language Models (MLLMs) have shown strong performance on Video Temporal Grounding (VTG). However, their coarse recognition capabilities are insufficient for fine-grained temporal understanding, making task-specific fine-tuning indispensable. This fine-tuning causes models to memoriz...

Keywords: SlotVTG, slot adapter, object-centric learning, slot attention, MLLM, video temporal grounding, OOD generalization, objectness priors

View Paper

PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference

0

9.0/10

[object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object] 3/26/2026 huggingface

computer vision

Autoregressive video diffusion models have demonstrated remarkable progress, yet they remain bottlenecked by intractable linear KV-cache growth, temporal repetition, and compounding errors during long-video generation. To address these challenges, we present PackForcing, a unified framework that eff...

Keywords: PackForcing, KV-cache, temporal RoPE, sink tokens, mid tokens, recent tokens, autoregressive video diffusion, context compression

View Paper

PixelSmile: Toward Fine-Grained Facial Expression Editing

0

9.0/10

[object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object] 3/26/2026 huggingface

computer vision

Fine-grained facial expression editing has long been limited by intrinsic semantic overlap. To address this, we construct the Flex Facial Expression (FFE) dataset with continuous affective annotations and establish FFE-Bench to evaluate structural confusion, editing accuracy, linear controllability,...

Keywords: fine-grained expression editing, facial expression, diffusion models, disentanglement, contrastive learning, continuous annotations, benchmarking, identity preservation

View Paper

Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models

0

9.0/10

[object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object] 3/26/2026 huggingface

computer vision

Video world models have shown immense potential in simulating the physical world, yet existing memory mechanisms primarily treat environments as static canvases. When dynamic subjects hide out of sight and later re-emerge, current methods often struggle, leading to frozen, distorted, or vanishing su...

Keywords: Hybrid Memory, HyDRA, HM-World, video world models, spatiotemporal retrieval, dynamic subject consistency, memory tokens, exit-entry events

View Paper

S2D2: Fast Decoding for Diffusion LLMs via Training-Free Self-Speculation

0

9.0/10

[object Object], [object Object], [object Object], [object Object], [object Object] 3/26/2026 huggingface

machine learning

Block-diffusion language models offer a promising path toward faster-than-autoregressive generation by combining block-wise autoregressive decoding with within-block parallel denoising. However, in the few-step regime needed for practical acceleration, standard confidence-thresholded decoding is oft...

Keywords: block-diffusion, diffusion LLMs, self-speculation, training-free, speculative decoding, autoregressive, parallel denoising, SDAR

View Paper

Persistent Robot World Models: Stabilizing Multi-Step Rollouts via Reinforcement Learning

0

9.0/10

[object Object], [object Object], [object Object], [object Object] 3/26/2026 huggingface

robotics

Action-conditioned robot world models generate future video frames of the manipulated scene given a robot action sequence, offering a promising alternative for simulating tasks that are difficult to model with traditional physics engines. However, these models are optimized for short-term prediction...

Keywords: robot world models, autoregressive rollouts, reinforcement learning, contrastive RL, diffusion models, multi-view rewards, video prediction, DROID dataset

View Paper

Export Archive Data

Browse by Date

Papers for March 29, 2026

ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling

Less Gaussians, Texture More: 4K Feed-Forward Textured Splatting

MuRF: Unlocking the Multi-Scale Potential of Vision Foundation Models

MegaFlow: Zero-Shot Large Displacement Optical Flow

SlotVTG: Object-Centric Adapter for Generalizable Video Temporal Grounding

PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference

PixelSmile: Toward Fine-Grained Facial Expression Editing

Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models

S2D2: Fast Decoding for Diffusion LLMs via Training-Free Self-Speculation

Persistent Robot World Models: Stabilizing Multi-Step Rollouts via Reinforcement Learning