Paper Archive

MedSAM3: Delving into Segment Anything with Medical Concepts

0

9.0/10

Unknown authors 11/26/2025 huggingface

machine learning

Medical image segmentation is fundamental for biomedical discovery. Existing methods lack generalizability and demand extensive, time-consuming manual annotation for new clinical application. Here, we propose MedSAM-3, a text promptable medical segmentation model for medical image and video segmenta...

Keywords: MedSAM-3, Segment Anything Model, Promptable Concept Segmentation, medical image segmentation, multimodal LLM, agent-in-the-loop, open-vocabulary

View Paper

Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward

0

9.0/10

Unknown authors 11/26/2025 huggingface

machine learning

Recent years have witnessed significant progress in Unified Multimodal Models, yet a fundamental question remains: Does understanding truly inform generation? To investigate this, we introduce UniSandbox, a decoupled evaluation framework paired with controlled, synthetic datasets to avoid data leaka...

Keywords: Unified Multimodal Models, UniSandbox, understanding-generation gap, Chain-of-Thought, self-training, knowledge transfer, query-based architectures, synthetic datasets

View Paper

GigaWorld-0: World Models as Data Engine to Empower Embodied AI

0

9.0/10

Unknown authors 11/26/2025 huggingface

machine learning

World models are emerging as a foundational paradigm for scalable, data-efficient embodied AI. In this work, we present GigaWorld-0, a unified world model framework designed explicitly as a data engine for Vision-Language-Action (VLA) learning. GigaWorld-0 integrates two synergistic components: Giga...

Keywords: world models, embodied AI, Vision-Language-Action, video generation, 3D Gaussian Splatting, differentiable system identification, motion planning, GigaTrain

View Paper

SteadyDancer: Harmonized and Coherent Human Image Animation with First-Frame Preservation

0

9.0/10

Unknown authors 11/26/2025 huggingface

computer vision

Preserving first-frame identity while ensuring precise motion control is a fundamental challenge in human image animation. The Image-to-Motion Binding process of the dominant Reference-to-Video (R2V) paradigm overlooks critical spatio-temporal misalignments common in real-world applications, leading...

Keywords: human image animation, image-to-video, first-frame preservation, condition reconciliation, pose modulation, temporal coherence, generative models, computer vision

View Paper

UltraViCo: Breaking Extrapolation Limits in Video Diffusion Transformers

0

9.0/10

Unknown authors 11/26/2025 huggingface

computer vision

Despite advances, video diffusion transformers still struggle to generalize beyond their training length, a challenge we term video length extrapolation. We identify two failure modes: model-specific periodic content repetition and a universal quality degradation. Prior works attempt to solve repeti...

Keywords: video diffusion, transformer, extrapolation, attention dispersion, positional encoding, UltraViCo, training-free, video synthesis

View Paper

ReDirector: Creating Any-Length Video Retakes with Rotary Camera Encoding

0

9.0/10

Unknown authors 11/26/2025 huggingface

computer vision

We present ReDirector, a novel camera-controlled video retake generation method for dynamically captured variable-length videos. In particular, we rectify a common misuse of RoPE in previous works by aligning the spatiotemporal positions of the input video and the target retake. Moreover, we introdu...

Keywords: ReDirector, Rotary Camera Encoding, RoCE, RoPE, camera-controlled retake, spatiotemporal alignment, multi-view, out-of-distribution generalization

View Paper

VQ-VA World: Towards High-Quality Visual Question-Visual Answering

0

9.0/10

Unknown authors 11/26/2025 huggingface

computer vision

This paper studies Visual Question-Visual Answering (VQ-VA): generating an image, rather than text, in response to a visual question -- an ability that has recently emerged in proprietary systems such as NanoBanana and GPT-Image. To also bring this capability to open-source models, we introduce VQ-V...

Keywords: VQ-VA, VQ-VA World, IntelligentBench, LightFusion, agentic pipeline, image generation, visual question answering, dataset construction

View Paper

Soft Adaptive Policy Optimization

0

9.0/10

Unknown authors 11/26/2025 huggingface

reinforcement learning

Reinforcement learning (RL) plays an increasingly important role in enhancing the reasoning capabilities of large language models (LLMs), yet stable and performant policy optimization remains challenging. Token-level importance ratios often exhibit high variance-a phenomenon exacerbated in Mixture-o...

Keywords: SAPO, soft gating, policy optimization, reinforcement learning, LLM fine-tuning, Mixture-of-Experts, GSPO, GRPO

View Paper

iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation

0

8.0/10

Unknown authors 11/26/2025 huggingface

generative models

Pre-trained video models learn powerful priors for generating high-quality, temporally coherent content. While these models excel at temporal coherence, their dynamics are often constrained by the continuous nature of their training data. We hypothesize that by injecting the rich and unconstrained c...

Keywords: iMontage, video models, image generation, many-to-many, motion priors, temporal coherence, data curation, model adaptation

View Paper

Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning

0

6.0/10

Unknown authors 11/26/2025 huggingface

machine learning

This paper presents research on agent0-vl:, exploring, self-evolving. The full abstract is not available at this time. Please visit the paper's website for complete details about the methodology, results, and contributions.

Keywords: Agent0-VL, self-evolving, tool-integrated, vision-language, multi-modal agents

View Paper

Export Archive Data

Browse by Date

Papers for November 26, 2025

MedSAM3: Delving into Segment Anything with Medical Concepts

Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward

GigaWorld-0: World Models as Data Engine to Empower Embodied AI

SteadyDancer: Harmonized and Coherent Human Image Animation with First-Frame Preservation

UltraViCo: Breaking Extrapolation Limits in Video Diffusion Transformers

ReDirector: Creating Any-Length Video Retakes with Rotary Camera Encoding

VQ-VA World: Towards High-Quality Visual Question-Visual Answering

Soft Adaptive Policy Optimization

iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation

Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning