{"papers":[{"id":"arxiv_2604.22749v1","arxiv_id":"2604.22749v1","title":"Representational Harms in LLM-Generated Narratives Against Global Majority Nationalities","abstract":"Large language models (LLMs) are increasingly used for text generation tasks from everyday use to high-stakes enterprise and government applications, including simulated interviews with asylum seekers. While many works highlight the new potential applications of LLMs, there are risks of LLMs encoding and perpetuating harmful biases about non-dominant communities across the globe. To better evaluate and mitigate such harms, more research examining how LLMs portray diverse individuals is needed. In this work, we study how national origin identities are portrayed by widely-adopted LLMs in response to open-ended narrative generation prompts. Our findings demonstrate the presence of persistent representational harms by national origin, including harmful stereotypes, erasure, and one-dimensional portrayals of Global Majority identities. Minoritized national identities are simultaneously underrepresented in power-neutral stories and overrepresented in subordinated character portrayals, which are over fifty times more likely to appear than dominant portrayals. The degree of harm is amplified when US nationality cues (e.g., ``American'') are present in input prompts. Notably, we find that the harms we identify cannot be explained away via sycophancy, as US-centric biases persist even when replacing US nationality cues with non-US national identities in the prompts. Based on our findings, we call for further exploration of cultural harms in LLMs through methodologies that center Global Majority perspectives and challenge the uncritical adoption of US-based LLMs for the classification, surveillance, and misrepresentation of the majority of our planet.","authors":["Ilana Nguyen","Harini Suresh","Thema Monroe-White","Evan Shieh"],"published":"2026-04-24T17:49:09Z","updated":"2026-04-24T17:49:09Z","category":"","source":"arxiv","original_source":"arxiv","url":"http://arxiv.org/abs/2604.22749v1","pdf_url":"http://arxiv.org/pdf/2604.22749v1.pdf","scraped_at":"2026-04-27T08:00:53.780Z","analysis":{"introduction":"🚀 Do LLMs erase or stereotype most of the world? This paper shows widely-adopted LLMs systematically produce harmful, one-dimensional portrayals of Global Majority nationalities in open-ended narratives. Why it matters: affects asylum interviews, policy, and fairness for marginalized communities.","challenges":"🎯 Key problems tackled: - LLMs encode and perpetuate harmful biases and stereotypes about non-dominant nationalities. - Lack of empirical work examining how national-origin identities are portrayed in narrative generation. - US-centric cues amplify harms, risking misuse in real-world settings (e.g., asylum simulations).","innovations":"✨ What they did differently: - Empirical study of how widely-adopted LLMs portray national origin identities using open-ended narrative prompts. - Quantified representational harms: stereotypes, erasure, and one-dimensional portrayals; measured underrepresentation in power-neutral roles and overrepresentation in subordinated roles. - Tested sycophancy by replacing US cues and showed US-centric bias persists. Novelty: centers Global Majority nationalities and directly tests whether US-centric framing explains harms.","experiments":"📊 Main quantitative finding: Subordinated character portrayals of minoritized nationalities are over fifty times more likely than dominant portrayals. This demonstrates LLMs systematically marginalize Global Majority identities in generated narratives, and harms are amplified by US nationality cues.","insights":"🤔 What's next? - Research: co-design evaluation frameworks with Global Majority communities and build regionally diverse benchmarks to surface cultural harms. - Mitigation: explore targeted fine-tuning, culturally-aware prompt guards, or auditing pipelines before using LLMs in high-stakes contexts (e.g., asylum, surveillance). Could reframing LLM training and evaluation reduce these deep cultural misrepresentations?","keywords":["representational_harm","LLM_bias","national_origin","Global_Majority","stereotypes","narrative_generation","US-centric_bias","fairness"],"category":"machine_learning","relevance_score":9,"technical_depth":"intermediate","summary":"**Introduction:** 🚀 Do LLMs erase or stereotype most of the world? This paper shows widely-adopted LLMs systematically produce harmful, one-dimensional portrayals of Global Majority nationalities in open-ended narratives. Why it matters: affects asylum interviews, policy, and fairness for marginalized communities.\n\n**Challenges:** 🎯 Key problems tackled: - LLMs encode and perpetuate harmful biases and stereotypes about non-dominan...","analyzed_at":"2026-04-27T08:01:19.288Z","model":"openai/gpt-5-mini"},"views":0},{"id":"hf_2604.22586_5h4tv6m86","title":"FlowAnchor: Stabilizing the Editing Signal for Inversion-Free Video Editing","abstract":"We propose FlowAnchor, a training-free framework for stable and efficient inversion-free, flow-based video editing. Inversion-free editing methods have recently shown impressive efficiency and structure preservation in images by directly steering the sampling trajectory with an editing signal. However, extending this paradigm to videos remains challenging, often failing in multi-object scenes or with increased frame counts. We identify the root cause as the instability of the editing signal in high-dimensional video latent spaces, which arises from imprecise spatial localization and length-induced magnitude attenuation. To overcome this challenge, FlowAnchor explicitly anchors both where to edit and how strongly to edit. It introduces Spatial-aware Attention Refinement, which enforces consistent alignment between textual guidance and spatial regions, and Adaptive Magnitude Modulation, which adaptively preserves sufficient editing strength. Together, these mechanisms stabilize the editing signal and guide the flow-based evolution toward the desired target distribution. Extensive experiments demonstrate that FlowAnchor achieves more faithful, temporally coherent, and computationally efficient video editing across challenging multi-object and fast-motion scenarios. The project page is available at https://cuc-mipg.github.io/FlowAnchor.github.io/.","authors":[{"_id":"69eededa7bd3cb2932a35e91","name":"Ze Chen","hidden":false},{"_id":"69eededa7bd3cb2932a35e92","name":"Lan Chen","hidden":false},{"_id":"69eededa7bd3cb2932a35e93","name":"Yuanhang Li","hidden":false},{"_id":"69eededa7bd3cb2932a35e94","name":"Qi Mao","hidden":false}],"published":"2026-04-24T00:00:00.000Z","updated":"2026-04-27T08:00:45.673Z","category":"computer_vision","source":"huggingface","original_source":"huggingface_api","url":"https://huggingface.co/papers/2604.22586","pdf_url":"","scraped_at":"2026-04-27T08:00:45.673Z","analysis":{"introduction":"🚀 Want stable, fast video edits without costly inversion? FlowAnchor stabilizes inversion-free, flow-based video editing by explicitly anchoring where and how strongly to edit. Makes multi-object & long-frame edits more faithful and temporally coherent — great for creators & VFX.","challenges":"🎯 Key problems tackled: - Instability of the editing signal in high-dimensional video latent spaces. - Imprecise spatial localization causing wrong regions to be edited. - Magnitude attenuation over longer trajectories leading to weak edits, failing on multi-object or long videos.","innovations":"✨ Core ideas: - FlowAnchor: a training-free, inversion-free flow-based video editing framework. - Spatial-aware Attention Refinement: enforces consistent alignment between text guidance and spatial regions. - Adaptive Magnitude Modulation: dynamically preserves sufficient editing strength. Novelty: explicitly anchors both \"where\" and \"how strongly\" to edit, stabilizing the editing signal without retraining.","experiments":"📊 Quantitative result: Not specified in the paper. But the experiments demonstrate that FlowAnchor yields more faithful, temporally coherent, and computationally efficient video edits across challenging multi-object and fast-motion scenarios compared to prior inversion-free flow methods.","insights":"🤔 What's next? - Research directions: integrate learned anchor predictors or user-guided anchors; combine FlowAnchor with learned flow models for richer edits. - Applications: VFX & film post-production, AR/VR content creation, real-time video editing tools. Could anchoring paradigms unlock reliable, real-time semantic video editing?","keywords":["FlowAnchor","inversion-free","flow-based","video editing","Spatial-aware Attention Refinement","Adaptive Magnitude Modulation","temporal coherence","multi-object"],"category":"computer_vision","relevance_score":9,"technical_depth":"advanced","summary":"**Introduction:** 🚀 Want stable, fast video edits without costly inversion? FlowAnchor stabilizes inversion-free, flow-based video editing by explicitly anchoring where and how strongly to edit. Makes multi-object & long-frame edits more faithful and temporally coherent — great for creators & VFX.\n\n**Challenges:** 🎯 Key problems tackled: - Instability of the editing signal in high-dimensional video latent spaces. - Imprecise spati...","analyzed_at":"2026-04-27T08:03:42.026Z","model":"openai/gpt-5-mini"},"views":0},{"id":"hf_2604.21689_352xdipzm","title":"StyleID: A Perception-Aware Dataset and Metric for Stylization-Agnostic Facial Identity Recognition","abstract":"Creative face stylization aims to render portraits in diverse visual idioms such as cartoons, sketches, and paintings while retaining recognizable identity. However, current identity encoders, which are typically trained and calibrated on natural photographs, exhibit severe brittleness under stylization. They often mistake changes in texture or color palette for identity drift or fail to detect geometric exaggerations. This reveals the lack of a style-agnostic framework to evaluate and supervise identity consistency across varying styles and strengths. To address this gap, we introduce StyleID, a human perception-aware dataset and evaluation framework for facial identity under stylization. StyleID comprises two datasets: (i) StyleBench-H, a benchmark that captures human same-different verification judgments across diffusion- and flow-matching-based stylization at multiple style strengths, and (ii) StyleBench-S, a supervision set derived from psychometric recognition-strength curves obtained through controlled two-alternative forced-choice (2AFC) experiments. Leveraging StyleBench-S, we fine-tune existing semantic encoders to align their similarity orderings with human perception across styles and strengths. Experiments demonstrate that our calibrated models yield significantly higher correlation with human judgments and enhanced robustness for out-of-domain, artist drawn portraits. All of our datasets, code, and pretrained models are publicly available at https://kwanyun.github.io/StyleID_page/","authors":[{"_id":"69ead908a961c3f1b5f40d66","name":"Kwan Yun","hidden":false},{"_id":"69ead908a961c3f1b5f40d67","name":"Changmin Lee","hidden":false},{"_id":"69ead908a961c3f1b5f40d68","name":"Ayeong Jeong","hidden":false},{"_id":"69ead908a961c3f1b5f40d69","name":"Youngseo Kim","hidden":false},{"_id":"69ead908a961c3f1b5f40d6a","name":"Seungmi Lee","hidden":false},{"_id":"69ead908a961c3f1b5f40d6b","name":"Junyong Noh","hidden":false}],"published":"2026-04-23T00:00:00.000Z","updated":"2026-04-27T08:00:45.673Z","category":"computer_vision","source":"huggingface","original_source":"huggingface_api","url":"https://huggingface.co/papers/2604.21689","pdf_url":"","scraped_at":"2026-04-27T08:00:45.673Z","analysis":{"introduction":"🚀 Ever had your face look unrecognizable after a cartoon filter? StyleID introduces a perception-aware dataset & metric to make facial identity recognition stylization-agnostic. It aligns encoders with human judgments across styles/strengths — aiding stylization apps and researchers.","challenges":"🎯 Challenges: - Identity encoders are brittle under stylization, misreading texture/color changes as identity drift. - They often miss geometric/exaggerated alterations in stylized portraits. - No style-agnostic framework to evaluate or supervise identity consistency across varying styles & strengths.","innovations":"✨ Innovations: - StyleBench-H: benchmark of human same–different verification across diffusion- and flow-matching stylizations at multiple strengths. - StyleBench-S: supervision set from psychometric recognition-strength curves via controlled 2AFC experiments. - Fine-tuning semantic encoders to align similarity orderings with human perception across styles/strengths.","experiments":"📊 Most compelling result: Calibrated models yield significantly higher correlation with human judgments and improved robustness to out-of-domain, artist-drawn portraits. Exact quantitative gains (e.g., % improvement) are Not specified in the paper.","insights":"🤔 Insights: - Future directions: extend perception-aware calibration to stylized video (temporal consistency) and explore demographic/fairness impacts of stylization on recognition. - Applications: identity-preserving avatar generation, robust creative-authentication tools. Could perception-aligned training become standard for creative-image pipelines?","keywords":["StyleID","StyleBench","facial identity","stylization","perception-aware","2AFC","psychometric curves","encoder calibration"],"category":"computer_vision","relevance_score":9,"technical_depth":"advanced","summary":"**Introduction:** 🚀 Ever had your face look unrecognizable after a cartoon filter? StyleID introduces a perception-aware dataset & metric to make facial identity recognition stylization-agnostic. It aligns encoders with human judgments across styles/strengths — aiding stylization apps and researchers.\n\n**Challenges:** 🎯 Challenges: - Identity encoders are brittle under stylization, misreading texture/color changes as identity drif...","analyzed_at":"2026-04-27T08:06:34.317Z","model":"openai/gpt-5-mini"},"views":0},{"id":"hf_2604.21611_brk1rxcps","title":"Process Supervision via Verbal Critique Improves Reasoning in Large Language Models","abstract":"Inference-time scaling for LLM reasoning has focused on three axes: chain depth, sample breadth, and learned step-scorers (PRMs). We introduce a fourth axis, granularity of external verbal supervision, via Verbal Process Supervision (VPS), a training-free framework that uses structured natural-language critique from a stronger supervisor to guide an iterative generate-critique-refine loop up to a round budget R. Across GPQA Diamond, AIME 2025, and LiveCodeBench V6 (covering both closed and open models), VPS yields three key results. First, on GPQA Diamond, GPT-5.4 (High) | GPT-5.4 (Low) reaches 94.9% at R=4, surpassing the 94.1% state of the art without gradient updates. Second, on AIME 2025, VPS enables strong weak-actor rescue, boosting scores from 11.7-26.7% to 63.3-90.0% (up to +63.3 points). Third, at matched compute, VPS outperforms Reflexion by +8.5 to +12.1 points and Self-Consistency@5 by +5.0 pp (GPQA) and +8.3 pp (LiveCodeBench), isolating critique granularity as the key driver. Performance scales with the supervisor-actor capability gap (Pearson r=0.90) and degrades when errors are not linguistically expressible (e.g., code synthesis), motivating hybrid verbal-executable methods. These results establish critique granularity as a new axis of inference-time scaling.","authors":[{"_id":"69ec15837bd3cb2932a35697","name":"Hao-Yuan Chen","hidden":false}],"published":"2026-04-23T00:00:00.000Z","updated":"2026-04-27T08:00:45.673Z","category":"computer_vision","source":"huggingface","original_source":"huggingface_api","url":"https://huggingface.co/papers/2604.21611","pdf_url":"","scraped_at":"2026-04-27T08:00:45.673Z","analysis":{"introduction":"🚀 Want better LLM reasoning without retraining?  Verbal Process Supervision (VPS) uses structured natural-language critique from a stronger supervisor in a generate→critique→refine loop (budget R).  It’s a training‑free way to boost reasoning and rescue weak actors. 💡","challenges":"🎯 Problems tackled: - Inference-time scaling focused on depth, breadth, and learned step-scorers, but not critique granularity. - Weak actor models fail on hard benchmarks and need costly finetuning. - Existing iterative methods (e.g., Reflexion) don't isolate verbal critique as the key factor.","innovations":"✨ Core ideas: - Verbal Process Supervision (VPS): structured natural-language critique from a stronger supervisor. - Iterative generate→critique→refine loop up to round budget R (training-free). - Novel axis: granularity of external verbal supervision.","experiments":"📊 Standout result: GPT-5.4 (High|Low) reaches 94.9% on GPQA Diamond at R=4, beating the 94.1% state of the art — without gradient updates.  Also: rescued weak actors on AIME 2025 (from ~12–27% → 63–90%).","insights":"🤔 What’s next? - Explore hybrid verbal+executable supervision for tasks with non-linguistic errors (e.g., code synthesis). - Automate supervisor selection/aggregation and study robustness to supervisor mistakes. Applications: automated tutoring, code review, QA pipelines. Could VPS shift how we scale LLMs? 🔭","keywords":["Verbal Process Supervision","VPS","LLM reasoning","critique","generate-critique-refine","inference-time scaling","Reflexion","Self-Consistency","GPT-5.4"],"category":"machine_learning","relevance_score":9,"technical_depth":"advanced","summary":"**Introduction:** 🚀 Want better LLM reasoning without retraining?  Verbal Process Supervision (VPS) uses structured natural-language critique from a stronger supervisor in a generate→critique→refine loop (budget R).  It’s a training‑free way to boost reasoning and rescue weak actors. 💡\n\n**Challenges:** 🎯 Problems tackled: - Inference-time scaling focused on depth, breadth, and learned step-scorers, but not critique granularity. -...","analyzed_at":"2026-04-27T08:07:27.686Z","model":"openai/gpt-5-mini"},"views":0},{"id":"hf_2604.21764_n5pl82gj2","title":"Thinking with Reasoning Skills: Fewer Tokens, More Accuracy","abstract":"Reasoning LLMs often spend substantial tokens on long intermediate reasoning traces (e.g., chain-of-thought) when solving new problems. We propose to summarize and store reusable reasoning skills distilled from extensive deliberation and trial-and-error exploration, and to retrieve these skills at inference time to guide future reasoning. Unlike the prevailing reasoning from scratch paradigm, our approach first recalls relevant skills for each query, helping the model avoid redundant detours and focus on effective solution paths. We evaluate our method on coding and mathematical reasoning tasks, and find that it significantly reduces reasoning tokens while improving overall performance. The resulting lower per-request cost indicates strong practical and economic potential for real-world deployment.","authors":[{"_id":"69eb1ba2a961c3f1b5f40ebc","name":"Guangxiang Zhao","hidden":false},{"_id":"69eb1ba2a961c3f1b5f40ebd","name":"Qilong Shi","hidden":false},{"_id":"69eb1ba2a961c3f1b5f40ebe","name":"Xusen Xiao","hidden":false},{"_id":"69eb1ba2a961c3f1b5f40ebf","name":"Xiangzheng Zhang","hidden":false},{"_id":"69eb1ba2a961c3f1b5f40ec0","name":"Tong Yang","hidden":false},{"_id":"69eb1ba2a961c3f1b5f40ec1","name":"Lin Sun","hidden":false}],"published":"2026-04-23T00:00:00.000Z","updated":"2026-04-27T08:00:45.673Z","category":"reinforcement_learning","source":"huggingface","original_source":"huggingface_api","url":"https://huggingface.co/papers/2604.21764","pdf_url":"","scraped_at":"2026-04-27T08:00:45.673Z","analysis":{"introduction":"🚀 Tired of LLMs wasting tokens on long step-by-step traces? What if models could recall reusable reasoning skills instead of re-deriving them every time?  This paper proposes summarizing and storing distilled reasoning skills to retrieve at inference — fewer tokens, better accuracy, lower cost. Not specified in the paper: exact authors' names or numbers.","challenges":"🎯 Key problems tackled: - Existing reasoning LLMs spend many tokens on long intermediate traces (e.g., chain-of-thought), raising latency and cost. - Prevailing \"reasoning from scratch\" causes redundant detours and inefficiency. - Practical deployment is hampered by high per-request token use. (Exact metrics: Not specified in the paper.)","innovations":"✨ Novel approach in simple terms: - Distill reusable \"reasoning skills\" via extensive deliberation and trial-and-error. - Summarize and store those skills in a retrievable library. - Retrieve relevant skills at inference to guide reasoning and avoid redundant paths. - Novel twist: retrieving distilled skills (not full CoT traces) to reduce tokens and focus solution search.","experiments":"📊 Experimental takeaway: - Evaluated on coding and mathematical reasoning tasks; reported significant reductions in reasoning tokens while improving overall performance. - Main proof: skill recall leads to lower per-request token cost and better accuracy compared to reasoning-from-scratch baselines. - Exact quantitative results and benchmarks: Not specified in the paper.","insights":"🤔 What's next? - Future research: automated curation and ranking of skill libraries; studying cross-domain transfer of distilled skills (e.g., from math → coding). - Broader apps: cost-efficient API deployment, tutoring systems that reuse pedagogical solution fragments. Could skill-recall become a standard layer in future LLM stacks?","keywords":["reasoning skills","skill retrieval","chain-of-thought","token efficiency","LLMs","coding reasoning","mathematical reasoning","cost reduction","deliberation","trial-and-error"],"category":"machine_learning","relevance_score":9,"technical_depth":"intermediate","summary":"**Introduction:** 🚀 Tired of LLMs wasting tokens on long step-by-step traces? What if models could recall reusable reasoning skills instead of re-deriving them every time?  This paper proposes summarizing and storing distilled reasoning skills to retrieve at inference — fewer tokens, better accuracy, lower cost. Not specified in the paper: exact authors' names or numbers.","analyzed_at":"2026-04-27T08:06:27.765Z","model":"openai/gpt-5-mini"},"views":0},{"id":"hf_2604.22291_ue9rua5xp","title":"Train in Vain: Functionality-Preserving Poisoning to Prevent Unauthorized Use of Code Datasets","abstract":"The widespread availability of large-scale code datasets has accelerated the development of code large language models (CodeLLMs), raising concerns about unauthorized dataset usage. Dataset poisoning offers a proactive defense by reducing the utility of such unauthorized training. However, existing poisoning methods often require full dataset poisoning and introduce transformations that break code compilability. In this paper, we introduce FunPoison, a functionality-preserving poisoning approach that injects short, compilable weak-use fragments into executed code paths. FunPoison leverages reusable statement-level templates with automatic repair and conservative safety checking to ensure side-effect freedom, while a type-aware synthesis module suppresses static analysis warnings and enhances stealth. Extensive experiments show that FunPoison achieves effective poisoning by contaminating only 10% of the dataset, while maintaining 100% compilability and functional correctness, and remains robust against various advanced code sanitization techniques.","authors":[{"_id":"69ef12597bd3cb2932a35fc6","name":"Yuan Xiao","hidden":false},{"_id":"69ef12597bd3cb2932a35fc7","name":"Jiaming Wang","hidden":false},{"_id":"69ef12597bd3cb2932a35fc8","name":"Yuchen Chen","hidden":false},{"_id":"69ef12597bd3cb2932a35fc9","name":"Wei Song","hidden":false},{"_id":"69ef12597bd3cb2932a35fca","name":"Jun Sun","hidden":false},{"_id":"69ef12597bd3cb2932a35fcb","name":"Shiqing Ma","hidden":false},{"_id":"69ef12597bd3cb2932a35fcc","name":"Yanzhou Mu","hidden":false},{"_id":"69ef12597bd3cb2932a35fcd","name":"Juan Zhai","hidden":false},{"_id":"69ef12597bd3cb2932a35fce","name":"Chunrong Fang","hidden":false},{"_id":"69ef12597bd3cb2932a35fcf","name":"Jin Song Dong","hidden":false},{"_id":"69ef12597bd3cb2932a35fd0","name":"Zhenyu Chen","hidden":false}],"published":"2026-04-24T00:00:00.000Z","updated":"2026-04-27T08:00:45.673Z","category":"natural_language_processing","source":"huggingface","original_source":"huggingface_api","url":"https://huggingface.co/papers/2604.22291","pdf_url":"","scraped_at":"2026-04-27T08:00:45.673Z","analysis":{"introduction":"🚀 Ever worried your public code corpus will train an unauthorized CodeLLM? FunPoison: a functionality-preserving dataset poisoning method that injects short, compilable fragments into executed code paths to sabotage unauthorized training while keeping code working. Protects dataset owners.","challenges":"🎯 Key problems tackled: - Existing poisoning approaches often require poisoning entire datasets. - Prior transforms break code compilability or functional correctness. - Hard to stealthily poison code without static-analysis warnings or side effects.","innovations":"✨ Core novelties: - FunPoison: injects short, compilable weak-use fragments into executed code paths. - Reusable statement-level templates + automatic repair and conservative safety checks to ensure side-effect freedom. - Type-aware synthesis to suppress static-analysis warnings and boost stealth.","experiments":"📊 Main empirical result: FunPoison contaminates only 10% of a dataset while maintaining 100% compilability and functional correctness, and remains robust against various advanced code sanitization techniques — demonstrating effective, low-overhead poisoning.","insights":"🤔 Where this points next: - Research: measure long-term effects on downstream CodeLLM behavior and develop adaptive countermeasures that respect functionality. - Applications: dataset stewardship/proprietary code protection and targeted dataset watermarking. Could stealthy, functionality-preserving defenses become standard for public code releases?","keywords":["dataset poisoning","code LLM","functionality-preserving","FunPoison","compilability","type-aware synthesis","code sanitization","dataset protection","data poisoning","robustness"],"category":"machine_learning","relevance_score":9,"technical_depth":"advanced","summary":"**Introduction:** 🚀 Ever worried your public code corpus will train an unauthorized CodeLLM? FunPoison: a functionality-preserving dataset poisoning method that injects short, compilable fragments into executed code paths to sabotage unauthorized training while keeping code working. Protects dataset owners.\n\n**Challenges:** 🎯 Key problems tackled: - Existing poisoning approaches often require poisoning entire datasets. - Prior tra...","analyzed_at":"2026-04-27T08:04:03.985Z","model":"openai/gpt-5-mini"},"views":0},{"id":"arxiv_2604.22748v1","arxiv_id":"2604.22748v1","title":"Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond","abstract":"As AI systems move from generating text to accomplishing goals through sustained interaction, the ability to model environment dynamics becomes a central bottleneck. Agents that manipulate objects, navigate software, coordinate with others, or design experiments require predictive environment models, yet the term world model carries different meanings across research communities. We introduce a \"levels x laws\" taxonomy organized along two axes. The first defines three capability levels: L1 Predictor, which learns one-step local transition operators; L2 Simulator, which composes them into multi-step, action-conditioned rollouts that respect domain laws; and L3 Evolver, which autonomously revises its own model when predictions fail against new evidence. The second identifies four governing-law regimes: physical, digital, social, and scientific. These regimes determine what constraints a world model must satisfy and where it is most likely to fail. Using this framework, we synthesize over 400 works and summarize more than 100 representative systems spanning model-based reinforcement learning, video generation, web and GUI agents, multi-agent social simulation, and AI-driven scientific discovery. We analyze methods, failure modes, and evaluation practices across level-regime pairs, propose decision-centric evaluation principles and a minimal reproducible evaluation package, and outline architectural guidance, open problems, and governance challenges. The resulting roadmap connects previously isolated communities and charts a path from passive next-step prediction toward world models that can simulate, and ultimately reshape, the environments in which agents operate.","authors":["Meng Chu","Xuan Billy Zhang","Kevin Qinghong Lin","Lingdong Kong","Jize Zhang","Teng Tu","Weijian Ma","Ziqi Huang","Senqiao Yang","Wei Huang","Yeying Jin","Zhefan Rao","Jinhui Ye","Xinyu Lin","Xichen Zhang","Qisheng Hu","Shuai Yang","Leyang Shen","Wei Chow","Yifei Dong","Fengyi Wu","Quanyu Long","Bin Xia","Shaozuo Yu","Mingkang Zhu","Wenhu Zhang","Jiehui Huang","Haokun Gui","Haoxuan Che","Long Chen","Qifeng Chen","Wenxuan Zhang","Wenya Wang","Xiaojuan Qi","Yang Deng","Yanwei Li","Mike Zheng Shou","Zhi-Qi Cheng","See-Kiong Ng","Ziwei Liu","Philip Torr","Jiaya Jia"],"published":"2026-04-24T17:48:47Z","updated":"2026-04-24T17:48:47Z","category":"","source":"arxiv","original_source":"arxiv","url":"http://arxiv.org/abs/2604.22748v1","pdf_url":"http://arxiv.org/pdf/2604.22748v1.pdf","scraped_at":"2026-04-27T08:00:53.780Z","analysis":{"introduction":"🚀 What if AI could not just predict the next step but simulate and reshape whole environments? This paper proposes a unifying \"levels x laws\" taxonomy (L1 Predictor, L2 Simulator, L3 Evolver) across four law regimes to build actionable world models — crucial for robotics, agents, and scientific discovery.","challenges":"🎯 Key problems tackled: - Ambiguity: \"world model\" means different things across communities, hindering progress. - Composition: One-step predictors fail to produce coherent, multi-step, action-conditioned simulations. - Adaptation & evaluation gaps: models rarely self-revise or share unified evaluation and governance standards.","innovations":"✨ Core contributions: - Introduced a two-axis \"levels x laws\" taxonomy (L1 Predictor, L2 Simulator, L3 Evolver). - Identified four governing-law regimes: physical, digital, social, scientific. - Synthesized >400 works and summarized >100 representative systems. - Proposed decision-centric evaluation principles and a minimal reproducible evaluation package, plus architectural guidance and governance analysis.","experiments":"📊 Main quantitative takeaway: the paper synthesizes over 400 papers and summarizes more than 100 representative systems, providing the largest cross-domain analysis (per the paper) to map methods, failure modes, and evaluation practices. No single benchmark SOTA improvement reported — Not specified in the paper.","insights":"🤔 What's next? - Research directions: build cross-regime benchmark suites for L2/L3 evaluation; develop autonomous model-revision methods (continual/meta-learning) tailored to law regimes. - Applications: safer robot control, automated lab experimentation, and richer multi-agent social sims. Could unified benchmarks and Evolver-capable models accelerate real-world deployment?","keywords":["world model","levels x laws","L1 Predictor","L2 Simulator","L3 Evolver","physical laws","digital laws","social laws","scientific laws","evaluation","model-based reinforcement learning","multi-agent","robotics","governance","roadmap"],"category":"machine_learning","relevance_score":9,"technical_depth":"advanced","summary":"**Introduction:** 🚀 What if AI could not just predict the next step but simulate and reshape whole environments? This paper proposes a unifying \"levels x laws\" taxonomy (L1 Predictor, L2 Simulator, L3 Evolver) across four law regimes to build actionable world models — crucial for robotics, agents, and scientific discovery.\n\n**Challenges:** 🎯 Key problems tackled: - Ambiguity: \"world model\" means different things across communities...","analyzed_at":"2026-04-27T08:01:46.770Z","model":"openai/gpt-5-mini"},"views":0},{"id":"arxiv_2604.22750v1","arxiv_id":"2604.22750v1","title":"How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks","abstract":"The wide adoption of AI agents in complex human workflows is driving rapid growth in LLM token consumption. When agents are deployed on tasks that require a significant amount of tokens, three questions naturally arise: (1) Where do AI agents spend the tokens? (2) Which models are more token-efficient? and (3) Can agents predict their token usage before task execution? In this paper, we present the first systematic study of token consumption patterns in agentic coding tasks. We analyze trajectories from eight frontier LLMs on SWE-bench Verified and evaluate models' ability to predict their own token costs before task execution. We find that: (1) agentic tasks are uniquely expensive, consuming 1000x more tokens than code reasoning and code chat, with input tokens rather than output tokens driving the overall cost; (2) token usage is highly variable and inherently stochastic: runs on the same task can differ by up to 30x in total tokens, and higher token usage does not translate into higher accuracy; instead, accuracy often peaks at intermediate cost and saturates at higher costs; (3) models vary substantially in token efficiency: on the same tasks, Kimi-K2 and Claude-Sonnet-4.5, on average, consume over 1.5 million more tokens than GPT-5; (4) task difficulty rated by human experts only weakly aligns with actual token costs, revealing a fundamental gap between human-perceived complexity and the computational effort agents actually expend; and (5) frontier models fail to accurately predict their own token usage (with weak-to-moderate correlations, up to 0.39) and systematically underestimate real token costs. Our study offers new insights into the economics of AI agents and can inspire future research in this direction.","authors":["Longju Bai","Zhemin Huang","Xingyao Wang","Jiao Sun","Rada Mihalcea","Erik Brynjolfsson","Alex Pentland","Jiaxin Pei"],"published":"2026-04-24T17:54:47Z","updated":"2026-04-24T17:54:47Z","category":"","source":"arxiv","original_source":"arxiv","url":"http://arxiv.org/abs/2604.22750v1","pdf_url":"http://arxiv.org/pdf/2604.22750v1.pdf","scraped_at":"2026-04-27T08:00:53.780Z","analysis":{"introduction":"🚀 Ever wondered why AI agents cost so much in API bills? This paper is the first systematic study of token consumption in agentic coding tasks — revealing where agents spend tokens, which models are token-efficient, and whether agents can predict their own costs. Crucial for cost-aware deployment.","challenges":"🎯 Problems tackled: - Where do tokens get consumed in agentic coding workflows? - Which LLMs are more token-efficient across real agent runs? - Can agents predict their token usage before executing a task?","innovations":"✨ Key methods & novelty: - Analyzed execution trajectories from eight frontier LLMs on SWE-bench Verified - Evaluated models' ability to predict their own token costs pre-execution - Novelty: first systematic, empirical breakdown of token usage in agentic coding tasks","experiments":"📊 Most compelling result: Agentic tasks consume ~1000× more tokens than code reasoning/code chat, and input tokens (not outputs) drive cost. Runs vary hugely (up to 30×) and models often underestimate costs; self-prediction correlations only up to 0.39.","insights":"🤔 Where to go next (potential): - Develop cost-aware agent planning and token-budgeting strategies to cut input-token waste - Build better self-prediction models or lightweight simulators to estimate token use pre-run Applications: cost monitoring/pricing tools for deployments and model selection policies for production.","keywords":["token consumption","agentic coding","LLMs","token-efficiency","SWE-bench","cost-aware AI"],"category":"machine_learning","relevance_score":9,"technical_depth":"advanced","summary":"**Introduction:** 🚀 Ever wondered why AI agents cost so much in API bills? This paper is the first systematic study of token consumption in agentic coding tasks — revealing where agents spend tokens, which models are token-efficient, and whether agents can predict their own costs. Crucial for cost-aware deployment.\n\n**Challenges:** 🎯 Problems tackled: - Where do tokens get consumed in agentic coding workflows? - Which LLMs are mor...","analyzed_at":"2026-04-27T08:01:18.373Z","model":"openai/gpt-5-mini"},"views":0},{"id":"arxiv_2604.22739v1","arxiv_id":"2604.22739v1","title":"Inter-Stance: A Dyadic Multimodal Corpus for Conversational Stance Analysis","abstract":"Social interactions dominate our perceptions of the world and shape our daily behavior by attaching social meaning to acts as simple and spontaneous as gestures, facial expressions, voice, and speech. People mimic and otherwise respond to each other's postures, facial expressions, mannerisms, and other verbal and nonverbal behavior, and form appraisals or evaluations in the process. Yet, no publicly-available dataset includes multimodal recordings and self-report measures of multiple persons in social interaction. Dyadic recordings and annotation are lacking. We present a new data corpus of multimodal dyadic interaction (45 dyads, 90 persons) that includes synchronized multi-modality behavior (2D face video, 3D face geometry, thermal spectrum dynamics, voice and speech behavior, physiology (PPG, EDA, heart-rate, blood pressure, and respiration), and self-reported affect of all participants in a communicative interaction scenario. Two types of dyads are included: persons with shared past history and strangers. Annotations include social signals, agreement, disagreement, and neutral stance. With a potent emotion induction, these multimodal data will enable novel modeling of multimodal interpersonal behavior. We present extensive experiments to evaluate multimodal dyadic communication of dyads with and without interpersonal history, and their affect. This new database will make multimodal modeling of social interaction never possible before. The dataset includes 20TB of multimodal data to share with the research community.","authors":["Xiang Zhang","Xiaotian Li","Taoyue Wang","Nan Bi","Xin Zhou","Cody Zhou","Zoie Wang","Andrew Yang","Yuming Su","Jeff Cohn","Qiang Ji","Lijun Yin"],"published":"2026-04-24T17:37:42Z","updated":"2026-04-24T17:37:42Z","category":"","source":"arxiv","original_source":"arxiv","url":"http://arxiv.org/abs/2604.22739v1","pdf_url":"http://arxiv.org/pdf/2604.22739v1.pdf","scraped_at":"2026-04-27T08:00:53.780Z","analysis":{"introduction":"🚀 Ever wondered how real conversations sync face, voice, and physiology? Inter‑Stance introduces the first public dyadic multimodal corpus: 45 dyads (90 people) with synchronized 2D+3D+thermal video, audio, physiology, and self‑reports — ~20TB to advance social/affective AI.","challenges":"🎯 Key problems tackled: - Lack of public multimodal dyadic datasets combining behavior + self‑report. - Missing synchronized multimodal recordings across video, 3D geometry, thermal, audio, and physiology. - Sparse dyadic annotations for stance (agreement/disagreement/neutral).","innovations":"✨ What the paper introduces: - A large multimodal dyadic corpus (45 dyads, 90 subjects) with synchronized 2D face video, 3D face geometry, thermal dynamics, audio, and physiology (PPG, EDA, HR, BP, respiration). - Self‑reported affect and annotations of social signals, agreement/disagreement/neutral. - Two dyad types: shared history vs strangers; potent emotion induction; ~20TB of data. Novelty: first publicly‑available dataset combining this breadth of modalities, self‑report, and dyadic context.","experiments":"📊 Not specified in the paper. The abstract states they present extensive experiments evaluating multimodal dyadic communication for dyads with and without interpersonal history and affect, but no single quantitative headline result is provided in the abstract.","insights":"🤔 Directions & applications inspired by this work: - Research: cross‑dyad transfer learning for interpersonal signals; temporal models of mutual influence and mimicry. - Applications: improved telehealth/therapy analytics, socially aware robots, negotiation and training simulations. Could multimodal dyadic benchmarks reshape interactive AI that understands mutual stance?","keywords":["multimodal dataset","dyadic interaction","stance analysis","affective computing","physiology","3D face","thermal imaging","self-report","social signals"],"category":"machine_learning","relevance_score":9,"technical_depth":"advanced","summary":"**Introduction:** 🚀 Ever wondered how real conversations sync face, voice, and physiology? Inter‑Stance introduces the first public dyadic multimodal corpus: 45 dyads (90 people) with synchronized 2D+3D+thermal video, audio, physiology, and self‑reports — ~20TB to advance social/affective AI.\n\n**Challenges:** 🎯 Key problems tackled: - Lack of public multimodal dyadic datasets combining behavior + self‑report. - Missing synchronize...","analyzed_at":"2026-04-27T08:01:52.525Z","model":"openai/gpt-5-mini"},"views":0},{"id":"arxiv_2604.22736v1","arxiv_id":"2604.22736v1","title":"An Undecidability Proof for the Plan Existence Problem","abstract":"The plan existence problem asks, given a goal in the form of a formula in modal logic, an initial epistemic state (a pointed Kripke model), and a set of epistemic actions, whether there exists a sequence of actions that can be applied to reach the goal. We prove that even in the case where the preconditions of the epistemic actions have modal depth at most 1, and there are no postconditions, the plan existence problem is undecidable. The (un)decidability of this problem was previously unknown.","authors":["Antonis Achilleos"],"published":"2026-04-24T17:36:17Z","updated":"2026-04-24T17:36:17Z","category":"","source":"arxiv","original_source":"arxiv","url":"http://arxiv.org/abs/2604.22736v1","pdf_url":"http://arxiv.org/pdf/2604.22736v1.pdf","scraped_at":"2026-04-27T08:00:53.780Z","analysis":{"introduction":"🚀 Can we always decide if an epistemic plan exists? Breakthrough: proves the plan-existence problem is undecidable — even when action preconditions have modal depth ≤1 and there are no postconditions. Big result for epistemic planning theory.","challenges":"🎯 Key problems tackled: - Decidability of plan existence for epistemic actions (previously unknown). - Understanding the limits of planning when preconditions are modal. - Determining whether severe syntactic restrictions restore decidability.","innovations":"✨ Novel contributions: - A formal undecidability proof for the plan-existence problem in epistemic planning. - Shows undecidability holds under strict restrictions: preconditions of modal depth ≤1 and no postconditions. - Tightens theoretical boundaries between decidable and undecidable fragments.","experiments":"📊 Experiments / quantitative results: Not specified in the paper. This is a theoretical undecidability proof; no experimental benchmarks or numeric improvements reported.","insights":"🤔 What's next? - Search for natural decidable fragments (further restrict preconditions, action forms, or agent models). - Develop semi-decision procedures, approximations, or practical heuristics for planners despite undecidability. Could this reshape automated epistemic planning practice?","keywords":["epistemic planning","plan existence","undecidability","modal logic","Kripke model","epistemic actions"],"category":"machine_learning","relevance_score":9,"technical_depth":"advanced","summary":"**Introduction:** 🚀 Can we always decide if an epistemic plan exists? Breakthrough: proves the plan-existence problem is undecidable — even when action preconditions have modal depth ≤1 and there are no postconditions. Big result for epistemic planning theory.\n\n**Challenges:** 🎯 Key problems tackled: - Decidability of plan existence for epistemic actions (previously unknown). - Understanding the limits of planning when preconditio...","analyzed_at":"2026-04-27T08:02:16.577Z","model":"openai/gpt-5-mini"},"views":0}],"pagination":{"current_page":1,"total_pages":7,"total_papers":64,"has_next":true,"has_prev":false},"filters":{"category":null,"search":null,"date":null}}