40 papers across AI, ML, NLP, and CV from the last 24 hours.
Three themes dominate today's batch. First, the internal mechanisms of vision-language models are coming into sharp focus: rather than treating multimodal systems as black boxes, multiple papers trace exactly how attention paths, register tokens, and representation-space priors mediate between seeing and describing. This is a shift from "better performance" to "understanding how performance emerges." Second, reinforcement learning with verifiable rewards is hitting real-world friction — thinking-answer inconsistencies, verifier regression on new tasks, and verifier quality proving deeply task-dependent rather than monotonic. The gap between a clean reward signal and reliable improvement is wider than the RLVR literature assumed. Third, there is a quiet push toward compositional structure: visual models decomposing scenes into reusable part registries, agent scaffolds treated as specifiable modules, and diffusion decoding measured token-by-token rather than assumed parallel.
The standout paper is "Neither Parallel Nor Sequential: How DiffusionGemma Actually Commits Tokens," which instruments a 26B masked-diffusion model to measure what decoding it actually performs. Shipped checkpoints marketed as parallel non-autoregressive decoders turn out to commit tokens in a sequential-ish pattern governed by confidence, not canvas geometry. It is a reminder that architectural intent and runtime behavior are not the same thing.
Notably absent: large-scale RL training breakthroughs or new foundation model announcements. The batch skews toward interpretability, analysis, and mechanism-level understanding over scaling. Taken together, these papers suggest the field is entering a consolidation phase — fewer "bigger is better" headlines, more careful measurement of how existing models actually work and where their advertised capabilities diverge from reality.
Rohit Gandikota, David Bau · 2026-06-12
How a vision-language model internally solves the task of describing an image is far from obvious. We find that the model develops a specific mechanism for this: a small set of attention heads in its language-model backbone, which we call gaze heads, whose attention tracks the image region the model is currently describing. We find them with a simple correlation score from a few forward passes, us
Xinyue Cai, Chaoyou Fu, Yi-Fan Zhang · 2026-06-12
Current automated pipelines for audio-visual Question Answering (QA) generally adopt a ``video-caption-QA'' paradigm. However, these methods typically segment videos into short clips and generate separate descriptions for audio and visual modalities. This decoupled processing severs inherent associations between sounds and their visual sources, while independent clip processing often causes incons
Timing Yang, Predrag Neskovic, Jansen Seheult · 2026-06-12
When humans see a bird, they recognize far more than just "bird" -- they see a head, wings, and talons, a structured assembly of reusable parts that can be identified across every bird they have ever seen. We ask whether a self-supervised visual model can discover the same compositional structure on its own. To this end, we propose RATS (Register Attention Transformers), which decomposes the class
Xichen Pan, Aashu Singh, Satya Narayan Shukla · 2026-06-12
Large language models (LLMs) are widely used in text-to-image (T2I) systems, but they are typically limited to text encoding, while denoising is handled by newly trained generative backbones. The emergence of representation autoencoders (RAEs) shifts the generation target toward semantically structured visual representations, creating a latent space that is more compatible with pretrained LLM prio
Ruining Li, Yuxin Yao, Matt Zhou · 2026-06-12
Reconstructing articulated 3D objects is important for animation, gaming, and robotic simulations. Recent neural networks can estimate the articulated structure of 3D objects, but their generalization remains limited by the scarcity of annotated data for this task. To address this gap, we introduce Instruct-Particulate, a model that takes a 3D mesh together with a target kinematic specification, i
Sicheng Yang, Hangjie Yuan, Wenjun Zhang · 2026-06-12
Building trustworthy medical multimodal large language models (MLLMs) is critical for reliable clinical decision support. Existing medical hallucination benchmarks mainly focus on data collection, but often ignore where hallucinations originate within the reasoning process. We find that hallucination sources vary across samples: errors may arise from visual misrecognition, incorrect medical knowle
Rafi Ahamed, Md. Abir Rahman, Tasnia Tarannum Roza · 2026-06-12
Globally, cotton is a highly economically beneficial crop, as the textile industry heavily depends on it. So, the precise identification and detection of cotton leaf disease is crucial for economic stability. The development goal of "CottonLeafVision" is to accurately classify and detect cotton leaf disease. With this goal, we have evaluated multiple pretrained Deep Convolutional Neural Networks,
Mohammed Arif Mainuddin, Najifa Tabassum, Omar Ibne Shahid · 2026-06-12
Real-time fire classification systems require models that are simultaneously accurate, computationally efficient, and deployable on resource-constrained hardware. This work proposes \textbf{HumP-KD}, a Hybrid Uncertainty-aware Multi-stage Progressive Knowledge Distillation framework for efficient fire classification. Two datasets, FlameVision and Dataset-II, containing 8,600 and 31,309 images, are
Xuan Wei, Longbin Ji, Guan Wang · 2026-06-12
Long-form video generation requires recurring subjects to remain consistent across various shots, viewpoints, motions, and scene transitions. Existing temporal decomposition methods improve scalability by generating videos shot by shot. However, they mainly focus on optimizing plausible next-shot continuations without verifying whether the historical memory preserves identity-critical subject evid
Nicole Villavicencio-Garduño, Maksim Ekin Eren, Milo Prisbrey · 2026-06-12
Artificial Intelligence (AI) is increasingly used to automate a variety of real-world computer vision (CV) applications, such as autonomous vehicle control, facial recognition, and security cameras. Recent research has shown that acoustic vibration can induce real physical motion in cameras, interfering with their internal stabilization mechanisms. Because the motion falls outside the conditions t
Yijun Liu, Jie Huang, Zeyue Xue · 2026-06-12
Reward models guide text-to-image (T2I) systems toward outputs aligned with human preferences. However, typical reward models such as HPSv3 are trained on pre-annotated data from earlier T2I models, without accounting for quality discriminative shifts arising from evolving model capabilities and reinforcement learning (RL) iterations, limiting their broader applicability. In this work, we propose
Junlong Tong, Wenqi Xu, Yingqi Fan · 2026-06-12
Large reasoning models typically follow a read-then-think paradigm: they observe the complete input, reason over a static context, and then produce the answer. Yet many real-world scenarios are inherently dynamic, such as audio and video stream, where information arrives as a continuous stream and models must reason, update, and respond under partial observations. Recent streaming reasoning method
Jiayue Cao, Zhicong Lu, Xuehan Sun · 2026-06-12
Reinforcement learning with verifiable rewards (RLVR) has successfully elicited the reasoning capabilities of large language models, motivating its extension to multimodal scenarios. Existing methods primarily focus on improving the visual coverage of reasoning traces and mitigating visual hallucinations, but underestimate the semantic inconsistency between the reasoning process and the final answ
Jixuan Chen, Jianzhi Shen, Haoqiang Kang · 2026-06-12
LLM agents are increasingly built not as single model calls, but as scaffolded systems that combine reasoning, memory, reflection, action execution, and learning. While such scaffolds often improve performance, they are often embedded in tightly coupled pipelines, making it difficult to isolate component contributions, compare alternative designs, or understand how module interactions shape agent
Jinsu Kim, Jihoon Tack, Noah Lee · 2026-06-12
Language Models (LMs) have shown remarkable potential as role-playing chatbots, delivering consistent, stylized interactions when given a specification of a character or user persona. However, applying these capabilities to real-world applications (e.g., ecosystems with numerous NPCs interacting simultaneously) exposes a critical inefficiency due to the excessive computational cost. In this paper,
Abdellah Aznag, Rachel Cummings, Adam N. Elmachtoub · 2026-06-12
We study a \emph{max-risk} objective for active learning in a multi-group mean estimation $d$-armed bandits: a learner adaptively allocates a budget of $T$ samples across $d$ groups to minimize the worst-case uncertainty index $\max_{k\in[d]}σ_k^2/n_k$, where $σ_k$ is the standard deviation of the distribution of arm $d$, and $n_k$ is the number of times arm $d$ is sampled. We develop a local mini
Xiaoyu Li, Andi Han, Dai Shi · 2026-06-12
AI systems coupled to proof assistants now generate formal mathematics at scale, and the gap between what a checker can verify and what a mathematician would value has become the binding constraint. We model the generation of valuable mathematics as nested language generation in the limit: a verifiable formal language $F$, accessed through a membership oracle (the proof checker), contains an unkno
Anthony Pineci, Yunzong Xu · 2026-06-12
Online inventory optimization (OIO) is online convex optimization with physical memory: inventory carryover makes the feasible action set depend on the past. A natural principle, used in stochastic inventory learning and recently in OIO under a single linear capacity constraint, is to maintain a hidden target chosen by an online learner and implement its projection onto the currently feasible orde
Jai Bhagat, Sara Molas-Medina, Giorgi Giglemiani · 2026-06-12
We study whether the Compressed Computation (CC) toy model (Braun et al., 2025) is an instance of computation in superposition. The CC model appears to compute 100 ReLU functions with just 50 neurons, achieving a better loss than expected from only representing 50 ReLU functions. We show that the model mixes inputs via its noisy residual stream, corresponding to an unintended mixing matrix in the
Yining Huang · 2026-06-12
Knowledge editing systems must update selected facts while preserving nearby but irrelevant behavior. This paper studies this problem in a memory-assisted setting where an edit memory is retrieved at inference time and a parameter-efficient adapter corrects the model's object preference. We argue that the central design question is not only how to write an edit, but also when to suppress it. We in
Ines Nolasco, Jules Cauzinille, Marius Miron · 2026-06-12
Pretrained audio embeddings are standard in bioacoustics, yet little is known about which acoustic features these models encode, nor which are useful for a given task. This hinders transparency and limits extension to rare species or data-scarce domains. Here we reveal which speech-like features are encoded in bioacoustic representations. Using the 88~eGeMAPS features across six taxonomic groups,
Christoph Bauschmann, Setareh Maghsudi · 2026-06-12
The identification of optimal structures within vast arrays of interconnected data necessitates significant sampling- and computational effort. Learning and leveraging underlying signal dependencies can improve efficiency and predictive capabilities considerably, but the ubiquity of nonlinear statistical relations amplifies the complexity of such undertakings. In this paper, we develop novel gener
Pengxin Wang, Lihao Guo, Yi Xie · 2026-06-12
Cooperative multi-objective multi-agent reinforcement learning (MOMARL) models team decision making under multiple, potentially conflicting objectives. In this setting, conflicts arise not only across objectives but also across agents with different observations, roles, and contributions. We propose Preference Coordinated Multi-agent Policy Optimization (PCMA), which learns coordinated agent-speci
Shikun Liu, Mufei Li, Dongqi Fu · 2026-06-12
Large language models increasingly serve as execution engines for agentic systems, yet they still consume context through a sequential text interface. This creates a mismatch with modern structured agent workflows, in which independent branches explore subtasks, retrieve evidence, or generate candidate solutions before a final synthesis step. Existing systems typically merge these branches by conc
Gaurav Verma, Scott Counts · 2026-06-12
Sequential or time-stamped interaction logs provide objective records of digital application usage, yet their granularity and noise often obscure meaningful insights into people's work. Such insights are essential for improving digital products in ways grounded in real-world user interactions. Prior research has applied deep learning models to cluster user actions into high-level activities, but t
This digest is generated automatically from arXiv submissions. Not affiliated with arXiv or Cornell University.