IEEE TRANSACTIONS ON IMAGE PROCESSING

Large Visual Language Models Continual Learning with Dynamic Mixture-of-Experts
Chen Y, Huang X and Zhang W
In dynamic and evolving application scenarios, the ability of visual language models to continuously learn from new data while preserving historical knowledge is critically important. Existing continual learning methods for large visual language models (LVLMs) often restrict the number of tasks they can handle, causing performance to decline as tasks continue to increase. In this paper, we propose a novel continual learning framework that adapts to the growing number of tasks, enabling visual language models to handle a dynamic range of open-set tasks while overcoming the catastrophic forgetting problem of learning new tasks at the expense of forgetting old ones.Our method builds on a pre-trained CLIP model and incorporates a dynamic mixture-of-experts (MoE) layer, enabling flexible adaptation to a wide range of open-set tasks. We design an elastic expert weight management strategy to effectively mitigate the catastrophic forgetting problem. Furthermore, we optimize the LoRA experts with adaptive ranks to achieve a balanced trade-off between model complexity and representational capacity. Extensive experiments across diverse settings demonstrate that our proposed method significantly reduces the number of tunable parameters while consistently surpassing state-of-the-art methods in new task learning capability and maintaining performance on historical tasks.
TrajDiff: Trajectory Prediction with Diffusion Probabilistic Models
Yang C, Pan H, Wang J and Hong Y
Diffusion probabilistic models (DPMs) have recently achieved brilliant achievements in computer vision. Inspired by the success of DPMs, we present TrajDiff, a model based on conditional diffusion probabilistic models for agent future trajectory prediction, which speculates the agent future states through a series of stochastic iterative denoising processes. Specifically, we map the trajectory prediction task into the latent heatmap space, translating hard keypoint prediction into soft cluster center learning. The core architecture is a U-shaped encoder-decoder network (U-Net) that is trained with a denoising objective. During inference, conditioned on the observed past trajectory heatmaps, random pure Gaussian noise is initialized to drive the reverse sampling process. The U-Net iteratively removes various levels of Gaussian noise from initialized images, resembling Langevin dynamics, and generates multi-modal predicted future trajectory heatmaps. Furthermore, we introduce a novel residual block with a mutual attention mechanism that can elegantly consider the interactions between the agent and the surrounding environment at multiple scales, assisting in generating physically and socially acceptable trajectories. We verify TrajDiff on the Stanford Drone Dataset and the ETH and UCY Datasets. The experimental results show that TrajDiff outperforms previous state-of-the-art methods with considerable accuracy gains, while significantly reducing computational requirements.
S2AFormer: Strip Self-Attention for Efficient Vision Transformer
Xu G, Huang W, Jia W, Li J, Gao G and Qi GJ
The Vision Transformer (ViT) has achieved remarkable success in computer vision due to its powerful token mixer, which effectively captures global dependencies among all tokens. However, the quadratic complexity of standard self-attention with respect to the number of tokens severely hampers its computational efficiency in practical deployment. Although recent hybrid approaches have sought to combine the strengths of convolutions and self-attention to improve the performance-efficiency trade-off, the costly pairwise token interactions and heavy matrix operations in conventional self-attention remain a critical bottleneck. To overcome this limitation, we introduce S2AFormer, an efficient Vision Transformer architecture built around a novel Strip Self-Attention (SSA) mechanism. Our design incorporates lightweight yet effective Hybrid Perception Blocks (HPBs) that seamlessly fuse the local inductive biases of CNNs with the global modeling capability of Transformer-style attention. The core innovation of SSA lies in simultaneously reducing the spatial resolution of the key (K) and value (V) tensors while compressing the channel dimension of the query (Q) and key (K) tensors. This joint spatial-and-channel compression dramatically lowers computational cost without sacrificing representational power, achieving an excellent balance between accuracy and efficiency. We extensively evaluate S2AFormer on a wide range of vision tasks, including image classification (ImageNet-1K), semantic segmentation (ADE20K), and object detection/instance segmentation (COCO). Experimental results consistently show that S2AFormer delivers substantial accuracy improvements together with superior inference speed and throughput across both GPU and non-GPU platforms, establishing it as a highly competitive solution in the landscape of efficient Vision Transformers.
MBGCN: Multi-view Block-wise Graph Convolutional Networks on Large-scale Graphs
Xu Z, Chen Y, Zou Y, Du S, Chen Y and Wang S
Existing methods based on graph convolutional network often struggle with large-scale graphs due to their high computing consumption and inefficiency. Although strategies such as edge sparsification and node sampling can indeed decrease the complexity, they frequently result in information loss and local information bias. Furthermore, in multi-view scenarios, traditional multi-view fusion methods are unable to simultaneously account for both inter-view consistency and intra-view diversity, thus constraining model performance. In this paper, we propose a multi-view block-wise graph convolutional network that effectively addresses the challenges posed by large-scale graphs while exploiting the complementary nature of multi-view information. Specifically, we implement a node segmentation module to partition nodes into view-specific subsets, thereby diminishing computational complexity while preserving local structural information. To enhance feature extraction, plentiful subgraph representations are captured within blocks by alternating graph convolution with graph structure learning under a shared-weight strategy. Finally, the global fusion module introduces a cross-view inter-block loss that progressively aligns block representations across views, alleviates over-smoothing, and yields a consistent and comprehensive common representation. Extensive experiments on diverse large-scale graph datasets demonstrate that MBGCN not only outperforms state-of-the-art approaches in multi-view semi-supervised classification but also exhibits superior scalability and memory efficiency.
Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation
Bai S, Liu Y, Han Y, Zhang H, Tang Y, Zhou J and Lu J
Recent advancements in pre-trained vision-language models like CLIP, have enabled the task of open-vocabulary segmentation. CLIP demonstrates impressive zero-shot capabilities in various downstream tasks that require holistic image understanding. However, due to the image-level contrastive learning and fully global feature interaction, ViT-based CLIP struggles to capture local details, resulting in poor performance in segmentation tasks. Our analysis of ViT-based CLIP reveals that anomaly tokens emerge during the forward process, attracting disproportionate attention from normal patch tokens and thereby diminishing spatial awareness. To address this issue, we propose Self-Calibrated CLIP (SC-CLIP), a training-free method that calibrates CLIP to generate finer representations while preserving its original generalization ability-without introducing new parameters or relying on additional backbones. Specifically, we mitigate the negative impact of anomaly tokens from two complementary perspectives. First, we explicitly identify the anomaly tokens and replace them based on local context. Second, we reduce their influence on normal tokens by enhancing feature discriminability and attention correlation, leveraging the inherent semantic consistency within CLIP's mid-level features. In addition, we introduce a two-pass strategy that effectively integrates multi-level features to enrich local details under the training-free setting. Together, these strategies enhance CLIP's feature representations with improved granularity and semantic coherence. Experimental results demonstrate the effectiveness of SC-CLIP, achieving state-of-the-art results across all datasets and surpassing previous methods by 9.5%. Notably, SC-CLIP boosts the performance of vanilla CLIP ViT-L/14 by 6.8 times. Furthermore, we discuss our method's applicability to other vision-language models and tasks for a comprehensive evaluation. Our source code is available at https://github.com/SuleBai/SC-CLIP.
M2Restore: Mixture-of-Experts-based Mamba-CNN Fusion Framework for All-in-One Image Restoration
Wang Y, Li Y, Zheng Z, Zhang XP and Wei M
Natural images are often degraded by complex, composite degradations such as rain, snow, and haze, which adversely impact downstream vision applications. While existing image restoration efforts have achieved notable success, they are still hindered by two critical challenges: limited generalization across dynamically varying degradation scenarios and a suboptimal balance between preserving local details and modeling global dependencies. To overcome these challenges, we propose M2Restore, a novel Mixture-of-Experts (MoE)-based Mamba-CNN fusion framework for efficient and robust all-in-one image restoration. M2Restore introduces three key contributions: First, to boost the model's generalization across diverse degradation conditions, we exploit a CLIP-guided MoE gating mechanism that fuses task-conditioned prompts with CLIP-derived semantic priors. This mechanism is further refined via cross-modal feature calibration, which enables precise expert selection for various degradation types. Second, to jointly capture global contextual dependencies and fine-grained local details, we design a dual-stream architecture that integrates the localized representational strength of CNNs with the long-range modeling efficiency of Mamba. This integration enables collaborative optimization of global semantic relationships and local structural fidelity, preserving global coherence while enhancing detail restoration. Third, we introduce an edge-aware dynamic gating mechanism that adaptively balances global modeling and local enhancement by reallocating computational attention to degradation-sensitive regions. This targeted focus leads to more efficient and precise restoration. Extensive experiments across multiple image restoration benchmarks validate the superiority of M2Restore in both visual quality and quantitative performance.
Content-Adaptive Unfolding Wavelet Transformer for Hyperspectral Image Super-Resolution
Fang Y, Liu Y, Long Z, Chi CY and Zhu C
In recent years, fusing high-resolution multispectral images (HR-MSIs) and low-resolution hyperspectral images (LR-HSIs) has become a widely used approach for hyperspectral image super-resolution (HSI-SR). The deep unfolding framework has attracted significant attention thanks to its ability to formulate the problem into a data module and a prior module. However, there are still two critical issues that hinder the performance enhancement of the existing methods: 1) Parameters in the data module are fixed (though learnable) at each iteration, i.e., lacking the adaptivity to comprehensive data; 2) The transformer in the prior module cannot effectively capture high-frequency information. To resolve these issues, we propose a Content-Adaptive Unfolding Wavelet Transformer (CAUWT) for HSI-SR, where the parameters are adaptively learned based on the reconstructed HSI at each iteration. Moreover, we propose a novel Wavelet-Assisted Transformer (WAT), by integrating the Discrete Wavelet Transform (DWT) and the Hybrid Spectral-Spatial Attention Block (HSSAB) to further upgrade the high-frequency information quality of HSI at no cost of extra branch structures, where the former is for multi-scale and multi-frequency details and the latter is for correlations between and within sub-band components. Extensive experiments performed on both simulated and real datasets well demonstrate the effectiveness of the proposed method. In comparison with mainstream HSI-SR methods, our method exhibits superior performance and lower computational overhead.
PH-Mamba: Enhancing Mamba with Position Encoding and Harmonized Attention for Image Deraining and Beyond
Jiang K, Jiang J, Liu X, Yao H and Lin CW
Mamba and its variants excel at modeling long-range dependencies with linear computational complexity, making them effective for diverse vision tasks. However, Mamba's reliance on unfolding 1D sequential representations necessitates multiple directional scans to recover lost spatial dependencies. This introduces significant computational overhead, redundant token traversal, and inefficiencies that compromise accuracy in real-world applications. To this end, we propose PH-Mamba, a novel framework integrating position encoding and harmonized attention for image deraining and beyond. PH-Mamba transforms Mamba's scanning process into a position-guided, unidirectional scanning that selectively prioritizes degradation-relevant tokens. Specifically, we devise a position-guided hybrid Mamba module (PHMM) that jointly encodes perturbation features alongside their spatial coordinates and harmonized representation to model consistent degradation patterns. Within PHMM, a harmonized Transformer is developed to focus on uncertain regions while suppressing noise interference, thereby improving spatial modeling fidelity. Additionally, we employ a vector decomposition and synthesis strategy to enable the unified representation layout to global degradation by directional scanning while minimizing redundancy. By cascading multiple PHMM blocks, PH-Mamba combines global positional guidance with local differential features to strengthen contextual learning. Extensive experiments demonstrate the superiority of PH-Mamba across low-level image restoration benchmarks. For example, compared to NeRD, PH-Mamba achieves a 0.60 dB PSNR improvement while requiring 88.9% fewer parameters, 36.2% less computation, and 63.0% faster inference time.
Stokes Simplex Modeling for Polarization Image Denoising
Raffoul J, LeMaster D, Ratliff B and Hirakawa K
In passive polarization imaging, the degree and the angle of linear polarization images are representations of the polarization content in the scene that can be used to detect small polarized objects in a largely randomly polarized surrounding. The polarized signal is often near the noise limit of a photon detector (as in CCD and CMOS cameras) and sensitivity to polarization deteriorates further when the source imagery is under-exposed. This work aims to increase the robustness to sensor noise by estimating the Cartesian coordinates of the degree and angle of linear polarization-a notion we refer to as "Stokes simplex." The proposed Stokes Simplex Polarimetric Image Denoising (SSPID) algorithm is the minimum mean squared error estimation of the noise-free Stokes simplex vectors in the wavelet domain from the Poisson corrupted analyzer images. Benchmarking against the state-of-the-art polarization image denoising methods on a newly acquired division-of-time (DoT) polarimetric data shows superior performance.
InfoARD: Enhancing Adversarial Robustness Distillation with Attack-Strength Adaptation and Mutual-Information Maximization
Liu R, Cai J, Liu Y, Cai S, Chen B, Guo Y and Bennamoun M
Adversarial distillation (AD) aims to mitigate deep neural networks' inherent vulnerability to adversarial attacks, thereby providing robust protection for compact models through teacher-student interactions. Despite advancements, existing AD studies still suffer from insufficient robustness due to the limitations of fixed attack strength and attention region shifts. To address these challenges, we propose a strength-adaptive Info-maximizing Adversarial Robustness Distillation paradigm, namely "InfoARD", which strategically incorporates the Attack-Strength Adaptation (ASA) and Mutual-Information Maximization (MIM) to enhance adversarial robustness against adversarial attacks and perturbations. Unlike previous adversarial training (AT) methods that utilize fixed attack strength, the ASA mechanism is designed to capture smoother and generalized classification boundaries by dynamically tailoring the attack strength based on the characteristics of individual instances. Benefiting from mutual information constraints, our MIM strategy ensures the student model effectively learns from various levels of feature representations and attention patterns, thereby deepening the student model's understanding of the teacher model's decision-making processes. Furthermore, a comprehensive multi-granularity distillation is conducted to capture knowledge across multiple dimensions, enabling a more effective transfer of knowledge from the teacher model to the student model. Note that our InfoARD can be seamlessly integrated into existing AD frameworks, further boosting the adversarial robustness of deep learning models. Extensive experiments on various challenging datasets consistently demonstrate the effectiveness and robustness of our InfoARD, surpassing previous state-of-the-art methods.
LPATR-Net: Learnable Piecewise Affine Transformation Regression Assisted Data-Driven Dehazing Framework
Li Y, Chen F, Liu Z, Zang T and Wang J
Nowadays, data-driven learning based deep neural network (DNN) is the most dominant SOTA image dehazing framework. Here, learning to perfectly simulate the underlying mapping rules (from hazy to clear) told by massive paired training data is its core driving force. However, under genuine scenarios, it is extremely hard to guarantee the 100% qualification of all collected ground truth (GT) haze-free data. That's because natural weather is hardly controlled, and many weathers are actually in a chaotic status existing between foggy and fog-free. Thus, unlike most supervised learning issues, the image dehazing society is born with the torture of part of faulty ground truth no-haze samples. Therefore, totally trusting training data and solely pursuing more fitting powerful data-driven model may not be a wise solution. To cope with this thorny challenge, in this paper, instead of faithfully pursuing for fitting capacity promotion, we on the contrary choose to intentionally cut down the fitting flexibility to achieve higher-level robustness. That is the LPATR-Net, a novel dehazing framework specially armed with fitting power suppression mechanism to resist intrinsic annoying faulty GT. This solution does not involve any extra manually labeling. Specifically, the LPATR-Net architecture is created completely around elaborately designed fitting-restrained learnable piecewise affine transformation regression. Since such low-order linear regression structure genetically can only fit for majority of data, the interference of minority of unqualified GT samples is expected to be effectively suppressed. Through further coupled with a highly customized multi-concerns high-accuracy dehazing fitting companion component, All-Mattering, proposed LPATR-Net elegantly achieves the seamless integration of traditional majority determining fixed-form regression and modern all freedom data-driven deep learning. Extensive experiments have been conducted on five commonly utilized public datasets to verify its effectiveness. In addition, the wide-range transplantability of the proposed core regression structure has also been experimentally confirmed. Source code is available at https://github.com/****/.
Better Image Filter for Pansharpening
Guo A, Dian R, Wang N and Li S
The modulation transfer function tailored image filter (MTF-TIF) has long been regarded as the optimal filter for multispectral image pansharpening. It excels at simulating the camera's frequency response, thereby capturing finer image details and significantly improving pansharpening performance. However, we are skeptical that the pre-measured MTF is sufficient to describe the characteristics of actually acquired panchromatic image (PAN) and multispectral image (MSI). For example, any image resampling operation in geometric correction or image registration inevitably changes the sharpness of acquired PAN and MSI, and the processed images no longer conform to the camera's MTF. Further, following the Wald protocol, in deep learning (DL) methods using MTF-TIF for downsampling images to construct training data does not satisfy the generalization consistency of training and testing. To prove our point, we propose a pair of symmetric frameworks based on DL in this paper, to find better image filters suitable for both traditional and DL pansharpening methods. We embed two learnable filters into the frameworks to simulate the optimal image filter, namely anisotropic Gaussian image filter and arbitrary image filter. Further, the proposed frameworks can capture subtle offsets between images and maintain the smoothness of the global deformation field. Extensive experiments on various satellite datasets demonstrate that the proposed frameworks can find better image filters than MTF-TIFs, which can achieve better pansharpening performance with stronger generalization ability.
A Perception CNN for Facial Expression Recognition
Tian C, Xie J, Li L, Zuo W, Zhang Y and Zhang D
Convolutional neural networks (CNNs) can automatically learn data patterns to express face images for facial expression recognition (FER). However, they may ignore effect of facial segmentation of FER. In this paper, we propose a perception CNN for FER as well as PCNN. Firstly, PCNN can use five parallel networks to simultaneously learn local facial features based on eyes, cheeks and mouth to realize the sensitive capture of the subtle changes in FER. Secondly, we utilize a multi-domain interaction mechanism to register and fuse between local sense organ features and global facial structural features to better express face images for FER. Finally, we design a two-phase loss function to restrict accuracy of obtained sense information and reconstructed face images to guarantee performance of obtained PCNN in FER. Experimental results show that our PCNN achieves superior results on several lab and real-world FER benchmarks: CK+, JAFFE, FER2013, FERPlus, RAF-DB and Occlusion and Pose Variant Dataset. Its code is available at https://github.com/hellloxiaotian/PCNN.
Neural Compression System for Point Cloud Video Streaming
Zhang J, Chen T, Ding D and Ma Z
Point cloud video streaming is promising for immersive media applications, which urges the development of efficient compression methods. However, existing approaches either suffer from poor performance or lack effective coder control mechanisms, making them impractical for networked point cloud services, where bandwidth is often constrained and fluctuates over time. Therefore, this paper proposes a system-level solution - a layered point cloud compressor, called Yak, to address these issues. Yak offers comprehensive support for both intra and inter-frame coding of geometry and attribute components in point cloud sequences. It consists of three layers: the Base Layer uses the standard G-PCC to encode a thumbnail counterpart downscaled from the input point cloud; the Enhancement Layer devises the end-to-end variational autoencoder to compress the original input conditioned on the base layer reconstruction, and the Dynamic Layer generates feature-space predictions as the temporal prior for conditional inter-frame coding. In addition, Yak devises the Content Analysis module to dynamically determine the optimal encoding parameters of each frame, by which bits budget is intelligently allocated for geometry and attribute components to maximize the overall rate-distortion (R-D) performance. Such accurate rate control relies on the parametric rate/distortion models whose parameters are initialized through one-pass template matching and frame-wise delta updating constrained by R-D optimization. Following standard evaluation guidelines, Yak has notably outperformed traditional rules-based methods such as MPEG G-PCC and V-PCC, as well as other learning-based approaches, while offering flexible networked adaption and affordable complexity.
High-Precision Camera Distortion Correction: A Decoupled Approach with Rational Functions
Yu J, Sun H, Zhou Y and Jiang X
This paper presents a robust, decoupled approach to camera distortion correction using a rational function model (RFM), designed to address challenges in accuracy and flexibility within precision-critical applications. Camera distortion is a pervasive issue in fields such as medical imaging, robotics, and 3D reconstruction, where high fidelity and geometric accuracy are crucial. Traditional distortion correction methods rely on radial-symmetry-based models, which have limited precision under tangential distortion and require nonlinear optimization. In contrast, general models do not rely on radial symmetry geometry and are theoretically generalizable to various sources of distortion. There exists a gap between the theoretical precision advantage of the Rational Function Model (RFM) and its practical applicability in real-world scenarios. This gap arises from uncertainties regarding the model's robustness to noise, the impact of sparse sample distributions, and its generalizability out of the training sample range. In this paper, we provide a mathematical interpretation of how RFM is suitable for the distortion correction problem through sensitivity analysis. The precision and robustness of RFM are evaluated through synthetic and real-world experiments, considering distortion level, noise level, and sample distribution. Moreover, a practical and accurate decoupled distortion correction method is proposed using just a single captured image of a chessboard pattern. The correction performance is compared with the current state-of-the-art using camera calibration, and experimental results indicate that more precise distortion correction can enhance the overall accuracy of camera calibration. In summary, this decoupled RFM-based distortion correction approach provides a flexible, high-precision solution for applications requiring minimal calibration steps and reliable geometric accuracy, establishing a foundation for distortion-free imaging and simplified camera models in precision-driven computer vision tasks.
DA-Net: A Double Alignment Multimodal Learning Network for Point Cloud Quality Assessment
Wu X, He Z, Luo T, Jiang G, Zhou W, Zhu L and Lin W
Existing multimodal point cloud quality assessment (PCQA) methods usually integrate 3D and 2D information to simulate human visual perception of distortions. However, due to the lack of consideration of spatial correspondence, they have difficulty to learn consistent distortion representations from different modalities in the same region of the PC. In addition, they also ignore the heterogeneity of modalities and rely on complex fusion mechanisms (e.g., attention) to integrate multimodal features. Both lead to limited performance and increased computational complexity. To address these limitations, we propose a novel double alignment multimodal learning network (DA-Net), which introduces two key alignment strategies. Specifically, the first is spatial pre-alignment strategy, which generates informative 2D patch for each 3D patch via an adaptive patch projection module (APPM), ensuring accurate spatial correspondence of different modalities prior to feature extraction. The second is a uniform feature alignment strategy, which includes feature disentanglement module (FDM) and feature mapping module (FMM) to relieve heterogeneity of modalities and guide the optimization of 2D and 3D encoder. Finally, multimodal features are simply integrated and regressed to obtain the quality score. Experimental results demonstrate that the DA-Net exhibits outstanding performance and generalization ability. It also achieves lower computational complexity compared with other multimodal PCQA methods. The source codes of DA-Net will be available at https://github.com/Rphone/DA-Net.
AI-Generated Image Quality Assessment Based on Task-Specific Prompt and Multi-Granularity Similarity
Xia J, He L, Deng C, Li L and Gao X
Recently, AI-generated images (AIGIs), synthesized based on initial textual prompts, have attracted widespread attention. However, due to limitations in current generation techniques, these images often exhibit degraded perceptual quality and semantic misalignment with the guiding prompts. Therefore, evaluating both perceptual quality and text-to-image alignment is essential for optimizing the performance of generative models. Existing methods design textual prompts solely based on the initial prompt for both perceptual and alignment quality tasks, and compute only coarse-grained similarity between the designed prompt and the generated image. However, such task-agnostic prompts overlook the distinctions between the perceptual and alignment quality tasks, and coarse-level similarity fails to capture semantic details, leading to suboptimal evaluation performance. To address these challenges, we propose a novel AIGI quality assessment framework, termed TPMS, which incorporates task-specific prompt and multi-granularity similarity computation. The task-specific prompt constructs dedicated prompts for perceptual and alignment quality respectively, allowing the model to capture distinct quality cues tailored to each evaluation task. Multi-granularity similarity measures the coarse-level similarity between the generated image and task-specific prompts to capture global quality characteristics, and the fine-level similarity between the generated image and the initial prompt to enhance semantic detail awareness. By integrating these two complementary similarities, TPMS enables precise and robust quality prediction. Extensive experiments on four widely-used AIGI quality benchmarks validate the effectiveness and superiority of the proposed framework.
URDM: Hyperspectral Unmixing Regularized by Diffusion Models
Zhao M, Tang L, Chen J and Huang B
Hyperspectral unmixing aims to decompose the mixed pixels into pure spectra and calculate their corresponding fractional abundances. It holds a critical position in hyperspectral image processing. Traditional model-based unmixing methods use convex optimization to iteratively solve the unmixing problem with hand-crafted regularizers. While their performance is limited by these manually designed constraints, which may not fully capture the structural information of the data. Recently, deep learning-based unmixing methods have shown remarkable capability for this task. However, they have limited generalizability and lack interpretability. In this paper, we propose a novel hyperspectral unmixing method regularized by a diffusion model (URDM) to overcome these shortcomings. Our method leverages the advantages of both conventional optimization algorithms and deep generative models. Specifically, we formulate the unmixing objective function from a variational perspective and integrate it into a diffusion sampling process to introduce generative priors from a denoising diffusion probabilistic model (DDPM). Since the original objective function is challenging to optimize, we introduce a splitting-based strategy to decouple it into simpler subproblems. Extensive experiment results conducted on both synthetic and real datasets demonstrate the efficiency and superior performance of our proposed method.
WMRNet: Wavelet Mamba With Reversible Structure for Infrared Small Target Detection
Zhang M, Li X, Guo J, Li Y and Gao X
Infrared small target detection (IRSTD) is of great practical significance in many real-world applications, such as maritime rescue and early warning systems, benefiting from the unique and excellent infrared imaging ability in adverse weather and low-light conditions. Nevertheless, segmenting small targets from the background remains a challenge. When the subsampling frequency during image processing does not satisfy the Nyquist criterion, the aliasing effect occurs, which makes it extremely difficult to identify small targets. To address this challenge, we propose a novel Wavelet Mamba with Reversible Structure Network (WMRNet) for infrared small target detection in this paper. Specifically, WMRNet consists of a Discrete Wavelet Mamba (DW-Mamba) module and a Third-order Difference Equation guided Reversible (TDE-Rev) structure. DW-Mamba employs the Discrete Wavelet Transform to decompose images into multiple subbands, integrating this information into the state equations of a state space model. This method minimizes frequency interference while preserving a global perspective, thereby effectively reducing background aliasing. The TDE-Rev aims to suppress edge aliasing effects by refining the target edges, which first processes features with an explicit neural structure derived from the second-order difference equations and then promotes feature interactions through a reversible structure. Extensive experiments on the public IRSTD-1k and SIRST datasets demonstrate that the proposed WMRNet outperforms the state-of-the-art methods.
Self-Adaptive Vision-Language Tracking With Context Prompting
Zhao J, Chen X, Li S, Bo C, Wang D and Lu H
Due to the substantial gap between vision and language modalities, along with the mismatch problem between fixed language descriptions and dynamic visual information, existing vision-language tracking methods exhibit performance on par with or slightly worse than vision-only tracking. Effectively exploiting the rich semantics of language to enhance tracking robustness remains an open challenge. To address these issues, we propose a self-adaptive vision-language tracking framework that leverages the pre-trained multi-modal CLIP model to obtain well-aligned visual-language representations. A novel context-aware prompting mechanism is introduced to dynamically adapt linguistic cues based on the evolving visual context during tracking. Specifically, our context prompter extracts dynamic visual features from the current search image and integrates them into the text encoding process, enabling self-updating language embeddings. Furthermore, our framework employs a unified one-stream Transformer architecture, supporting joint training for both vision-only and vision-language tracking scenarios. Our method not only bridges the modality gap but also enhances robustness by allowing language features to evolve with visual context. Extensive experiments on four vision-language tracking benchmarks demonstrate that our method effectively leverages the advantages of language to enhance visual tracking. Our large model can obtain 55.0% AUC on $\text {LaSOT}_{\text {EXT}}$ and 69.0% AUC on TNL2K. Additionally, our language-only tracking model achieves performance comparable to that of state-of-the-art vision-only tracking methods on TNL2K. Code is available at https://github.com/zj5559/SAVLT.
Enhancing Descriptive Image Quality Assessment With a Large-Scale Multi-Modal Dataset
You Z, Gu J, Cai X, Li Z, Zhu K, Dong C and Xue T
With the rapid advancement of Vision Language Models (VLMs), VLM-based Image Quality Assessment (IQA) seeks to describe image quality linguistically to align with human expression and capture the multifaceted nature of IQA tasks. However, current methods are still far from practical usage. First, prior works focus narrowly on specific sub-tasks or settings, which do not align with diverse real-world applications. Second, their performance is sub-optimal due to limitations in dataset coverage, scale, and quality. To overcome these challenges, we introduce the Enhanced Depicted image Quality Assessment model (EDQA). Our method includes a multi-functional IQA task paradigm that encompasses both assessment and comparison tasks, brief and detailed responses, full-reference and non-reference scenarios. We introduce a ground-truth-informed dataset construction approach to enhance data quality, and scale up the dataset to 495K under the brief-detail joint framework. Consequently, we construct a comprehensive, large-scale, and high-quality dataset, named EDQA-495K. We also retain image resolution during training to better handle resolution-related quality issues, and estimate a confidence score that is helpful to filter out low-quality responses. Experimental results demonstrate that EDQA significantly outperforms traditional score-based methods, prior VLM-based IQA models, and proprietary GPT-4V in distortion identification, instant rating, and reasoning tasks. Our advantages are further confirmed by real-world applications including assessing the web-downloaded images and ranking model-processed images. Codes, datasets, and model weights have been released in https://depictqa.github.io/.