Smooth Tensor Product for Tensor Completion
Low-rank tensor completion (LRTC) has shown promise in processing incomplete visual data, yet it often overlooks the inherent local smooth structures in images and videos. Recent advances in LRTC, integrating total variation regularization to capitalize on the local smoothness, have yielded notable improvements. Nonetheless, these methods are limited to exploiting local smoothness within the original data space, neglecting the latent factor space of tensors. More seriously, there is a lack of theoretical backing for the role of local smoothness in enhancing recovery performance. In response, this paper introduces an innovative tensor completion model that concurrently leverages the global low-rank structure of the original tensor and the local smooth structure of its factor tensors. Our objective is to learn a low-rank tensor that decomposes into two factor tensors, each exhibiting sufficient local smoothness. We propose an efficient alternating direction method of multipliers to optimize our model. Further, we establish generalization error bounds for smooth factor-based tensor completion methods across various decomposition frameworks. These bounds are significantly tighter than existing baselines. We conduct extensive inpainting experiments on color images, multispectral images, and videos, which demonstrate the efficacy and superiority of our method. Additionally, our approach shows a low sensitivity to hyper-parameter settings, enhancing its convenience and reliability for practical applications.
Cost Volume Aggregation in Stereo Matching Revisited: A Disparity Classification Perspective
Cost aggregation plays a critical role in existing stereo matching methods. In this paper, we revisit cost aggregation in stereo matching from disparity classification and propose a generic yet efficient Disparity Context Aggregation (DCA) module to improve the performance of CNN-based methods. Our approach is based on an insight that a coarse disparity class prior is beneficial to disparity regression. To obtain such a prior, we first classify pixels in an image into several disparity classes and treat pixels within the same class as homogeneous regions. We then generate homogeneous region representations and incorporate these representations into the cost volume to suppress irrelevant information while enhancing the matching ability for cost aggregation. With the help of homogeneous region representations, efficient and informative cost aggregation can be achieved with only a shallow 3D CNN. Our DCA module is fully-differentiable and well-compatible with different network architectures, which can be seamlessly plugged into existing networks to improve performance with small additional overheads. It is demonstrated that our DCA module can effectively exploit disparity class priors to improve the performance of cost aggregation. Based on our DCA, we design a highly accurate network named DCANet, which achieves state-of-the-art performance on several benchmarks.
CFOR: Character-First Open-Set Text Recognition via Context-Free Learning
The open-set text recognition task is a generalized form of the (close-set) text recognition task, where the model is further challenged to spot and incrementally recognize novel characters in new languages, which also indicate a shift in the language model. In this work, we propose to alleviate the confounding effect such biases under an open-set setup by learning individual character representations out of their context.We propose a Character-First Open-Set Text Recognition framework that treats individual characters as first-class citizens by cotraining the feature extractor with two context-free learning tasks. First, a Context Isolation Learning task is proposed to wipe the context for each character, utilizing a character mask learned in a weak supervision manner. Second, the framework adopts an Individual Character Learning task, which is a single-character classification task with synthetic samples. Our framework can reliably spot unseen characters in Japanese with F1-score over 64%, and can adapt to recognize unseen characters in Japanese, Korean, Greek, and other languages without retraining. The framework also shows decent many-shot performance on close-set text benchmarks with 91.5% line accuracy on IIIT5k and a speed of over 69 FPS single-batched, making it a feasible universal lightweight OCR solution.
SegHSI: Semantic Segmentation of Hyperspectral Images with Limited Labeled Pixels
Hyperspectral images (HSIs), with hundreds of narrow spectral bands, are increasingly used for ground object classification in remote sensing. However, many HSI classification models operate pixel-by-pixel, limiting the utilization of spatial information and resulting in increased inference time for the whole image. This paper proposes SegHSI, an effective and efficient end-to-end HSI segmentation model, alongside a novel training strategy. SegHSI adopts a head-free structure with cluster attention modules and spatial-aware feedforward networks (SA-FFN) for multiscale spatial encoding. Cluster attention encodes pixels through constructed clusters within the HSI, while SA-FFN integrates depth-wise convolution to enhance spatial context. Our training strategy utilizes a student-teacher model framework that combines labeled pixel class information with consistency learning on unlabeled pixels. Experiments on three public HSI datasets demonstrate that SegHSI not only surpasses other state-of-the-art models in segmentation accuracy but also achieves inference time at the scale of seconds, even reaching sub-second speeds for full-image classification. Code is available at https://github.com/huanliu233/SegHSI.
Noisy-Aware Unsupervised Domain Adaptation for Scene Text Recognition
Unsupervised Domain Adaptation (UDA) has shown promise in Scene Text Recognition (STR) by facilitating knowledge transfer from labeled synthetic text (source) to more challenging unlabeled real scene text (target). However, existing UDA-based STR methods fully rely on the pseudo-labels of target samples, which ignores the impact of domain gaps (inter-domain noise) and various natural environments (intra-domain noise), resulting in poor pseudo-label quality. In this paper, we propose a novel noisy-aware unsupervised domain adaptation framework tailored for STR, which aims to enhance model robustness against both inter- and intra-domain noise, thereby providing more precise pseudo-labels for target samples. Concretely, we propose a reweighting target pseudo-labels by estimating the entropy of refined probability distributions, which mitigates the impact of domain gaps on pseudo-labels. Additionally, a decoupled triple-P-N consistency matching module is proposed, which leverages data augmentation to increase data diversity, enhancing model robustness in diverse natural environments. Within this module, we design a low-confidence-based character negative learning, which is decoupled from high-confidence-based positive learning, thus improving sample utilization under scarce target samples. Furthermore, we extend our framework to the more challenging Source-Free UDA (SFUDA) setting, where only a pre-trained source model is available for adaptation, with no access to source data. Experimental results on benchmark datasets demonstrate the effectiveness of our framework. Under the SFUDA setting, our method exhibits faster convergence and superior performance with less training data than previous UDA-based STR methods. Our method surpasses representative STR methods, establishing new state-of-the-art results across multiple datasets.
Constructing Diverse Inlier Consistency for Partial Point Cloud Registration
Partial point cloud registration aims to align partial scans into a shared coordinate system. While learning-based partial point cloud registration methods have achieved remarkable progress, they often fail to take full advantage of the relative positional relationships both within (intra-) and between (inter-) point clouds. This oversight hampers their ability to accurately identify overlapping regions and search for reliable correspondences. To address these limitations, a diverse inlier consistency (DIC) method has been proposed that adaptively embeds the positional information of a reliable correspondence in the intra- and inter-point cloud. Firstly, a diverse inlier consistency-driven region perception (DICdRP) module is devised, which encodes the positional information of the selected correspondence within the intra-point cloud. This module enhances the sensitivity of all points to overlapping regions by recognizing the position of the selected correspondence. Secondly, a diverse inlier consistency-aware correspondence search (DICaCS) module is developed, which leverages relative positions in the inter-point cloud. This module studies an inter-point cloud DIC weight to supervise correspondence compatibility, allowing for precise identification of correspondences and effective outlier filtration. Thirdly, diverse information is integrated throughout our framework to achieve a more holistic and detailed registration process. Extensive experiments on object-level and scene-level datasets demonstrate the superior performance of the proposed algorithm. The code is available at https://github.com/yxzhang15/DIC.
Learning a Cross-modality Anomaly Detector for Remote Sensing Imagery
Remote sensing anomaly detector can find the objects deviating from the background as potential targets for Earth monitoring. Given the diversity in earth anomaly types, designing a transferring model with cross-modality detection ability should be cost-effective and flexible to new earth observation sources and anomaly types. However, the current anomaly detectors aim to learn the certain background distribution, the trained model cannot be transferred to unseen images. Inspired by the fact that the deviation metric for score ranking is consistent and independent from the image distribution, this study exploits the learning target conversion from the varying background distribution to the consistent deviation metric. We theoretically prove that the large-margin condition in labeled samples ensures the transferring ability of learned deviation metric. To satisfy this condition, two large margin losses for pixel-level and feature-level deviation ranking are proposed respectively. Since the real anomalies are difficult to acquire, anomaly simulation strategies are designed to compute the model loss. With the large-margin learning for deviation metric, the trained model achieves cross-modality detection ability in five modalities-hyperspectral, visible light, synthetic aperture radar (SAR), infrared and low-light-in zero-shot manner.
Laplacian Gradient Consistency Prior for Flash Guided Non-Flash Image Denoising
For flash guided non-flash image denoising, the main challenge is to explore the consistency prior between the two modalities. Most existing methods attempt to model the flash/non-flash consistency in pixel level, which may easily lead to blurred edges. Different from these methods, we have an important finding in this paper, which reveals that the modality gap between flash and non-flash images conforms to the Laplacian distribution in gradient domain. Based on this finding, we establish a Laplacian gradient consistency (LGC) model for flash guided non-flash image denoising. This model is demonstrated to have faster convergence speed and denoising accuracy than the traditional pixel consistency model. Through solving the LGC model, we further design a deep network namely LGCNet. Different from existing image denoising networks, each component of the LGCNet strictly matches the solution of LGC model, giving the network good interpretability. The performance of the proposed LGCNet is evaluated on three different flash/non-flash image datasets, which demonstrates its superior denoising performance over many state-of-the-art methods both quantitatively and qualitatively. The intermediate features are also visualized to verify the effectiveness of the Laplacian gradient consistency prior. The source codes are available at https://github.com/JingyiXu404/LGCNet.
Explainability Enhanced Object Detection Transformer with Feature Disentanglement
Explainability is a pivotal factor in determining whether a deep learning model can be authorized in critical applications. To enhance the explainability of models of end-to-end object DEtection with TRansformer (DETR), we introduce a disentanglement method that constrains the feature learning process, following a divide-and-conquer decoupling paradigm, similar to how people understand complex real-world problems. We first demonstrate the entangled property of the features between the extractor and detector and find that the regression function is a key factor contributing to the deterioration of disentangled feature activation. These highly entangled features always activate the local characteristics, making it difficult to cover the semantic information of an object, which also reduces the interpretability of single-backbone object detection models. Thus, an Explainability Enhanced object detection Transformer with feature Disentanglement (DETD) model is proposed, in which the Tensor Singular Value Decomposition (T-SVD) is used to produce feature bases and the Batch averaged Feature Spectral Penalization (BFSP) loss is introduced to constrain the disentanglement of the feature and balance the semantic activation. The proposed method is applied across three prominent backbones, two DETR variants, and a CNN based model. By combining two optimization techniques, extensive experiments on two datasets consistently demonstrate that the DETD model outperforms the counterpart in terms of object detection performance and feature disentanglement. The Grad-CAM visualizations demonstrate the enhancement of feature learning explainability in the disentanglement view.
PVPUFormer: Probabilistic Visual Prompt Unified Transformer for Interactive Image Segmentation
Integration of diverse visual prompts like clicks, scribbles, and boxes in interactive image segmentation significantly facilitates users' interaction as well as improves interaction efficiency. However, existing studies primarily encode the position or pixel regions of prompts without considering the contextual areas around them, resulting in insufficient prompt feedback, which is not conducive to performance acceleration. To tackle this problem, this paper proposes a simple yet effective Probabilistic Visual Prompt Unified Transformer (PVPUFormer) for interactive image segmentation, which allows users to flexibly input diverse visual prompts with the probabilistic prompt encoding and feature post-processing to excavate sufficient and robust prompt features for performance boosting. Specifically, we first propose a Probabilistic Prompt-unified Encoder (PPuE) to generate a unified one-dimensional vector by exploring both prompt and non-prompt contextual information, offering richer feedback cues to accelerate performance improvement. On this basis, we further present a Prompt-to-Pixel Contrastive (PC) loss to accurately align both prompt and pixel features, bridging the representation gap between them to offer consistent feature representations for mask prediction. Moreover, our approach designs a Dual-cross Merging Attention (DMA) module to implement bidirectional feature interaction between image and prompt features, generating notable features for performance improvement. A comprehensive variety of experiments on several challenging datasets demonstrates that the proposed components achieve consistent improvements, yielding state-of-the-art interactive segmentation performance. Our code is available at https://github.com/XuZhang1211/PVPUFormer.
Image Copy-Move Forgery Detection via Deep PatchMatch and Pairwise Ranking Learning
Recent advances in deep learning algorithms have shown impressive progress in image copy-move forgery detection (CMFD). However, these algorithms lack generalizability in practical scenarios where the copied regions are not present in the training images, or the cloned regions are part of the background. Additionally, these algorithms utilize convolution operations to distinguish source and target regions, leading to unsatisfactory results when the target regions blend well with the background. To address these limitations, this study proposes a novel end-to-end CMFD framework that integrates the strengths of conventional and deep learning methods. Specifically, the study develops a deep cross-scale PatchMatch (PM) method that is customized for CMFD to locate copy-move regions. Unlike existing deep models, our approach utilizes features extracted from high-resolution scales to seek explicit and reliable point-to-point matching between source and target regions. Furthermore, we propose a novel pairwise rank learning framework to separate source and target regions. By leveraging the strong prior of point-to-point matches, the framework can identify subtle differences and effectively discriminate between source and target regions, even when the target regions blend well with the background. Our framework is fully differentiable and can be trained end-to-end. Comprehensive experimental results highlight the remarkable generalizability of our scheme across various copy-move scenarios, significantly outperforming existing methods.
Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object Detection
We explore multi-modal contextual knowledge learned through multi-modal masked language modeling to provide explicit localization guidance for novel classes in open-vocabulary object detection (OVD). Intuitively, a well-modeled and correctly predicted masked concept word should effectively capture the textual contexts, visual contexts, and the cross-modal correspondence between texts and regions, thereby automatically activating high attention on corresponding regions. In light of this, we propose a multi-modal contextual knowledge distillation framework, MMC-Det, to explicitly supervise a student detector with the context-aware attention of the masked concept words in a teacher fusion transformer. The teacher fusion transformer is trained with our newly proposed diverse multi-modal masked language modeling (D-MLM) strategy, which significantly enhances the fine-grained region-level visual context modeling in the fusion transformer. The proposed distillation process provides additional contextual guidance to the concept-region matching of the detector, thereby further improving the OVD performance. Extensive experiments performed upon various detection datasets show the effectiveness of our multi-modal context learning strategy.
Rethinking Noise Sampling in Class-Imbalanced Diffusion Models
In the practical application of image generation, dealing with long-tailed data distributions is a common challenge for diffusion-based generative models. To tackle this issue, we investigate the head-class accumulation effect in diffusion models' latent space, particularly focusing on its correlation to the noise sampling strategy. Our experimental analysis indicates that employing a consistent sampling distribution for the noise prior across all classes leads to a significant bias towards head classes in the noise sampling distribution, which results in poor quality and diversity of the generated images. Motivated by this observation, we propose a novel sampling strategy named Bias-aware Prior Adjusting (BPA) to debias diffusion models in the class-imbalanced scenario. With BPA, each class is automatically assigned an adaptive noise sampling distribution prior during training, effectively mitigating the influence of class imbalance on the generation process. Extensive experiments on several benchmarks demonstrate that images generated using our proposed BPA showcase elevated diversity and superior quality.
Salient Object Detection From Arbitrary Modalities
Toward desirable saliency prediction, the types and numbers of inputs for a salient object detection (SOD) algorithm may dynamically change in many real-life applications. However, existing SOD algorithms are mainly designed or trained for one particular type of inputs, failing to be generalized to other types of inputs. Consequentially, more types of SOD algorithms need to be prepared in advance for handling different types of inputs, raising huge hardware and research costs. Differently, in this paper, we propose a new type of SOD task, termed Arbitrary Modality SOD (AM SOD). The most prominent characteristics of AM SOD are that the modality types and modality numbers will be arbitrary or dynamically changed. The former means that the inputs to the AM SOD algorithm may be arbitrary modalities such as RGB, depths, or even any combination of them. While, the latter indicates that the inputs may have arbitrary modality numbers as the input type is changed, e.g. single-modality RGB image, dual-modality RGB-Depth (RGB-D) images or triple-modality RGB-Depth-Thermal (RGB-D-T) images. Accordingly, a preliminary solution to the above challenges, i.e. a modality switch network (MSN), is proposed in this paper. In particular, a modality switch feature extractor (MSFE) is first designed to extract discriminative features from each modality effectively by introducing some modality indicators, which will generate some weights for modality switching. Subsequently, a dynamic fusion module (DFM) is proposed to adaptively fuse features from a variable number of modalities based on a novel Transformer structure. Finally, a new dataset, named AM-XD, is constructed to facilitate research on AM SOD. Extensive experiments demonstrate that our AM SOD method can effectively cope with changes in the type and number of input modalities for robust salient object detection. Our code and AM-XD dataset will be released on https://github.com/nexiakele/AMSODFirst.
Learning Cross-Attention Point Transformer With Global Porous Sampling
In this paper, we propose a point-based cross-attention transformer named CrossPoints with parametric Global Porous Sampling (GPS) strategy. The attention module is crucial to capture the correlations between different tokens for transformers. Most existing point-based transformers design multi-scale self-attention operations with down-sampled point clouds by the widely-used Farthest Point Sampling (FPS) strategy. However, FPS only generates sub-clouds with holistic structures, which fails to fully exploit the flexibility of points to generate diversified tokens for the attention module. To address this, we design a cross-attention module with parametric GPS and Complementary GPS (C-GPS) strategies to generate series of diversified tokens through controllable parameters. We show that FPS is a degenerated case of GPS, and the network learns more abundant relational information of the structure and geometry when we perform consecutive cross-attention over the tokens generated by GPS as well as C-GPS sampled points. More specifically, we set evenly-sampled points as queries and design our cross-attention layers with GPS and C-GPS sampled points as keys and values. In order to further improve the diversity of tokens, we design a deformable operation over points to adaptively adjust the points according to the input. Extensive experimental results on both shape classification and indoor scene segmentation tasks indicate promising boosts over the recent point cloud transformers. We also conduct ablation studies to show the effectiveness of our proposed cross-attention module with GPS strategy.
GSSF: Generalized Structural Sparse Function for Deep Cross-Modal Metric Learning
Cross-modal metric learning is a prominent research topic that bridges the semantic heterogeneity between vision and language. Existing methods frequently utilize simple cosine or complex distance metrics to transform the pairwise features into a similarity score, which suffers from an inadequate or inefficient capability for distance measurements. Consequently, we propose a Generalized Structural Sparse Function to dynamically capture thorough and powerful relationships across modalities for pair-wise similarity learning while remaining concise but efficient. Specifically, the distance metric delicately encapsulates two formats of diagonal and block-diagonal terms, automatically distinguishing and highlighting the cross-channel relevancy and dependency inside a structured and organized topology. Hence, it thereby empowers itself to adapt to the optimal matching patterns between the paired features and reaches a sweet spot between model complexity and capability. Extensive experiments on cross-modal and two extra uni-modal retrieval tasks (image-text retrieval, person re-identification, fine-grained image retrieval) have validated its superiority and flexibility over various popular retrieval frameworks. More importantly, we further discover that it can be seamlessly incorporated into multiple application scenarios, and demonstrates promising prospects from Attention Mechanism to Knowledge Distillation in a plug-and-play manner.
AnlightenDiff: Anchoring Diffusion Probabilistic Model on Low Light Image Enhancement
Low-light image enhancement aims to improve the visual quality of images captured under poor illumination. However, enhancing low-light images often introduces image artifacts, color bias, and low SNR. In this work, we propose AnlightenDiff, an anchoring diffusion model for low light image enhancement. Diffusion models can enhance the low light image to well-exposed image by iterative refinement, but require anchoring to ensure that enhanced results remain faithful to the input. We propose a Dynamical Regulated Diffusion Anchoring mechanism and Sampler to anchor the enhancement process. We also propose a Diffusion Feature Perceptual Loss tailored for diffusion based model to utilize different loss functions in image domain. AnlightenDiff demonstrates the effect of diffusion models for low-light enhancement and achieving high perceptual quality results. Our techniques show a promising future direction for applying diffusion models to image enhancement.
Multi-Dimensional Visual Data Restoration: Uncovering the Global Discrepancy in Transformed High-Order Tensor Singular Values
The recently proposed high-order tensor algebraic framework generalizes the tensor singular value decomposition (t-SVD) induced by the invertible linear transform from order-3 to order-d ( ). However, the derived order-d t-SVD rank essentially ignores the implicit global discrepancy in the quantity distribution of non-zero transformed high-order singular values across the higher modes of tensors. This oversight leads to suboptimal restoration in processing real-world multi-dimensional visual datasets. To address this challenge, in this study, we look in-depth at the intrinsic properties of practical visual data tensors, and put our efforts into faithfully measuring their high-order low-rank nature. Technically, we first present a novel order-d tensor rank definition. This rank function effectively captures the aforementioned discrepancy property observed in real visual data tensors and is thus called the discrepant t-SVD rank. Subsequently, we introduce a nonconvex regularizer to facilitate the construction of the corresponding discrepant t-SVD rank minimization regime. The results show that the investigated low-rank approximation has the closed-form solution and avoids dilemmas caused by the previous convex optimization approach. Based on this new regime, we meticulously develop two models for typical restoration tasks: high-order tensor completion and high-order tensor robust principal component analysis. Numerical examples on order-4 hyperspectral videos, order-4 color videos, and order-5 light field images substantiate that our methods outperform state-of-the-art tensor-represented competitors. Finally, taking a fundamental order-3 hyperspectral tensor restoration task as an example, we further demonstrate the effectiveness of our new rank minimization regime for more practical applications. The source codes of the proposed methods are available at https://github.com/CX-He/DTSVD.git.
λ-Domain Rate Control via Wavelet-Based Residual Neural Network for VVC HDR Intra Coding
High dynamic range (HDR) video offers a more realistic visual experience than standard dynamic range (SDR) video, while introducing new challenges to both compression and transmission. Rate control is an effective technology to overcome these challenges, and ensure optimal HDR video delivery. However, the rate control algorithm in the latest video coding standard, versatile video coding (VVC), is tailored to SDR videos, and does not produce well coding results when encoding HDR videos. To address this problem, a data-driven λ -domain rate control algorithm is proposed for VVC HDR intra frames in this paper. First, the coding characteristics of HDR intra coding are analyzed, and a piecewise R- λ model is proposed to accurately determine the correlation between the rate (R) and the Lagrange parameter λ for HDR intra frames. Then, to optimize bit allocation at the coding tree unit (CTU)-level, a wavelet-based residual neural network (WRNN) is developed to accurately predict the parameters of the piecewise R- λ model for each CTU. Third, a large-scale HDR dataset is established for training WRNN, which facilitates the applications of deep learning in HDR intra coding. Extensive experimental results show that our proposed HDR intra frame rate control algorithm achieves superior coding results than the state-of-the-art algorithms. The source code of this work will be released at https://github.com/TJU-Videocoding/WRNN.git.
Energy-Based Domain Adaptation Without Intermediate Domain Dataset for Foggy Scene Segmentation
Robust segmentation performance under dense fog is crucial for autonomous driving, but collecting labeled real foggy scene datasets is burdensome in the real world. To this end, existing methods have adapted models trained on labeled clear weather images to the unlabeled real foggy domain. However, these approaches require intermediate domain datasets (e.g. synthetic fog) and involve multi-stage training, making them cumbersome and less practical for real-world applications. In addition, the issue of overconfident pseudo-labels by a confidence score remains less explored in self-training for foggy scene adaptation. To resolve these issues, we propose a new framework, named DAEN, which Directly Adapts without additional datasets or multi-stage training and leverages an ENergy score in self-training. Notably, we integrate a High-order Style Matching (HSM) module into the network to match high-order statistics between clear weather features and real foggy features. HSM enables the network to implicitly learn complex fog distributions without relying on intermediate domain datasets or multi-stage training. Furthermore, we introduce Energy Score-based Pseudo-Labeling (ESPL) to mitigate the overconfidence issue of the confidence score in self-training. ESPL generates more reliable pseudo-labels through a pixel-wise energy score, thereby alleviating bias and preventing the model from assigning pseudo-labels exclusively to head classes. Extensive experiments demonstrate that DAEN achieves state-of-the-art performance on three real foggy scene datasets and exhibits a generalization ability to other adverse weather conditions. Code is available at https://github.com/jdg900/daen.
Efficient Swin Transformer for Remote Sensing Image Super-Resolution
Remote sensing super-resolution (SR) technique, which aims to generate high-resolution image with rich spatial details from its low-resolution counterpart, play a vital role in many applications. Recently, more and more studies attempt to explore the application of Transformer in remote sensing field. However, they suffer from the high computational burden and memory consumption for remote sensing super-resolution. In this paper, we propose an efficient Swin Transformer (ESTNet) via channel attention for SR of remote sensing images, which is composed of three components. First, a three-layer convolutional operation is utilized to extract shallow features of the input low-resolution image. Then, a residual group-wise attention module is proposed to extract the deep features, which contains an efficient channel attention block (ECAB) and a group-wise attention block (GAB). Finally, the extracted deep features are reconstructed to generate high-resolution remote sensing images. Extensive experimental results proclaim that the proposed ESTNet can obtain better super-resolution results with low computational burden. Compared to the recently proposed Transformer-based remote sensing super-resolution method, the number of parameters is reduced by 82.68% while the computational cost is reduced by 87.84%. The code of the proposed ESTNet will be available at https://github.com/PuhongDuan/ESTNet for reproducibility.