SCTransNet: Spatial-channel Cross Transformer Network for Infrared Small Target Detection (2024)

ShuaiYuan,HanlinQin,XiangYan,NaveedAkhtar,AjmalMianThis work was supported in part by the Shaanxi Province Key Research and Development Plan Project under Grant 2022JBGS2-09, in part by the 111 Project under Grant B17035, in part by the Shaanxi Province Science and Technology Plan Project under Grant 2023KXJ-170, in part by the Xian City Science and Technology Plan Project under Grant 21JBGSZ-QCY9-0004, Grant 23ZDCYJSGG0011-2023, Grant 22JBGS-QCY4-0006, and Grant 23GBGS0001, in part by the Aeronautical Science Foundation of China under Grant 20230024081027, in part by the Natural Science Foundation Explore of Zhejiang province under Grant LTGG24F010001, in part by the Natural Science Foundation of Ningbo under Grant 2022J185, in part by the China scholarship council 202306960052, in part by the Technology Area Foundation of China 2021-JJ-1244, 2021-JJ-0471, 2023-JJ-0148, and part by the Xidian Graduate Student Innovation fund under Grant YJSJ23010. (Corresponding authors:Hanlin Qin, Xiang Yan.)ShuaiYuan, HanlinQin, and XiangYan are with the School of Optoelectronic Engineering, Xidian University, Xi’an 710071, China. (email: yuansy@stu.xidian.edu.cn; hlqin@mail.xidian.edu.cn; xyan@xidian.edu.cn)Naveed Akhtar is with the School of Computing and Information Systems, Faculty of Engineering and IT, The University of Melbourne, Parkville VIC 3052, Australia (email: naveed.akhtar1@unimelb.edu.au).Ajmal Mian is with the Department of Computer Science and Software Engineering, The University of Western Australia, Perth, 6009, Australia (email: ajmal.mian@uwa.edu.au).

Abstract

This is the pre-acceptance version, to read the final version please go to IEEE TRANSACTION ON GEOSCIENCE AND REMOTE SENSING on IEEE Xplore.Infrared small target detection (IRSTD) has recently benefitted greatly from U-shaped neural models.However, largely overlooking effective global information modeling, existing techniques struggle when the target has high similarities with the background.We present a Spatial-channel Cross Transformer Network(SCTransNet) that leverages spatial-channel cross transformer blocks(SCTBs) on top of long-range skip connections to address the aforementioned challenge.In the proposed SCTBs, the outputs of all encoders are interacted with cross transformer to generate mixed features, which are redistributed to all decoders to effectively reinforce semantic differences between the target and clutter at full levels.Specifically, SCTB contains the following two key elements:(a) spatial-embedded single-head channel-cross attention(SSCA) for exchanging local spatial features and full-level global channel information to eliminate ambiguity among the encoders and facilitate high-level semantic associations of the images,and (b) a complementary feed-forward network(CFN) for enhancing the feature discriminability via a multi-scale strategy and cross-spatial-channel information interaction to promote beneficial information transfer.Our SCTransNet effectively encodes the semantic differences between targets and backgrounds to boost its internal representation for detecting small infrared targets accurately.Extensive experiments on three public datasets, NUDT-SIRST, NUAA-SIRST, and IRSTD-1K, demonstrate that the proposed SCTransNet outperforms existing IRSTD methods. Our code will be made public athttps://github.com/xdFai/SCTransNet.

Index Terms:

Infrared small target detection, transformer, cross attention, CNN, deep learning.

I Introduction

Infrared small target detection (IRSTD) plays an important role in traffic monitoring[1], maritime rescue[2], and target warning[3], where separating small targets in complex scene backgrounds is required.The challenges emerging from the dynamic nature of scenes have attracted considerable research attention in single-frame IRSTD[4]. Early methods in this direction employed image filtering[5],[6], human visual system (HVS)[7],[8], and \replacedlow-ranklow rank approximation[9],[10] techniques while relying on complex handcrafted feature designs, empirical observations, and model parameter fine-tuning.However, suffering from the absence of a reliable high-level understanding of the holistic scene, these methods exhibit poor robustness.

Recently, learning-based methods have become more popular due to their strong data-driven feature mining abilities[11].To capture the target’s outlines and mitigate performance degradation caused by its small size, these methods approach the IRSTD problem as a semantic segmentation task instead of a traditional object detection issue.Unlike general object segmentation in autonomous driving[12], imaging mechanism of the IR detection systems in remote sensing applications[13] leads to small targets in images exhibiting the following characteristics.1) Dim and small: Due to remote imaging, IR targets are small and usually exhibit a low signal-to-clutter ratio,making them susceptible to immersion in heavy noise and background clutter.2) Characterless: Thermal images lack color and texture information in targets, and imprecise camera focus can cause target blurring. These factors pose peculiar challenges in designing feature extraction techniques for IRSTD.3) Uncertain shapes: The scales and shapes of IR targets vary significantly across different scenes, which makes the problem of detection considerably challenging.

SCTransNet: Spatial-channel Cross Transformer Network for Infrared Small Target Detection (1)

To identify small IR targets in complex backgrounds, numerous learning-based methods have been proposed, among which\replacedneural networks with U-shaped architecturesU-shaped architectures of neural networks have gained prominence.\replacedBenefiting from these frameworks of encoders, decoders, and long-range skip connections,These networks consist of encoders, decoders, and long-range skip connections.asymmetric contextual modulation (ACM) network[14] initially demonstrated the effectiveness of cross-layer feature fusion for retaining IR target features. This is achieved through bidirectional aggregation of high-level semantic information and low-level details using asymmetric top-down and bottom-up structures.Subsequently, feature fusion strategies have been widely adopted in IRSTD task[18],[19],[20],[21].A few recent methods facilitate the transfer of beneficial features to the decoder component by improving the skip connections[22],[23].Inspired by the nested structure[24],DNA-Net[15] developed a densely nested interactive module to facilitate gradual interaction between high- and low-level features and adaptively enhance features.Moreover, there are also approaches that focus on developing more effective encoders and decoders[25],[26]. For instance, UIU-Net[16] embeds smaller U-Nets in the U-Net to learn the local contrast information of the target and perform interactive-cross attention (IC-A) for feature fusion.

Despite achieving satisfactory results, the aforementioned CNN-based approaches lack the ability to encode comprehensive attributes of the target, missing their discriminative features.To address that, MTU-Net[17] employs a multilevel Vision Transformer (ViT)-CNN hybrid encoder to exploit the spatial correlation among all encoded features for contextual information aggregation.However, a simple spatial ViT-CNN hybrid module is insufficient for understanding the global semantics of images, which makes high false alarms.To further dissect the issue, we illustrate the frameworks of ACM[14], DNA-Net[15], UIU-Net[16], and MTU-Net[17] separately, along with visualizations of the attention maps from different decoder levels in Fig.1(c)-(f).\addedGiven the input image in Fig.1(b),we observe that false alarms occur when existing models direct their attention to localized regions of background clutters in high-level features. In other words, false alarms are often caused by discontinuity modeling of backgrounds in the deeper layers.We identify this problem to the following three main reasons:

1) Semantic interaction across feature levels is not established well.\addedAs shown in Fig.1(a)①, IR small targets exhibit limited features owing to their diminutive size.Multiple downsampling processes inevitably result in the loss of spatial information. This considerably affects the level-to-level feature interactions in the network, eventually leading to poor comprehensive global semantic information encoding.

2) Feature enhancement fails to bridge the information gap between encoders and decoders.\addedAs shown in Fig.1(a)②, there exists a semantic gap between the output features of encoders and the input features of the decoders. Simple skip connections and dense nested modules are insufficient to enhance the advantageous responses of the features to the decoder, thereby making it challenging to establish a mapping relationship from the IR image to the segmentation space.

3) Inaccurate long-range contextual perception of targets and backgrounds in deeper layers.IR small targets can be highly similar to the scene background. \addedAs shown in Fig.1(a)③, a powerful detector not only has to sense the local saliency of the target but also needs to model the continuity of the background. Convolutional Neural Networks (CNNs) and vanilla ViTs are not fully equipped to achieve this.

\added

Inspired by the success of channel-wise cross fusion transformer in image segmentation[27],[28],[29] and local spatial embedding in image restoration[30],[31],[32],\replacedwe propose a spatial-channel cross transformer network (SCTransNet) for IRSTD to address the above challenges, To address the above problems, we propose a spatial-channel cross transformer network (SCTransNet) for IRSTD,aiming to distinguish the small targets and background clutters in deeper layers.As illustrated in Fig.1(g), our framework adds multiple spatial-channel cross transformer blocks (SCTB) (Sec.III-B) on the original skip connections to establish an explicit association with all encoders and decoders.Specifically, SCTB consists of two components: Spatial-embedded single-head channel-cross attention (SSCA) (Sec.III-B1) and complementary feed-forward network (CFN) (Sec.III-B2).

The SSCA applies channel cross-attention from the feature dimension at all levels to learn global information. Besides, depth-wise convolutions are used for local spatial context mixing before feature covariance computation.This strategy provides two advantages:Firstly, it highlights the context of local space with a small computational overhead using the convolution’s local connectivity, thereby increasing the saliency of IR small targets.Secondly, it makes sure that contextualized global relationships among full-level feature pixels are implicitly captured during the attention matrix computation, thereby reinforcing the continuity of the background.

After the SSCA completes the cross-level information interaction, CFN performs feature enhancement at every level in two complementary stages.Initially, it utilizes multi-scale depth-wise convolutions to enhance target neighborhood space response and pixel-wise aggregates the cross-channel nonlinear information.Subsequently, it estimates total spatial information on a channel-by-channel basis using global average pooling and creates local cross-channel interactions between distinct semantic patterns as an attention map.The above strategy has two advantages.(1) Multi-scale spatial modeling can emphasize semantic differences between the target and background.(2) Establishing the complementary correlation of the local space global channel (LSGC) and the global space local channel (GSLC) can facilitate the interface between infrared images and semantic maps.

Benefiting from the above structure (Fig.1(g)), our SCTransNet can perceive the image semantics better than other methods leading to reduced false alarms.Our main contributions are as follows:

  • We propose SCTransNet, which leverages multiple spatial-channel cross transformer blocks (SCTB) connecting all encoders and decoders to predict the context of targets and backgrounds in the deeper network layers.

  • We propose a spatial-embedded single-head channel-cross attention (SSCA) module to foster semantic interactions across all feature levels and learn the long-range context correlation of the image.

  • We devise a novel complementary feed-forward network (CFN) by crossing spatial-channel information to enhance the semantic difference between the target and background, bridging the semantic gap between encoders and decoders.

II RELATED WORK

We first briefly review the CNN- and transformer-based techniques in IRSTD. Following that, we discuss the application of channel-wise cross transformer in image processing.

II-A CNN-based IRSTD methods

Owing to the local saliency of IR small targets coinciding with the local connectivity of convolution neural networks (CNNs), CNNs have demonstrated remarkable performance in the IRSTD task.To effectively preserve the semantic patterns of small targets, diverse feature fusion strategies have been proposed.One common strategy is cross-layer feature fusion[33],[34],[35], which can address the loss of target information when fusing the encoded and decoded features.Additionally, densely nested interactive feature fusion[15],[36] is used to repetitively fuse and enhance the features of different levels, maintaining the information of IR small targets in the deeper layers.Considering variations in target scales, multi-scale feature fusion[37],[38] has been proposed to enhance the low-resolution feature maps.Besides feature fusion, incorporating prior information about the target into CNNs is also an effective strategy. For instance, Sun et al.[39] exploited the small-target gray gradient change property using a receptive-field and direction-induced attention network (RDIAN), which solves the imbalance between the target and background classes.Zhang et al.[40] used Taylor’s finite difference for complex edge feature extraction of a target to enhance the target and background gray scale difference.

Although satisfactory results are achieved by CNN-based techniques, the inherent inductive bias of CNNs makes it difficult to unambiguously establish long-range contextual information for the IRSTD task.Unlike the aforementioned methods, we incorporate transformer blocks into the backbone of CNNs as a core unit to capture non-local information for the entire image.

SCTransNet: Spatial-channel Cross Transformer Network for Infrared Small Target Detection (2)

II-B Transformer-based IRSTD methods

Vision Transformer(ViT)[41] decomposes an image/features into a series of patches and computes their correlation. This computational paradigm can stably establish long-distance dependence among different patches, \replacedleading towhich has led to its widespread usage in IRSTD tasks for global image modeling[42],[43],[44].Inspired by TransUnet[45], IRSTFormer[46] embedded the spatial transformer within multiple encoder stages in a U-Net. Motivated by Swin transformer[47], FTC-Net[48] establishes a robust feature representation of the target using a two-branch structure combining the local feature extraction of CNNs and the global feature extraction capability of the Swin transformer.Recently, Meng et al.[49] modeled the local gradient information of the target using central difference convolution and employed criss-cross multi-attention[50] to acquire contextual information.Note that, the above methods use spatial self-attention (SA) to calculate covariance-based attention maps, which have two problems: 1) The computational complexity is proportional to the square of the number of tokens, which limits the multiple nesting of the spatial transformer and its fine-grained representation of high-resolution images[30].2) The SA only constructs long-distance dependency for a single feature map, whereas it is more critical to establish contextual connections among all levels.

Different from previous works, we present the channel-wise cross transformer on the long-range skip connections for the first time in the IRSTD task.This allows establishing cross-channel semantic patterns across all levels with an acceptable computational overhead.

II-C Channel-wise Cross Transformer on Image Processing

Unlike spatial transformers, channel-wise transformers (CT)[30] treat each channel as a patch. Note that every channel is a unique semantic pattern, CT essentially establishes correlations between multiple semantic patterns.Considering that not every skip connection is effective,\replacedWang et al.[27] proposed UCTransNet, utilizing channel-wise cross fusion transformer (CCT) to address the semantic difference for precise medical image segmentation.Wang et al.[27] proposed a channel-wise cross fusion transformer (CCT) to address the semantic difference for precise medical image segmentation.The CCT’s powerful global semantic modeling capability facilitates its widespread application in tasks such as metal surface defect detection[29], remote sensing image segmentation[51], and building edge detection[28].This inspires us to introduce this model to separate IR targets and backgrounds in the deeper layers effectively.\replacedHowever, IR small targets differ significantly from the usual large-size targets not only in size but also in terms of effective features and sample balance.However, the IR small targets differ significantly from the usual large-size targets not only in terms of target size, but also in effective features, and sample balance.The attention matrix computation, the positional encoding, and the pure channel modeling in vanilla CCT are harmful to the limited-pixel target detection.Therefore, we propose a spatial-channel cross transformer block. Its launching point is leveraging the target’s local spatial saliency and global background continuity to separate the target in the deep layers.

III METHOD

This section elaborates on the proposed Spatial-channel Cross Transformer Network (SCTransNet) for infrared small target detection. We begin by presenting the overall structure of the proposed SCTransNet in SectionIII-A. Then, we present the technical details of the spatial-channel cross transformer block(SCTB) and its internal structure: Spatial-embedded single-head channel-cross attention(SSCA) and the complementary feed-forward network(CFN) in SectionIII-B.

III-A Overall pipeline

As shown in Fig.2, given an infrared image, SCTransNet initially employs four groups of \replacedresidual blocksResBlocks (RBs)[52] and max-pooling layers, to acquire high-level features 𝐄𝐢Ci×Hi×Wisubscript𝐄𝐢superscriptsubscript𝐶𝑖𝐻𝑖𝑊𝑖{\mathbf{{E}_{i}}}\in\mathbb{R}^{{C_{i}}\times{\frac{H}{i}}\times{\frac{W}{i}}}bold_E start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × divide start_ARG italic_H end_ARG start_ARG italic_i end_ARG × divide start_ARG italic_W end_ARG start_ARG italic_i end_ARG end_POSTSUPERSCRIPT, (i=1,2,3,4)𝑖1234(i=1,2,3,4)( italic_i = 1 , 2 , 3 , 4 ). Cisubscript𝐶𝑖{C_{i}}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the channel dimensions, in which C1subscript𝐶1{C_{1}}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 32, C2subscript𝐶2{C_{2}}italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 64, C3subscript𝐶3{C_{3}}italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = 128, C4subscript𝐶4{C_{4}}italic_C start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = 256.Next, we perform patch embedding on 𝐄𝐢subscript𝐄𝐢\mathbf{{E}_{i}}bold_E start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT using convolution with kernel size and stride size of P𝑃Pitalic_P, P/2𝑃2P/2italic_P / 2, P/4𝑃4P/4italic_P / 4, and P/8𝑃8P/8italic_P / 8 to obtain embedded layers 𝐈𝐢Ci×H16×W16subscript𝐈𝐢superscriptsubscript𝐶𝑖𝐻16𝑊16{\mathbf{{I}_{i}}}\in\mathbb{R}^{{C_{i}}\times{\frac{H}{16}}\times{\frac{W}{16%}}}bold_I start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × divide start_ARG italic_H end_ARG start_ARG 16 end_ARG × divide start_ARG italic_W end_ARG start_ARG 16 end_ARG end_POSTSUPERSCRIPT, (i=1,2,3,4)𝑖1234(i=1,2,3,4)( italic_i = 1 , 2 , 3 , 4 ) respectively.These layers are then fed into the SCTB for full-level semantic feature blending and obtaining the output𝐎𝐢Ci×H16×W16subscript𝐎𝐢superscriptsubscript𝐶𝑖𝐻16𝑊16{\mathbf{{O}_{i}}}\in\mathbb{R}^{{C_{i}}\times{\frac{H}{16}}\times{\frac{W}{16%}}}bold_O start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × divide start_ARG italic_H end_ARG start_ARG 16 end_ARG × divide start_ARG italic_W end_ARG start_ARG 16 end_ARG end_POSTSUPERSCRIPT, (i=1,2,3,4)𝑖1234(i=1,2,3,4)( italic_i = 1 , 2 , 3 , 4 ), which have the same size of 𝐈𝐢subscript𝐈𝐢{\mathbf{{I}_{i}}}bold_I start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT. Details of SCTB are provided in the next Section.The 𝐎𝐢subscript𝐎𝐢{\mathbf{{O}_{i}}}bold_O start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT are recovered to the size of the original encoder processing using feature mapping (FM), which consists of bilinear interpolation, convolution, \replacedbatch normalizationbatchnorm, and ReLU activation. Meanwhile, we employ a residual connection to merge the features between the encoders and decoders. The process described above can be expressed mathematically as

𝐎𝐢=𝐄𝐢+FMi(SCTB(𝐈𝟏,𝐈𝟐,𝐈𝟑,𝐈𝟒))(i=1,2,3,4).subscript𝐎𝐢subscript𝐄𝐢subscriptFM𝑖SCTBsubscript𝐈1subscript𝐈2subscript𝐈3subscript𝐈4𝑖1234\mathbf{O_{i}}={\mathbf{E_{i}}}+{\text{FM}_{i}}(\text{SCTB}(\mathbf{I_{1}},%\mathbf{I_{2}},\mathbf{I_{3}},\mathbf{I_{4}}))~{}(i=1,2,3,4).bold_O start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT = bold_E start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT + FM start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( SCTB ( bold_I start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT , bold_I start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT , bold_I start_POSTSUBSCRIPT bold_3 end_POSTSUBSCRIPT , bold_I start_POSTSUBSCRIPT bold_4 end_POSTSUBSCRIPT ) ) ( italic_i = 1 , 2 , 3 , 4 ) .(1)

Finally, the Channel-wise Cross Attention (CCA)[27] is employed to fuse the high- and low-level features, followed by decoding using two CBL blocks.

To enhance the gradient propagation efficiency and feature representation, we utilize a multi-scale deeply supervised fusion strategy to optimize SCTransNet.Specifically, a 1×1111\times 11 × 1 convolution and sigmoid function are used for each decoder outputs 𝐅𝐢subscript𝐅𝐢\mathbf{F_{i}}bold_F start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT, acquiring the saliency map 𝐌𝐢subscript𝐌𝐢\mathbf{M_{i}}bold_M start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT which is denoted as

𝐌𝐢=Sigmoid(f1×1(𝐅𝐢))(i=1,2,3,4,5).subscript𝐌𝐢Sigmoidsubscript𝑓11subscript𝐅𝐢𝑖12345\mathbf{M_{i}}=\text{Sigmoid}({f_{1\times 1}}(\mathbf{F_{i}}))~{}(i=1,2,3,4,5).bold_M start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT = Sigmoid ( italic_f start_POSTSUBSCRIPT 1 × 1 end_POSTSUBSCRIPT ( bold_F start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ) ) ( italic_i = 1 , 2 , 3 , 4 , 5 ) .(2)

Next, we upsample the low-resolution salient maps 𝐌𝐢(i=2,3,4,5)subscript𝐌𝐢𝑖2345\mathbf{M_{i}}~{}(i=2,3,4,5)bold_M start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ( italic_i = 2 , 3 , 4 , 5 ) to the original image size and fuse all the salient maps to obtain 𝐌subscript𝐌\mathbf{M_{\sum}}bold_M start_POSTSUBSCRIPT ∑ end_POSTSUBSCRIPT as

𝐌=Sigmoid(f1×1[𝐌𝟏,(𝐌𝟐),(𝐌𝟑),(𝐌𝟒),(𝐌𝟓)]),subscript𝐌Sigmoidsubscript𝑓11subscript𝐌1subscript𝐌2subscript𝐌3subscript𝐌4subscript𝐌5\mathbf{M_{\sum}}=\text{Sigmoid}({f_{1\times 1}}[\mathbf{M_{1}},\mathcal{B}({%\mathbf{M_{2}}),\mathcal{B}(\mathbf{M_{3}}),\mathcal{B}(\mathbf{M_{4}}),%\mathcal{B}(\mathbf{M_{5}})}]),bold_M start_POSTSUBSCRIPT ∑ end_POSTSUBSCRIPT = Sigmoid ( italic_f start_POSTSUBSCRIPT 1 × 1 end_POSTSUBSCRIPT [ bold_M start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT , caligraphic_B ( bold_M start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT ) , caligraphic_B ( bold_M start_POSTSUBSCRIPT bold_3 end_POSTSUBSCRIPT ) , caligraphic_B ( bold_M start_POSTSUBSCRIPT bold_4 end_POSTSUBSCRIPT ) , caligraphic_B ( bold_M start_POSTSUBSCRIPT bold_5 end_POSTSUBSCRIPT ) ] ) ,(3)

where []delimited-[][\cdot][ ⋅ ] is the channel-wise concatenation, \mathcal{B}caligraphic_B denotes the bilinear interpolation.Finally, we calculate the Binary Cross Entropy (BCE)[16] loss between the overall saliency maps and the ground truth (GT) Y as below, and combine the losses.

l1=BCE(𝐌𝟏,𝐘),subscript𝑙1subscript𝐵𝐶𝐸subscript𝐌1𝐘\displaystyle{l_{1}}={\mathcal{L}_{BCE}}(\mathbf{M_{1}},\mathbf{Y}),italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_B italic_C italic_E end_POSTSUBSCRIPT ( bold_M start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT , bold_Y ) ,(4)
li=BCE((𝐌𝐢),𝐘)(i=2,3,4,5),subscript𝑙𝑖subscript𝐵𝐶𝐸subscript𝐌𝐢𝐘𝑖2345\displaystyle{l_{i}}={\mathcal{L}_{BCE}}(\mathcal{B}(\mathbf{M_{i}}),\mathbf{Y%})~{}(i=2,3,4,5),italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_B italic_C italic_E end_POSTSUBSCRIPT ( caligraphic_B ( bold_M start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ) , bold_Y ) ( italic_i = 2 , 3 , 4 , 5 ) ,(5)
l=BCE(𝐌,𝐘),subscript𝑙subscript𝐵𝐶𝐸subscript𝐌𝐘\displaystyle{l_{\sum}}={\mathcal{L}_{BCE}}(\mathbf{M_{\sum}},\mathbf{Y}),italic_l start_POSTSUBSCRIPT ∑ end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_B italic_C italic_E end_POSTSUBSCRIPT ( bold_M start_POSTSUBSCRIPT ∑ end_POSTSUBSCRIPT , bold_Y ) ,(6)
L=λ1l1+λ2l2+λ3l3+λ4l4+λ5l5+λl,𝐿subscript𝜆1subscript𝑙1subscript𝜆2subscript𝑙2subscript𝜆3subscript𝑙3subscript𝜆4subscript𝑙4subscript𝜆5subscript𝑙5subscript𝜆subscript𝑙\displaystyle L={\lambda_{1}}{l_{1}}+{\lambda_{2}}{l_{2}}+{\lambda_{3}}{l_{3}}%+{\lambda_{4}}{l_{4}}+{\lambda_{5}}{l_{5}}+{\lambda_{\sum}}{l_{\sum}},italic_L = italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT ∑ end_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT ∑ end_POSTSUBSCRIPT ,(7)

in which λi(i=1,2,3,4,5)subscript𝜆𝑖𝑖12345{\lambda_{i}~{}(i=1,2,3,4,5)}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_i = 1 , 2 , 3 , 4 , 5 ) represents the weights corresponding to different loss functions. In this work, λisubscript𝜆𝑖{\lambda_{i}}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and λsubscript𝜆{\lambda_{\sum}}italic_λ start_POSTSUBSCRIPT ∑ end_POSTSUBSCRIPT are set to 1 empirically.

SCTransNet: Spatial-channel Cross Transformer Network for Infrared Small Target Detection (3)

III-B Spatial-channel Cross Transformer Block

Recently, successful architectures such as MLP-mixer[53] and Poolformer[54] have both considered the interaction between spatial and channel information in constructing context information. However, vanilla CCT focuses excessively on establishing channel information and overlooks the crucial role of spatial information in neighborhood modeling.To address this, we develop a spatial-channel cross transformer block (SCTB) as a spatial-channel blending unit to mix full-level encoded features.As shown in Fig.3, given the i𝑖iitalic_i-th level features 𝐈𝐢Ci×h×w,(i=1,2,3,4)subscript𝐈𝐢superscriptsubscript𝐶𝑖𝑤𝑖1234{\mathbf{I_{i}}\in\mathbb{R}^{{C_{i}}\times h\times w}},(i=1,2,3,4)bold_I start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_h × italic_w end_POSTSUPERSCRIPT , ( italic_i = 1 , 2 , 3 , 4 ), in which h=H16,w=W16formulae-sequence𝐻16𝑤𝑊16h={\frac{H}{16}},w={\frac{W}{16}}italic_h = divide start_ARG italic_H end_ARG start_ARG 16 end_ARG , italic_w = divide start_ARG italic_W end_ARG start_ARG 16 end_ARG. the procedure of SCTB can be defined as

𝐉subscript𝐉\displaystyle\mathbf{J_{\sum}}bold_J start_POSTSUBSCRIPT ∑ end_POSTSUBSCRIPT=LN([𝐈𝟏,𝐈𝟐,𝐈𝟑,𝐈𝟒]),absentLNsubscript𝐈1subscript𝐈2subscript𝐈3subscript𝐈4\displaystyle=\text{LN}([\mathbf{I_{1}},\mathbf{I_{2}},\mathbf{I_{3}},\mathbf{%I_{4}}]),= LN ( [ bold_I start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT , bold_I start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT , bold_I start_POSTSUBSCRIPT bold_3 end_POSTSUBSCRIPT , bold_I start_POSTSUBSCRIPT bold_4 end_POSTSUBSCRIPT ] ) ,(8)
𝐉𝐢subscript𝐉𝐢\displaystyle\mathbf{J_{i}}bold_J start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT=LN(𝐈𝐢),absentLNsubscript𝐈𝐢\displaystyle=\text{LN}(\mathbf{I_{i}}),= LN ( bold_I start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ) ,(9)
𝐏𝐢subscript𝐏𝐢\displaystyle\mathbf{P_{i}}bold_P start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT=SSCA(𝐉𝟏,𝐉𝟐,𝐉𝟑,𝐉𝟒,𝐉)+𝐈𝐢,absentSSCAsubscript𝐉1subscript𝐉2subscript𝐉3subscript𝐉4subscript𝐉subscript𝐈𝐢\displaystyle=\text{SSCA}({\mathbf{J_{1}},\mathbf{J_{2}},\mathbf{J_{3}},%\mathbf{J_{4}},\mathbf{J_{\sum}}})+\mathbf{I_{i}},= SSCA ( bold_J start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT , bold_J start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT , bold_J start_POSTSUBSCRIPT bold_3 end_POSTSUBSCRIPT , bold_J start_POSTSUBSCRIPT bold_4 end_POSTSUBSCRIPT , bold_J start_POSTSUBSCRIPT ∑ end_POSTSUBSCRIPT ) + bold_I start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ,(10)
𝐎𝐢subscript𝐎𝐢\displaystyle\mathbf{O_{i}}bold_O start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT=CFNi(𝐏𝐢),absentsubscriptCFN𝑖subscript𝐏𝐢\displaystyle={\text{CFN}_{i}}(\mathbf{P_{i}}),= CFN start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_P start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ) ,(11)

where LN denotes the layer normalization, 𝐉𝐢Ci×h×w,(i=1,2,3,4)subscript𝐉𝐢superscriptsubscript𝐶𝑖𝑤𝑖1234{\mathbf{J_{i}}\in\mathbb{R}^{{C_{i}}\times h\times w}},(i=1,2,3,4)bold_J start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_h × italic_w end_POSTSUPERSCRIPT , ( italic_i = 1 , 2 , 3 , 4 ) and the concatenated tokens 𝐉C×h×wsubscript𝐉superscriptsubscript𝐶𝑤{\mathbf{J_{\sum}}\in\mathbb{R}^{{C_{\sum}}\times h\times w}}bold_J start_POSTSUBSCRIPT ∑ end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT ∑ end_POSTSUBSCRIPT × italic_h × italic_w end_POSTSUPERSCRIPT are the five inputs of SSCA, 𝐏𝐢subscript𝐏𝐢\mathbf{P_{i}}bold_P start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT represent the outputs of SSCA, and 𝐎𝐢subscript𝐎𝐢\mathbf{O_{i}}bold_O start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT stands for the outputs of SCTB. The SSCA: Spatial-embedded single-head channel-cross attention; and CFN: Complementary feed-forward network, are separately described below.

MethodNUAA-SIRST[14]NUDT-SIRST[15]IRSTD-1K[40]
mIoUnIoUF-measurePdFamIoUnIoUF-measurePdFamIoUnIoUF-measurePdFa
Top-Hat[5]7.14318.2714.6379.84101220.7228.9833.5278.41166.710.067.43816.0275.111432
Max-Median[55]4.17212.3110.6769.2055.334.1973.6747.63558.4136.896.9983.0518.15265.2159.73
WSLCM[56]1.1586.8354.81277.9554462.2833.8655.98756.8213093.4520.6782.12572.446619
TTLCM[57]1.0294.0994.99579.0958992.1764.3157.22562.0116083.3110.7842.18677.396738
IPI[9]25.6750.1743.6584.6316.6717.7615.4226.9474.4941.2327.9220.4635.6881.3716.18
PSTNN[58]30.3033.6739.1672.8048.9914.8523.5735.6366.1344.1724.5717.9337.1871.9935.26
MSLSTIPT[59]10.3015.9318.8382.1311318.34210.0618.2647.40888.111.435.93212.2379.031524
ACM[14]68.9369.1880.8791.6315.2361.1264.4075.8793.1255.2259.2357.0374.3893.2765.28
ALCNet[18]70.8371.0582.9294.3036.1564.7467.2078.5994.1834.6160.6057.1475.4792.9858.80
RDIAN[39]68.7275.3981.4693.5443.2976.2879.1486.5495.7734.5656.4559.7272.1488.5526.63
ISTDU[22]75.5279.7386.0696.5814.5489.5590.4894.4997.6713.4466.3663.8679.5893.6053.10
MTU-Net[17]74.7878.2785.3793.5422.3674.8577.5484.4793.9746.9566.1163.2479.2693.2736.80
IAANet[60]74.2275.5885.0293.5322.7090.2292.0494.8897.268.3266.2565.7778.3493.1514.20
AGPCNet[19]75.6976.6085.2696.4814.9988.8790.6493.8897.2010.0266.2965.2379.5892.8313.12
DNA-Net[15]75.8079.2086.2495.828.7888.1988.5893.7398.839.0065.9066.3879.4490.9112.24
UIU-Net[16]76.9179.9986.9595.8214.1393.4893.8996.6398.317.7966.1566.6679.6393.9822.07
SCTransNet77.5081.0887.3296.9513.9294.0994.3896.9598.624.2968.0368.1580.9693.2710.74

III-B1 Spatial-embedded single-head channel-cross attention

In Fig.3(a), given the five input tokens 𝐉𝐢subscript𝐉𝐢\mathbf{J_{i}}bold_J start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT and 𝐉subscript𝐉\mathbf{J_{\sum}}bold_J start_POSTSUBSCRIPT ∑ end_POSTSUBSCRIPT for which LN is performed, the launching point of SSCA is to calculate the local-spatial channel similarity between single-level features and full-level concatenation features to establish global semantics.Therefore, our SSCA employs the four input tokens 𝐉𝐢subscript𝐉𝐢\mathbf{J_{i}}bold_J start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT as queries, one concatenated token 𝐉subscript𝐉\mathbf{J_{\sum}}bold_J start_POSTSUBSCRIPT ∑ end_POSTSUBSCRIPT as key and value.This is accomplished by utilizing 1×1111\times 11 × 1 convolutions to consolidate pixel-wise cross-channel context and then applying 3×3333\times 33 × 3 depth-wise convolutions to capture local spatial context. Mathematically,

𝐐𝐢=WdiQWpiQ𝐉𝐢,𝐊=WdKWpK𝐉,𝐕=WdVWpV𝐉,formulae-sequencesubscript𝐐𝐢subscriptsuperscript𝑊𝑄𝑑𝑖subscriptsuperscript𝑊𝑄𝑝𝑖subscript𝐉𝐢formulae-sequence𝐊subscriptsuperscript𝑊𝐾𝑑subscriptsuperscript𝑊𝐾𝑝subscript𝐉𝐕subscriptsuperscript𝑊𝑉𝑑subscriptsuperscript𝑊𝑉𝑝subscript𝐉\begin{split}\mathbf{Q_{i}}={W^{Q}_{di}}{W^{Q}_{pi}}\mathbf{J_{i}},~{}\mathbf{%K}={W^{K}_{d}}{W^{K}_{p}}\mathbf{J_{\sum}},~{}\mathbf{V}={W^{V}_{d}}{W^{V}_{p}%}\mathbf{J_{\sum}},\end{split}start_ROW start_CELL bold_Q start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT = italic_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d italic_i end_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_i end_POSTSUBSCRIPT bold_J start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT , bold_K = italic_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT bold_J start_POSTSUBSCRIPT ∑ end_POSTSUBSCRIPT , bold_V = italic_W start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT bold_J start_POSTSUBSCRIPT ∑ end_POSTSUBSCRIPT , end_CELL end_ROW(12)

where Wpi()Ci×1×1subscriptsuperscript𝑊𝑝𝑖superscriptsubscript𝐶𝑖11{W^{(\cdot)}_{pi}}\in\mathbb{R}^{{C_{i}}\times 1\times 1}italic_W start_POSTSUPERSCRIPT ( ⋅ ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × 1 × 1 end_POSTSUPERSCRIPT and Wp()C×1×1subscriptsuperscript𝑊𝑝superscriptsubscript𝐶11{W^{(\cdot)}_{p}}\in\mathbb{R}^{{C_{\sum}}\times 1\times 1}italic_W start_POSTSUPERSCRIPT ( ⋅ ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT ∑ end_POSTSUBSCRIPT × 1 × 1 end_POSTSUPERSCRIPT are the 1×1111\times 11 × 1 point-wise convolution, Wdi()Ci×3×3subscriptsuperscript𝑊𝑑𝑖superscriptsubscript𝐶𝑖33{W^{(\cdot)}_{di}}\in\mathbb{R}^{{C_{i}}\times 3\times 3}italic_W start_POSTSUPERSCRIPT ( ⋅ ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × 3 × 3 end_POSTSUPERSCRIPT and Wd()C×3×3subscriptsuperscript𝑊𝑑superscriptsubscript𝐶33{W^{(\cdot)}_{d}}\in\mathbb{R}^{{C_{\sum}}\times 3\times 3}italic_W start_POSTSUPERSCRIPT ( ⋅ ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT ∑ end_POSTSUBSCRIPT × 3 × 3 end_POSTSUPERSCRIPT are the 3×3 depth-wise convolution.Next, we reshape 𝐐𝐢Ci×h×wsubscript𝐐𝐢superscriptsubscript𝐶𝑖𝑤{\mathbf{Q_{i}}}\in\mathbb{R}^{{C_{i}}\times h\times w}bold_Q start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_h × italic_w end_POSTSUPERSCRIPT, 𝐊C×h×w𝐊superscriptsubscript𝐶𝑤{\mathbf{K}}\in\mathbb{R}^{{C_{\sum}}\times h\times w}bold_K ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT ∑ end_POSTSUBSCRIPT × italic_h × italic_w end_POSTSUPERSCRIPT, and 𝐕C×h×w𝐕superscriptsubscript𝐶𝑤{\mathbf{V}}\in\mathbb{R}^{{C_{\sum}}\times h\times w}bold_V ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT ∑ end_POSTSUBSCRIPT × italic_h × italic_w end_POSTSUPERSCRIPT to Ci×hwsuperscriptsubscript𝐶𝑖𝑤\mathbb{R}^{{C_{i}}\times hw}blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_h italic_w end_POSTSUPERSCRIPT, C×hwsuperscriptsubscript𝐶𝑤\mathbb{R}^{{C_{\sum}}\times hw}blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT ∑ end_POSTSUBSCRIPT × italic_h italic_w end_POSTSUPERSCRIPT and C×hwsuperscriptsubscript𝐶𝑤\mathbb{R}^{{C_{\sum}}\times hw}blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT ∑ end_POSTSUBSCRIPT × italic_h italic_w end_POSTSUPERSCRIPT, separately. Our SSCA process is defined as

𝐂𝐀𝐢=WpiCrossAtt(𝐐𝐢,𝐊,𝐕),subscript𝐂𝐀𝐢subscript𝑊𝑝𝑖CrossAttsubscript𝐐𝐢𝐊𝐕\displaystyle{\mathbf{CA_{i}}}={W_{pi}}~{}\text{CrossAtt}(\mathbf{Q_{i}},%\mathbf{K},\mathbf{V}),bold_CA start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT = italic_W start_POSTSUBSCRIPT italic_p italic_i end_POSTSUBSCRIPT CrossAtt ( bold_Q start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT , bold_K , bold_V ) ,(13)
CrossAtt(𝐐𝐢,𝐊,𝐕)=𝐀𝐢𝐕=Softmax{(𝐐𝐢𝐊𝐓λ)}𝐕,CrossAttsubscript𝐐𝐢𝐊𝐕subscript𝐀𝐢𝐕Softmaxsubscript𝐐𝐢superscript𝐊𝐓𝜆𝐕\displaystyle\text{CrossAtt}(\mathbf{Q_{i}},\mathbf{K},\mathbf{V})={\mathbf{{A%}_{i}}}{\mathbf{V}}=\text{Softmax}\left\{\mathcal{I}(\frac{{\mathbf{Q_{i}}}~{}%{\mathbf{K^{T}}}}{\lambda})\right\}{\mathbf{V}},CrossAtt ( bold_Q start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT , bold_K , bold_V ) = bold_A start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT bold_V = Softmax { caligraphic_I ( divide start_ARG bold_Q start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT bold_K start_POSTSUPERSCRIPT bold_T end_POSTSUPERSCRIPT end_ARG start_ARG italic_λ end_ARG ) } bold_V ,(14)

where 𝐂𝐀𝐢Ci×h×wsubscript𝐂𝐀𝐢superscriptsubscript𝐶𝑖𝑤{\mathbf{CA_{i}}}\in\mathbb{R}^{{C_{i}}\times h\times w}bold_CA start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_h × italic_w end_POSTSUPERSCRIPT are the output of SSCA,𝐀𝐢Ci×Csubscript𝐀𝐢superscriptsubscript𝐶𝑖subscript𝐶{\mathbf{{A}_{i}}}\in\mathbb{R}^{{C_{i}}\times{C_{\sum}}}bold_A start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_C start_POSTSUBSCRIPT ∑ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT represent different level covariance-based attention maps, \mathcal{I}caligraphic_I denotes the instance normalization operation[61], andλ𝜆\lambdaitalic_λ is an optional temperature factor defined by λ=C𝜆subscript𝐶\lambda=\sqrt{C_{\sum}}italic_λ = square-root start_ARG italic_C start_POSTSUBSCRIPT ∑ end_POSTSUBSCRIPT end_ARG.Notably, we differ from the common channel-cross attention under two further aspects: Our patches are without positional encoding, and we use a single head to learn the attention matrix.These strategies will be compared for their efficacy in detail in the ablation studyIV-E2.

SCTransNet: Spatial-channel Cross Transformer Network for Infrared Small Target Detection (4)

III-B2 Complementary Feed-forward Network

As shown in Fig.4(a),previous studies[41],[32],[30] always incorporate single-scale depth-wise convolutions into the standard feed-forward network to enhance local focus.More recently, state-of-the-art MSFN[31] incorporates two paths with depth-wise convolution using different kernel sizes to enhance the multi-scale representation.However, the above approaches are limited to a local spatial global channel paradigm of feature representation.In fact, global spatial and local channel information(Fig.4(b)) is equally important[62]. Hence, we design a CFN, which combines the advantages of both feature representations.

In Fig.3(b), given an input tensor 𝐗𝐢Ci×h×wsubscript𝐗𝐢superscriptsubscript𝐶𝑖𝑤{\mathbf{X_{i}}}\in\mathbb{R}^{{C_{i}}\times h\times w}bold_X start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_h × italic_w end_POSTSUPERSCRIPT, CFN first models multi-scale LSGC information. Specifically, after the layer normalization,\replacedCFN utilizes 1×1111\times 11 × 1 convolution to increase the channel dimension in the ratio of η𝜂\etaitalic_η and splits the feature map equally into two branches. Subsequently, 3×3333\times 33 × 3 and 5×5555\times 55 × 5 depth-wise convolutions are employed to enhance the local spatial information.CFN utilizes 1×1111\times 11 × 1 convolution to increase the channel dimension by a factor of η𝜂\etaitalic_η, and equally divides the feature map into two branches and enhances the local spatial information using 3×3333\times 33 × 3 and 5×5555\times 55 × 5 depth-wise convolution, respectively.This is followed by channel concatenating the multi-scale features and restoring them to their original dimensions. The above process can be defined as

𝐗𝟑×𝟑,𝐗𝟓×𝟓=Chunk(f1×1c(LN(𝐗𝐢))),subscript𝐗33subscript𝐗55Chunksubscriptsuperscript𝑓𝑐11𝐿𝑁subscript𝐗𝐢\displaystyle{{\mathbf{X_{3\times 3}}},{\mathbf{X_{5\times 5}}}}=\text{Chunk}(%{f^{c}_{1\times 1}}(LN({\mathbf{X_{i}}}))),bold_X start_POSTSUBSCRIPT bold_3 × bold_3 end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT bold_5 × bold_5 end_POSTSUBSCRIPT = Chunk ( italic_f start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 × 1 end_POSTSUBSCRIPT ( italic_L italic_N ( bold_X start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ) ) ) ,(15)
𝐗𝐬𝐜=f1×1c[δ(f3×3dwc(𝐗𝟑×𝟑)),δ(f5×5dwc(𝐗𝟓×𝟓))],subscript𝐗𝐬𝐜subscriptsuperscript𝑓𝑐11𝛿subscriptsuperscript𝑓𝑑𝑤𝑐33subscript𝐗33𝛿subscriptsuperscript𝑓𝑑𝑤𝑐55subscript𝐗55\displaystyle{\mathbf{X_{sc}}}={f^{c}_{1\times 1}}[\delta({f^{dwc}_{3\times 3}%({\mathbf{X_{3\times 3}}})}),\delta({f^{dwc}_{5\times 5}}({\mathbf{X_{5\times 5%}}}))],bold_X start_POSTSUBSCRIPT bold_sc end_POSTSUBSCRIPT = italic_f start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 × 1 end_POSTSUBSCRIPT [ italic_δ ( italic_f start_POSTSUPERSCRIPT italic_d italic_w italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT bold_3 × bold_3 end_POSTSUBSCRIPT ) ) , italic_δ ( italic_f start_POSTSUPERSCRIPT italic_d italic_w italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 5 × 5 end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT bold_5 × bold_5 end_POSTSUBSCRIPT ) ) ] ,(16)

where f1×1csubscriptsuperscript𝑓𝑐11{f^{c}_{1\times 1}}italic_f start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 × 1 end_POSTSUBSCRIPT denotes 1×1111\times 11 × 1 convolution, f3×3dwcsubscriptsuperscript𝑓𝑑𝑤𝑐33f^{dwc}_{3\times 3}italic_f start_POSTSUPERSCRIPT italic_d italic_w italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT and f5×5dwcsubscriptsuperscript𝑓𝑑𝑤𝑐55f^{dwc}_{5\times 5}italic_f start_POSTSUPERSCRIPT italic_d italic_w italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 5 × 5 end_POSTSUBSCRIPT represent 3×3333\times 33 × 3 and 5×5555\times 55 × 5 depth-wise convolutions. Here, Chunk(\cdot) denotes dividing the feature vector into two equal parts along the channel dimension.

ModelParams (M)Flops (G)IoUnIoUF-measure
DNA-Net[15]4.69714.2680.2382.5988.60
UIU-Net[16]50.5454.4282.4086.1290.35
SCTransNet11.1920.2483.4386.8690.96

Next, CFN constructs the GSLC information. Because of the varying resolution of the small target detection image inputs in the test stage, we first use the global average pooling (GAP) of spatial dimensions to approximate the total spatial information of the features instead of using computationally intensive spatial MLPs to precisely compute the global spatial information[63]. We then employ a one-dimensional convolution with a kernel size of 3333 to capture the local channel information of the spatially compressed feature as follows

𝐗𝐨=f31D(GAP2D(𝐗𝐬𝐜))𝐗𝐬𝐜+𝐗𝐢,subscript𝐗𝐨direct-productsubscriptsuperscript𝑓1𝐷3subscriptGAP2𝐷subscript𝐗𝐬𝐜subscript𝐗𝐬𝐜subscript𝐗𝐢\displaystyle{\mathbf{X_{o}}}={f^{1D}_{3}}({\text{GAP}_{2D}}({\mathbf{X_{sc}}}%))\odot{\mathbf{X_{sc}}}+{\mathbf{X_{i}}},bold_X start_POSTSUBSCRIPT bold_o end_POSTSUBSCRIPT = italic_f start_POSTSUPERSCRIPT 1 italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( GAP start_POSTSUBSCRIPT 2 italic_D end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT bold_sc end_POSTSUBSCRIPT ) ) ⊙ bold_X start_POSTSUBSCRIPT bold_sc end_POSTSUBSCRIPT + bold_X start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ,(17)

where direct-product\odot is the broadcasted Hadamard product.By incorporating complementary spatial and channel information, CFN enriches the representation of features in terms of the target’s localization and the background’s global continuity.

IV Experiments and Analysis

IV-A Evaluation metrics

We compare the proposed SCTransNet with the state-of-the-art (SOTA) methods using several standard metrics.

1) Intersection over Union (IoU): IoU is a pixel-level evaluation metric defined as

IoU=AiAu=i=1NTP[i]i=1N(T[i]+P[i]TP[i]),𝐼𝑜𝑈subscript𝐴𝑖subscript𝐴𝑢superscriptsubscript𝑖1𝑁𝑇𝑃delimited-[]𝑖superscriptsubscript𝑖1𝑁𝑇delimited-[]𝑖𝑃delimited-[]𝑖𝑇𝑃delimited-[]𝑖IoU=\frac{A_{i}}{A_{u}}=\frac{\sum_{i=1}^{N}{TP[i]}}{\sum_{i=1}^{N}(T[i]+P[i]-%TP[i])},italic_I italic_o italic_U = divide start_ARG italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_A start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_T italic_P [ italic_i ] end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_T [ italic_i ] + italic_P [ italic_i ] - italic_T italic_P [ italic_i ] ) end_ARG ,(18)

where Aisubscript𝐴𝑖{{A}_{i}}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Ausubscript𝐴𝑢{{A}_{u}}italic_A start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT denote the size of the intersection region and union region, respectively. N is the number of samples, TP[\cdot] denotes the number of true positive pixels, T[\cdot] and P[\cdot] represent the number of ground truth and predicted positive pixels, respectively.

2) Normalized Intersection over Union (nIoU): nIoU is the normalized version of IoU[14], given as

nIoU=1Ni=1NTP[i]T[i]+P[i]TP[i].𝑛𝐼𝑜𝑈1𝑁superscriptsubscript𝑖1𝑁𝑇𝑃delimited-[]𝑖𝑇delimited-[]𝑖𝑃delimited-[]𝑖𝑇𝑃delimited-[]𝑖nIoU=\frac{1}{N}\sum_{i=1}^{N}\frac{TP[i]}{T[i]+P[i]-TP[i]}.italic_n italic_I italic_o italic_U = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG italic_T italic_P [ italic_i ] end_ARG start_ARG italic_T [ italic_i ] + italic_P [ italic_i ] - italic_T italic_P [ italic_i ] end_ARG .(19)

3) F-measure (F): It evaluates the miss detection and false alarms at pixel-level, given as

F=2×Prec×RecPrec+Rec,𝐹2𝑃𝑟𝑒𝑐𝑅𝑒𝑐𝑃𝑟𝑒𝑐𝑅𝑒𝑐F=\frac{2\times{Prec}\times{Rec}}{{Prec}+{Rec}},italic_F = divide start_ARG 2 × italic_P italic_r italic_e italic_c × italic_R italic_e italic_c end_ARG start_ARG italic_P italic_r italic_e italic_c + italic_R italic_e italic_c end_ARG ,(20)

where Prec𝑃𝑟𝑒𝑐{Prec}italic_P italic_r italic_e italic_c and Rec𝑅𝑒𝑐{Rec}italic_R italic_e italic_c denote the precision rate and recall rate respectively.

SCTransNet: Spatial-channel Cross Transformer Network for Infrared Small Target Detection (5)
DatasetIndexACMALCNetRDIANISTDUMTU-NetIAANetAGPCNetDNA-NetUIU-NetSCTransNet
NUAA-SIRST[14]AUCFa=0.5subscriptAUCsubscript𝐹𝑎0.5{\mathrm{AUC}_{{F}_{a}=0.5}}roman_AUC start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = 0.5 end_POSTSUBSCRIPT0.72230.86180.54610.75150.74570.80810.69530.65820.48540.9539
AUCFa=1subscriptAUCsubscript𝐹𝑎1{\mathrm{AUC}_{{F}_{a}=1}}roman_AUC start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = 1 end_POSTSUBSCRIPT0.81800.90250.73210.85790.84370.86140.82620.80980.71970.9589
NUDT-SIRST[15]AUCFa=0.5subscriptAUCsubscript𝐹𝑎0.5{\mathrm{AUC}_{{F}_{a}=0.5}}roman_AUC start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = 0.5 end_POSTSUBSCRIPT0.43920.63210.46300.86350.46400.75690.50380.63000.82750.9853
AUCFa=1subscriptAUCsubscript𝐹𝑎1{\mathrm{AUC}_{{F}_{a}=1}}roman_AUC start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = 1 end_POSTSUBSCRIPT0.58650.77160.66950.92110.60640.84630.73060.80720.90130.9863
IRSTD-1K[40]AUCFa=0.5subscriptAUCsubscript𝐹𝑎0.5{\mathrm{AUC}_{{F}_{a}=0.5}}roman_AUC start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = 0.5 end_POSTSUBSCRIPT0.53740.66060.45450.60140.50180.78620.62110.61620.47490.9107
AUCFa=1subscriptAUCsubscript𝐹𝑎1{\mathrm{AUC}_{{F}_{a}=1}}roman_AUC start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = 1 end_POSTSUBSCRIPT0.73660.80060.64800.76870.71980.84560.77520.76840.70990.9200

4) Probability of Detection (Pdsubscript𝑃𝑑{{P}_{d}}italic_P start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT): Pdsubscript𝑃𝑑{{P}_{d}}italic_P start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT is the ratio of correctly predicted targets Npredpred{}_{\mbox{\scriptsize pred}}start_FLOATSUBSCRIPT pred end_FLOATSUBSCRIPT and all targets Nallall{}_{\mbox{\scriptsize all}}start_FLOATSUBSCRIPT all end_FLOATSUBSCRIPT, given as

Pd=NpredNall.subscript𝑃𝑑subscript𝑁𝑝𝑟𝑒𝑑subscript𝑁𝑎𝑙𝑙P_{d}=\frac{N_{pred}}{N_{all}}.italic_P start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = divide start_ARG italic_N start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d end_POSTSUBSCRIPT end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_a italic_l italic_l end_POSTSUBSCRIPT end_ARG .(21)

Following[15], if the deviation of target centroid is less than 3, we consider the target correctly predicted.

5) False-Alarm Rate (Fasubscript𝐹𝑎{{F}_{a}}italic_F start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT): Fasubscript𝐹𝑎{{F}_{a}}italic_F start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT is the ratio of false predicted target pixels Nfalsesubscript𝑁𝑓𝑎𝑙𝑠𝑒{N}_{false}italic_N start_POSTSUBSCRIPT italic_f italic_a italic_l italic_s italic_e end_POSTSUBSCRIPT and all the pixels in the image Pallsubscript𝑃𝑎𝑙𝑙{P}_{all}italic_P start_POSTSUBSCRIPT italic_a italic_l italic_l end_POSTSUBSCRIPT, given as

Fa=NfalsePall.subscript𝐹𝑎subscript𝑁𝑓𝑎𝑙𝑠𝑒subscript𝑃𝑎𝑙𝑙F_{a}=\frac{N_{false}}{P_{all}}.italic_F start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = divide start_ARG italic_N start_POSTSUBSCRIPT italic_f italic_a italic_l italic_s italic_e end_POSTSUBSCRIPT end_ARG start_ARG italic_P start_POSTSUBSCRIPT italic_a italic_l italic_l end_POSTSUBSCRIPT end_ARG .(22)

In addition to the fixed-threshold evaluation methods, we also utilize Receiver Operation Characteristics (ROC) curves to comprehensively evaluate the models. ROC is used to describe the changing trends of Pdsubscript𝑃𝑑{{P}_{d}}italic_P start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT under varying Fasubscript𝐹𝑎{{F}_{a}}italic_F start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT.

SCTransNet: Spatial-channel Cross Transformer Network for Infrared Small Target Detection (6)

IV-B Experiment settings

Datasets:In our experiments, we utilized three public datasets, namely; NUAA-SIRST[14], NUDT-SIRST[15], and IRSTD-1K[40], which consist of 427, 1327, and 1000 images, respectively.We adopt the method used by [15] to partition the training and test sets of NUAA-SIRST and NUDT-SIRST,and [40] for splitting the IRSTD-1K. Hence, all splits are standard.

Implementation Details:We employ U-Net with four RBs as our detection backbone[17], the number of downsampling layers is 4, and the basic width is set to 32. The kernel size and stride size P𝑃Pitalic_P for patch embedding is 16, the number of SCTB is 4, and the channel expansion factor η𝜂\etaitalic_η in CFN is 2.66.Our SCTransNet does not use any pre-trained weights for training,every image undergoes normalization and random cropping into 256×256 patches.To avoid over-fitting, we augment the training data through random flipping and rotation.We initialized the weights and bias of our model using the Kaiming initialization method[64].The model is trained using the BCE loss function and optimized by the Adam optimizer with the initial learning rate of 0.001, and the learning rate is gradually decreased to 1×1051superscript1051\times{{10}^{-5}}1 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT using the Cosine Annealing strategy.The batch size and epoch are set as 16 and 1000, respectively.Following[14],[18],[15], the fixed threshold to segment the salient map is set to 0.5. The proposed SCTransNet is implemented with PyTorch on a single Nvidia GeForce 3090 GPU, an Intel Core i7-12700KF CPU, and 32 GB of memory. The training process took approximately 24 hours.

Baselines:To evaluate the performance of our method,\replacedwe compare SCTransNet to the SOTA IRSTD methods, specifically, seven well-established traditional methodswe compare SCTransNet to the SOTA IRSTD methods. Specifically, we compare it with seven well-established traditional methods (Top-Hat[5], Max-Median[55], WSLCM[56], TLLCM[57], IPI[9], MSLSTIPT[59]), and nine learning-based methods (ACM[14], ALCNet[18], RDIAN[39], ISTDU[22], IAANet[60], AGPCNet[19], DNA-Net[15], UIU-Net[16], and MTU-Net[17]) on the NUAA-SIRST, NUDT-SIRST and IRSTD-1K datasets. To guarantee an equitable comparison, we retrained all the learning-based methods using the same training datasets as our SCTransNet, and following the original papers, adopted their fixed thresholds. Open-source implementations of most techniques can be found at https://github.com/XinyiYing/BasicIRSTD and https://github.com/xdFai/SCTransNet.

SCTransNet: Spatial-channel Cross Transformer Network for Infrared Small Target Detection (7)

IV-C Quantitative Results

Quantitative results are shown in TableI. In general, the learning-based methods significantly outperform the conventional algorithms in terms of both target detection accuracy and contour prediction of targets.Meanwhile, our method outperforms all other algorithms. In the three metrics of IoU, nIoU and F-measure, SCTransNet stands considerably ahead on all three public datasets. This indicates that our algorithm possesses a strong ability to retain target contours and can discern pixel-level information differences between the target and the background.We also note that even though SCTransNet does not obtain optimal Pdsubscript𝑃𝑑{{P}_{d}}italic_P start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT and Fasubscript𝐹𝑎{{F}_{a}}italic_F start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, e.g., DNA-Net’s Pdsubscript𝑃𝑑{{P}_{d}}italic_P start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT is higher than ours by only 0.2 in the NUDT-SIRST, whereas our target detection false alarms are over twice as low as DNA-Net’s.This demonstrates that our algorithm achieves a superior balance between false alarms and detection accuracy, as indicated by the remarkably high composite metric, F-measure.Next, we comprehensively compare the present algorithm with the most competitive deep learning methods, DNA-Net and UIU-Net.TableII gives the average metrics of the different algorithms on the three data, and we can observe that SCTransNet has acceptable parameters at the highest performanceand outperforms the powerful UIU-Net.

Fig.5 displays the ROC curves of various competitive learning-based algorithms. It is evident that the ROC curve of SCTransNet outperforms all other algorithms.For instance, by appropriately selecting a segmentation threshold, SCTransNet achieves the highest detection accuracy while maintaining the lowest false alarms in the NUAA-SIRST and NUDT-SIRST datasets.

TableIII presents the Area Under Curve (AUC) of Fig.5 in two different thresholds: Fa=0.5×106subscript𝐹𝑎0.5superscript106{{F}_{a}=0.5\times{10}^{-6}}italic_F start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = 0.5 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT and Fa=1×106subscript𝐹𝑎1superscript106{{F}_{a}=1\times{10}^{-6}}italic_F start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = 1 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT.It can be seen that our method consistently achieves optimal detection performance across various false alarm rates.Meanwhile, while undergoing the same continuous threshold change, the curve of our method is more continuous and rounded compared to other methods. This observation suggests that SCTransNet showcases exceptional tunable adaptability.

IV-D Visual Results

The qualitative results of the seven representative algorithms in the NUAA-SIRST, NUDT-SIRST, and IRSTD-1K datasets are given in Fig.6 and Fig.7.Among them, conventional algorithms such as Top-Hat and TTLCM frequently yield a high number of false alarms and missed detections. Furthermore, even in cases where the target is detected, its contour is often unclear, hindering further accurate identification of the target type.In the learning-based algorithms, our method achieves precise target detection and effective contour segmentation. As illustrated in Fig.6(2), our method successfully distinguishes between two closely located targets, whereas other deep learning methods tend to merge them into a single target.This suggests that our method discriminates each element in the image accurately.In Fig.6(4), only our method accurately separates the shape of the unmanned aerial vehicle (UAV) from the mountain range. This is because our method not only learns the target’s features but also constructs high-level semantic information about the backgrounds, thereby accurately capturing the overall continuity of the background.In Fig.6(6), except for the present method and DNA-Net, the remaining methods produce false alarms on the stone in the grass. This can be attributed to their limitation in only constructing local contrast information and lack of establishing long-distance dependence on the image.

U-Net+RBs+DS+SSCA+CFN+CCAIoUnIoUF-measure
75.2978.6086.36
77.0780.1387.05
77.7380.7887.47
82.3985.7190.34
82.8986.2890.66
83.4386.8690.96
UCTransNet+RBs+DS+SKsSCTB r/ CCTIoUnIoUF-measure
78.7881.5687.80
79.9582.9788.45
81.4783.8988.92
82.0384.9889.54
83.4386.8690.66

IV-E Ablation Study

\added

In this section, we first employ two baselines to demonstrate the effectiveness of SCTransNet.

  • U-Net: \addedWe incrementally incorporate the residual blocks (RBs), deep supervised (DS), SSCA, CFN, and CCA into the baseline U-Net to validate the effectiveness of the above modules for infrared small target detection. The results are presented in TableIV. We observe that the algorithm’s performance improves consistently with the inclusion of the aforementioned modules. In particular, the SSCA module significantly enhances the IoU𝐼𝑜𝑈IoUitalic_I italic_o italic_U, nIoU𝑛𝐼𝑜𝑈nIoUitalic_n italic_I italic_o italic_U, and F𝐹Fitalic_F-measure𝑚𝑒𝑎𝑠𝑢𝑟𝑒measureitalic_m italic_e italic_a italic_s italic_u italic_r italic_e value of the algorithm by 4.66%, 4.93%, and 2.87%, respectively. This effectively demonstrates the effectiveness of the full-level information modeling of the IR small target.

  • UCTransNet: \addedWe incrementally incorporate the RBs, DS, and skip connections (SKs), and use the proposed SCTB to replace CCT in the baseline UCTransNet to validate the effectiveness of these modules. As shown in TableV, these modules consistently enhance the algorithm’s performance. Particularly, the proposed SCTB improves the IoU𝐼𝑜𝑈IoUitalic_I italic_o italic_U, nIoU𝑛𝐼𝑜𝑈nIoUitalic_n italic_I italic_o italic_U, and F𝐹Fitalic_F-measure𝑚𝑒𝑎𝑠𝑢𝑟𝑒measureitalic_m italic_e italic_a italic_s italic_u italic_r italic_e value of the algorithm by 1.40%, 1.88%, and 1.12%, respectively, compared to the primitive CCT. This demonstrates the proposed SCTB can more effectively enhance the semantic difference between IR small targets and backgrounds than CCT.

\deleted

In this section, we incorporate the ResBlocks (RBs), deep supervised (DS), SSCA, CFN, and CCA into the baseline U-Net to validate the effectiveness of the above modules for infrared small target detection. The results are presented in TableIV. We observe that the algorithm’s performance improves consistently with the inclusion of the aforementioned modules. In particular, the SSCA module significantly enhances the IoU𝐼𝑜𝑈IoUitalic_I italic_o italic_U, nIoU𝑛𝐼𝑜𝑈nIoUitalic_n italic_I italic_o italic_U, and F𝐹Fitalic_F-measure𝑚𝑒𝑎𝑠𝑢𝑟𝑒measureitalic_m italic_e italic_a italic_s italic_u italic_r italic_e value of the algorithm by 4.66%, 4.93%, and 2.87%, respectively. This effectively demonstrates the effectiveness of the full-level information modeling of the target.

Next, we will delve into a detailed discussion of the proposed SCTB, SSCA and CFN, and compare the adopted CCA block with other feature fusion approaches implemented in IRSTD.

SCTransNet: Spatial-channel Cross Transformer Network for Infrared Small Target Detection (8)

IV-E1 The Spatial-channel Cross Transformer Block

In the proposed SCTransNet, a primary idea is utilizing SCTB to mix and redistribute the output features of the full-stage encoders \addedto predict contextual information about the small target and backgrounds. Since the network is encoded four times, the number of queries (Q) is set to 4, and both keys (K) and values (V) are formed by mapping the concatenated features (J) of the complete 4-level features. In this section, we will discuss different levels of Q and the composition of J to illustrate the importance of full-level feature modeling.

Fig.8 presents the ablation results for the level of Q and composition of J across three datasets.Note that when changing Q, J is composed of full-level features, and likewise, Q is the full-level feature input when varying J.The experimental results for Q indicate significant differences in the information learned by the neural network from different levels of features.Queries with higher and more comprehensive levels (Q123, Q234, Q34) encompass rich image semantics, thus achieving higher performance.The model performs best when fed with full-level Q inputs (SCTransNet), thus validating our motivation.Similarly, the experimental results for J suggest that selecting complete channel information allows queries to capture more accurate key features, thereby improving the performance of IRSTD.

IV-E2 The Spatial-embedded Single-head Channel-cross Attention

To demonstrate the efficacy of the proposed SSCA, we present multi-head cross-attention[27] (MCA, a typical full-level information interaction structure \replacedin UCTransNet for medical image segmentation) and three network structure variants: SSCA with positional encoding (SSCA w PE), SSCA with multi-head (SSCA w MH), and SSCA without spatial-embedding (SSCA w/o SE), respectively.

  • SSCA w PE: We incorporate positional encoding during the patch embedding stage. To accommodate test images of different sizes, we employ interpolation to scale the position-coding matrix, ensuring the proper functioning of the algorithm.

  • SSCA w MH: We use a typical multi-head cross-attention mechanism to replace the single-head cross-attention mechanism in SSCA to verify the effectiveness of the single-head strategy for extracting limited features from the IR small targets.

  • SSCA w/o SE: To validate the effectiveness of local spatial information coding, we eliminate the depth-wise convolution in the QKV matrix generation process in SCTB.

ModelDataset
NUAA-SIRSTNUDT-SIRSTIRSTD-1K
MCA[27]74.72/78.35/85.5393.07/93.61/96.4165.60/66.57/79.22
SSCA w PE77.10/79.88/87.0794.03/94.25/96.9366.01/65.29/79.52
SSCA w MH76.35/79.56/86.5993.72/94.13/96.7667.08/67.55/80.30
SSCA w/o SE76.40/79.19/86.6293.23/93.49/96.5066.10/65.48/79.59
SSCA77.50/81.08/87.3294.09/94.38/96.9568.03/68.15/80.96
SCTransNet: Spatial-channel Cross Transformer Network for Infrared Small Target Detection (9)

As illustrated in TableVI, our SSCA has higher IoU𝐼𝑜𝑈IoUitalic_I italic_o italic_U, IoU𝐼𝑜𝑈IoUitalic_I italic_o italic_U, and F𝐹Fitalic_F-measure𝑚𝑒𝑎𝑠𝑢𝑟𝑒measureitalic_m italic_e italic_a italic_s italic_u italic_r italic_e values than the MCA and the variant SSCA w PE on three datasets.This suggests that SCTransNet can better perceive the information difference between small targets and complex backgrounds than MCA through comprehensive information interaction.It also illustrates that absolute positional encoding is not suitable for IRSTD tasks.This is due to the scaling of the position-embedding matrix in variable-size image inputs, which leads to inaccurate small-target position coding information, consequently affecting the prediction of target pixels.

Compared to our SSCA, SSCA w MH suffers decreases of 1.15%, 1.52%, and 0.73% in terms of IoU𝐼𝑜𝑈IoUitalic_I italic_o italic_U, IoU𝐼𝑜𝑈IoUitalic_I italic_o italic_U, and F𝐹Fitalic_F-measure𝑚𝑒𝑎𝑠𝑢𝑟𝑒measureitalic_m italic_e italic_a italic_s italic_u italic_r italic_e values on the SIRST-1K dataset. This is because the multi-head strategy complicates the feature mapping space of IR small targets, which is rather unfavorable for extracting information from targets with limited features. Therefore, in SCTransNet, we utilize the single-head attention for IRSTD.

Comparing SSCA and the variant SSCA w/o SE, we find that the local spatial embedding can significantly improve the performance of infrared small target detection in the three public datasets.Visualization maps displayed in Fig.9 further illustrate the effectiveness of this strategy.This is due to the ability of local spatial embedding to capture both specific details of the target and potential spatial correlations in the background within the deep layers.As a result, this approach minimizes instances of missed detections and improves the confidence of the detection process.

IV-E3 The Complementary Feed-forward Network

Feed-forward networks (FFNs) are used to strengthen the information correlation within features and introduce nonlinear radicalization to enrich the feature representation.In this section, we use five different FFN models based on SCTransNet to compare the proposed CFNs.As shown in Fig.10, we used typical FFN[41] (ViT for image classification), LeFF[32] (Uformer for image restoration) embedded in localized space, GDFN[30] (Restormer for image restoration) based on gated convolution, MSFN[31] (Sparse transformer for image deraining) based on multi-scale depth-wise convolution, the variant CFN without global spatial and local channel module (CFN w/o GSLC), respectively.

SCTransNet: Spatial-channel Cross Transformer Network for Infrared Small Target Detection (10)
ModelParams(M)Flops(G)Dataset
NUAA-SIRSTNUDT-SIRST
FFN[41]11.029220.147476.87/80.0893.58/93.85
LeFF[32]11.131220.194476.49/80.2193.92/94.07
GDFN[30]10.184119.721075.48/79.3293.40/93.64
MSFN[31]11.710720.502677.35/79.8993.88/94.24
CFN w/o GSLC11.190520.236276.54/80.5693.95/94.18
CFN11.190520.237277.50/81.0894.09/94.38

As shown in TableVII, LeFF exhibits a slight improvement in metrics over FFN, which indicates that the local spatial information aggregation employed in feed-forward neural networks is effective for IRSTD.Because gated convolution tends to consider IR small targets as noise and filters them out, this results in the GDFN having a low detection accuracy.We also find that MSFN outperforms all methods except our CFN, illustrating the superior ability of multi-scale structures to interact with spatial information compared to single-scale structures.Finally, we observe that the performance of the variant CFN w/o GSLC is inferior to that of MSFN. However, when we incorporate the GSLC module, our CFN achieves optimal values of IoU𝐼𝑜𝑈IoUitalic_I italic_o italic_U and nIoU𝑛𝐼𝑜𝑈nIoUitalic_n italic_I italic_o italic_U on the NUAA and NUDT datasets. Moreover, the network’s parameters and computational complexity remain almost unchanged, which demonstrates the validity and utility of the complementary mechanism proposed in this paper for the IRSTD task.As illustrated in Fig.11, with the help of the complementary mechanism, the network allows for more effective enhancement of infrared small targets and suppression of clutter in building and jungle backgrounds, leading to improved target detection accuracy.

SCTransNet: Spatial-channel Cross Transformer Network for Infrared Small Target Detection (11)
ModelParams(M)Flops(G)Dataset
NUAA-SIRSTNUDT-SIRST
C.ACM[14]13.062730.986275.68/79.5293.92/94.24
C.AGPC[19]11.758122.964777.39/79.9694.01/94.22
C.AFFPN[33]11.717122.729176.12/79.3393.53/93.69
SCTransNet11.190520.237277.50/81.0894.09/94.38

IV-E4 The Impact of CCA Block

As mentioned in Sec.II-A, cross-layer feature fusion can facilitate the preservation of enhanced target information.In this section, we utilize three cross-layer feature fusion structures, namely ACM[14], AGPC[19], and AFFPN[33], derived from different IRSTD methods, to replace the CCA module employed in SCTransNet. This substitution yields the variation structures, namely C.ACM, C.AGPC, and C.AFFPN, respectively.As shown in TableVIII, the results illustrate that our SCTransNet obtains the highest IoU and nIoU values on the NUAA and NUDT datasets with the lowest model parameters and computational complexity. This illustrates the effectiveness of the CCA we utilized.

IV-F Core Hyper-parameter Analysis

We utilize the depth of the RBs, the number of SCTBs, the channel expansion factor of CFNs, and the base width of the model to validate the hyper-parameters of SCTransNet.As shown in TableIX, the numbers “0”, “1”, “2”, and “3” indicate the embedding depth of the RBs. We observe that as the \replacedresidual blockResBlock depth increases, there is a slight increase in both the number of parameters and flops, and the performance of IRSTD shows significant improvement.This improvement can be attributed to the residual connection facilitating gradient propagation and mitigating feature degradation. Therefore, our SCTransNet uses four \replacedresidial blocksResBlocks for information encoding.TableX illustrates the results of the hyper-parameter study of the number of SCTBs, the channel expansion factor of CFNs, and the basic width of the model.It is evident that as the number of SCTB modules increases, the model’s performance steadily improves, reaffirming the effectiveness of the SCTB model.We observe that while the performance with 6 SCTBs is slightly better than with 4 SCTBs, it incurs excessive computational complexity. When the channel expansion factor η𝜂\etaitalic_η = 2.66, the model can get the best performance. Additionally, we also noticed that setting the base width of the model W=48 results in a slight degradation in performance compared with W=32, which can be attributed to the excessive model parameters reducing the algorithm’s generalization ability.Therefore, in our proposed SCTransNet, the number of SCTBs, the channel expansion factor of CFNs, and the base width of the model are set to 4, 2.66, and 32, respectively.

1234IoUnIoUF-measureParams(M)Flops(G)
82.2985.7790.2620.021211.1462
82.3385.8990.3120.096711.1484
82.4986.1190.4020.168011.1569
82.9586.2790.6820.237211.1905
83.4386.8690.9620.237211.1905
Hyper-paramIoUnIoUF-measureParams(M)Flops(G)
The number of SCTBs
N = 182.3385.8690.2817.74086.3295
N = 282.5386.0590.4318.57297.9498
N = 382.9786.4690.5819.40519.5702
N = 483.4386.8690.9620.237211.1905
N = 583.4086.8490.9521.069412.8108
N = 683.4586.8690.9721.901514.4312
The channel expansion factor of CFNs
η𝜂\etaitalic_η = 1.3382.8086.1890.5919.24579.2539
η𝜂\etaitalic_η = 2.0082.7586.3290.5619.747410.2338
η𝜂\etaitalic_η = 2.6683.4386.8690.9620.237211.1905
η𝜂\etaitalic_η = 3.0083.2486.6990.8420.493811.6917
η𝜂\etaitalic_η = 3.9983.1086.6090.7721.230613.1307
The basic width of the model
W = 877.5280.5587.331.33210.7468
W = 1681.0284.5089.515.14882.8609
W = 3283.4386.8690.9620.237211.1905
W = 4882.9586.4890.6045.268724.994

IV-G Robustness of SCTransNet

In an actual IR detection system, the non-uniform response of the focal plane array (FPN) can cause stripe noise in IR images[65]. This presents a challenge to the noise immunity and generalization ability of the IRSTD methods.Fig.12 gives the visual effect of the IR image with real stripe noise on various detection methods. It is evident that the noise destroys the local neighborhood information of the targets.In Fig.12(1), only our SCTransNet accurately detects two targets, while the other methods exhibit missed detections and false alarms.In Fig.12(2), there is also a piece of blind element in the striped image, which interferes with the semantics understanding of the building. As a result, the ACM, RDIAN, and MTU-Net generate false alarms around the blind element.The ability to explicitly establish full-level contextual information about the target and the background is what makes our approach more robust.

SCTransNet: Spatial-channel Cross Transformer Network for Infrared Small Target Detection (12)

V CONCLUSION

In this paper, we presented a Spatial-channel Cross Transformer Network (SCTransNet) for IR small target detection.Our SCTransNet utilizes spatial-channel cross transformer blocks to establish associations between encoder and decoder features to predict the context difference of targets and backgrounds in deeper network layers.We introduced a spatial-embedded single-head channel-cross attention module, which establishes the semantic relevance between targets and backgrounds by interacting local spatial features with global full-level channel information.We also devised a complementary feed-forward network, which employs a multi-scale strategy and crosses spatial-channel information to enhance feature differences between the target and background, thereby facilitating effective mapping of IR images to the segmentation space.Our comprehensive evaluation of the method on three public datasets shows the effectiveness and superiority of the proposed technique.

References

  • [1]Y.Sun, J.Yang, and W.An, “Infrared dim and small target detection via multiple subspace learning and spatial-temporal patch-tensor model,” IEEE Trans. Geosci. Remote Sens., vol.59, no.5, pp. 3737–3752, 2020.
  • [2]P.Wu, H.Huang, H.Qian, S.Su, B.Sun, and Z.Zuo, “SRCANet: Stacked residual coordinate attention network for infrared ship detection,” IEEE Trans. Geosci. Remote Sens., vol.60, pp. 1–14, 2022.
  • [3]P.Yan, R.Hou, X.Duan, C.Yue, X.Wang, and X.Cao, “STDMANet: Spatio-temporal differential multiscale attention network for small moving infrared target detection,” IEEE Trans. Geosci. Remote Sens., vol.61, pp. 1–16, 2023.
  • [4]X.Ying, L.Liu, Y.Wang, R.Li, N.Chen, Z.Lin, W.Sheng, and S.Zhou, “Mapping degeneration meets label evolution: Learning infrared small target detection with single point supervision,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2023, pp. 15 528–15 538.
  • [5]X.Bai and F.Zhou, “Analysis of new top-hat transformation and the application for infrared dim small target detection,” Pattern Recognit., vol.43, no.6, pp. 2145–2156, 2010.
  • [6]J.-F. Rivest and R.Fortin, “Detection of dim targets in digital infrared imagery by morphological image processing,” Opt. Eng., vol.35, no.7, pp. 1886–1893, 1996.
  • [7]C.P. Chen, H.Li, Y.Wei, T.Xia, and Y.Y. Tang, “A local contrast method for small infrared target detection,” IEEE Trans. Geosci. Remote Sens., vol.52, no.1, pp. 574–581, 2013.
  • [8]S.Kim and J.Lee, “Scale invariant small target detection by optimizing signal-to-clutter ratio in heterogeneous background for infrared search and track,” Pattern Recognit., vol.45, no.1, pp. 393–406, 2012.
  • [9]C.Gao, D.Meng, Y.Yang, Y.Wang, X.Zhou, and A.G. Hauptmann, “Infrared patch-image model for small target detection in a single image,” IEEE Trans. Image Process., vol.22, no.12, pp. 4996–5009, 2013.
  • [10]H.Zhu, S.Liu, L.Deng, Y.Li, and F.Xiao, “Infrared small target detection via low-rank tensor completion with top-hat regularization,” IEEE Trans. Geosci. Remote Sens., vol.58, no.2, pp. 1004–1016, 2019.
  • [11]H.Wang, L.Zhou, and L.Wang, “Miss detection vs. false alarm: Adversarial learning for small object segmentation in infrared images,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), 2019, pp. 8509–8518.
  • [12]Z.Huang, X.Wang, L.Huang, C.Huang, Y.Wei, and W.Liu, “Ccnet: Criss-cross attention for semantic segmentation,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), 2019, pp. 603–612.
  • [13]X.He, Y.Zhou, J.Zhao, D.Zhang, R.Yao, and Y.Xue, “Swin transformer embedding unet for remote sens. image semantic segmentation,” IEEE Trans. Geosci. Remote Sens., vol.60, pp. 1–15, 2022.
  • [14]Y.Dai, Y.Wu, F.Zhou, and K.Barnard, “Asymmetric contextual modulation for infrared small target detection,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 950–959.
  • [15]B.Li, C.Xiao, L.Wang, Y.Wang, Z.Lin, M.Li, W.An, and Y.Guo, “Dense nested attention network for infrared small target detection,” IEEE Trans. Image Process., vol.32, pp. 1745–1758, 2022.
  • [16]X.Wu, D.Hong, and J.Chanussot, “UIU-Net: U-net in u-net for infrared small object detection,” IEEE Trans. Image Process., vol.32, pp. 364–376, 2022.
  • [17]T.Wu, B.Li, Y.Luo, Y.Wang, C.Xiao, T.Liu, J.Yang, W.An, and Y.Guo, “MTU-Net: Multilevel transunet for space-based infrared tiny ship detection,” IEEE Trans. Geosci. Remote Sens., vol.61, pp. 1–15, 2023.
  • [18]Y.Dai, Y.Wu, F.Zhou, and K.Barnard, “Attentional local contrast networks for infrared small target detection,” IEEE Trans. Geosci. Remote Sens., vol.59, no.11, pp. 9813–9824, 2021.
  • [19]T.Zhang, L.Li, S.Cao, T.Pu, and Z.Peng, “Attention-guided pyramid context networks for detecting infrared small target under complex background,” IEEE Trans. Aerosp. Electron. Syst., 2023.
  • [20]X.Tong, S.Su, P.Wu, R.Guo, J.Wei, Z.Zuo, and B.Sun, “MSAFFNet: A multi-scale label-supervised attention feature fusion network for infrared small target detection,” IEEE Trans. Geosci. Remote Sens., 2023.
  • [21]M.Zhang, K.Yue, J.Zhang, Y.Li, and X.Gao, “Exploring feature compensation and cross-level correlation for infrared small target detection,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 1857–1865.
  • [22]Q.Hou, L.Zhang, F.Tan, Y.Xi, H.Zheng, and N.Li, “ISTDU-Net: Infrared small-target detection u-net,” IEEE Geosci. Remote Sens. Lett., vol.19, pp. 1–5, 2022.
  • [23]X.He, Q.Ling, Y.Zhang, Z.Lin, and S.Zhou, “Detecting dim small target in infrared images via subpixel sampling cuneate network,” IEEE Geosci. Remote Sens. Lett., vol.19, pp. 1–5, 2022.
  • [24]Z.Zhou, M.M. RahmanSiddiquee, N.Tajbakhsh, and J.Liang, “Unet++: A nested u-net architecture for medical image segmentation,” in Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings 4.Springer, 2018, pp. 3–11.
  • [25]R.Kou, C.Wang, Y.Yu, Z.Peng, M.Yang, F.Huang, and Q.Fu, “LW-IRSTnet: Lightweight infrared small target segmentation network and application deployment,” IEEE Trans. Geosci. Remote Sens., 2023.
  • [26]J.Lin, K.Zhang, X.Yang, X.Cheng, and C.Li, “Infrared dim and small target detection based on u-transformer,” J. Vis. Commun. Image Represent., vol.89, p. 103684, 2022.
  • [27]H.Wang, P.Cao, J.Wang, and O.R. Zaiane, “UCTransnet: rethinking the skip connections in u-net from a channel-wise perspective with transformer,” in Proceedings of the AAAI conference on artificial intelligence, vol.36, no.3, 2022, pp. 2441–2449.
  • [28]Y.Li, Z.Cheng, C.Wang, J.Zhao, and L.Huang, “RCCT-ASPPNet: Dual-encoder remote image segmentation based on transformer and ASPP,” Remote Sens., vol.15, no.2, p. 379, 2023.
  • [29]Q.Luo, J.Su, C.Yang, W.Gui, O.Silven, and L.Liu, “CAT-EDNet: Cross-attention transformer-based encoder–decoder network for salient defect detection of strip steel surface,” IEEE Trans. Instrum. Meas., vol.71, pp. 1–13, 2022.
  • [30]S.W. Zamir, A.Arora, S.Khan, M.Hayat, F.S. Khan, and M.-H. Yang, “Restormer: Efficient transformer for high-resolution image restoration,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 5728–5739.
  • [31]X.Chen, H.Li, M.Li, and J.Pan, “Learning a sparse transformer network for effective image deraining,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2023, pp. 5896–5905.
  • [32]Z.Wang, X.Cun, J.Bao, W.Zhou, J.Liu, and H.Li, “Uformer: A general u-shaped transformer for image restoration,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 17 683–17 693.
  • [33]Z.Zuo, X.Tong, J.Wei, S.Su, P.Wu, R.Guo, and B.Sun, “AFFPN: attention fusion feature pyramid network for small infrared target detection,” Remote Sens., vol.14, no.14, p. 3412, 2022.
  • [34]C.Yu, Y.Liu, S.Wu, X.Xia, Z.Hu, D.Lan, and X.Liu, “Pay attention to local contrast learning networks for infrared small target detection,” IEEE Geosci. Remote Sens. Lett., vol.19, pp. 1–5, 2022.
  • [35]X.Tong, B.Sun, J.Wei, Z.Zuo, and S.Su, “EAAU-Net: Enhanced asymmetric attention u-net for infrared small target detection,” Remote Sens., vol.13, no.16, p. 3200, 2021.
  • [36]S.Liu, P.Chen, and M.Woźniak, “Image enhancement-based detection with small infrared targets,” Remote Sens., vol.14, no.13, p. 3232, 2022.
  • [37]L.Huang, S.Dai, T.Huang, X.Huang, and H.Wang, “Infrared small target segmentation with multiscale feature representation,” Infr. Phys. Technol., vol. 116, p. 103755, 2021.
  • [38]Y.Chen, L.Li, X.Liu, and X.Su, “A multi-task framework for infrared small target detection and segmentation,” IEEE Trans. Geosci. Remote Sens., vol.60, pp. 1–9, 2022.
  • [39]H.Sun, J.Bai, F.Yang, and X.Bai, “Receptive-field and direction induced attention network for infrared dim small target detection with a large-scale dataset irdst,” IEEE Trans. Geosci. Remote Sens., vol.61, pp. 1–13, 2023.
  • [40]M.Zhang, R.Zhang, Y.Yang, H.Bai, J.Zhang, and J.Guo, “ISNet: Shape matters for infrared small target detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 877–886.
  • [41]A.Dosovitskiy, L.Beyer, A.Kolesnikov, D.Weissenborn, X.Zhai, T.Unterthiner, M.Dehghani, M.Minderer, G.Heigold, S.Gelly etal., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  • [42]M.Zhang, H.Bai, J.Zhang, R.Zhang, C.Wang, J.Guo, and X.Gao, “Rkformer: Runge-kutta transformer with random-connection attention for infrared small target detection,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 1730–1738.
  • [43]P.Pan, H.Wang, C.Wang, and C.Nie, “ABC: Attention with bilinear correlation for infrared small target detection,” arXiv preprint arXiv:2303.10321, 2023.
  • [44]F.Liu, C.Gao, F.Chen, D.Meng, W.Zuo, and X.Gao, “Infrared small-dim target detection with transformer under complex backgrounds,” arXiv preprint arXiv:2109.14379, 2021.
  • [45]J.Chen, Y.Lu, Q.Yu, X.Luo, E.Adeli, Y.Wang, L.Lu, A.L. Yuille, and Y.Zhou, “Transunet: Transformers make strong encoders for medical image segmentation,” arXiv preprint arXiv:2102.04306, 2021.
  • [46]G.Chen, W.Wang, and S.Tan, “Irstformer: A hierarchical vision transformer for infrared small target detection,” Remote Sens., vol.14, no.14, p. 3258, 2022.
  • [47]Z.Liu, Y.Lin, Y.Cao, H.Hu, Y.Wei, Z.Zhang, S.Lin, and B.Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), 2021, pp. 10 012–10 022.
  • [48]M.Qi, L.Liu, S.Zhuang, Y.Liu, K.Li, Y.Yang, and X.Li, “FTC-net: fusion of transformer and cnn features for infrared small target detection,” IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens., vol.15, pp. 8613–8623, 2022.
  • [49]S.Meng, C.Zhang, Q.Shi, Z.Chen, W.Hu, and F.Lu, “A robust infrared small target detection method jointing multiple information and noise prediction: Algorithm and benchmark,” IEEE Trans. Geosci. Remote Sens., 2023.
  • [50]Z.Huang, X.Wang, L.Huang, C.Huang, Y.Wei, and W.Liu, “Ccnet: Criss-cross attention for semantic segmentation,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), 2019, pp. 603–612.
  • [51]C.Xu, Z.Ye, L.Mei, S.Shen, Q.Zhang, H.Sui, W.Yang, and S.Sun, “SCAD: A siamese cross-attention discrimination network for bitemporal building change detection,” Remote Sens., vol.14, no.24, p. 6213, 2022.
  • [52]K.He, X.Zhang, S.Ren, and J.Sun, “Deep residual learning for image recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2016, pp. 770–778.
  • [53]I.O. Tolstikhin, N.Houlsby, A.Kolesnikov, L.Beyer, X.Zhai, T.Unterthiner, J.Yung, A.Steiner, D.Keysers, J.Uszkoreit etal., “Mlp-mixer: An all-mlp architecture for vision,” Adv. Neural Inf. Process. Syst. (NeurIPS), vol.34, pp. 24 261–24 272, 2021.
  • [54]W.Yu, M.Luo, P.Zhou, C.Si, Y.Zhou, X.Wang, J.Feng, and S.Yan, “Metaformer is actually what you need for vision,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 10 819–10 829.
  • [55]S.D. Deshpande, M.H. Er, R.Venkateswarlu, and P.Chan, “Max-mean and max-median filters for detection of small targets,” in Signal and Data Processing of Small Targets 1999, vol. 3809.SPIE, 1999, pp. 74–83.
  • [56]J.Han, S.Moradi, I.Faramarzi, H.Zhang, Q.Zhao, X.Zhang, and N.Li, “Infrared small target detection based on the weighted strengthened local contrast measure,” IEEE Geosci. Remote Sens. Lett., vol.18, no.9, pp. 1670–1674, 2020.
  • [57]J.Han, S.Moradi, I.Faramarzi, C.Liu, H.Zhang, and Q.Zhao, “A local contrast method for infrared small-target detection utilizing a tri-layer window,” IEEE Geosci. Remote Sens. Lett., vol.17, no.10, pp. 1822–1826, 2019.
  • [58]L.Zhang and Z.Peng, “Infrared small target detection based on partial sum of the tensor nuclear norm,” Remote Sens., vol.11, no.4, p. 382, 2019.
  • [59]Y.Sun, J.Yang, and W.An, “Infrared dim and small target detection via multiple subspace learning and spatial-temporal patch-tensor model,” IEEE Trans. Geosci. Remote Sens., vol.59, no.5, pp. 3737–3752, 2020.
  • [60]K.Wang, S.Du, C.Liu, and Z.Cao, “Interior attention-aware network for infrared small target detection,” IEEE Trans. Geosci. Remote Sens., vol.60, pp. 1–13, 2022.
  • [61]D.Ulyanov, A.Vedaldi, and V.Lempitsky, “Instance normalization: The missing ingredient for fast stylization,” arXiv preprint arXiv:1607.08022, 2016.
  • [62]Q.Wang, B.Wu, P.Zhu, P.Li, W.Zuo, and Q.Hu, “ECA-Net: Efficient channel attention for deep convolutional neural networks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2020, pp. 11 534–11 542.
  • [63]H.Touvron, P.Bojanowski, M.Caron, M.Cord, A.El-Nouby, E.Grave, G.Izacard, A.Joulin, G.Synnaeve, J.Verbeek etal., “Resmlp: Feedforward networks for image classification with data-efficient training,” IEEE Trans. Pattern Anal. Mach. Intell., vol.45, no.4, pp. 5314–5321, 2022.
  • [64]K.He, X.Zhang, S.Ren, and J.Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), 2015, pp. 1026–1034.
  • [65]S.Yuan, H.Qin, X.Yan, N.Akhtar, S.Yang, and S.Yang, “ARCNet: An asymmetric residual wavelet column correction network for infrared image destriping,” arXiv preprint arXiv:2401.15578, 2024.
SCTransNet: Spatial-channel Cross Transformer Network for Infrared Small Target Detection (13)Shuai Yuan received the B.S. degree from Xi’an Technological University, Xi’an, China, in 2019. He is currently pursuing a Ph.D. degree at Xidian University, Xi’an, China. He is currently studying at the University of Melbourne as a visiting student, working closely with Dr. Naveed Akhtar. His research interests include infrared image understanding, remote sensing, and deep learning.
SCTransNet: Spatial-channel Cross Transformer Network for Infrared Small Target Detection (14)Hanlin Qin received the B.S and Ph.D. degrees from Xidian University, Xi’an, China, in 2004 and 2010. He is currently a full professor at the School of Optoelectronic Engineering, Xidian University. He authored or co-authored more than 100 scientific articles. His research interests include electro-optical cognition, advanced intelligent computing, and autonomous collaboration.
SCTransNet: Spatial-channel Cross Transformer Network for Infrared Small Target Detection (15)Xiang Yan received the B.S and Ph.D. degrees from Xidian University, Xi’an, China, in 2012 and 2018. He was a visiting Ph.D. Student with the School of Computer Science and Software Engineering, Australia, from 2016 to 2018, working closely with Prof. Ajmal Mian. He is currently an associate professor at Xidian University, Xi’an, China. His current research interests include image processing, computer vision and deep learning.
SCTransNet: Spatial-channel Cross Transformer Network for Infrared Small Target Detection (16)Naveed Akhtar is a Senior Lecturer at the University of Melbourne. He received his PhD in Computer Science from the University of Western Australia and Master degree from Hochschule Bonn-Rhein-Sieg, Germany. He is a recipient of the Discovery Early Career Researcher Award from the Australian Research Council. He is a Universal Scientific Education and Research Network Laureate in Formal Sciences. He was a finalist of the Western Australia’s Early Career Scientist of the Year 2021. He is an ACM Distinguished Speaker and serves as an Associate Editor of IEEE Trans. Neural Networks and Learning Systems.
SCTransNet: Spatial-channel Cross Transformer Network for Infrared Small Target Detection (17)Ajmal Mian is a Professor of Computer Science at The University of Western Australia. He is the recipient of three esteemed national fellowships from the Australian Research Council (ARC) including the recent Future Fellowship Award 2022. He is a Fellow of the International Association for Pattern Recognition and recipient of several awards including the West Australian Early Career Scientist of the Year Award 2012, the HBF Mid-Career Scientist of the Year Award 2022, Excellence in Research Supervision Award, EH Thompson Award, ASPIRE Professional Development Award, Vice-chancellors Mid-career Research Award, Outstanding Young Investigator Award, and the Australasian Distinguished Doctoral Dissertation Award. Ajmal Mian has secured research funding from the ARC, NHMRC, DARPA, and the Australian Department of Defence. He has served as a Senior Editor for IEEE Transactions on Neural Networks & Learning Systems and Associate Editor for IEEE Transactions on Image Processing and the Pattern Recognition journal. His research interests include computer vision, machine learning, remote sensing, and 3D point cloud analysis.
SCTransNet: Spatial-channel Cross Transformer Network for Infrared Small Target Detection (2024)
Top Articles
Latest Posts
Article information

Author: Kerri Lueilwitz

Last Updated:

Views: 6318

Rating: 4.7 / 5 (67 voted)

Reviews: 90% of readers found this page helpful

Author information

Name: Kerri Lueilwitz

Birthday: 1992-10-31

Address: Suite 878 3699 Chantelle Roads, Colebury, NC 68599

Phone: +6111989609516

Job: Chief Farming Manager

Hobby: Mycology, Stone skipping, Dowsing, Whittling, Taxidermy, Sand art, Roller skating

Introduction: My name is Kerri Lueilwitz, I am a courageous, gentle, quaint, thankful, outstanding, brave, vast person who loves writing and wants to share my knowledge and understanding with you.