Synthesizer Based Efficient Self-Attention for Vision Tasks

Guangyang Zhu; Jianfeng Zhang; Yuanzhi Feng; Hai Lan

doi:10.6911/WSRJ.202501_11(1).0002

Authors

Guangyang Zhu
Jianfeng Zhang
Yuanzhi Feng
Hai Lan

DOI:

https://doi.org/10.6911/WSRJ.202501_11(1).0002

Keywords:

Synthesizer; Self-Attention; Efficient Visual Transformer; Image Classification; Image Captioning.

Abstract

Attention mechanism was first designed for natural language processing (NLP) and then was widely applied in the field of computer vision, which shows notable competence in capturing long-range relationships. However, the dot product multiplication among query-key-value features within the self-attention module results in exhaustive and redundant computation. It is impractical for a self-attention module to directly handle raw image data with millions of pixels. As a result, an image is usually partitioned into a sequence of small patches or is processed by a Convolutional Neural Network backbone to make the computation tractable before feeding into a self-attention module. Furthermore, dimension alignment among query-key-value features within the self-attention module might destroy the internal structure of the visual feature maps. To address these problems, this paper proposes a plug-in module named Synthesizing Tensor Transformations (STT) with its variants for self-attention which directly processes pixel-level image features. Instead of computing the dot-product multiplication among query-key-value, the basic STT learns to obtain the synthetic attention weight by transforming the input visual tensor. The effectiveness of STT series is validated on the image classification and image captioning. Experiments show that the proposed STT achieves competitive performance while keeping robustness compared to basic self-attention.

Downloads

Download data is not yet available.

References

[1] Roy, A.; Saffar, M.; Vaswani, A.; Grangier, D. Efficient Content-Based Sparse Attention with Routing Transformers. Trans. Assoc. Comput. Linguistics 2021, 9, 53–68.

[2] Zaheer, M.; Guruganesh, G.; Dubey, A.; Ainslie, J.; Alberti, C.; Ontanon, S.; Pham, P.; Ravula, A.; Wang, Q.; Yang, L.; et al. Big Bird: Transformers for Longer Sequences. In Proceedings of the NeurIPS, 2020.

[3] Kitaev, N.; Kaiser, L.; Levskaya, A. Reformer: The Efficient Transformer. In Proceedings of the ICLR, 2020.

[4] Wu, Q.; Lan, Z.; Gu, J.; Yu, Z. Memformer: The Memory-Augmented Transformer, 2020, [arXiv:cs.CL/2010.06891].

[5] Choromanski, K.M.; Likhosherstov, V.; Dohan, D.; Song, X.; Gane, A.; Sarlós, T.; Hawkins, P.; Davis, J.Q.; Mohiuddin, A.; Kaiser, L.; et al. Rethinking Attention with Performers. In Proceedings of the ICLR, 2021.

[6] Rae, J.W.; Potapenko, A.; Jayakumar, S.M.; Hillier, C.; Lillicrap, T.P. Compressive Transformers for Long-Range Sequence Modelling. In Proceedings of the ICLR, 2020.

[7] Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. In Proceedings of the ICLR, 2021.

[8] Huang, L.; Wang, W.; Chen, J.; Wei, X. Attention on Attention for Image Captioning. In Proceedings of the ICCV, 2019, pp. 4633–4642.

[9] Fahim, S.R.; Sarker, Y.; Sarker, S.K.; Sheikh, M.R.I.; Das, S.K. Self attention convolutional neural network with time series imaging based feature extraction for transmission line fault detection and classification. Electric Power Systems Research 2020, 187, 106437. https://doi.org/https://doi.org/10.1016/j.epsr.2020.106437.

[10] Tong, W.; Guan, X.; Zhang, M.; Li, P.; Ma, J.; Wu, E.Q.; Zhu, L.M. Edge-assisted epipolar transformer for industrial scene reconstruction. IEEE Transactions on Automation Science and Engineering 2024.

[11] Mnih, V.; Heess, N.; Graves, A.; et al. Recurrent models of visual attention. In Proceedings of the NeurIPS, 2014, pp. 2204–2212.

[12] Ba, J.; Mnih, V.; Kavukcuoglu, K. Multiple Object Recognition with Visual Attention. In Proceedings of the ICLR, 2015.

[13] Wang, F.; Jiang, M.; Qian, C.; Yang, S.; Li, C.; Zhang, H.; Wang, X.; Tang, X. Residual Attention Network for Image Classification. In Proceedings of the CVPR, 2017.

[14] Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the CVPR, 2018, pp. 7132–7141.

[15] Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the ECCV, 2018.

[16] Tay, Y.; Bahri, D.; Metzler, D.; Juan, D.C.; Zhao, Z.; Zheng, C. Synthesizer: Rethinking Self-Attention in Transformer Models, 2020, [arXiv:cs.CL/2005.00743].

[17] Dong, Y.; Cordonnier, J.; Loukas, A. Attention is not all you need: pure attention loses rank doubly exponentially with depth. In Proceedings of the ICML, 2021, Vol. 139, pp. 2793–2803.

[18] Liu, C.; Mao, J.; Sha, F.; Yuille, A.L. Attention Correctness in Neural Image Captioning. In Proceedings of the AAAI, 2017, pp. 4176–4182.

[19] Wang, Y.; Ma, N.; Guo, Z. Machine Reading Comprehension Model Based on Fusion of Mixed Attention. Applied Sciences 2024, 14. https://doi.org/10.3390/app14177794.

[20] Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Vedaldi, A. Gather-Excite: Exploiting Feature Context in Convolutional Neural Networks. In Proceedings of the NeurIPS, 2018, pp. 9423–9433.

[21] Chen, L.; Zhang, H.; Xiao, J.; Nie, L.; Shao, J.; Liu, W.; Chua, T.S. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the CVPR, 2017, pp. 5659–5667.

[22] Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual Attention Network for Scene Segmentation. In Proceedings of the CVPR, 2019, pp. 3146–3154.

[23] Bello, I.; Zoph, B.; Vaswani, A.; Shlens, J.; Le, Q.V. Attention augmented convolutional networks. In Proceedings of the ICCV, 2019, pp. 3286–3295.

[24] Parmar, N.; Ramachandran, P.; Vaswani, A.; Bello, I.; Levskaya, A.; Shlens, J. Stand-Alone Self-Attention in Vision Models. In Proceedings of the NeurIPS, 2019, pp. 68–80.

[25] Cordonnier, J.B.; Loukas, A.; Jaggi, M. On the Relationship between Self-Attention and Convolutional Layers. In Proceedings of the ICLR, 2020.

[26] Zhao, H.; Jia, J.; Koltun, V. Exploring self-attention for image recognition. In Proceedings of the CVPR, 2020, pp. 10076–10085.

[27] Maurício, J.; Domingues, I.; Bernardino, J. Comparing Vision Transformers and Convolutional Neural Networks for Image Classification: A Literature Review. Applied Sciences 2023.

[28] Hu, H.; Gu, J.; Zhang, Z.; Dai, J.; Wei, Y. Relation Networks for Object Detection. In Proceedings of the CVPR, 2018, pp. 3588–3597.

[29] Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the ECCV, 2020, Vol. 12346, pp. 213–229.

[30] Wang, H.; Zhu, Y.; Green, B.; Adam, H.; Yuille, A.; Chen, L.C. Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation, 2020, [arXiv:cs.CV/2003.07853].

[31] Ondeng, O.; Ouma, H.; Akuon, P. A Review of Transformer-Based Approaches for Image Captioning. Applied Sciences 2023, 13. https://doi.org/10.3390/app131911103.

[32] Tong, W.; Zhang, M.; Zhu, G.; Xu, X.; Wu, E.Q. Robust Depth Estimation Based on Parallax Attention for Aerial Scene Perception. IEEE Transactions on Industrial Informatics 2024.

[33] Parmar, N.; Vaswani, A.; Uszkoreit, J.; Kaiser, L.; Shazeer, N.; Ku, A.; Tran, D. Image transformer. In Proceedings of the ICML, 2018, pp. 4055–4064.

[34] Ho, J.; Kalchbrenner, N.; Weissenborn, D.; Salimans, T. Axial attention in multidimensional transformers. arXiv preprint arXiv:1912.12180 2019.

[35] Chen, M.; Radford, A.; Child, R.; Wu, J.; Jun, H.; Luan, D.; Sutskever, I. Generative pretraining from pixels. In Proceedings of the ICML, 2020, pp. 1691–1703.

[36] Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the ICLR, 2021.

[37] Huang, L.; Yuan, Y.; Guo, J.; Zhang, C.; Chen, X.; Wang, J. Interlaced sparse self-attention for semantic segmentation. arXiv preprint arXiv:1907.12273 2019.

[38] Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Liu, W. CCNet: Criss-Cross Attention for Semantic Segmentation. In Proceedings of the ICCV, 2019, pp. 603–612.

[39] Chen, Y.; Kalantidis, Y.; Li, J.; Yan, S.; Feng, J. Aˆ2-Nets: Double Attention Networks. In Proceedings of the NeurIPS, 2018, pp. 350–359.

[40] Wang, S.; Li, B.Z.; Khabsa, M.; Fang, H.; Ma, H. Linformer: Self-Attention with Linear Complexity, 2020, [arXiv:cs.LG/2006.04768].

[41] De Lathauwer, L.; De Moor, B.; Vandewalle, J. A multilinear singular value decomposition. SIAM journal on Matrix Analysis and Applications 2000, 21, 1253–1278.

[42] Van Loan, C.F. The ubiquitous Kronecker product. Journal of computational and applied mathematics 2000, 123, 85–100.

[43] Seibert, M.; Wörmann, J.; Gribonval, R.; Kleinsteuber, M. Learning co-sparse analysis operators with separable structures. IEEE Transactions on Signal Processing 2015, 64, 120–130.

[44] Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the NeurIPS, 2017, pp. 5998–6008.

[45] Krizhevsky, A.; Hinton, G.; et al. Learning multiple layers of features from tiny images; Master Thesis, 2009.

[46] Chen, X.; Fang, H.; Lin, T.Y.; Vedantam, R.; Gupta, S.; Dollár, P.; Zitnick, C.L. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 2015.

[47] Karpathy, A.; Fei-Fei, L. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the CVPR, 2015, pp. 3128–3137.

[48] Denkowski, M.; Lavie, A. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the Proceedings of the ninth workshop on statistical machine translation, 2014, pp. 376–380.

[49] Vedantam, R.; Lawrence Zitnick, C.; Parikh, D. Cider: Consensus-based image description evaluation. In Proceedings of the CVPR, 2015, pp. 4566–4575.

[50] Lin, C.Y. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Text summarization branches out, 2004, pp. 74–81.