MNTN: Deep Modular N-shape Transformer Networks for Image Captioning

Conference: ISCTT 2021 - 6th International Conference on Information Science, Computer Technology and Transportation
11/26/2021 - 11/28/2021 at Xishuangbanna, China

Proceedings: ISCTT 2021

Pages: 6Language: englishTyp: PDF

Personal VDE Members are entitled to a 10% discount on this title

Authors:
Yang, You (National Center for Applied Mathematics in Chongqing, Chongqing, China)
Fang, Xiaolong; Deng, Yi; Wu, Chunyan (School of Computer and Information Science, Chongqing Normal University, Chongqing, China)

Abstract:
Image captioning requires the computer automatically generate natural language captions from the input image. Recent progress on image captioning uses multiple features as model inputs to improve performance. Nevertheless, there has not been sufficient feature utilization. In this paper, we introduce a Modular N-shape Transformer (MNT) fully to the high order intra interaction of single-feature and the high order guided interaction of multi-feature, which is composed of two basic attention transformer units. Furthermore, we present a deep Modular N-shape Transformer Network (MNTN) that novelty integrates MNT into image encoder part of image captioning model not only fully to leverage the spatial and location information of image, but also make the features better locate the image. Experiments show that MNTN outperforms most previously published methods and can express the semantic content of the image extreme accurately.