Comparison of the potential between transformer and CNN in image classification
Conference: ICMLCA 2021 - 2nd International Conference on Machine Learning and Computer Application
12/17/2021 - 12/19/2021 at Shenyang, China
Proceedings: ICMLCA 2021
Pages: 6Language: englishTyp: PDF
Personal VDE Members are entitled to a 10% discount on this title
Authors:
Lu, Kangrui (Fuqua School of Business, Duke University, Durham, NC, USA)
Xu, Yuanrun (College of Computer Science, Sichuan University, Chengdu, Sichuan, China)
Yang, Yige (Department of Computer Science, University of Surrey, Guildford, Surrey, UK)
Abstract:
Convolution Neural Network (CNN) based algorithms have been dominating image classification tasks. In the meantime, Transformer based methods have also started to gain popularity and usage in recent years. To get a clear view and understanding of the two types of methods for image classification tasks on a butterfly dataset of 10,000 data points, the study is designed to compare the efficiency of CNN’s Insception-ResNetV2 model and Vision Transformer (ViT). For each of the methods, we also compare internally by different sizes of the dataset. By examining the experiment results of both validation accuracy and training time, we conclude that the ViT model’s accuracy is much more sensitive to largescale datasets. Also, ViT training requires relatively higher expense and duration. Meanwhile, the ViT model displays a relatively stable loss throughout the training process, illustrating feasible industry-level applications and opportunities for further refinement.