Estimating the Execution Time of CNN Inference on GPUs

Conference: MBMV 2024 - 27. Workshop
02/14/2024 - 02/15/2024 at Kaiserslautern

Proceedings: ITG-Fb. 314: MBMV 2024

Pages: 10Language: englishTyp: PDF

Authors:
Groth, Stefan; Teich, Juergen; Hannig, Frank (Hardware/Software Co-Design, Department of Computer Science, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Erlangen, Germany)
Schmid, Moritz (Siemens Healthineers, Forchheim, Germany)

Abstract:
Performance modeling of Convolutional Neural Network (CNN) inference on GPU targets is a crucial task when searching for the best network model for a certain application while also adhering to latency requirements. We propose an estimation strategy based on execution time measurements of just the basic building blocks of a network, so-called cells, and reusing these measurements for the evaluation of full and, in particular, for different networks. To further improve the estimation accuracy of our performance model, we incorporate layer fusion, an important optimization technique used in many inference frameworks, in a way that is even back-end agnostic. Our performance model is able to consider even uncommon layer types by dynamically measuring cell execution times when necessary, thus allowing to estimate a wide range of neural networks. We assess the quality of our performance model for four different common CNNs and three different Nvidia GPUs, a discrete RTX A6000, an embedded Xavier AGX and an embedded Jetson TX2, using TensorRT. For all considered networks, we achieve highly accurate execution time estimates when, particularly when compared to state-of-the-art approaches. Moreover, a root mean square percentage error (RMSPE) of 2%, 1.1%, and 0.7% is achieved for the RTX A6000, Xavier AGX, and Jetson TX2, respectively. Finally, the generality of the performance model and its ability to offer fast and highly accurate estimation results is shown in the context of Neural Architecture Search (NAS) by estimating the latency of neural networks that are created as instances of a commonly used NAS benchmark. Here, we achieve an RMSPE of 4.1%.