TY - GEN
T1 - Maximizing CNN accelerator efficiency through resource partitioning
AU - Shen, Yongming
AU - Ferdman, Michael
AU - Milder, Peter
N1 - Publisher Copyright: © 2017 Association for Computing Machinery.
PY - 2017/6/24
Y1 - 2017/6/24
N2 - Convolutional neural networks (CNNs) are revolutionizing machine learning, but they present signifcant computational challenges. Recently, many FPGA-based accelerators have been proposed to improve the performance and effciency of CNNs. Current approaches construct a single processor that computes the CNN layers one at a time; the processor is optimized to maximize the throughput at which the collection of layers is computed. However, this approach leads to ineffcient designs because the same processor structure is used to compute CNN layers of radically varying dimensions. We present a new CNN accelerator paradigm and an accompanying automated design methodology that partitions the available FPGA resources into multiple processors, each of which is tailored for a different subset of the CNN convolutional layers. Using the same FPGA resources as a single large processor, multiple smaller specialized processors increase computational effciency and lead to a higher overall throughput. Our design methodology achieves 3.8x higher throughput than the state-of-the-art approach on evaluating the popular AlexNet CNN on a Xilinx Virtex-7 FPGA. For the more recent SqueezeNet and GoogLeNet, the speedups are 2.2x and 2.0x.
AB - Convolutional neural networks (CNNs) are revolutionizing machine learning, but they present signifcant computational challenges. Recently, many FPGA-based accelerators have been proposed to improve the performance and effciency of CNNs. Current approaches construct a single processor that computes the CNN layers one at a time; the processor is optimized to maximize the throughput at which the collection of layers is computed. However, this approach leads to ineffcient designs because the same processor structure is used to compute CNN layers of radically varying dimensions. We present a new CNN accelerator paradigm and an accompanying automated design methodology that partitions the available FPGA resources into multiple processors, each of which is tailored for a different subset of the CNN convolutional layers. Using the same FPGA resources as a single large processor, multiple smaller specialized processors increase computational effciency and lead to a higher overall throughput. Our design methodology achieves 3.8x higher throughput than the state-of-the-art approach on evaluating the popular AlexNet CNN on a Xilinx Virtex-7 FPGA. For the more recent SqueezeNet and GoogLeNet, the speedups are 2.2x and 2.0x.
KW - Accelerator
KW - Convolutional Neural Network
KW - FPGA
UR - https://www.scopus.com/pages/publications/85025700588
U2 - 10.1145/3079856.3080221
DO - 10.1145/3079856.3080221
M3 - Conference contribution
T3 - Proceedings - International Symposium on Computer Architecture
SP - 535
EP - 547
BT - ISCA 2017 - 44th Annual International Symposium on Computer Architecture - Conference Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 44th Annual International Symposium on Computer Architecture - ISCA 2017
Y2 - 24 June 2017 through 28 June 2017
ER -