毕业设计(论文)外文翻译
PVANET: Deep but Lightweight Neural Networks for
Real-time Object Detection
Kye-Hyeon Kimlowast;, Sanghoon Honglowast;, Byungseok Rohlowast;, Yeongjae Cheon, and Minje Park
Intel Imaging and Camera Technology
21 Teheran-ro 52-gil, Gangnam-gu, Seoul 06212, Korea
{kye-hyeon.kim, sanghoon.hong, peter.roh,
yeongjae.cheon, minje.park}@intel.com
Abstract
This paper presents how we can achieve the state-of-the-art accuracy in multi-
category object detection task while minimizing the computational cost by adapt-
ing and combining recent technical innovations. Following the common pipeline
of “CNN feature extraction region proposal RoI classification”, we mainly
redesign the feature extraction part, since region proposal part is not computation-
ally expensive and classification part can be efficiently compressed with common
techniques like truncated SVD. Our design principle is “less channels with more
layers” and adoption of some building blocks including concatenated ReLU, In-
ception, and HyperNet. The designed network is deep and thin and trained with
the help of batch normalization, residual connections, and learning rate schedul-
ing based on plateau detection. We obtained solid results on well-known object
detection benchmarks: 83.8% mAP (mean average precision) on VOC2007 and
82.5% mAP on VOC2012 (2nd place), while taking only 750ms/image on Intel
i7-6700K CPU with a single core and 46ms/image on NVIDIA Titan X GPU. The-
oretically, our network requires only 12.3% of the computational cost compared
to ResNet-101, the winner on VOC2012.
1 Introduction
Convolutional neural networks (CNNs) have made impressive improvements in object detection for
several years. Thanks to many innovative work, recent object detection systems have met acceptable
accuracies for commercialization in a broad range of markets like automotive and surveillance. In
terms of detection speed, however, even the best algorithms are still suffering from heavy computa-
tional cost. Although recent work on network compression and quantization shows promising result,
it is important to reduce the computational cost in the network design stage.
This paper presents our lightweight feature extraction network architecture for object detection,
named PVANET1, which achieves real-time object detection performance without losing accuracy
compared to the other state-of-the-art systems:
bull; Computational cost: 7.9GMAC for feature extraction with 1065x640 input (cf. ResNet-101
lowast;These authors contributed equally. Corresponding author: Sanghoon Hong
1The code and the trained models are available at
2ResNet-101 used multi-scale testing without mentioning additional computation cost. If we take this into
account, ours requires only lt;7% of the computational cost compared to ResNet-101.
1
[1]: 80.5GMAC2)
Convolution
Negation
Concatenation
Scale / Shift
ReLU
Figure 1: Our C.ReLU building block. Negation simply multiplies minus;1 to the output of Convolution.
Scale / Shift applies trainable weight and bias to each channel, allowing activations in the negated
part to be adaptive.
bull; Runtime performance: 750ms/image (1.3FPS) on Intel i7-6700K CPU with a single core;
46ms/image (21.7FPS) on NVIDIA Titan X GPU
bull; Accuracy: 83.8% mAP on VOC-2007; 82.5% mAP on VOC-2012 (2nd place)
The key design principle is “less channels with more layers”. Additionally, our networks adopted
some recent building blocks while some of them have not been verified their effectiveness on object
detection tasks:
bull; Concatenated rectified linear unit (C.ReLU) [2] is applied to the early stage of our CNNs
(i.e., first several layers from the network input) to reduce the number of computations by
half without losing accuracy.
bull; Inception [3] is applied to the remaining of our feature generation sub-network. An In-
ception module produces output activations of different sizes of receptive fields, so that
increases the variety of receptive field sizes in the previous layer. We observed that stack-
ing up Inception modules can capture widely varying-sized objects more effectively than a
linear chain of convolutions.
bull; We adopted the idea of multi-scale representation like HyperNet [4] that combines several
intermediate outputs so that multiple levels of details and non-linearities can be considered
simultaneously.
We will show that our thin but deep network can be trained effectively with batch normalization [5],
residual connections [1], and learning rate scheduling based on plateau detection [1].
In the remaining of the paper, we describe our network design briefly (Section 2) and summarize
the detailed structure of PVANET (Section 3). Finally we provide some experimental results on
VOC-2007 and VOC-2012 benchmarks, with detailed settings for training and testing (Section 4).
2 Details on Network D
剩余内容已隐藏,支付完成后下载完整资料
资料编号:[254536],资料为PDF文档或Word文档,PDF文档可免费转换为Word
以上是毕业论文外文翻译,课题毕业论文、任务书、文献综述、开题报告、程序设计、图纸设计等资料可联系客服协助查找。