PVANET:深度但轻量级的神经网络 实时目标检测外文翻译资料

 2023-01-20 10:32:05


PVANET: Deep but Lightweight Neural Networks for

Real-time Object Detection

Kye-Hyeon Kimlowast;, Sanghoon Honglowast;, Byungseok Rohlowast;, Yeongjae Cheon, and Minje Park

Intel Imaging and Camera Technology

21 Teheran-ro 52-gil, Gangnam-gu, Seoul 06212, Korea

{kye-hyeon.kim, sanghoon.hong, peter.roh,

yeongjae.cheon, minje.park}@intel.com


This paper presents how we can achieve the state-of-the-art accuracy in multi-

category object detection task while minimizing the computational cost by adapt-

ing and combining recent technical innovations. Following the common pipeline

of “CNN feature extraction region proposal RoI classification”, we mainly

redesign the feature extraction part, since region proposal part is not computation-

ally expensive and classification part can be efficiently compressed with common

techniques like truncated SVD. Our design principle is “less channels with more

layers” and adoption of some building blocks including concatenated ReLU, In-

ception, and HyperNet. The designed network is deep and thin and trained with

the help of batch normalization, residual connections, and learning rate schedul-

ing based on plateau detection. We obtained solid results on well-known object

detection benchmarks: 83.8% mAP (mean average precision) on VOC2007 and

82.5% mAP on VOC2012 (2nd place), while taking only 750ms/image on Intel

i7-6700K CPU with a single core and 46ms/image on NVIDIA Titan X GPU. The-

oretically, our network requires only 12.3% of the computational cost compared

to ResNet-101, the winner on VOC2012.

1 Introduction

Convolutional neural networks (CNNs) have made impressive improvements in object detection for

several years. Thanks to many innovative work, recent object detection systems have met acceptable

accuracies for commercialization in a broad range of markets like automotive and surveillance. In

terms of detection speed, however, even the best algorithms are still suffering from heavy computa-

tional cost. Although recent work on network compression and quantization shows promising result,

it is important to reduce the computational cost in the network design stage.

This paper presents our lightweight feature extraction network architecture for object detection,

named PVANET1, which achieves real-time object detection performance without losing accuracy

compared to the other state-of-the-art systems:

bull; Computational cost: 7.9GMAC for feature extraction with 1065x640 input (cf. ResNet-101

lowast;These authors contributed equally. Corresponding author: Sanghoon Hong

1The code and the trained models are available at



2ResNet-101 used multi-scale testing without mentioning additional computation cost. If we take this into

account, ours requires only lt;7% of the computational cost compared to ResNet-101.


[1]: 80.5GMAC2)




Scale / Shift


Figure 1: Our C.ReLU building block. Negation simply multiplies minus;1 to the output of Convolution.

Scale / Shift applies trainable weight and bias to each channel, allowing activations in the negated

part to be adaptive.

bull; Runtime performance: 750ms/image (1.3FPS) on Intel i7-6700K CPU with a single core;

46ms/image (21.7FPS) on NVIDIA Titan X GPU

bull; Accuracy: 83.8% mAP on VOC-2007; 82.5% mAP on VOC-2012 (2nd place)

The key design principle is “less channels with more layers”. Additionally, our networks adopted

some recent building blocks while some of them have not been verified their effectiveness on object

detection tasks:

bull; Concatenated rectified linear unit (C.ReLU) [2] is applied to the early stage of our CNNs

(i.e., first several layers from the network input) to reduce the number of computations by

half without losing accuracy.

bull; Inception [3] is applied to the remaining of our feature generation sub-network. An In-

ception module produces output activations of different sizes of receptive fields, so that

increases the variety of receptive field sizes in the previous layer. We observed that stack-

ing up Inception modules can capture widely varying-sized objects more effectively than a

linear chain of convolutions.

bull; We adopted the idea of multi-scale representation like HyperNet [4] that combines several

intermediate outputs so that multiple levels of details and non-linearities can be considered


We will show that our thin but deep network can be trained effectively with batch normalization [5],

residual connections [1], and learning rate scheduling based on plateau detection [1].

In the remaining of the paper, we describe our network design briefly (Section 2) and summarize

the detailed structure of PVANET (Section 3). Finally we provide some experimental results on

VOC-2007 and VOC-2012 benchmarks, with detailed settings for training and testing (Section 4).

2 Details on Network D



原文和译文剩余内容已隐藏,您需要先支付 30元 才能查看原文和译文全部内容!立即支付
