Keypoint Inference Benchmark

Benchmark on Server

We tested benchmarks in different runtime environments。 See the table below for details.

Model	CPU + MKLDNN (thread=1)	CPU + MKLDNN (thread=4)	GPU	TensorRT (FP32)	TensorRT (FP16)
LiteHRNet-18-256x192	88.8 ms	40.7 ms	4.4 ms	2.0 ms	1.8 ms
LiteHRNet-18-384x288	188.0 ms	79.3 ms	4.8 ms	3.6 ms	3.2 ms
LiteHRNet-30-256x192	148.4 ms	69.0 ms	7.1 ms	3.1 ms	2.8 ms
LiteHRNet-30-384x288	309.8 ms	133.5 ms	8.2 ms	6.0 ms	5.3 ms
PP-TinyPose-128x96	25.2 ms	14.1 ms	2.7 ms	0.9 ms	0.8 ms
PP-TinyPose-256x192	82.4 ms	36.1 ms	3.0 ms	1.5 ms	1.1 ms

Notes:

These tests above are based Python deployment.
The environment is NVIDIA T4 / PaddlePaddle(commit: 7df301f2fc) / CUDA10.1 / CUDNN7 / Python3.7 / TensorRT6.
The test is based on deploy/python/det_keypoint_unite_infer.py with image demo/000000014439.jpg. And input batch size for keypoint model is set to 8.
The time only includes inference time.

Model	CPU + MKLDNN (thread=1)	CPU + MKLDNN (thread=4)	GPU	TensorRT (FP32)	TensorRT (FP16)
DARK_HRNet_w32-256x192	363.93 ms	97.38 ms	4.13 ms	3.74 ms	1.75 ms
DARK_HRNet_w32-384x288	823.71 ms	218.55 ms	9.44 ms	8.91 ms	2.96 ms
HRNet_w32-256x192	363.67 ms	97.64 ms	4.11 ms	3.71 ms	1.72 ms
HRNet_w32-256x256_mpii	485.56 ms	131.48 ms	4.81 ms	4.26 ms	2.00 ms
HRNet_w32-384x288	822.73 ms	215.48 ms	9.40 ms	8.81 ms	2.97 ms
PP-TinyPose-128x96	24.06 ms	13.05 ms	2.43 ms	0.75 ms	0.72 ms
PP-TinyPose-256x192	82.73 ms	36.25 ms	2.57 ms	1.38 ms	1.15 ms

Notes:

These tests above are based C++ deployment.
The environment is NVIDIA T4 / PaddlePaddle(commit: 7df301f2fc) / CUDA10.1 / CUDNN7 / Python3.7 / TensorRT6.
The test is based on deploy/python/det_keypoint_unite_infer.py with image demo/000000014439.jpg. And input batch size for keypoint model is set to 8.
The time only includes inference time.

Benchmark on Mobile

We tested benchmarks on Kirin and Qualcomm Snapdragon devices. See the table below for details.

Model	Kirin 980 (1-thread)	Kirin 980 (4-threads)	Qualcomm Snapdragon 845 (1-thread)	Qualcomm Snapdragon 845 (4-threads)	Qualcomm Snapdragon 660 (1-thread)	Qualcomm Snapdragon 660 (4-threads)
PicoDet-s-192x192 (det)	14.85 ms	5.45 ms	17.50 ms	7.56 ms	80.08 ms	27.36 ms
PicoDet-s-320x320 (det)	38.09 ms	12.00 ms	45.26 ms	17.07 ms	232.81 ms	58.68 ms
PP-TinyPose-128x96 (pose)	12.03 ms	5.09 ms	13.14 ms	6.73 ms	71.87 ms	20.04 ms

Notes:

These tests above are based Paddle Lite deployment, and version is v2.10-rc.
The time only includes inference time.