Skip to content

Benchmarks#

Some benchmarks are given here, mainly to help with back of the envelope -type calculations of what level of resources might be needed for your own pipeline build from these stages on your own video corpus.

The following benchmarks were run on the following machine:

  • 7 cores of Intel Xeon Gold 5120 CPU @ 2.20GHz
  • 1 P100 with 16GB RAM
  • 32 GB RAM

They were run on the 480p version of a typical talking heads video on YouTube. The video is just over 2 minutes, so use this to get an approximate scaling factor for your own corpus. All times include the script/Singularity startup time, so this should be taken into account.

System Time (m:s)
Shot Segmentors
PySceneDetect 10.4
ffprobe 10.7
Pose estimators
OpenPose Body25 4:23.8
OpenPose Body25+Face 8:53.9
OpenPose Body25+Hands 10:11.2
OpenPose Body25+Face+Hands 14:49.7
Within-shot pose trackers (run on output of Body25)
lighttrackish 33.5
opt_lighttrack 17.7
deepsortlike 21.8
Face detection + embedding (this based upon embedding all faces found)
dlib-hog-face5 2:50.1
dlib-hog-face68 2:53.1
dlib-cnn-face5 0:44.5
dlib-cnn-face68 0:43.7
Embedding faces from existing OpenPose (face3 only requires Body25, whereas face68 requires Body25+Face)
openpose-face68 17.3
openpose-face3 21.0

Accuracy comparisons#

In many cases, accuracy comparisons can be found on the original websites or papers of the software providing the different pipeline stages. However, here are some extra comparisons available elsewhere. Note that these may refer to old versions: