Multi-Object Semantic Video Detection and Indexing using a 3D Deep Learning Model
Abstract
Within the exponential growth in raw data production, attributed in no small part to social media - Facebook,
Youtube, and others. Video is proving to be the most important data type thanks to the substantial amount of raw data
it contains, requiring an efficient way to be understood, organized, structured, and stored for ease of retrieval. Hence,
an efficient video indexing architecture is thus crucial for video datasets. This paper proposes an efficient Multi-Object
Semantic Video Detection (MOSD) that leverages the deep learning power to achieve effective indexing on the
semantic concept level. MOSD is multi-detection network of video semantics in multiple frames. MOSD exploits a
3D convolution operation to do multiple detections among multiple frames with higher performance. The detected
semantics then structured and used for indexing the video segments. MOSD has been trained and evaluated on
ImageNet VID dataset and has been compared to peers. MOSD showed efficiency in exploiting the temporal context
of a video to do simultaneous detections of consecutive frames which speeds up the detection of semantic objects.
MOSD also showed performance efficiency in terms of mAP which is 85.2%.