DSTC-Sum: A Supervised Video Summarization Model Using Depthwise Separable Temporal Convolutional
Abstract
The exponential growth in video content has created a critical need for efficient video summarization techniques to enable faster and more accurate information retrieval. Video summarization has excellent potential to simplify the analysis of large video databases in various application areas ranging from surveillance, education, entertainment, and research. DSTC-Sum, a novel supervised video summarization model, is proposed based on Depthwise Separable Temporal Convolutional (DSTC). Leveraging the superior representational efficiency of DSTCN, the model addresses computational challenges and training inefficiencies encountered in traditional recurrent architectures such as Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTMs). Additionally, this approach reduces computational overhead and memory usage. DSTC-Sum achieved state-of-the-art performance on two commonly used benchmark datasets, TVSum and SumMe, and outperformed all previous methods with F-scores by 1.8% and 3.33%, respectively. To validate the model's generality and robustness, the model was further tested on the YouTube and Open Video Project (OVP) datasets. The proposed model did slightly better on these datasets than several popular techniques, with F scores of 60.3 and 58.5, respectively. Finally, these findings confirm that this model captures long-term temporal dependencies and produces high-quality video summaries across all types of videos.