Skip to main content
 

 

 

SwinVid: Enhancing Video Object Detection Using Swin Transformer

Author name : AMR ABDELWAHED MAHMOUD ABOZEID
Publication Date : 2024-03-19
Journal Name : Computer Systems Science & Engineering

Abstract

What causes object detection in video to be less accurate than it is in still images? Because some video frames
have degraded in appearance from fast movement, out-of-focus camera shots, and changes in posture. These
reasons have made video object detection (VID) a growing area of research in recent years. Video object detection
can be used for various healthcare applications, such as detecting and tracking tumors in medical imaging,
monitoring the movement of patients in hospitals and long-term care facilities, and analyzing videos of surgeries to
improve technique and training. Additionally, it can be used in telemedicine to help diagnose and monitor patients
remotely. Existing VID techniques are based on recurrent neural networks or optical flow for feature aggregation
to produce reliable features which can be used for detection. Some of those methods aggregate features on the fullsequence level or from nearby frames. To create feature maps, existing VID techniques frequently use Convolutional
Neural Networks (CNNs) as the backbone network. On the other hand, Vision Transformers have outperformed
CNNs in various vision tasks, including object detection in still images and image classification. We propose
in this research to use Swin-Transformer, a state-of-the-art Vision Transformer, as an alternative to CNN-based
backbone networks for object detection in videos. The proposed architecture enhances the accuracy of existing
VID methods. The ImageNet VID and EPIC KITCHENS datasets are used to evaluate the suggested methodology.
We have demonstrated that our proposed method is efficient by achieving 84.3% mean average precision (mAP)
on ImageNet VID using less memory in comparison to other leading VID techniques. The source code is available
on the website https://github.com/amaharek/SwinVid

Keywords

Video object detection; vision transformers; convolutional neural networks; deep learning

Publication Link

https://cdn.techscience.cn/files/csse/2024/TSP_CSSE-48-2/TSP_CSSE_39436/TSP_CSSE_39436.pdf

Block_researches_list_suggestions

Suggestions to read

Photocurrent and electrical properties of SiGe Nanocrystals grown on insulator via Solid-state dewetting of Ge/SOI for Photodetection and Solar cells Applications
MOHAMMED OMAR MOHAMMEDAHMED IBRAHIM
Comparative analysis of high-performance UF membranes with sulfonated polyaniline: Improving hydrophilicity and antifouling capabilities for water purification
EBTSAM KHALEFAH H ALENEZY
Efficient framework for energy management of microgrid installed in Aljouf region considering renewable energy and electric vehicles
Ali fathy mohmmed ahmed
Comparative analysis of high-performance UF membranes with sulfonated polyaniline: Improving hydrophilicity and antifouling capabilities for water purification
AHMED HAMAD FARHAN ALANAZI
Contact