Advancing Arabic Word Embeddings: A Multi-Corpora Approach with Optimized Hyperparameters and Custom Evaluation

Author name : AZZAH KHALAF ZAYED ALLAHIM

Publication Date : 2024-11-28

Journal Name : Applied Sciences

Abstract

The expanding Arabic user base presents a unique opportunity for researchers to tap into vast online Arabic resources. However, the lack of reliable Arabic word embedding models and the limited availability of Arabic corpora poses significant challenges. This paper addresses these gaps by developing and evaluating Arabic word embedding models trained on diverse Arabic corpora, investigating how varying hyperparameter values impact model performance across different NLP tasks. To train our models, we collected data from three distinct sources: Wikipedia, newspapers, and 32 Arabic books, each selected to capture specific linguistic and contextual features of Arabic. By using advanced techniques such as Word2Vec and FastText, we experimented with different hyperparameter configurations, such as vector size, window size, and training algorithms (CBOW and skip-gram), to analyze their impact on model quality. Our models were evaluated using a range of NLP tasks, including sentiment analysis, similarity tests, and an adapted analogy test designed specifically for Arabic. The findings revealed that both the corpus size and hyperparameter settings had notable effects on performance. For instance, in the analogy test, a larger vocabulary size significantly improved outcomes, with the FastText skip-gram models excelling in accurately solving analogy questions. For sentiment analysis, vocabulary size was critical, while in similarity scoring, the FastText models achieved the highest scores, particularly with smaller window and vector sizes. Overall, our models demonstrated strong performance, achieving 99% and 90% accuracies in sentiment analysis and the analogy test, respectively, along with a similarity score of 8 out of 10. These results underscore the value of our models as a robust tool for Arabic NLP research, addressing a pressing need for high-quality Arabic word embeddings.

Keywords

word embedding; Word2vec; FastText; Arabic embedding; Arabic corpus

Publication Link

https://www.mdpi.com/2076-3417/14/23/11104

Suggestions to read

2025-12-02

HIDS-IoMT: A Deep Learning-Based Intelligent Intrusion Detection System for the Internet of Medical Things

Ahlem . Harchy Ep Berguiga

اقرأ المزيد

2025-12-02

Generalized first approximation Matsumoto metric

AMR SOLIMAN MAHMOUD HASSAN

اقرأ المزيد

2025-11-04

Structure–Performance Relationship of Novel Azo-Salicylaldehyde Disperse Dyes: Dyeing Optimization and Theoretical Insights

EBTSAM KHALEFAH H ALENEZY

اقرأ المزيد

2025-09-01

“Synthesis and Characterization of SnO₂/α-Fe₂O₃, In₂O₃/α-Fe₂O₃, and ZnO/α-Fe₂O₃ Thin Films: Photocatalytic and Antibacterial Applications”

Asma Arfaoui

اقرأ المزيد

روابط المواقع الالكترونية الرسمية السعودية تنتهي بـ edu.sa

المواقع الالكترونية الحكومية تستخدم بروتوكول HTTPS للتشفير و الأمان.

القائمة

Advancing Arabic Word Embeddings: A Multi-Corpora Approach with Optimized Hyperparameters and Custom Evaluation

Abstract

Keywords

Publication Link

Block_researches_list_suggestions

Suggestions to read

اتصل بنا