University of Basrah Organizes a Lecture on the Function of Pre-Trained Transformer Models for Arabic in Topic Detection and Classification in Arabic Textual Data

The College of Computer Science and Information Technology at University of Basrah organized a scientific lecture entitled "Leveraging Arabic Pre-trained Transformer Models for Topic Detection and Classification in Arabic Textual Data." The lecture aimed to present a summary of the key findings of a master's thesis, which addresses the challenge of automatically classifying unstructured Arabic texts available online. This is a highly complex task due to the morphological and syntactic characteristics of the Arabic language, such as derivation, diacritics, and multiple roots.

The lecture, presented by researcher Noor Salman Dawood, tackled this challenge. The research proposes an integrated hybrid system that begins with a pre-processing stage specifically designed for Arabic text. This stage includes normalization, diacritics management, common word removal, and root extraction. The system then proceeds through three sequential phases: topic detection using the BERTopic model based on deep contextual representations, followed by automatic topic classification using the Arabic NAMAA model, and finally, measuring the level of congruence with human classification. Cohen's Kappa coefficient.

The system was tested on four Arabic language models: Asafaya, AraBERTv2, and QARiB, using two Arabic news sets in the sports and economics sectors that differed in size and category distribution, thus mimicking real-world publishing environments. The system demonstrated a high ability to detect topics with strong semantic coherence, achieving a classification accuracy of 98.2% and near-perfect agreement with human classification.