Dr. Mohammad Alsmairat
Associate Professor, American University in the Emirates
Generative models, such as large language and vision-language models, are widely used but often biased toward English culture due to the dominance of English texts in pretraining data. This bias challenges their deployment in Arabic-speaking countries, where cultural alignment is essential for education, media, and public services. Recent evaluations of Arabic-centric models, such as JAIS, highlight these issues. Despite being trained on extensive Arabic corpora, these models can produce outputs that conflict with cultural norms. For instance, JAIS generated the text, “بعد صلاة المغرب سأذهب مع الأصدقاء لنشرب” (“After Maghrib prayer, I’m going with friends to drink …”), which contradicts Arab cultural and religious values. Such examples emphasize the broader problem of cultural misalignment in generative models and the need for efforts to identify and mitigate these biases, ensuring their responsible use in diverse cultural settings.
To address these challenges, we propose the development of safe and culturally-aware language models that align with the diverse cultures and norms of Arabic-speaking countries, including the UAE/Dubai and other regions. Our work spans both Modern Standard Arabic (MSA) and regional Arabic dialects to ensure comprehensive linguistic and cultural representation. Instead of building a new LLM from scratch, this project focuses on (1) evaluating existing Arabic-centric models using a detailed taxonomy to capture the nuances of local contexts and (2) adapting these models to better reflect Arabic culture and values. To comprehensively represent Arabic cultures, the project will deliver three main datasets: (1) an Arabic Cultural Knowledge Graph integrating multimodal and multilingual data to enhance cultural understanding, (2) a regionally diverse Arabic-specific instruction dataset, and (3) a safety dataset tailored to Arabic cultural and religious values. This strategy ensures cultural alignment while minimizing costs and improving the utility of existing models.