Grant Initiatives- Localization of Large Language Models to Arabic Norms, Culture, and Values - Dubai Research, Development and Innovation Program

Localization of Large Language Models to Arabic Norms, Culture, and Values

Dr. Mohammad Alsmairat

Associate Professor, American University in the Emirates

“I am excited to embark on this research journey, leveraging AI and Machine Learning to unlock new possibilities in intelligent automation and decision-making. This grant fuels my passion for pushing the boundaries of AI-driven innovation, transforming data into actionable insights that shape the future. With a deep commitment to bridging research and real-world impact, I look forward to developing solutions that redefine efficiency, adaptability, and intelligence in modern industries.”

Generative models, such as large language and vision-language models, are widely used but often biased toward English culture due to the dominance of English texts in pretraining data. This bias challenges their deployment in Arabic-speaking countries, where cultural alignment is essential for education, media, and public services. Recent evaluations of Arabic-centric models, such as JAIS, highlight these issues. Despite being trained on extensive Arabic corpora, these models can produce outputs that conflict with cultural norms. For instance, JAIS generated the text, “بعد صلاة المغرب سأذهب مع الأصدقاء لنشرب” (“After Maghrib prayer, I’m going with friends to drink …”), which contradicts Arab cultural and religious values. Such examples emphasize the broader problem of cultural misalignment in generative models and the need for efforts to identify and mitigate these biases, ensuring their responsible use in diverse cultural settings.

To address these challenges, we propose the development of safe and culturally-aware language models that align with the diverse cultures and norms of Arabic-speaking countries, including the UAE/Dubai and other regions. Our work spans both Modern Standard Arabic (MSA) and regional Arabic dialects to ensure comprehensive linguistic and cultural representation. Instead of building a new LLM from scratch, this project focuses on (1) evaluating existing Arabic-centric models using a detailed taxonomy to capture the nuances of local contexts and (2) adapting these models to better reflect Arabic culture and values. To comprehensively represent Arabic cultures, the project will deliver three main datasets: (1) an Arabic Cultural Knowledge Graph integrating multimodal and multilingual data to enhance cultural understanding, (2) a regionally diverse Arabic-specific instruction dataset, and (3) a safety dataset tailored to Arabic cultural and religious values. This strategy ensures cultural alignment while minimizing costs and improving the utility of existing models.