Evaluating AI Models in Translating Educational Content to African Languages

We recently conducted a short study to evaluate the translation quality of five different AI models tasked with translating an “Introductory guide to Artificial Intelligence” from English to the Swahili and Zulu languages. The assessment focused on four key dimensions:

Accuracy of meaning
Terminology and technical precision
Cultural and contextual relevance, and
Overall quality and usability

The results showed significant variation in performance across models and languages, with implications for the deployment of AI translation tools in educational contexts across Africa.

Methodology

We used the same prompt across different languages to evaluate five AI models’ ability to translate educational content into both Swahili and Zulu. The translation prompt specifically instructed the models to:

Follow standard grammar, syntax, and vocabulary suitable for a general audience
Maintain accessibility and clarity while preserving linguistic standards
Adapt cultural references, idioms, and metaphors using culturally appropriate equivalents
Ensure accuracy with no loss of meaning or nuance from the original English text
Preserve the document’s formatting and structure
Maintain the original tone, register, and purpose

The five AI models we evaluated were:

4o (GPT-4o, the latest available model at the time) – Chat GPT, a large language model by OpenAI
Flash (Gemini Flash) – Google’s efficient large language model
Sonnet (Claude Sonnet 3.7, the latest available model at the time) – Anthropic’s large language model
DeepSeek – a Chinese Large language model
Inkubane – Small language model developed by Lelapa and specifically designed for African languages

We tried to use Vambo, another model with a specific focus on African languages but could not get the API to work.

A Note on Development Resources

It’s important to note the significant disparity in development resources between some of these models. The large language models (4o, Flash, Sonnet) benefit from substantially larger budgets for training and refinement, computational resources, and datasets compared to Lelapa’s Inkubane, which operates as a small language model (SLM) with more constrained resources but a focused specialisation on African languages.

Evaluation Process

Qualified Zulu and Swahili language practitioners (also mother-tongue speakers of the languages) conducted independent assessments of each translation, giving points ratings of 1 (lowest) to 5 (highest) across the four criteria (accuracy of meaning terminology and technical precision, cultural and contextual relevance, overall quality and usability).

These evaluators provided both quantitative ratings and detailed qualitative feedback, highlighting real-world usability. The practitioners were asked to assess each translation holistically, considering whether it would be suitable for use in the context of general public education.

Key Findings

Below is a summary of the evaluators’ scoring and comments on the performance of the different models, per language.

Swahili Translations

4o (GPT-4o): 4.5/5 average – Demonstrated the highest overall quality.
Flash: 4.1/5 average – Strong performance with notable grammatical issues.
Sonnet: 3.9/5 average – Good readability but showed inconsistencies.
DeepSeek: 3.5/5 average – Adequate meaning preservation with significant awkwardness in language.
Lelapa: 1.25/5 average – Poor performance despite African language specialisation.

Zulu Translations

4o: 4.1/5 average – Slightly outperformed Sonnet and Lelapa.
Sonnet: 3.9/5 average – Very good translation with minor terminology issues.
Lelapa: 3.75/5 average – Strong performance.
Flash: 2.5/5 average – Significant issues with terminology and consistency.
DeepSeek: 1.5/5 average – Poor translation quality, only bullet points coherent.

Cross-Language Performance Analysis

The results reveal striking differences in model performance across languages:

Consistency Across Models: 4o and Sonnet maintained relatively high performance in both languages, with 4o demonstrating the most consistent cross-language performance.
Language-Specific Challenges: DeepSeek did not perform amazingly in both languages, while Flash showed more pronounced difficulties with Zulu than Swahili.
Lelapa’s Dramatic Improvement in Zulu: While Lelapa performed poorly in Swahili (1.25/5), it showed substantially better results in Zulu (3.75/5), suggesting stronger training data or optimisation for certain African languages. A member of Lelapa’s technical team explained that they were still in the process of evaluating and fine-tuning Swahili, which was not its best performing language.

Performance by Dimension

Accuracy of Meaning
- Swahili: 4o excelled (4.5/5), while Lelapa showed significant issues (1.5/5)
- Zulu: Multiple models achieved strong scores (4/5), with only DeepSeek and Gemini Flash struggling
Terminology and Technical Precision
- Swahili: Technical term handling varied significantly, with 4o performing best (4.5/5)
- Zulu: More moderate performance across models, with 4o leading (4/5).
Cultural and Contextual Relevance
- Swahili: Lowest-scoring dimension overall, with even top models struggling
- Zulu: Generally stronger performance, with Lelapa and Sonnet achieving 4/5
Overall Quality and Usability
- Swahili: Clear hierarchy with 4o leading and Lelapa trailing
- Zulu: Closer competition with Sonnet and 4o both achieving 4.5/5 and Lelapa close behind at 4

Issues

Language-Specific Challenges

Our language practitioners presented specific issues with the translations, which offer more nuanced insights into the issues the different models faced:

Swahili Translations:
- Grammatical errors making text sound “awkward and unnatural”
- Incorrect verb conjugations and poor sentence structure
- Technical term inconsistencies (e.g., “tokens” left untranslated, “open-source” incorrectly translated)
- Significant content omissions in some models
Zulu Translations:
- Terminological inconsistencies within single translations (e.g. switching between “iselula,” “foni,” and “ucingo” for phone)
- Minor grammatical issues requiring editing (e.g., “qalisa” vs. “qala”)
- Using English words instead of Zulu equivalents

Technical Term Management

Both languages showed challenges with:

Deciding which terms to translate and which to retain in English
Maintaining consistency throughout documents
Providing appropriate cultural adaptation of technical concepts

Content Completeness and Accuracy

Swahili: Lelapa showed content omissions
Zulu: DeepSeek’s translations were largely incoherent except for bullet points

Conclusion

This comparative study evaluated translation quality across two African languages and revealed significant language-dependent performance variations. The results suggest that multiple factors influence model outcomes including budget, resource availability, technical approaches, and training methodologies.

The most striking finding is the dramatic difference in Lelapa’s performance between languages, from 1.25/5 in Swahili to 3.75/5 in Zulu. This likely reflects the resource constraints faced by smaller AI companies. With limited budgets, Lelapa must make strategic choices about which languages to prioritise for training data collection, model optimisation, and testing. As a South African company, it likely had greater access to Zulu language datasets. Unlike well-funded LLMs that can afford broad language coverage, resource-constrained models like Inkubane must focus their efforts more selectively, resulting in stronger performance in some African languages than others.

While Lelapa struggled in Swahili, its strong Zulu performance indicates that specialised African language models can be competitive when properly trained. General purpose models like 4o and Sonnet delivered more consistent cross-language performance. While larger budgets enable more comprehensive language coverage, strategic focus and specialisation may offer a viable path for resource-constrained African language AI development.

3 thoughts on “Evaluating AI Models in Translating Educational Content to African Languages”

Kanyisa Diamond

7th July 2025 at 06:47

Very insightful
- iAfrika
  
  10th July 2025 at 12:53
  
  Thanks for your interest, Kanyisa!
Neema

17th July 2025 at 13:09

Enlightening!

We're making the web more African one language at a time...