We recently conducted a short study to evaluate the translation quality of five different AI models tasked with translating an “Introductory guide to Artificial Intelligence” from English to the Swahili and Zulu languages. The assessment focused on four key dimensions:
- Accuracy of meaning
- Terminology and technical precision
- Cultural and contextual relevance, and
- Overall quality and usability
The results showed significant variation in performance across models and languages, with implications for the deployment of AI translation tools in educational contexts across Africa.

Methodology
We used the same prompt across different languages to evaluate five AI models’ ability to translate educational content into both Swahili and Zulu. The translation prompt specifically instructed the models to:
- Follow standard grammar, syntax, and vocabulary suitable for a general audience
- Maintain accessibility and clarity while preserving linguistic standards
- Adapt cultural references, idioms, and metaphors using culturally appropriate equivalents
- Ensure accuracy with no loss of meaning or nuance from the original English text
- Preserve the document’s formatting and structure
- Maintain the original tone, register, and purpose
The five AI models we evaluated were:
- 4o (GPT-4o, the latest available model at the time) – Chat GPT, a large language model by OpenAI
- Flash (Gemini Flash) – Google’s efficient large language model
- Sonnet (Claude Sonnet 3.7, the latest available model at the time) – Anthropic’s large language model
- DeepSeek – a Chinese Large language model
- Inkubane – Small language model developed by Lelapa and specifically designed for African languages
We tried to use Vambo, another model with a specific focus on African languages but could not get the API to work.
A Note on Development Resources
It’s important to note the significant disparity in development resources between some of these models. The large language models (4o, Flash, Sonnet) benefit from substantially larger budgets for training and refinement, computational resources, and datasets compared to Lelapa’s Inkubane, which operates as a small language model (SLM) with more constrained resources but a focused specialisation on African languages.
Evaluation Process
Qualified Zulu and Swahili language practitioners (also mother-tongue speakers of the languages) conducted independent assessments of each translation, giving points ratings of 1 (lowest) to 5 (highest) across the four criteria (accuracy of meaning terminology and technical precision, cultural and contextual relevance, overall quality and usability).
These evaluators provided both quantitative ratings and detailed qualitative feedback, highlighting real-world usability. The practitioners were asked to assess each translation holistically, considering whether it would be suitable for use in the context of general public education.
Key Findings
Below is a summary of the evaluators’ scoring and comments on the performance of the different models, per language.
Swahili Translations
- 4o (GPT-4o): 4.5/5 average – Demonstrated the highest overall quality.
- Flash: 4.1/5 average – Strong performance with notable grammatical issues.
- Sonnet: 3.9/5 average – Good readability but showed inconsistencies.
- DeepSeek: 3.5/5 average – Adequate meaning preservation with significant awkwardness in language.
- Lelapa: 1.25/5 average – Poor performance despite African language specialisation.

Zulu Translations
- 4o: 4.1/5 average – Slightly outperformed Sonnet and Lelapa.
- Sonnet: 3.9/5 average – Very good translation with minor terminology issues.
- Lelapa: 3.75/5 average – Strong performance.
- Flash: 2.5/5 average – Significant issues with terminology and consistency.
- DeepSeek: 1.5/5 average – Poor translation quality, only bullet points coherent.

Cross-Language Performance Analysis
The results reveal striking differences in model performance across languages:
- Consistency Across Models: 4o and Sonnet maintained relatively high performance in both languages, with 4o demonstrating the most consistent cross-language performance.
- Language-Specific Challenges: DeepSeek did not perform amazingly in both languages, while Flash showed more pronounced difficulties with Zulu than Swahili.
- Lelapa’s Dramatic Improvement in Zulu: While Lelapa performed poorly in Swahili (1.25/5), it showed substantially better results in Zulu (3.75/5), suggesting stronger training data or optimisation for certain African languages. A member of Lelapa’s technical team explained that they were still in the process of evaluating and fine-tuning Swahili, which was not its best performing language.

Performance by Dimension
- Accuracy of Meaning
- Swahili: 4o excelled (4.5/5), while Lelapa showed significant issues (1.5/5)
- Zulu: Multiple models achieved strong scores (4/5), with only DeepSeek and Gemini Flash struggling
- Terminology and Technical Precision
- Swahili: Technical term handling varied significantly, with 4o performing best (4.5/5)
- Zulu: More moderate performance across models, with 4o leading (4/5).
- Cultural and Contextual Relevance
- Swahili: Lowest-scoring dimension overall, with even top models struggling
- Zulu: Generally stronger performance, with Lelapa and Sonnet achieving 4/5
- Overall Quality and Usability
- Swahili: Clear hierarchy with 4o leading and Lelapa trailing
- Zulu: Closer competition with Sonnet and 4o both achieving 4.5/5 and Lelapa close behind at 4
Issues
Language-Specific Challenges
Our language practitioners presented specific issues with the translations, which offer more nuanced insights into the issues the different models faced:
- Swahili Translations:
- Grammatical errors making text sound “awkward and unnatural”
- Incorrect verb conjugations and poor sentence structure
- Technical term inconsistencies (e.g., “tokens” left untranslated, “open-source” incorrectly translated)
- Significant content omissions in some models
- Zulu Translations:
- Terminological inconsistencies within single translations (e.g. switching between “iselula,” “foni,” and “ucingo” for phone)
- Minor grammatical issues requiring editing (e.g., “qalisa” vs. “qala”)
- Using English words instead of Zulu equivalents
Technical Term Management
Both languages showed challenges with:
- Deciding which terms to translate and which to retain in English
- Maintaining consistency throughout documents
- Providing appropriate cultural adaptation of technical concepts
Content Completeness and Accuracy
- Swahili: Lelapa showed content omissions
- Zulu: DeepSeek’s translations were largely incoherent except for bullet points
Conclusion
This comparative study evaluated translation quality across two African languages and revealed significant language-dependent performance variations. The results suggest that multiple factors influence model outcomes including budget, resource availability, technical approaches, and training methodologies.
The most striking finding is the dramatic difference in Lelapa’s performance between languages, from 1.25/5 in Swahili to 3.75/5 in Zulu. This likely reflects the resource constraints faced by smaller AI companies. With limited budgets, Lelapa must make strategic choices about which languages to prioritise for training data collection, model optimisation, and testing. As a South African company, it likely had greater access to Zulu language datasets. Unlike well-funded LLMs that can afford broad language coverage, resource-constrained models like Inkubane must focus their efforts more selectively, resulting in stronger performance in some African languages than others.
While Lelapa struggled in Swahili, its strong Zulu performance indicates that specialised African language models can be competitive when properly trained. General purpose models like 4o and Sonnet delivered more consistent cross-language performance. While larger budgets enable more comprehensive language coverage, strategic focus and specialisation may offer a viable path for resource-constrained African language AI development.
Very insightful
Thanks for your interest, Kanyisa!
Enlightening!