Engineering

Optimizing Large Language Models for Sinhala & Niche Languages

SD
Senuth Dilshan
ML Engineer
Nov 28, 20258 min read
Optimizing Large Language Models for Sinhala & Niche Languages

Language is the operating system of culture. Yet, the foundational models powering the AI revolution—GPT-5, Llama-4, Claude—are overwhelmingly English-centric. For languages like Sinhala, with unique scripts and morphological richness, 'out-of-the-box' performance is often suboptimal. SIVONX is changing that.

The Tokenization Tax

Most standard tokenizers (like BPE) fragment Sinhala words into meaningless character bytes. This increases the token count for a simple sentence by 3x-4x compared to English, inflating inference costs and latency. We tackled this by training a custom SentencePiece tokenizer on a curated corpus of 50 years of Sri Lankan literature and legal texts.

Fine-Tuning Llama-3-70B

We didn't stop at tokenization. Using Low-Rank Adaptation (LoRA), we fine-tuned the weights of Llama-3-70B specifically for local government applications. The result? A model that understands not just the literal translation, but the honorifics and contextual nuance required for official communication in Sri Lanka.

Impact

This localized model is now powering the customer support systems of three major banks in Colombo, processing queries with near-native fluency and a 60% reduction in hallucination rates compared to base GPT-4.