A proprietary streaming Bangla text-to-speech engine built on a fine-tuned language model and a neural audio codec — 200ms to first audio, running on commodity GPU.
Commercial TTS systems from Google, Amazon, and Microsoft either don't support Bangla at all, or produce robotic, unnatural output that fails on real-world text — news articles, mixed numerals, conjunct consonants, and the full range of Bengali orthography.
We built BanglaTTS to close that gap: a proprietary system that produces natural-sounding, low-latency Bangla speech from arbitrary input text, running on commodity GPU hardware without cloud API dependency.
Fully proprietary — model, training data pipeline, normalisation layer, and inference server are all built and owned by Iterosys.
What made this hard
Modern high-quality TTS (Meta, OpenAI) uses language models to generate discrete audio tokens rather than mel-spectrograms. Adapting this architecture to Bangla meant working through Bengali-specific orthography challenges — conjunct consonants, zero-width joiners, vowel carriers, and number systems — that generic text normalisation pipelines handle poorly or not at all. We built a dedicated normalisation layer using bkit before any token ever hits the model.
The neural audio codec (SNAC) decodes in batches. Waiting for a full batch means the user hears nothing for seconds. Decoding every token individually is too slow. The answer is two different chunk sizes: fire the first SNAC decode at just 21 codes (3 groups), getting audio to the user in ~200ms, then switch to 210-code chunks for the rest to amortise decoding overhead. The difference in felt responsiveness is dramatic.
Model
Generates sequences of discrete SNAC audio token IDs (128,266+) instead of mel-spectrograms — the same approach used by modern commercial TTS systems.
Prompt format + output
SNAC (hubertsiuzdak/snac_24khz)
7-layer hierarchical neural audio codec. Every 7 LLM tokens decode into one frame of audio. Audio codes are redistributed across 3 quantization layers before decoding.
Adaptive chunk strategy (the 200ms trick)
Classical TTS (Tacotron2, FastSpeech2) generates mel-spectrograms that a separate vocoder converts to audio. The modern approach — used by Meta, OpenAI, and now BanglaTTS — trains a language model to generate discrete audio token IDs, which a neural codec (SNAC) decodes directly to waveform. This unifies the text and audio spaces under a single model, leverages LLM pre-training for prosody and naturalness, and produces higher quality output from far less training data.
SNAC decodes in batches. A fixed chunk size forces a trade-off: small chunks for low latency waste decoding overhead; large chunks for efficiency feel slow to start. The solution: start with 21 codes (3 SNAC groups) to get audio in front of the user within 200ms, then switch to 210-code chunks for the rest of the stream.
Initial: 21 codes → ~200ms first audio · Subsequent: 210 codes → amortised overhead
Generic Unicode normalisation doesn't handle Bengali orthography correctly. Conjunct consonants (হ্যালো), zero-width joiners, vowel carriers (অ), halant marks, and Bengali numeral spelling (২৫ → পঁচিশ) must all be resolved before any token hits the LLM — otherwise the model produces broken or unnatural speech on real-world text.
Uses bkit: halant · vowel-kar · ZW characters · number spelling · punctuation
The language model runs in a dedicated vLLM server on port 8000 and is accessed via an OpenAI-compatible streaming API. The FastAPI server (port 8001) handles text normalisation, chunking, SNAC decoding, and WAV streaming. Separating them lets each scale independently and keeps the model loaded in VRAM across concurrent requests.
Nginx (port 8030) → FastAPI → vLLM · each service restarted independently
Seven audio samples across short sentences, paragraphs, and full news articles — including politics, sports, and international coverage. All generated by the same model, streamed in real time.
Open audio demo pageWe build custom TTS and ASR systems — fine-tuned for your language, domain, and latency requirements. On-premise or cloud.