BanglaTTS

The Problem

Bangla is spoken by
230 million people.
Good TTS barely exists.

Commercial TTS systems from Google, Amazon, and Microsoft either don't support Bangla at all, or produce robotic, unnatural output that fails on real-world text — news articles, mixed numerals, conjunct consonants, and the full range of Bengali orthography.

We built BanglaTTS to close that gap: a proprietary system that produces natural-sounding, low-latency Bangla speech from arbitrary input text, running on commodity GPU hardware without cloud API dependency.

What We Built

End-to-end neural TTS pipeline
LLM-based audio code generation
SNAC neural audio codec for decoding
Two-stage streaming for 200ms first audio
Bengali-specific text normalisation
vLLM inference server for production
FastAPI streaming endpoint (WAV over HTTP)
Full data pipeline — collection to training

Fully proprietary — model, training data pipeline, normalisation layer, and inference server are all built and owned by Iterosys.

What made this hard

LLM-based TTS over a severely underserved language

Modern high-quality TTS (Meta, OpenAI) uses language models to generate discrete audio tokens rather than mel-spectrograms. Adapting this architecture to Bangla meant working through Bengali-specific orthography challenges — conjunct consonants, zero-width joiners, vowel carriers, and number systems — that generic text normalisation pipelines handle poorly or not at all. We built a dedicated normalisation layer using bkit before any token ever hits the model.

200ms first-token latency via adaptive chunk sizing

The neural audio codec (SNAC) decodes in batches. Waiting for a full batch means the user hears nothing for seconds. Decoding every token individually is too slow. The answer is two different chunk sizes: fire the first SNAC decode at just 21 codes (3 groups), getting audio to the user in ~200ms, then switch to 210-code chunks for the rest to amortise decoding overhead. The difference in felt responsiveness is dramatic.

System Architecture

Text in. Audio out. 200ms.

Stage 1 — Text Normalisation (bkit)

Unicode normalisation

Standardise Bengali characters

Halant normalisation

Conjunct consonant handling

Vowel carrier normalisation

Vowel diacritic correction

Number spelling

২৫ → পঁচিশ (Bengali words)

Punctuation spacing

Clean sentence boundaries

Text chunking

~150 chars per LLM call

Stage 2 — Language Model (vLLM server)

Model

Custom fine-tuned LLMParameter-efficient fine-tuning

Generates sequences of discrete SNAC audio token IDs (128,266+) instead of mel-spectrograms — the same approach used by modern commercial TTS systems.

Prompt format + output

# input
[SOH] voice_name: normalised_text [EOT][EOH]

# output
[SOS] 128,271 128,394 128,511 … [EOS]

Stage 3 — SNAC Neural Codec + Two-Stage Streaming

SNAC (hubertsiuzdak/snac_24khz)

7-layer hierarchical neural audio codec. Every 7 LLM tokens decode into one frame of audio. Audio codes are redistributed across 3 quantization layers before decoding.

24 kHz output16-bit PCMMono channelGPU/CPU

Adaptive chunk strategy (the 200ms trick)

Initial chunk — 21 codes (3 groups)

Fire SNAC decode immediately → user hears first audio in ~200ms

Streaming chunks — 210 codes (30 groups)

10× larger batches amortise decoding cost for remaining audio

Stage 4 — Delivery

Chunked WAV stream over HTTP

FastAPI StreamingResponse5–50ms fade-in/outNginx reverse proxyAsync cancellation

Engineering Decisions

Four choices that define the system

LLM + neural codec over traditional mel-spectrogram pipelines

Classical TTS (Tacotron2, FastSpeech2) generates mel-spectrograms that a separate vocoder converts to audio. The modern approach — used by Meta, OpenAI, and now BanglaTTS — trains a language model to generate discrete audio token IDs, which a neural codec (SNAC) decodes directly to waveform. This unifies the text and audio spaces under a single model, leverages LLM pre-training for prosody and naturalness, and produces higher quality output from far less training data.

No mel-spectrogram intermediate

Prosody from LLM pretraining

Fine-tuned on minimal audio data

Scales with LLM capability

Two-stage adaptive chunk sizing

SNAC decodes in batches. A fixed chunk size forces a trade-off: small chunks for low latency waste decoding overhead; large chunks for efficiency feel slow to start. The solution: start with 21 codes (3 SNAC groups) to get audio in front of the user within 200ms, then switch to 210-code chunks for the rest of the stream.

Initial: 21 codes → ~200ms first audio · Subsequent: 210 codes → amortised overhead

Bengali-specific normalisation before tokenisation

Generic Unicode normalisation doesn't handle Bengali orthography correctly. Conjunct consonants (হ্যালো), zero-width joiners, vowel carriers (অ), halant marks, and Bengali numeral spelling (২৫ → পঁচিশ) must all be resolved before any token hits the LLM — otherwise the model produces broken or unnatural speech on real-world text.

Uses bkit: halant · vowel-kar · ZW characters · number spelling · punctuation

vLLM for inference, separated from the API server

The language model runs in a dedicated vLLM server on port 8000 and is accessed via an OpenAI-compatible streaming API. The FastAPI server (port 8001) handles text normalisation, chunking, SNAC decoding, and WAV streaming. Separating them lets each scale independently and keeps the model loaded in VRAM across concurrent requests.

Nginx (port 8030) → FastAPI → vLLM · each service restarted independently

Tech Stack

What we shipped with

Models

Custom fine-tuned Bangla TTS LLMSNAC 24kHz (neural codec)Parameter-efficient fine-tuning

ML / Training

PyTorchTransformers (HuggingFace)PEFTDemucs (voice separation)Whisper (transcription)

Inference

vLLMSNAC decoderbkit (Bengali normalisation)asyncio streaming

Backend

FastAPIUvicornNginxBeautifulSoup (article fetch)

Data Pipeline

PostgreSQLS3 / boto3SQLAlchemyLoguru

Live

Hear it yourself

Seven audio samples across short sentences, paragraphs, and full news articles — including politics, sports, and international coverage. All generated by the same model, streamed in real time.

Open audio demo page

Short sentence

আমার সোনার বাংলা…

Historical paragraph

১৯৭১ সালের ১৬ই ডিসেম্বর…

News — Politics

Full article sample

News — International