Work BanglaTTS
Speech AI · Bangla NLP

BanglaTTS

A proprietary streaming Bangla text-to-speech engine built on a fine-tuned language model and a neural audio codec — 200ms to first audio, running on commodity GPU.

200ms
First-token latency
24 kHz
Audio output quality
Proprietary
In-house R&D
Listen to live audio demos
Type
Internal R&D
Domain
Bangla NLP · Speech AI
Architecture
LLM + Neural Codec
Status
Live demo
The Problem

Bangla is spoken by
230 million people.
Good TTS barely exists.

Commercial TTS systems from Google, Amazon, and Microsoft either don't support Bangla at all, or produce robotic, unnatural output that fails on real-world text — news articles, mixed numerals, conjunct consonants, and the full range of Bengali orthography.

We built BanglaTTS to close that gap: a proprietary system that produces natural-sounding, low-latency Bangla speech from arbitrary input text, running on commodity GPU hardware without cloud API dependency.

What We Built
  • End-to-end neural TTS pipeline
  • LLM-based audio code generation
  • SNAC neural audio codec for decoding
  • Two-stage streaming for 200ms first audio
  • Bengali-specific text normalisation
  • vLLM inference server for production
  • FastAPI streaming endpoint (WAV over HTTP)
  • Full data pipeline — collection to training

Fully proprietary — model, training data pipeline, normalisation layer, and inference server are all built and owned by Iterosys.

What made this hard

LLM-based TTS over a severely underserved language

Modern high-quality TTS (Meta, OpenAI) uses language models to generate discrete audio tokens rather than mel-spectrograms. Adapting this architecture to Bangla meant working through Bengali-specific orthography challenges — conjunct consonants, zero-width joiners, vowel carriers, and number systems — that generic text normalisation pipelines handle poorly or not at all. We built a dedicated normalisation layer using bkit before any token ever hits the model.

200ms first-token latency via adaptive chunk sizing

The neural audio codec (SNAC) decodes in batches. Waiting for a full batch means the user hears nothing for seconds. Decoding every token individually is too slow. The answer is two different chunk sizes: fire the first SNAC decode at just 21 codes (3 groups), getting audio to the user in ~200ms, then switch to 210-code chunks for the rest to amortise decoding overhead. The difference in felt responsiveness is dramatic.

System Architecture

Text in. Audio out. 200ms.

Stage 1 — Text Normalisation (bkit)
Unicode normalisation
Standardise Bengali characters
Halant normalisation
Conjunct consonant handling
Vowel carrier normalisation
Vowel diacritic correction
Number spelling
২৫ → পঁচিশ (Bengali words)
Punctuation spacing
Clean sentence boundaries
Text chunking
~150 chars per LLM call
Stage 2 — Language Model (vLLM server)

Model

Custom fine-tuned LLMParameter-efficient fine-tuning

Generates sequences of discrete SNAC audio token IDs (128,266+) instead of mel-spectrograms — the same approach used by modern commercial TTS systems.

Prompt format + output

# input
[SOH] voice_name: normalised_text [EOT][EOH]

# output
[SOS] 128,271 128,394 128,511 … [EOS]
Stage 3 — SNAC Neural Codec + Two-Stage Streaming

SNAC (hubertsiuzdak/snac_24khz)

7-layer hierarchical neural audio codec. Every 7 LLM tokens decode into one frame of audio. Audio codes are redistributed across 3 quantization layers before decoding.

24 kHz output16-bit PCMMono channelGPU/CPU

Adaptive chunk strategy (the 200ms trick)

Initial chunk — 21 codes (3 groups)
Fire SNAC decode immediately → user hears first audio in ~200ms
Streaming chunks — 210 codes (30 groups)
10× larger batches amortise decoding cost for remaining audio
Stage 4 — Delivery
Chunked WAV stream over HTTP
FastAPI StreamingResponse5–50ms fade-in/outNginx reverse proxyAsync cancellation
Engineering Decisions

Four choices that define the system

01

LLM + neural codec over traditional mel-spectrogram pipelines

Classical TTS (Tacotron2, FastSpeech2) generates mel-spectrograms that a separate vocoder converts to audio. The modern approach — used by Meta, OpenAI, and now BanglaTTS — trains a language model to generate discrete audio token IDs, which a neural codec (SNAC) decodes directly to waveform. This unifies the text and audio spaces under a single model, leverages LLM pre-training for prosody and naturalness, and produces higher quality output from far less training data.

No mel-spectrogram intermediate
Prosody from LLM pretraining
Fine-tuned on minimal audio data
Scales with LLM capability
02

Two-stage adaptive chunk sizing

SNAC decodes in batches. A fixed chunk size forces a trade-off: small chunks for low latency waste decoding overhead; large chunks for efficiency feel slow to start. The solution: start with 21 codes (3 SNAC groups) to get audio in front of the user within 200ms, then switch to 210-code chunks for the rest of the stream.

Initial: 21 codes → ~200ms first audio · Subsequent: 210 codes → amortised overhead

03

Bengali-specific normalisation before tokenisation

Generic Unicode normalisation doesn't handle Bengali orthography correctly. Conjunct consonants (হ্যালো), zero-width joiners, vowel carriers (অ), halant marks, and Bengali numeral spelling (২৫ → পঁচিশ) must all be resolved before any token hits the LLM — otherwise the model produces broken or unnatural speech on real-world text.

Uses bkit: halant · vowel-kar · ZW characters · number spelling · punctuation

04

vLLM for inference, separated from the API server

The language model runs in a dedicated vLLM server on port 8000 and is accessed via an OpenAI-compatible streaming API. The FastAPI server (port 8001) handles text normalisation, chunking, SNAC decoding, and WAV streaming. Separating them lets each scale independently and keeps the model loaded in VRAM across concurrent requests.

Nginx (port 8030) → FastAPI → vLLM · each service restarted independently

Tech Stack

What we shipped with

Models
Custom fine-tuned Bangla TTS LLMSNAC 24kHz (neural codec)Parameter-efficient fine-tuning
ML / Training
PyTorchTransformers (HuggingFace)PEFTDemucs (voice separation)Whisper (transcription)
Inference
vLLMSNAC decoderbkit (Bengali normalisation)asyncio streaming
Backend
FastAPIUvicornNginxBeautifulSoup (article fetch)
Data Pipeline
PostgreSQLS3 / boto3SQLAlchemyLoguru
Live

Hear it yourself

Seven audio samples across short sentences, paragraphs, and full news articles — including politics, sports, and international coverage. All generated by the same model, streamed in real time.

Open audio demo page
Short sentence
আমার সোনার বাংলা…
Historical paragraph
১৯৭১ সালের ১৬ই ডিসেম্বর…
News — Politics
Full article sample
News — International
Full article sample
Work With Us

Need speech AI for your product?

We build custom TTS and ASR systems — fine-tuned for your language, domain, and latency requirements. On-premise or cloud.