Production backend for a multi-provider LLM aggregation platform — four AI models unified under one credit system, shipped from spec to live in six weeks.
Everything shipped in those 6 weeks
The Lipika AI team needed a production backend that could route requests across Claude, GPT-4, Gemini, and Grok — with per-model credit metering, subscription billing, and a growing suite of AI tools including image generation, video generation (Gemini Veo), long-horizon deep research, and real-time speech-to-text.
The system also had to handle real-time SSE streaming, long-running async jobs up to 30 minutes, team workspaces with seat-based billing, SSLCommerz payment integration, and a Kubernetes deployment on GCP that could scale without multiplying database connection counts.
Iterosys owns the entire backend surface — API design through to Kubernetes deployment and observability. The frontend was built by the client's own team.
What made this hard
Most platforms bill at the request boundary — charge when the call comes in, reconcile later. We built the billing logic directly into each LangGraph ToolNode: check quota, execute, deduct atomically on success, roll back on failure. Users are never charged for infrastructure errors or failed API calls. Getting this right across four LLM providers plus image gen, video gen, and deep research — each with a different cost model — required financial-grade correctness at the orchestration layer.
Deep research tasks run for up to 30 minutes. The standard answer is Celery, a message broker, and a separate worker fleet — an ops burden that lives forever. We handled it with native asyncio tasks, Redis status tracking, webhook callbacks from the research provider, and reconciliation crons. No broker, no extra pods, nothing new to operate. The result is simpler, cheaper, and fully fits the async-first architecture.
LLM Providers (factory pattern)
Tool Nodes (credit-metered)
Credits are deducted inside each LangGraph ToolNode — not at the request boundary. The middleware checks per-model quotas and daily limits before execution, then decrements atomically on success and rolls back on failure. Users are never charged for failed API calls, provider timeouts, or infrastructure errors. This works across four LLM providers plus image gen, video gen, deep research, and STT — each with independent cost models and rate configurations.
Image generation, video generation (Gemini Veo), and deep research (Valyu) run as native asyncio tasks — with Redis status tracking and reconciliation cron jobs. No Celery broker, no separate worker pods.
Fewer moving parts · fits the async-first FastAPI design · no broker to operate
LangGraph writes a checkpoint row on every agent step. We give it a dedicated PostgreSQL database — separate from the app DB — to avoid Alembic migration conflicts and allow independent backup schedules and connection pool sizing.
15 app conns + 6 checkpoint conns per pod — predictable ceiling as the fleet scales
The sse-starlette library raises ExceptionGroup errors in the test suite. We use Starlette's StreamingResponse directly with data: {json}\n\n formatting — more control, clean test isolation, and GZip bypass on the /stream path.
Timeout override: 600s for SSE · GZip middleware bypassed on /stream
Adding workers multiplies connection pools. Instead, each pod runs a single Gunicorn worker — async I/O handles concurrency — and HPA adds pods when traffic increases. Connection counts stay predictable at scale.
1 worker per pod — HPA adds pods at load, not workers per pod
Kubernetes HPA scales the pod count automatically under load — each pod runs a single worker, so connection counts scale linearly and predictably. The async-first design means a single Uvicorn worker handles hundreds of concurrent SSE streams without thread overhead, targeting 80–100k daily active users.
We specialise in production AI backends — LLM orchestration, billing systems, async pipelines, and Kubernetes deployments on GCP or AWS.