Work Lipika AI
Backend Engineering

Lipika AI

Production backend for a multi-provider LLM aggregation platform — four AI models unified under one credit system, shipped from spec to live in six weeks.

6 weeks
MVP to production
4 providers
LLM models unified
Backend only
Our scope

Everything shipped in those 6 weeks

Email/password + Google OAuth
Multi-LLM routing (4 providers)
Credit billing + subscription plans
Per-seat team workspaces
SSLCommerz payment gateway
Image & video generation pipelines
30-min deep research (Valyu)
Real-time SSE streaming
Speech-to-text via WebSocket
PDF & DOCX export
Kubernetes deployment on GCP
Langfuse + Sentry observability
Client
Lipika AI
Our scope
Backend Engineering
Category
AI Platform · SaaS
Status
Live in production
The Challenge

One product.
Four AI models.
Zero compromise.

The Lipika AI team needed a production backend that could route requests across Claude, GPT-4, Gemini, and Grok — with per-model credit metering, subscription billing, and a growing suite of AI tools including image generation, video generation (Gemini Veo), long-horizon deep research, and real-time speech-to-text.

The system also had to handle real-time SSE streaming, long-running async jobs up to 30 minutes, team workspaces with seat-based billing, SSLCommerz payment integration, and a Kubernetes deployment on GCP that could scale without multiplying database connection counts.

Our Role
  • Full backend engineering
  • API design & architecture
  • Auth — email/password & Google OAuth
  • LLM orchestration (LangGraph)
  • Billing & payments integration
  • Async job infrastructure
  • Kubernetes deployment on GCP
  • Frontend engineering

Iterosys owns the entire backend surface — API design through to Kubernetes deployment and observability. The frontend was built by the client's own team.

What made this hard

Credit metering inside the agent, not around it

Most platforms bill at the request boundary — charge when the call comes in, reconcile later. We built the billing logic directly into each LangGraph ToolNode: check quota, execute, deduct atomically on success, roll back on failure. Users are never charged for infrastructure errors or failed API calls. Getting this right across four LLM providers plus image gen, video gen, and deep research — each with a different cost model — required financial-grade correctness at the orchestration layer.

30-minute async jobs with zero extra infrastructure

Deep research tasks run for up to 30 minutes. The standard answer is Celery, a message broker, and a separate worker fleet — an ops burden that lives forever. We handled it with native asyncio tasks, Redis status tracking, webhook callbacks from the research provider, and reconciliation crons. No broker, no extra pods, nothing new to operate. The result is simpler, cheaper, and fully fits the async-first architecture.

System Architecture

Four layers, one coherent system

Client
Web Application (Next.js)
REST / JSONSSE StreamingWebSocket (STT)
API Layer — FastAPI (ASGI)
Security Headers
CSP · X-Frame-Options · X-Content-Type-Options
JWT Auth
Access + refresh tokens, Google OAuth 2.0
Rate Limiting
SlowAPI, per-user + IP fallback
Request Tracing
UUID X-Request-ID on every response
Timeout Guard
60s default · 600s for SSE streams
GZip Bypass
Disabled on /stream — SSE stream-safe
Intelligence Layer — LangGraph Agent

LLM Providers (factory pattern)

ClaudeGPT-4GeminiGrok

Tool Nodes (credit-metered)

Image GenVideo Gen (Veo)Deep ResearchWeb SearchSpeech-to-TextPDF / DOCX
Data Layer
PostgreSQL
App data, billing, users
PostgreSQL (checkpoint)
LangGraph state — isolated DB
Redis
Cache · job queue · rate limits
Cloud Storage
File uploads · media assets
Infrastructure & Observability
GCPKubernetesHorizontal Pod AutoscalerCloud SQLDocker multi-stageLangfuseSentryBandit (security scan)
Engineering Decisions

Five choices that shaped the system

01

Credit metering at the tool-execution layer, with rollback

Credits are deducted inside each LangGraph ToolNode — not at the request boundary. The middleware checks per-model quotas and daily limits before execution, then decrements atomically on success and rolls back on failure. Users are never charged for failed API calls, provider timeouts, or infrastructure errors. This works across four LLM providers plus image gen, video gen, deep research, and STT — each with independent cost models and rate configurations.

Quota check before execution
Atomic deduction on success
Rollback on any exception
Per-model + per-tool rate config
02

Celery-free background jobs

Image generation, video generation (Gemini Veo), and deep research (Valyu) run as native asyncio tasks — with Redis status tracking and reconciliation cron jobs. No Celery broker, no separate worker pods.

Fewer moving parts · fits the async-first FastAPI design · no broker to operate

03

Isolated LangGraph checkpoint DB

LangGraph writes a checkpoint row on every agent step. We give it a dedicated PostgreSQL database — separate from the app DB — to avoid Alembic migration conflicts and allow independent backup schedules and connection pool sizing.

15 app conns + 6 checkpoint conns per pod — predictable ceiling as the fleet scales

04

Manual SSE, not sse-starlette

The sse-starlette library raises ExceptionGroup errors in the test suite. We use Starlette's StreamingResponse directly with data: {json}\n\n formatting — more control, clean test isolation, and GZip bypass on the /stream path.

Timeout override: 600s for SSE · GZip middleware bypassed on /stream

05

One worker per pod, scale horizontally

Adding workers multiplies connection pools. Instead, each pod runs a single Gunicorn worker — async I/O handles concurrency — and HPA adds pods when traffic increases. Connection counts stay predictable at scale.

1 worker per pod — HPA adds pods at load, not workers per pod

Tech Stack

What we shipped with

AI / Orchestration
LangGraph 1.1.6LangChainlangchain-anthropiclangchain-openailangchain-google-genailangchain-xaiTavily SearchOpenAI Whisper (STT)
Backend
Python 3.13FastAPI 0.129Pydantic v2SQLAlchemy 2 (async)AlembicasyncpgUvicorn + Gunicorn
Auth & Security
JWT (access + refresh)Google OAuth 2.0Bcrypt (password hashing)SecurityHeadersMiddlewareBandit (static analysis)
Data & Storage
PostgreSQL (Cloud SQL)Redis 7.2Google Cloud StorageLangGraph Checkpoint DB
Infrastructure
GCPKubernetes + HPADocker (multi-stage)SSLCommerz (payments)
Observability
Langfuse 4.0Sentry SDKLoguru (structured JSON)pytest + asyncio
Outcomes

From spec to production

6 weeks
MVP to live production
Full backend — auth, billing, LLM routing, streaming, Kubernetes deployment
4 models
Unified under one API
Claude · GPT-4 · Gemini · Grok — swappable via LangGraph factory pattern
0 Celery
Background job workers
Async-native pipelines reduced infra footprint and eliminated broker ops overhead
Scale Target
80–100k
daily active users

Kubernetes HPA scales the pod count automatically under load — each pod runs a single worker, so connection counts scale linearly and predictably. The async-first design means a single Uvicorn worker handles hundreds of concurrent SSE streams without thread overhead, targeting 80–100k daily active users.

Work With Us

Building something similar?

We specialise in production AI backends — LLM orchestration, billing systems, async pipelines, and Kubernetes deployments on GCP or AWS.