Work Lipika AI

Backend Engineering

Lipika AI

Production backend for a multi-provider LLM aggregation platform — four AI models unified under one credit system, shipped from spec to live in six weeks.

6 weeks

MVP to production

4 providers

LLM models unified

Backend only

Our scope

Everything shipped in those 6 weeks

Email/password + Google OAuth

Multi-LLM routing (4 providers)

Credit billing + subscription plans

Per-seat team workspaces

SSLCommerz payment gateway

Image & video generation pipelines

30-min deep research (Valyu)

Real-time SSE streaming

Speech-to-text via WebSocket

PDF & DOCX export

Kubernetes deployment on GCP

Langfuse + Sentry observability

The Challenge

One product.
Four AI models.
Zero compromise.

The Lipika AI team needed a production backend that could route requests across Claude, GPT-4, Gemini, and Grok — with per-model credit metering, subscription billing, and a growing suite of AI tools including image generation, video generation (Gemini Veo), long-horizon deep research, and real-time speech-to-text.

The system also had to handle real-time SSE streaming, long-running async jobs up to 30 minutes, team workspaces with seat-based billing, SSLCommerz payment integration, and a Kubernetes deployment on GCP that could scale without multiplying database connection counts.

Our Role

Full backend engineering
API design & architecture
Auth — email/password & Google OAuth
LLM orchestration (LangGraph)
Billing & payments integration
Async job infrastructure
Kubernetes deployment on GCP
Frontend engineering

Iterosys owns the entire backend surface — API design through to Kubernetes deployment and observability. The frontend was built by the client's own team.

What made this hard

Credit metering inside the agent, not around it

Most platforms bill at the request boundary — charge when the call comes in, reconcile later. We built the billing logic directly into each LangGraph ToolNode: check quota, execute, deduct atomically on success, roll back on failure. Users are never charged for infrastructure errors or failed API calls. Getting this right across four LLM providers plus image gen, video gen, and deep research — each with a different cost model — required financial-grade correctness at the orchestration layer.

30-minute async jobs with zero extra infrastructure

Deep research tasks run for up to 30 minutes. The standard answer is Celery, a message broker, and a separate worker fleet — an ops burden that lives forever. We handled it with native asyncio tasks, Redis status tracking, webhook callbacks from the research provider, and reconciliation crons. No broker, no extra pods, nothing new to operate. The result is simpler, cheaper, and fully fits the async-first architecture.

System Architecture

Four layers, one coherent system

Client

Web Application (Next.js)

REST / JSONSSE StreamingWebSocket (STT)

API Layer — FastAPI (ASGI)

Security Headers

CSP · X-Frame-Options · X-Content-Type-Options

JWT Auth

Access + refresh tokens, Google OAuth 2.0

Rate Limiting

SlowAPI, per-user + IP fallback

Request Tracing

UUID X-Request-ID on every response

Timeout Guard

60s default · 600s for SSE streams

GZip Bypass

Disabled on /stream — SSE stream-safe

Intelligence Layer — LangGraph Agent

LLM Providers (factory pattern)

ClaudeGPT-4GeminiGrok

Tool Nodes (credit-metered)

Image GenVideo Gen (Veo)Deep ResearchWeb SearchSpeech-to-TextPDF / DOCX

Data Layer

PostgreSQL

App data, billing, users

PostgreSQL (checkpoint)

LangGraph state — isolated DB

Redis

Cache · job queue · rate limits

Cloud Storage

File uploads · media assets

Infrastructure & Observability

GCPKubernetesHorizontal Pod AutoscalerCloud SQLDocker multi-stageLangfuseSentryBandit (security scan)

Engineering Decisions

Five choices that shaped the system

Credit metering at the tool-execution layer, with rollback

Credits are deducted inside each LangGraph ToolNode — not at the request boundary. The middleware checks per-model quotas and daily limits before execution, then decrements atomically on success and rolls back on failure. Users are never charged for failed API calls, provider timeouts, or infrastructure errors. This works across four LLM providers plus image gen, video gen, deep research, and STT — each with independent cost models and rate configurations.

Quota check before execution

Atomic deduction on success

Rollback on any exception

Per-model + per-tool rate config

Celery-free background jobs

Image generation, video generation (Gemini Veo), and deep research (Valyu) run as native asyncio tasks — with Redis status tracking and reconciliation cron jobs. No Celery broker, no separate worker pods.

Fewer moving parts · fits the async-first FastAPI design · no broker to operate

Isolated LangGraph checkpoint DB

LangGraph writes a checkpoint row on every agent step. We give it a dedicated PostgreSQL database — separate from the app DB — to avoid Alembic migration conflicts and allow independent backup schedules and connection pool sizing.

15 app conns + 6 checkpoint conns per pod — predictable ceiling as the fleet scales

Manual SSE, not sse-starlette

The sse-starlette library raises ExceptionGroup errors in the test suite. We use Starlette's StreamingResponse directly with data: {json}\n\n formatting — more control, clean test isolation, and GZip bypass on the /stream path.

Timeout override: 600s for SSE · GZip middleware bypassed on /stream

One worker per pod, scale horizontally

Adding workers multiplies connection pools. Instead, each pod runs a single Gunicorn worker — async I/O handles concurrency — and HPA adds pods when traffic increases. Connection counts stay predictable at scale.

1 worker per pod — HPA adds pods at load, not workers per pod

Tech Stack

What we shipped with

AI / Orchestration

LangGraph 1.1.6LangChainlangchain-anthropiclangchain-openailangchain-google-genailangchain-xaiTavily SearchOpenAI Whisper (STT)

Backend

Python 3.13FastAPI 0.129Pydantic v2SQLAlchemy 2 (async)AlembicasyncpgUvicorn + Gunicorn

Auth & Security

JWT (access + refresh)Google OAuth 2.0Bcrypt (password hashing)SecurityHeadersMiddlewareBandit (static analysis)

Data & Storage

PostgreSQL (Cloud SQL)Redis 7.2Google Cloud StorageLangGraph Checkpoint DB

Infrastructure

GCPKubernetes + HPADocker (multi-stage)SSLCommerz (payments)

Observability

Langfuse 4.0Sentry SDKLoguru (structured JSON)pytest + asyncio

Outcomes

From spec to production

6 weeks

MVP to live production

Full backend — auth, billing, LLM routing, streaming, Kubernetes deployment

4 models

Unified under one API

Claude · GPT-4 · Gemini · Grok — swappable via LangGraph factory pattern

0 Celery

Background job workers

Async-native pipelines reduced infra footprint and eliminated broker ops overhead

Scale Target

80–100k

daily active users

Kubernetes HPA scales the pod count automatically under load — each pod runs a single worker, so connection counts scale linearly and predictably. The async-first design means a single Uvicorn worker handles hundreds of concurrent SSE streams without thread overhead, targeting 80–100k daily active users.