LangGraph credit metering: how to charge users correctly for AI tool usage

When you wire up billing for an AI product, the obvious place to put it is the HTTP handler. A request comes in, you check the user’s balance, you let it through, you charge them, and you reconcile the details later. This works fine for a plain chat completion. It falls apart the moment your agent can call tools.

This post walks through the pattern we use in production to meter credits inside a LangGraph agent, not around it. The short version: don’t bill at the request boundary, bill where tools execute. The longer version is that “where tools execute” isn’t one place, and the right billing strategy depends on what each tool costs and how it runs.

The wrong default: billing at the request boundary

Charging when the request arrives has two failure modes that surface immediately in a real product.

You overcharge for failures. A deep-research tool times out after twenty seconds. An image provider returns a 500. A web search comes back empty. If you debited the user when the request came in, they paid for nothing, and now you owe them a refund flow, a support queue, and an apology. Multiply that by every transient provider hiccup.

You can’t price a turn. A single user message (“research this market and make me a chart”) fans out into one LLM call, two web searches, and an image generation. That’s four billable events with four different cost models, and the model decides which ones happen at runtime. The request boundary knows that a request happened. It has no idea what it will cost. That information only exists inside the agent, after the model has chosen its tools and those tools have either succeeded or failed.

So move the meter inward.

Two stacked flows. At the request boundary, the user is charged before tools run, so a failed tool leaves you owing a refund. At the tool layer, charging happens inside the agent's ToolNodes, only on success, with rollback on failure.

Where the money actually moves

A LangGraph agent is a loop between two kinds of nodes: a model node that calls the LLM and emits tool calls, and a tool node that executes them and returns the results to the model.

graph = StateGraph(AgentState)
graph.add_node("model", call_model)
graph.add_node("tools", ToolNode(tools))
graph.add_edge(START, "model")
graph.add_conditional_edges("model", route_tools)  # -> "tools" or END
graph.add_edge("tools", "model")

Every billable event happens inside that loop: the model node burns input/output tokens, and the tool node runs tools that each cost something. That’s where the meter belongs.

And “something” is never the same something twice. Each model prices its own way: input, output, and cached tokens all bill at different rates within a single call. Each tool sets its own price too. A web search costs a fraction of a cent, an image is priced per output, a deep-research job runs orders of magnitude past both. The meter isn’t one number. It’s a per-model, per-tool rate table the loop reads as each event fires.

A LangGraph agent loop. START flows into the model node, labelled "LLM tokens metered," which emits tool calls to the tool node, labelled "tool calls metered." Results flow back to the model, looping until the model stops calling tools and routes to END. Both nodes are metered; the request boundary is not.

It’s not one pattern, it’s three

The instinct is to write a single billing wrapper (check quota, run the tool, deduct on success, roll back on failure) and apply it everywhere. That’s the right idea at the wrong granularity. Tools differ along two axes that change how you bill:

Cost. A web search is fractions of a cent. A thirty-minute deep-research job is real money you can’t hand out on spec.
Shape. Most tools are synchronous and idempotent. Some are fire-and-forget: you submit a job and get a webhook back minutes later.

Those two axes give three strategies.

Strategy 1: cheap, idempotent tools (wrap the ToolNode)

For the long tail of cheap tools, you don’t want a database transaction per call. You want a fast counter (we use Redis) enforcing a daily limit. LangGraph’s ToolNode takes an awrap_tool_call hook that lets you wrap every tool execution: check before, roll back the counter if the tool throws.

async def metered_tool_call(request, execute):
    cfg = request.runtime.config["configurable"]
    user_id, tool = cfg["user_id"], request.tool_call["name"]

    limit = DAILY_LIMITS.get(tool)
    if limit is not None:
        try:
            await increment_daily_counter(user_id, tool, limit)
        except DailyLimitExceeded:
            return ToolMessage(
                content=f"Daily limit reached for {tool}. Try again tomorrow.",
                tool_call_id=request.tool_call["id"],
            )

    try:
        return await execute(request)
    except Exception:
        await rollback_daily_counter(user_id, tool)  # the user didn't get a result
        raise

graph.add_node("tools", ToolNode(tools, awrap_tool_call=metered_tool_call))

Two things to notice. The limit-exceeded case returns a ToolMessage rather than raising: the model sees “limit reached,” tells the user gracefully, and the turn keeps going. And the rollback only fires on a real exception, so a user is never billed against their daily quota for a tool that errored out.

Strategy 2: the chat turn itself (deduct at end-of-turn in a nested transaction)

The dominant cost of most turns is LLM tokens, and you don’t know the token count until the turn is over. So don’t meter mid-flight. Tally everything once the turn completes (input, output and cached tokens priced per model, plus each tool invocation priced per tool) and deduct in a single operation.

The subtlety is that deduction must never be able to corrupt the conversation. By the time you bill, the assistant message is already written to the database. If your pricing code throws (a missing rate, a bad cache entry) you must not lose the message. Wrap the deduction in a SAVEPOINT (a nested transaction):

async def finalize_turn(session, ctx, message_id, model_id, tokens, tool_calls):
    llm_credits = price_tokens(model_id, tokens)     # Decimal, per-model rates
    tool_credits = price_tools(tool_calls)           # Decimal, per-tool rates
    total = llm_credits + tool_credits

    try:
        async with session.begin_nested():           # SAVEPOINT
            await debit_balance(session, ctx.balance_id, total)
            await write_usage_log(session, ctx, message_id, llm_credits, tool_credits)
    except Exception:
        # the assistant message is already persisted; never lose it to a billing bug
        logger.exception("credit deduction failed for message %s", message_id)
        capture_exception()
    await session.commit()

debit_balance is a single atomic UPDATE. The arithmetic happens in the database, not in Python, so two concurrent turns can’t read-modify-write over each other:

await session.execute(
    update(CreditBalance)
    .where(CreditBalance.id == balance_id)
    .values(used_credits=CreditBalance.used_credits + delta)
)

Strategy 3: expensive async tools (reserve up front, refund on failure)

A deep-research job runs for up to thirty minutes. You submit it, you get a webhook later. End-of-turn deduction can’t work: the turn ends long before the job does. And the job is expensive enough that you can’t let a user kick off ten of them on the credits for one.

So reserve before you submit. And because two requests could race to spend the last of a balance, lock the row while you check and hold:

async def reserve(session, balance_id, amount):
    balance = (await session.execute(
        select(CreditBalance)
        .where(CreditBalance.id == balance_id)
        .with_for_update()                # SELECT ... FOR UPDATE
    )).scalar_one()

    if balance.remaining < amount:
        raise InsufficientCredits
    balance.used_credits += amount        # held until commit

If submitting the job fails after you’ve reserved, refund, and make noise, because this is real money and a silent failure leaves a user short credits they never spent:

try:
    await reserve(session, balance_id, cost)
    await session.commit()
    await submit_deep_research(job)
except SubmissionError as exc:
    await refund(balance_id, cost)        # atomic UPDATE, -cost
    capture_exception(exc)                # explicit: don't let log levels bury it
    raise

When the webhook eventually reports the job’s real cost, you reconcile the reservation against the actual usage and adjust.

Lifecycle of an expensive async tool. Reserve credits with a row lock, submit the job, wait around thirty minutes, receive a webhook with the real cost, then reconcile. A failure branch off submission refunds the reserved credits and fires an explicit alert.

How to choose

A cost-by-shape decision matrix. Cheap, idempotent synchronous tools use a counter-gate; expensive synchronous cost is deducted at end-of-turn in a nested transaction; expensive asynchronous tools reserve credits up front and refund on failure.

The rule of thumb: if a tool is cheap, gate it with a counter. If it’s part of the synchronous turn, fold it into the end-of-turn deduction. If it’s expensive and async, reserve and refund.

These aren’t mutually exclusive, and in practice they compose. A cheap tool can earn two of them at once: a web search, say, gets counter-gated for its daily limit and folded into the end-of-turn deduction: the counter caps abuse, the ledger still captures the cost. Pick the primary strategy by a tool’s cost and shape, then layer the others on where a tool actually needs them.

Correctness details that bite

Use Decimal, never float. Credits are money; binary floating point will drift and you’ll ship a balance that’s off by a hundredth of a cent and impossible to reconcile. Numeric(16, 6) in Postgres, Decimal in Python, end to end.
Lock the row where it’s worth it; tolerate a little drift where it isn’t. Any “check the balance, then deduct” sequence is a race, and reads are stale the instant you have them. But the answer isn’t to lock every path; it’s to lock the one that matters. Reserve an expensive async job behind a SELECT ... FOR UPDATE: there, a lost race hands a user tens of thousands of credits’ worth of work twice, so you serialize it. For a cheap synchronous turn, skip the lock: push the arithmetic into a single atomic UPDATE and accept that two concurrent turns might overshoot a balance by a fraction of a cent. Locking the hot path on every chat message buys you contention you’ll feel long before you’d ever notice the overdraft. Knowing which races are worth preventing is the actual engineering.
Decide fail-open vs. fail-closed per store. If the Redis daily counter is down, fail open: don’t block a paying customer over a cache blip on a near-free tool. If the Postgres balance is unreachable, fail closed: never hand out paid work you can’t record.
Separate the ledger from the balance. The balance is one mutable number for “how much is left this cycle.” The usage log is an append-only list of line items: one per LLM call, one per tool invocation. The balance makes quota checks fast; the ledger lets you audit, reconcile, and answer “why was I charged this?”

Takeaways

Don’t bill at the request boundary. Bill where tools execute. It’s the only place that knows what a turn actually cost.
One wrapper isn’t enough. Pick the strategy by the tool’s cost and shape: counter-gate the cheap ones, fold synchronous cost into an end-of-turn deduction, reserve-and-refund the expensive async ones.
Money code is transaction code. Decimal, row locks, nested transactions, and loud failures aren’t optional polish. They’re the difference between a billing system you trust and one you babysit.

This is the pattern behind the multi-provider credit system we built for Lipika AI: four LLM providers plus image generation, video generation, deep research, and speech-to-text, each metered independently inside the agent. If you’re building something similar and want it to be correct the first time, that’s exactly the kind of work we do.