Cost guardrail for
autonomous agents

Evaluate whether an action is worth the cost before it executes.

Every agent should ask one question before spending:
should this cost?

API endpoint

POST /v1/should-spend

Live demo

request payload

{
  "action": "call_model",
  "provider": "openai",
  "model": "gpt-4",
  "estimated_cost": 0.02,
  "expected_value": 0.05,
  "confidence": 0.7,
  "risk_tolerance": "medium",
  "budget_remaining": 5.00,
  "historical_success_rate": 0.82
}

Confidence

Why AI agents need a cost layer

Autonomous agents make decisions continuously — calling models, running searches, invoking tools — each with a real dollar cost. Without guardrails, a single runaway loop or an unnecessarily expensive model call can burn through budget in seconds. Most agent frameworks focus on what to do next, not whether doing it is economically justified.

ShouldSpend adds a cost-awareness checkpoint before any expensive action executes. Pass what the action costs, what you expect to gain, how confident you are, and — optionally — how risk-averse the agent is and how much budget remains. The API returns a binary verdict, a risk level, and a cheaper alternative when one exists.

Request fields

estimated_cost Projected spend for this action in USD. Compared directly against expected_value to anchor the base score.

expected_value Anticipated return or benefit in USD equivalent. A 2.5× value-to-cost ratio yields a strong score; lower ratios compress toward the decision threshold.

confidence Your certainty that the expected value will materialise, as a probability (0–1). Below 0.5 penalises the score; above 0.75 provides a small boost.

risk_tolerance NEW "low" raises the spend threshold to 0.72 — only clearly justified actions pass. "medium" (default) uses 0.60. "high" lowers it to 0.45 for exploratory agents.

budget_remaining NEW Hard budget gate in USD. If estimated_cost exceeds this value the API returns should_spend: false immediately, bypassing the scoring model entirely.

historical_success_rate NEW Past success rate for this action type (0–1). When provided, the engine weights it at 60% against subjective confidence (40%), reducing reliance on estimates and improving decision quality over time.

Open decision logic

The scoring model is intentionally transparent — no black box. Here is the complete algorithm the API runs on every request:

scoring algorithm — pseudocode

# 1. Hard budget gate (checked before scoring) if budget_remaining and estimated_cost > budget_remaining: return { should_spend: false, reason: "Exceeds remaining budget" } # 2. Base score from value ratio score = 0.5 if expected_value > estimated_cost: ratio = expected_value / estimated_cost score += min(0.30, (ratio - 1) * 0.12) else: shortfall = estimated_cost - expected_value score -= min(0.40, shortfall * 5) # 3. Effective confidence (blends historical data when available) if historical_success_rate is provided: eff_conf = historical_success_rate * 0.6 + confidence * 0.4 else: eff_conf = confidence if eff_conf < 0.5: score -= (0.5 - eff_conf) * 0.4 if eff_conf > 0.75: score += (eff_conf - 0.75) * 0.2 score = clamp(score, 0, 1) # 4. Spend threshold from risk_tolerance threshold = { "low": 0.72, "medium": 0.60, "high": 0.45 }[risk_tolerance] should_spend = score > threshold

Case studies

Research agent switching from GPT-4 to GPT-4o-mini model selection

A document research agent was using GPT-4 for every query regardless of complexity. After routing calls through ShouldSpend with risk_tolerance: "medium", 71% of requests were flagged with a GPT-4o-mini alternative. Replacing those calls produced statistically identical output quality (measured by downstream task completion) at a fraction of the cost.

$2,340 saved over 30 days — same output quality

Data pipeline hitting daily budget cap by noon budget control

An extraction pipeline processing variable-length documents had no per-call cost awareness. Setting budget_remaining on each call (decremented after each approved spend) meant the hard gate fired automatically when the daily limit was reached — no developer intervention, no alert fatigue, no overrun.

100% budget overruns eliminated across 61 pipeline runs

Code review agent with low-confidence inputs historical scoring

A code review agent processing ambiguous diffs had a historical success rate of 0.34 for files over 500 lines. Passing historical_success_rate: 0.34 caused the engine to weight past outcomes over subjective confidence, returning should_spend: false for 28% of calls that would otherwise have proceeded. False positive rate dropped from 19% to 5%.

74% reduction in wasted calls on ambiguous inputs

Framework integrations

import requests from langchain.tools import tool @tool def should_spend_guard(action: str, model: str, cost: float, value: float, confidence: float) -> dict: """Check if an LLM call is cost-justified before executing.""" r = requests.post("https://shouldspend.replit.app/v1/should-spend", json={ "action": action, "model": model, "estimated_cost": cost, "expected_value": value, "confidence": confidence, "risk_tolerance": "medium" }) return r.json() # Use in a chain before any expensive tool call verdict = should_spend_guard.run({ "action": "summarise_document", "model": "gpt-4", "cost": 0.03, "value": 0.08, "confidence": 0.75 }) if not verdict["should_spend"]: raise Exception(verdict["reason"])

import requests from autogen import AssistantAgent, UserProxyAgent def cost_check(model: str, cost: float, value: float, confidence: float) -> bool: r = requests.post("https://shouldspend.replit.app/v1/should-spend", json={ "model": model, "estimated_cost": cost, "expected_value": value, "confidence": confidence, "risk_tolerance": "low", "budget_remaining": budget_tracker.remaining() }) result = r.json() if result["should_spend"]: budget_tracker.deduct(cost) return result["should_spend"] # Wrap the model call in your agent's reply function def guarded_reply(messages, sender, config): if not cost_check("gpt-4o", 0.015, 0.04, 0.8): return True, "Skipped: cost not justified." # proceed with normal reply logic...

import requests from crewai import Agent, Task, Crew from crewai.tools import BaseTool class CostGuardTool(BaseTool): name: str = "cost_guard" description: str = "Checks if an action is cost-justified." def _run(self, model: str, cost: float, value: float, confidence: float) -> str: r = requests.post( "https://shouldspend.replit.app/v1/should-spend", json={ "model": model, "estimated_cost": cost, "expected_value": value, "confidence": confidence, "risk_tolerance": "medium" } ) d = r.json() verdict = "APPROVED" if d["should_spend"] else "BLOCKED" return f"{verdict}: {d['reason']}. Alt: {d.get('alternative','none')}" agent = Agent(role="Cost-aware researcher", tools=[CostGuardTool()])

curl -X POST https://shouldspend.replit.app/v1/should-spend \ -H "Content-Type: application/json" \ -d '{ "action": "call_model", "provider": "openai", "model": "gpt-4", "estimated_cost": 0.02, "expected_value": 0.05, "confidence": 0.7, "risk_tolerance": "medium", "budget_remaining": 5.00, "historical_success_rate": 0.82 }' # Response { "should_spend": true, "confidence": 0.74, "reason": "Expected value exceeds cost", "cost_efficiency": 0.6, "risk": "low", "alternative": "gpt-4o-mini (~97% cheaper, suitable for most tasks)", "budget_impact": { "budget_remaining_after": 4.98 } }

Pricing

Free

$0 / month

For solo developers and experiments.

500 decisions / month
Core scoring engine
Risk classification
Model alternatives

Get started

Pro

$49 / month

For teams running agents in production.

Unlimited decisions
Budget controls + hard gates
Risk tolerance profiles
Historical outcome tracking
Usage analytics dashboard
Custom spend thresholds
Priority support

Join waitlist

Cost guardrail forautonomous agents

Why AI agents need a cost layer

Request fields

Open decision logic

Cost guardrail for
autonomous agents