Cost guardrail for
autonomous agents

Evaluate whether an action is worth the cost before it executes.

Every agent should ask one question before spending:
should this cost?

API endpoint

POST /v1/should-spend
request payload
{
  "action": "call_model",
  "provider": "openai",
  "model": "gpt-4",
  "estimated_cost": 0.02,
  "expected_value": 0.05,
  "confidence": 0.7,
  "risk_tolerance": "medium",
  "budget_remaining": 5.00,
  "historical_success_rate": 0.82
}

Confidence

Why AI agents need a cost layer

Autonomous agents make decisions continuously — calling models, running searches, invoking tools — each with a real dollar cost. Without guardrails, a single runaway loop or an unnecessarily expensive model call can burn through budget in seconds. Most agent frameworks focus on what to do next, not whether doing it is economically justified.

ShouldSpend adds a cost-awareness checkpoint before any expensive action executes. Pass what the action costs, what you expect to gain, how confident you are, and — optionally — how risk-averse the agent is and how much budget remains. The API returns a binary verdict, a risk level, and a cheaper alternative when one exists.

Request fields

estimated_cost Projected spend for this action in USD. Compared directly against expected_value to anchor the base score.
expected_value Anticipated return or benefit in USD equivalent. A 2.5× value-to-cost ratio yields a strong score; lower ratios compress toward the decision threshold.
confidence Your certainty that the expected value will materialise, as a probability (0–1). Below 0.5 penalises the score; above 0.75 provides a small boost.
risk_tolerance NEW "low" raises the spend threshold to 0.72 — only clearly justified actions pass. "medium" (default) uses 0.60. "high" lowers it to 0.45 for exploratory agents.
budget_remaining NEW Hard budget gate in USD. If estimated_cost exceeds this value the API returns should_spend: false immediately, bypassing the scoring model entirely.
historical_success_rate NEW Past success rate for this action type (0–1). When provided, the engine weights it at 60% against subjective confidence (40%), reducing reliance on estimates and improving decision quality over time.

Open decision logic

The scoring model is intentionally transparent — no black box. Here is the complete algorithm the API runs on every request:

scoring algorithm — pseudocode
# 1. Hard budget gate (checked before scoring) if budget_remaining and estimated_cost > budget_remaining: return { should_spend: false, reason: "Exceeds remaining budget" } # 2. Base score from value ratio score = 0.5 if expected_value > estimated_cost: ratio = expected_value / estimated_cost score += min(0.30, (ratio - 1) * 0.12) else: shortfall = estimated_cost - expected_value score -= min(0.40, shortfall * 5) # 3. Effective confidence (blends historical data when available) if historical_success_rate is provided: eff_conf = historical_success_rate * 0.6 + confidence * 0.4 else: eff_conf = confidence if eff_conf < 0.5: score -= (0.5 - eff_conf) * 0.4 if eff_conf > 0.75: score += (eff_conf - 0.75) * 0.2 score = clamp(score, 0, 1) # 4. Spend threshold from risk_tolerance threshold = { "low": 0.72, "medium": 0.60, "high": 0.45 }[risk_tolerance] should_spend = score > threshold
Research agent switching from GPT-4 to GPT-4o-mini model selection

A document research agent was using GPT-4 for every query regardless of complexity. After routing calls through ShouldSpend with risk_tolerance: "medium", 71% of requests were flagged with a GPT-4o-mini alternative. Replacing those calls produced statistically identical output quality (measured by downstream task completion) at a fraction of the cost.

$2,340 saved over 30 days — same output quality
Data pipeline hitting daily budget cap by noon budget control

An extraction pipeline processing variable-length documents had no per-call cost awareness. Setting budget_remaining on each call (decremented after each approved spend) meant the hard gate fired automatically when the daily limit was reached — no developer intervention, no alert fatigue, no overrun.

100% budget overruns eliminated across 61 pipeline runs
Code review agent with low-confidence inputs historical scoring

A code review agent processing ambiguous diffs had a historical success rate of 0.34 for files over 500 lines. Passing historical_success_rate: 0.34 caused the engine to weight past outcomes over subjective confidence, returning should_spend: false for 28% of calls that would otherwise have proceeded. False positive rate dropped from 19% to 5%.

74% reduction in wasted calls on ambiguous inputs
import requests from langchain.tools import tool @tool def should_spend_guard(action: str, model: str, cost: float, value: float, confidence: float) -> dict: """Check if an LLM call is cost-justified before executing.""" r = requests.post("https://shouldspend.replit.app/v1/should-spend", json={ "action": action, "model": model, "estimated_cost": cost, "expected_value": value, "confidence": confidence, "risk_tolerance": "medium" }) return r.json() # Use in a chain before any expensive tool call verdict = should_spend_guard.run({ "action": "summarise_document", "model": "gpt-4", "cost": 0.03, "value": 0.08, "confidence": 0.75 }) if not verdict["should_spend"]: raise Exception(verdict["reason"])
import requests from autogen import AssistantAgent, UserProxyAgent def cost_check(model: str, cost: float, value: float, confidence: float) -> bool: r = requests.post("https://shouldspend.replit.app/v1/should-spend", json={ "model": model, "estimated_cost": cost, "expected_value": value, "confidence": confidence, "risk_tolerance": "low", "budget_remaining": budget_tracker.remaining() }) result = r.json() if result["should_spend"]: budget_tracker.deduct(cost) return result["should_spend"] # Wrap the model call in your agent's reply function def guarded_reply(messages, sender, config): if not cost_check("gpt-4o", 0.015, 0.04, 0.8): return True, "Skipped: cost not justified." # proceed with normal reply logic...
import requests from crewai import Agent, Task, Crew from crewai.tools import BaseTool class CostGuardTool(BaseTool): name: str = "cost_guard" description: str = "Checks if an action is cost-justified." def _run(self, model: str, cost: float, value: float, confidence: float) -> str: r = requests.post( "https://shouldspend.replit.app/v1/should-spend", json={ "model": model, "estimated_cost": cost, "expected_value": value, "confidence": confidence, "risk_tolerance": "medium" } ) d = r.json() verdict = "APPROVED" if d["should_spend"] else "BLOCKED" return f"{verdict}: {d['reason']}. Alt: {d.get('alternative','none')}" agent = Agent(role="Cost-aware researcher", tools=[CostGuardTool()])
curl -X POST https://shouldspend.replit.app/v1/should-spend \ -H "Content-Type: application/json" \ -d '{ "action": "call_model", "provider": "openai", "model": "gpt-4", "estimated_cost": 0.02, "expected_value": 0.05, "confidence": 0.7, "risk_tolerance": "medium", "budget_remaining": 5.00, "historical_success_rate": 0.82 }' # Response { "should_spend": true, "confidence": 0.74, "reason": "Expected value exceeds cost", "cost_efficiency": 0.6, "risk": "low", "alternative": "gpt-4o-mini (~97% cheaper, suitable for most tasks)", "budget_impact": { "budget_remaining_after": 4.98 } }
Free
$0 / month
For solo developers and experiments.
  • 500 decisions / month
  • Core scoring engine
  • Risk classification
  • Model alternatives
Get started
Pro
$49 / month
For teams running agents in production.
  • Unlimited decisions
  • Budget controls + hard gates
  • Risk tolerance profiles
  • Historical outcome tracking
  • Usage analytics dashboard
  • Custom spend thresholds
  • Priority support
Join waitlist