Why we’re building rendfly

Most production AI agents fail silently.

Sentry sees HTTP 200. Datadog sees a healthy latency curve. New Relic sees the database queries and the cache hit rate. All green. Meanwhile the agent on the other end of the WhatsApp number is recommending a competitor’s product, escalating an angry customer to a deleted Slack channel, and quietly inventing a refund policy that doesn’t exist.

By some estimates, 82% of AI bugs in production are silent failures of this kind. Existing observability tools cannot see them — they were built when “did the request finish?” was the interesting question. For a chatbot that always finishes, that question is meaningless.

The system message is the contract

Every team that operates an AI agent has already written down the rules. They live in the system message: “You are a customer support agent for X. Don’t give medical advice. Always escalate billing disputes to a human. Never quote prices outside the published catalog. If you don’t know, say so.”

That paragraph is the contract between the company and its agent. It’s also the answer key for whether each conversation passed or failed.

rendfly turns the system message into the rubric. We extract the rules, classify them by type and severity, and run an LLM-as-judge against every conversation in production. When the agent’s behavior drifts from the contract, you get an alert with the exact span that triggered the verdict — not a vague “performance dropped” but a concrete “rule 3 (no medical advice) failed in conversation 7f3a, here’s the message.”

Why post-deploy, not pre-deploy

Tools like Braintrust, LangSmith, and Promptfoo are great at what they do — they run before deploy, against curated prompts, to catch regressions in the lab. But drift happens in production. The model gets a silent quality update. The system message gets edited. A customer asks something nobody anticipated. The lab can’t see any of that.

rendfly runs after deploy, on real conversations. Every reply gets a per-rule scorecard and a drift signal computed against a 7-day rolling baseline. The first time a quiet regression starts, you see it on the dashboard — usually before customers do.

What rendfly is not

We are not a general APM tool. Sentry and Datadog cover infrastructure beautifully; we don’t try to compete with that. We are not a dev-time eval framework. The pre-deploy tools cover that surface.

What we are is the layer of conversational quality that nobody covers today. The thing that watches whether the agent is honoring its own contract once it’s in front of real users.

Coming soon

We’re shipping early access in waves. If you operate an AI agent in production and the silent-failure problem sounds familiar, drop your email at rendfly.com and we’ll get in touch.