Most AI evaluation requests are underspecified

Teams often ask for “RAG evaluation,” “LLM evals,” or an “AI evaluation consultant” before they can name the exact failure they need to measure. The model may hallucinate, cite the wrong source, choose the wrong tool, miss a graph relation, or route a user to the wrong workflow. Those failures need different evidence.

A vague eval project can become a week of dashboards that nobody trusts. A useful first diagnostic starts with the operating question: which decisions are expensive when wrong, what logs already exist, what labels are trusted, and who is allowed to approve changes to prompts, retrieval, tools, or model routing?

Package the eval as a workflow

The practical shape is a small Flev workflow: collect representative queries or traces, define oracle labels, compare baseline and candidate behavior, keep citations and tool calls attached to each case, and return a short report that separates confirmed failures from missing evidence.

For teams with a concrete eval path, the same intake used for Flev DevOps can scope a fixed diagnostic: send one RAG, agent-routing, retrieval, or tool-selection workflow through the Flev intake. Payment is requested only after the scope fits; the first message should avoid private credentials and can use redacted logs or public repo evidence.

The evaluation flow

A buyer-ready eval keeps the test set, labels, traces, and approval decisions connected.

flowchart TB
  Question[Business failure question] --> Cases[Representative queries or traces]
  Cases --> Labels[Oracle labels and expected sources]
  Cases --> Baseline[Baseline RAG or agent behavior]
  Baseline --> Metrics[Precision, recall, tool choice, latency]
  Labels --> Metrics
  Metrics --> Evidence[Reviewable evidence table]
  Evidence --> Boundary[Approval boundary for prompt, retrieval, model, or tool changes]
  Boundary --> Report[Diagnostic report and next workflow]

What to send before asking for help

Send 10 to 30 representative questions, expected source ids or accepted answers, sample traces if available, the current retrieval or routing stack, the failure that hurts the business, and the metric that would change a decision. Good starter diagnostics include source precision, source recall, answer faithfulness, tool-selection accuracy, graph retrieval coverage, and latency or cost per run. If the current evidence is only a GitHub issue, label it as a technical need rather than a paid buyer until someone confirms budget or ownership.

The commercial lesson is simple: sell the smallest reviewable eval workflow first. Do not sell a giant AI quality platform before the buyer can see which failures are real and which data is missing.