Measuring AI SDR Performance and Attribution
The metrics and instrumentation a GTM Engineer wires up to prove an AI SDR drives revenue, not just sends volume.
An AI SDR will happily report that it sent 12,000 emails and made 4,000 dials last month. None of that is evidence it worked. The agent's job is pipeline, and the GTM Engineer's job is to instrument the system so leadership can see exactly how much pipeline the agent created and how much of it would have shown up anyway. That second half is where most teams fail. They measure activity, declare victory, and can't answer the one question the CRO asks: what did this thing move?
This is a measurement problem, and measurement is engineering work. If you own the AI SDR, you own the proof. Here's the metric stack, the deliverability gates, the attribution models that hold up under scrutiny, and the trace-event backbone that makes all of it auditable. For the operational side of running the agent day to day, see managing AI SDRs.
The Metrics That Matter (And What Good Looks Like)
Build the funnel top to bottom and measure conversion at every step. Volume metrics sit at the top and matter least. Outcome metrics sit at the bottom and matter most.
Connect rate (calls): the share of dials that reach a live human. Healthy is 8-12%. Most teams in the current market see 3-10%. Verified direct-dial numbers connect at 3-5x the rate of switchboard numbers, so data quality moves this number more than dial volume does.
Reply rate (email): a good cold reply rate is 6-8%, above the 5.8% average. Static cold lists land near 5%. Lists triggered by intent signals hit 12-15%. If your AI SDR is replying-rate-starved, the fix is usually targeting, not copy.
Positive reply rate: replies that express interest, not "unsubscribe" or "wrong person." This is the first metric that filters out noise. An agent can manufacture a high reply rate with provocative subject lines and produce zero positive replies. Track positive replies as a percentage of total sent, and watch the ratio of positive to negative replies over time.
Meetings booked: median is 8-10 per month per equivalent rep, top quartile hits 12-15, elite books 18+. For an AI SDR, normalize this against cost, not headcount. Meetings per dollar of spend is the honest unit.
Show rate: booked meetings that happen. 75-85% is healthy, 80%+ is the target. A low show rate after a high book rate signals the agent is booking weak meetings, often by over-promising or qualifying loosely.
Meeting-to-opportunity rate: 50% or higher is the bar. The best SDRs run 1:3 to 1:5, so for every three to five meetings, one becomes a sales-accepted opportunity. Below that and the meetings aren't qualified.
Pipeline created and opportunities: the bottom of the funnel and the only metrics leadership cares about. Pipeline coverage should sit at 3-5x quota minimum, with strong teams holding 4-5x. Attribute pipeline to the agent at the opportunity level, not the lead level, because leads inflate and opportunities don't.
Deliverability Health Metrics
An AI SDR can quietly destroy a sending domain in a week. High volume plus templated copy plus stale lists equals spam folder, and once you're flagged the recovery takes months. These gates are non-negotiable, and the GTM Engineer enforces them as hard thresholds in the pipeline.
Spam complaint rate under 0.10%. Gmail's hard limit is 0.30%. Cross it and Gmail starts rejecting your mail outright. Keep a buffer and alert at 0.08%.
Hard bounce rate under 2%. Above 2% and mailbox providers downgrade your reputation, which drags down every future send. Verify every address before the agent touches it.
Authentication: SPF, DKIM, and DMARC all passing. Since 2024, Gmail and Yahoo require all three for senders above 5,000 messages a day, plus one-click unsubscribe honored within two days. These aren't best practices anymore, they're entry requirements.
Inbox placement, not just delivery. "Delivered" means the server accepted the message. It doesn't mean a human saw it. Run seed-list tests across Gmail, Outlook, and Yahoo to measure what share lands in the inbox versus spam. A 7% reply rate on 40% inbox placement is closer to a 17% reply rate being throttled by deliverability. Fix placement and the agent's numbers jump without changing a word of copy.
Attribution Models
Once the agent is producing meetings and pipeline, leadership wants to know how much of the revenue to credit to it. There's no single clean answer, so you stack methods and triangulate.
First-touch and last-touch are simple and wrong on their own. First-touch over-credits the opener. Last-touch over-credits the closer. Both ignore the middle of the journey. Use them as sanity checks, never as the system of record.
Multi-touch attribution (MTA) distributes fractional credit across every touch in the path: the AI SDR's email, the rep's follow-up call, the demo, the ABM ad. Modern platforms compute this across paid media, SDR and AE outreach, events, and partner activity in near real time. MTA tells you which touches appear in winning paths. It does not tell you which touches caused the win. That's its ceiling.
Shapley-value attribution borrows from game theory to allocate credit based on each touch's marginal contribution across all the sequences it appears in. It respects ordering better than rules-based MTA and handles the "this touch only mattered when it came after that one" problem. It's heavier to compute but closer to causal than fixed-weight models.
The trap with all of these: they distribute credit among touches that happened, and assume every touch mattered. They can't see the deals that would have closed without the agent. For that, you need incrementality.
Isolating Agent Contribution
The honest way to prove the AI SDR drives incremental revenue is a holdout test. Take your target account list, randomly hold back a matched set from AI outreach for a fixed window, and let the agent work the rest. At the end, compare pipeline created in the treated group against the holdout group. The gap is the agent's incremental contribution, the pipeline that wouldn't exist otherwise.
This is the same logic as incrementality testing in paid media, where you configure holdouts to measure true causal lift rather than coincidental credit. It's the one method that survives a skeptical CFO. Multi-touch attribution says "the agent touched 60% of won deals." A holdout says "accounts the agent worked produced 22% more pipeline than accounts it didn't." The second statement is causal. The first is correlation wearing a suit.
Practical notes. Match the groups on firmographics and prior engagement so you're comparing like with like. Size the holdout so the difference clears statistical noise, usually 15-20% of the list for a meaningful read. Hold it for at least one full sales cycle. And accept the cost: you're deliberately not working some accounts, which feels like leaving money on the table. It's the price of knowing the truth instead of guessing.
Instrumentation and Trace Events
None of the metrics above are trustworthy if you can't reconstruct what the agent did and why. AI SDRs are non-deterministic. The same input can trigger different tool calls, retrieve different records, and produce different messages each run. Activity logs that only count sends won't cut it. You need a trace.
Instrument the agent so every run emits a structured trace: a root span for the request, child spans for each step. The research call to Apollo, the LLM API call that drafted the message, the enrichment lookup, the CRM write, the sequence enrollment. Each span carries inputs, outputs, timing, token counts, and cost. Open one trace and you can see the full chain from "account entered the list" to "meeting booked," with the prompt version and the data the agent acted on at every step.
OpenTelemetry has become the default standard for this, with semantic conventions for generative AI systems. Tools like Langfuse, LangSmith, MLflow, and Arize Phoenix capture agent traces out of the box. Pick one and wire it in from day one, not after the agent is already in production and you're trying to explain a deliverability spike with no record of what changed.
The trace backbone pays off in three places. First, debugging: when reply rate drops, you open the traces and find the prompt change or the bad data source that caused it. Second, cost control: agent cost per run is highly variable, one run might cost five cents and the next two dollars depending on reasoning steps, so cost-per-opportunity only exists if you capture cost per run. Third, attribution itself: the trace is the ground truth that ties a specific AI-generated touch to the opportunity it opened. Without it, your attribution model is inferring connections the data can't support. The same instrumentation discipline underpins AI SDR guardrails, where trace events catch off-policy behavior before it reaches a prospect.
Reporting to Leadership
The CRO does not want a dashboard of 14 metrics. They want one number with a credible method behind it. Lead with incremental pipeline created, sourced from the holdout, and back it with the funnel that produced it.
A clean report reads: the AI SDR worked 1,200 accounts this quarter. Versus a matched holdout, it produced 22% more pipeline, roughly $640K in net-new incremental opportunities. It booked 84 meetings at a 78% show rate and a 52% meeting-to-opportunity rate. Fully loaded cost was $14K, so the agent returned about 45x on incremental pipeline at current win rates. Deliverability held at a 0.06% complaint rate and 1.1% bounce rate all quarter.
That's a number leadership can act on. It separates the agent's contribution from the rest of the motion, it shows the funnel is healthy rather than gamed, and it proves the sending infrastructure is safe to scale. Pair it with the trace data on request so any skeptic can drill from the headline number down to an individual email the agent sent. Governance leadership will want that audit trail anyway, which is covered in AI SDR governance.
One more discipline. Hold the ROI claim until you have six to twelve months of data. The first quarter is setup and warmup, and pipeline created early is noise. Model payback over a full sales cycle plus a buffer, and resist the pressure to declare a 90-day win. The GTM Engineers who get trusted with bigger automation budgets are the ones whose numbers held up the second time someone checked them. If you're weighing whether to hire the function or build it, the GTM Engineer vs AI SDR comparison breaks down the tradeoff, and the coding premium in GTM salaries shows why the instrumentation skill commands a pay bump.
Frequently Asked Questions
What metrics prove an AI SDR is working?
Pipeline created and opportunities accepted, not activity volume. An AI SDR that sends 10,000 emails and books 40 meetings looks busy. If those meetings produce a 75% show rate, a 50% meeting-to-opportunity rate, and 15 sales-accepted opportunities, it's working. Activity counts are vanity. Track the funnel down to accepted pipeline, and tie every accepted opportunity back to the specific AI-driven touches that opened it.
How do you separate the AI SDR's contribution from everything else that touched a deal?
Run a holdout. Hold back a randomized set of matched accounts from AI outreach for a fixed window, then compare pipeline created in the treated group against the holdout. The difference is the incremental contribution. Multi-touch attribution alone can't do this because it splits credit across touches that may have happened anyway. A holdout measures what wouldn't have happened without the agent. It's the only method that answers the incrementality question honestly.
What deliverability numbers should I watch before scaling an AI SDR?
Spam complaint rate under 0.10% (Gmail starts rejecting outright at 0.30%), hard bounce rate under 2%, and SPF, DKIM, and DMARC all passing. If you send more than 5,000 messages a day to Gmail or Yahoo addresses, those authentication records are mandatory and one-click unsubscribe has to work within two days. A high reply rate means nothing if half the volume lands in spam. Check inbox placement, not just send count, before you turn up volume.
How long before AI SDR ROI is real?
Six to twelve months. The first quarter is dominated by setup, list cleanup, deliverability warmup, and prompt tuning. Pipeline created in month one is mostly noise. Opportunities take weeks to qualify and deals take months to close, so judging the agent on a 90-day window measures your onboarding, not the agent. Model the payback over a full sales cycle plus a buffer.
Source: State of GTM Engineering Report 2026 (n=228). Salary data combines survey responses from 228 GTM Engineers across 32 countries with analysis of 3,342 job postings.