How-To Guide

Build a Lead Scoring Model With Claude Code in 2026

The RevOps and GTM Engineer's guide to building rule-based lead scoring in Claude Code. Six dimensions, the YAML tuning pattern, and the iteration loop.

By Rome Thorndike | June 2026

Why Build Lead Scoring in Claude Code

Most CRMs have a built-in scoring feature. Most of them are bad. HubSpot's predictive lead scoring and Salesforce Einstein both work in theory but treat scoring as a black box and require enough labeled data to make the ML meaningful, which most B2B teams don't have. Building a weighted rule-based scoring model in Claude Code gives you a transparent, tunable, debuggable scoring system that fits your actual buying motion.

This guide is for the RevOps lead or GTM Engineer designing a lead scoring model from scratch. The pattern: define dimensions, weight them, code the model, write to the CRM, iterate based on conversion data.

The Setup

1. Install Claude Code and wire your CRM. claude --install. Set up HubSpot or Salesforce MCP. Pull existing leads as test data.

2. Write a CLAUDE.md with your scoring framework. The ICP definition, the dimensions, the weights, the tier boundaries. The CLAUDE.md becomes the spec the agent codes against.

3. Plan the iteration cadence. Build v1, ship it, review conversion data in 4 weeks, tune v2.

The Six Scoring Dimensions That Matter

Most B2B lead scoring models reduce to six dimensions. Different weights per business, but the dimensions are stable.

1. Company fit (firmographic). Employee count, industry, revenue, geography, tech stack. The static-account dimension.

2. Contact fit (persona). Title, seniority, function, department. Whether this specific person is the decision-maker or champion.

3. Intent signal. Recent activity that suggests buying interest. Job posts indicating growth, technology adoptions, page visits.

4. Engagement. What the lead has done with you. Email opens, page visits, content downloads, demo requests.

5. Recency. When did the most recent activity happen. A hot signal from a month ago is different from a hot signal from yesterday.

6. Source. Where did the lead come from. Inbound demo request scores higher than scraped TAM list.

Building the Model: A 200-Line Python Script

The prompt to Claude Code: "Build a Python script that scores each contact in the contacts table. The score is the weighted sum of six dimensions defined in scoring.yaml. For each contact, calculate dimension scores, sum with weights, output a total score (0 to 100) and a tier (A: 80+, B: 60-79, C: 40-59, D: under 40). Write tier and score back to HubSpot custom fields. Log every scoring decision to a CSV for audit."

Claude Code scaffolds the script. The scoring.yaml file holds the dimension weights and the rules per dimension. Example.

company_fit:

weight: 25

rules:

- employee_count: 200-2000 -> 25

- employee_count: 50-199 -> 15

- employee_count: under_50 -> 0

You edit the YAML to tune. The Python script reads the YAML and re-scores everyone. Tuning is a 30-second edit, not a model retraining job.

The Engagement Dimension: Where Most Teams Get It Wrong

Engagement scoring is the trickiest dimension because the signals are noisy. Three patterns that fail and one that works.

Fails: counting events. "5 page visits this week" scores high. Problem: a competitor researcher and a real buyer both score the same.

Fails: time-weighted events. "Activity in last 24 hours is worth 3x activity from a month ago." Problem: still doesn't distinguish quality.

Fails: weighted events. "Demo request is worth 100, page visit is worth 5." Problem: misses progression. A demo request from someone with no prior activity is different from one with 5 visits.

Works: stage progression. Track where in the buying journey the engagement maps to. Page on pricing means consideration. Demo request means evaluation. Following 3 employees on LinkedIn means deep awareness. Score the highest stage the lead has reached.

Writing Back to the CRM

The output of the scoring run is two custom fields per contact: gtme_score (integer 0-100) and gtme_tier (A/B/C/D). Plus an audit log in S3 or a warehouse table showing exactly which rules fired for which contacts.

Routes built on the score. Tier A leads get routed to an SDR within 5 minutes via Slack alert. Tier B gets put in a slower outbound sequence. Tier C goes into nurture. Tier D doesn't get worked.

The Iteration Loop

The first scoring model is wrong. The second is less wrong. The third is good enough. Plan for iteration from day one.

Every 4 weeks: pull the conversion data. For each tier, calculate the meeting-book rate, the opportunity-create rate, and the close-won rate. If Tier A has lower conversion than Tier B, your dimension weights are wrong. Adjust the YAML and re-score.

The audit log helps. For deals that closed, look at which dimensions scored highest. For deals that lost, same. The patterns tell you what to weight differently.

What to Avoid

Don't build the ML version first. Predictive scoring with training data sounds good and works badly. Stick with rule-based scoring until you have 1,000+ closed-won deals and a labeled outcome per deal. Most B2B teams never reach this volume.

Don't hide the scoring logic. The CRM tier and score must be transparent to reps. They need to know why Account X scored higher than Account Y. The audit log is what makes this possible.

Don't update scores in real time. Score once a day on a nightly run. Real-time scoring creates flapping tiers that confuse reps and waste cycles.

Don't forget recency decay. An engagement signal from 6 months ago shouldn't score the same as one from yesterday. Build a decay function so old signals drop off naturally.

The Verdict

Rule-based lead scoring built in Claude Code is the right pattern for most B2B teams in 2026. Transparent, tunable, debuggable, fast to ship. ML-based scoring is the next step only for teams with the data volume and the ML capacity to support it.

For deeper patterns, see the account scoring model guide and the ICP definition framework.

Authoritative References

For Claude Code's CLI and scripting patterns, see Anthropic's Claude Code documentation.

Frequently Asked Questions

Is rule-based scoring as good as ML-based scoring?

For most B2B teams, yes. ML-based scoring requires hundreds to thousands of labeled examples per outcome and a data science capacity to maintain the model. Most B2B teams don't have either. Rule-based scoring built with explicit weights and transparent rules is good enough for the conversion lift, and you can iterate on it in minutes without retraining.

How often should I re-score leads?

Nightly is the right default. Real-time scoring sounds appealing but creates flapping tiers that confuse reps. Once-per-day scoring is enough granularity for B2B sales cycles and stable enough for the routing rules built on top of the scores.

Should I use Claude Code or a tool like 6sense for scoring?

Different tools, different jobs. 6sense ships intent data and account-level scoring as a product. Claude Code lets you build custom scoring that combines 6sense data with your CRM, engagement, and other signals. Most teams that have 6sense feed its intent data into a custom model rather than relying on 6sense's tier output alone.

What's the right number of scoring dimensions?

Six is the sweet spot for most teams. Fewer and you miss signal. More and the model gets noisy and hard to tune. Start with company fit, contact fit, intent, engagement, recency, and source. Add a seventh dimension only if you have a clear hypothesis it should change scores meaningfully.

Can a non-engineer maintain a Claude Code scoring model?

Yes, with the YAML config pattern. The Python script reads scoring.yaml for the weights and rules. A RevOps person can edit the YAML to tune weights without touching Python. The engineering work is only the initial build (40 hours) and occasional debugging when CRM schemas change.

Source: State of GTM Engineering Report 2026 (n=228). Combines survey responses from 228 GTM Engineers with analysis of 3,342 job postings.