How-To Guide

Build a Web Scraper With Claude Code in One Hour

The GTM Engineer's guide to building scrapers for hiring signals, pricing changes, and competitive intel. With the polite-scraping rules baked in.

Build a Web Scraper With Claude Code in One Hour
Build a Web Scraper With Claude Code in One Hour

Why Build a Scraper With Claude Code

Most GTM Engineers need a scraper at some point. Pull job postings to detect hiring signals. Pull tech-stack pages from BuiltWith. Pull press releases to flag funding events. Pull pricing pages from competitor sites. The data is on the open web. Vendors charge for it. Building a scraper saves the vendor cost and gives you control over what you collect.

Claude Code makes scraping accessible. You describe what you want, the agent writes the scraper, you review the code, you run it. No Selenium tutorials. No BeautifulSoup syntax memorization. The agent handles the boilerplate.

This guide is for the GTM Engineer who needs a working scraper today. Plain Python. Playwright for sites that need JavaScript rendering. Polite scraping. Real legal guardrails.

The Setup

1. Install Claude Code and Playwright. claude --install, then pip install playwright && playwright install chromium.

2. Create a scraper project. mkdir ~/scrapers && cd ~/scrapers && claude init. Write a CLAUDE.md with your scraping rules (rate limits, allowed sources, output format).

3. Set up your data store. SQLite for solo projects, Postgres for team work. Claude Code can scaffold the schema for you.

The First Scraper: Job Postings From a Career Page

The task. Pull job postings from a target account's career page nightly. Flag new postings. Output a JSON file of current open roles.

The prompt: "Build a scraper that fetches the career page at companies/acme/careers, extracts each open role (title, location, department, posted date), and writes to acme-jobs.json. Use Playwright for JavaScript rendering. Set User-Agent to identify the scraper. Wait 3 seconds between requests. If the page structure breaks the scraper, log the error and continue."

Claude Code writes the scraper. You review the code, focusing on: the User-Agent string, the rate limit, the error handling, the data shape. Run it once manually. Verify the output. Schedule it nightly.

The Diff Step: Detecting New Postings

One scrape isn't useful. What matters is the change. Add a diff step.

The prompt: "Extend the scraper to keep a record of all postings seen. On each run, compare the current scrape to the previous run. Output a 'new-postings.json' file with only the postings that appeared today. Output a 'closed-postings.json' file for postings that disappeared."

Claude Code adds the comparison logic and the snapshot storage. You now have a hiring signal feed. New SDR roles flagged means a customer is scaling outbound. Engineering hires flagged means they're growing technical capacity.

Scaling to Many Target Accounts

The next move. Instead of one account, run the scraper against your TAM of 200 target accounts.

The prompt: "Refactor the scraper to accept a list of target career page URLs from targets.csv. For each one, run the same scrape and diff logic. Write outputs into per-account files in output/jobs/{slug}.json. Throttle to 5 concurrent scrapes max. If any single target fails, log and continue with the rest."

Claude Code refactors. You now have a hiring signal feed for 200 target accounts. New postings flagged per account become buying signals you route to SDRs.

The Polite Scraping Rules

Scraping has technical and legal limits. Build the limits into your scraper from day one.

1. Respect robots.txt. Read it before scraping any new site. If it disallows the path, don't scrape that path. Period.

2. Set a real User-Agent. Identify yourself. Include a contact email in the User-Agent. "Mozilla/5.0 GTMEPulse-Bot [email protected]". Sites that block fake browsers don't block identified bots that are polite.

3. Rate-limit aggressively. Default to 1 request per 3 seconds per domain. Slower if the site is small. The goal is invisible scraping. A scraper that takes down a target site is a scraper that gets banned.

4. Cache results. Once you have the data, don't re-scrape it for at least 24 hours. The site doesn't change that often. Your cache is the right source of truth between scrapes.

5. Handle errors gracefully. Sites change. The scraper that worked yesterday may break today. Log the error, alert yourself, continue with other targets. Don't crash the whole job because one site changed.

The Legal Reality

Web scraping legality varies by jurisdiction and target. A few rules that hold broadly.

Public data is generally OK to read. Job postings on a public career page, press releases, public pricing pages. You can scrape this for internal analysis.

Republishing scraped data is risky. Even public data, putting it on your own site as a derivative product can run afoul of terms of service and copyright. Use scraped data for internal signal. Don't ship it as your own product.

Account-protected data is off-limits. If you have to log in to see it, scraping it is breaking terms of service and possibly Computer Fraud and Abuse Act. Don't.

When in doubt, ask. Many companies have data sharing programs. Reaching out is sometimes faster than scraping and gives you a relationship.

What to Avoid

Don't scrape LinkedIn through automation. LinkedIn's terms forbid it, they detect it, they sue. Use the LinkedIn API or paid data providers for LinkedIn data, not a scraper.

Don't scrape sites behind logins. Even if it works technically, the legal exposure is real.

Don't aggregate scraped data for resale. The legal exposure is significant. The data brokerage business is heavily regulated. Run scraped data for internal use, not as a product.

Don't ignore the rate limit. A scraper that hammers a target site is a scraper that gets your IP banned and possibly a cease-and-desist letter. Be polite.

The Verdict

For internal GTM signal generation, building scrapers with Claude Code is fast, cheap, and effective. Job postings, press releases, pricing changes, tech stack changes from public pages are all fair game.

For systematic data needs (full TAM enrichment, contact data), buy from vendors who handle the data acquisition legally. The build-vs-buy line for scraping is: build for signal, buy for completeness.

For broader Claude Code patterns, see the lead enrichment workflow guide and the buying signal scanner guide.

Authoritative References

For Playwright's Python API, see Playwright documentation. For Claude Code's CLI, see Anthropic's Claude Code documentation.

Frequently Asked Questions

Is web scraping legal for GTM use?

Public data scraping for internal analysis is generally OK. Scraping account-protected data is illegal. Republishing scraped data as your own product is high risk. The safe lane: scrape public pages (job listings, press releases, pricing) for internal signal generation. Avoid LinkedIn, anything behind a login, and avoid building a product on top of scraped data without a clear legal review.

Can Claude Code build the scraper without me writing Python?

Yes. You describe what you want in plain English. Claude Code writes the scraper, sets up Playwright if needed, handles the error cases, and outputs the code. You review the code to confirm it does what you asked. The first scraper takes 30 minutes including review. After that the pattern is reusable: describe, review, run.

What's the best scraping framework with Claude Code?

Playwright for sites that need JavaScript rendering. Plain requests + BeautifulSoup for static HTML sites. Scrapy for high-volume crawl-style scraping. Claude Code picks the right tool based on your prompt. For most GTM scraping (career pages, pricing pages, press feeds), Playwright is the default because most modern sites render JavaScript.

How do I avoid getting blocked when scraping?

Rate-limit aggressively (1 request per 3 seconds per domain), set a real User-Agent that identifies your bot, respect robots.txt, and cache results so you don't re-scrape. If you do all four, most sites won't block you. The sites that do block legitimate identified bots are sites you shouldn't be scraping anyway.

Should I scrape LinkedIn for contact data?

No. LinkedIn's terms of service explicitly forbid automated scraping, they have detection systems, and they have a track record of suing scrapers. Use the LinkedIn API or buy from paid data providers (Apollo, Cognism, ZoomInfo) for LinkedIn-derived contact data. Building a LinkedIn scraper is a fast path to a banned account and potential legal exposure.

Source: State of GTM Engineering Report 2026 (n=228). Combines survey responses from 228 GTM Engineers with analysis of 3,342 job postings.

Get the Weekly Pulse

Salary shifts, tool intel, and job market data for GTM Engineers. Weekly Claude Code workflows for GTM Engineers.