Finding qualified B2B leads is time-consuming and expensive. Chamber of Commerce directories contain thousands of local businesses, but manually extracting this data is impractical. Traditional web scrapers fail because each chamber website has a unique structure.
I built lead-gen-pipeline - an AI-powered data extraction system that solves this problem using a local LLM (Qwen2-7B) to intelligently navigate and scrape Chamber of Commerce directories.
GitHub Repository: lead-gen-pipeline
The Challenge
Chamber of Commerce directories present unique scraping challenges:
- No consistent structure - Each chamber uses different layouts (categorical, alphabetical, paginated)
- Dynamic content - JavaScript-rendered pages and infinite scroll
- Data quality - Missing fields, inconsistent formatting, duplicate entries
- Rate limiting - Need to respect server resources while maintaining speed
Traditional scrapers require custom code for each website. I needed a solution that could adapt to any directory structure automatically.
The Solution: Local AI + Smart Data Pipeline
The pipeline uses a local 7B parameter language model (Qwen2) to analyze page structure and extract business data. Running locally means:
- Complete data privacy - No external API calls
- No usage costs - Unlimited scraping
- Full control - Customize behavior for specific needs
Testing on Palo Alto Chamber of Commerce:
- 296 businesses extracted from 26 categories
- 9 minutes total runtime (33 businesses/minute)
- 100% capture rate for names and phone numbers
- 90% email capture, 85% website capture
- 4-8GB RAM usage with LLM loaded
- 500+ records/second bulk database operations
Pipeline Architecture
The system follows a classic ETL (Extract, Transform, Load) pattern with AI-powered extraction:
Chamber URL → LLM Analysis → Navigation → Extraction → Validation → SQLite Database
Core Components
1. LLM Processor (llm_processor.py
)
- Analyzes HTML structure
- Identifies directory patterns
- Extracts business information as JSON
- Handles malformed output with JSON repair
2. Chamber Parser (chamber_parser.py
)
- Navigates category pages
- Handles pagination
- Follows business detail links
- Manages rate limiting
3. Web Crawler (crawler.py
)
- Fetches pages with retry logic
- Respects robots.txt
- Implements exponential backoff
- Configurable timeouts and concurrency
4. Database Layer (bulk_database.py
)
- Efficient bulk operations
- Deduplication based on website/name
- SQLite with optimized indexes
- CSV export functionality
Technical Implementation
Intelligent Web Scraping with LLM
Traditional scrapers use CSS selectors that break when websites change. My approach asks the LLM to understand the page:
def extract_business_data(self, html_content: str) -> dict:
"""Use LLM to extract structured business data from HTML."""
prompt = """
Analyze this HTML and extract business information.
Return JSON with: name, website, phone, email, address, categories.
If a field is missing, use null.
"""
response = self.llm.generate(
prompt + html_content,
max_tokens=1000,
temperature=0.1 # Low temperature for consistent extraction
)
# Parse JSON with fallback repair
try:
return json.loads(response)
except json.JSONDecodeError:
return self._repair_json(response)
Key insight: Setting temperature to 0.1 ensures consistent JSON structure while allowing the LLM to adapt to different HTML formats.
Handling Multiple Directory Layouts
Chambers organize directories in three main patterns:
1. Category-based:
/directory → [Categories] → [Businesses in Category]
2. Alphabetical:
/directory → [A-Z Letters] → [Businesses starting with Letter]
3. Paginated:
/directory?page=1 → /directory?page=2 → ...
The LLM identifies which pattern is in use:
def detect_directory_structure(self, main_page_html: str) -> str:
"""Ask LLM to identify directory organization pattern."""
prompt = """
Analyze this Chamber of Commerce directory page.
Identify the structure: 'categorical', 'alphabetical', or 'paginated'.
Return only the structure type.
"""
structure = self.llm.generate(prompt + main_page_html, max_tokens=20)
return structure.strip().lower()
This approach eliminates hardcoded navigation logic.
Data Quality Pipeline
Raw scraped data requires cleaning and validation:
1. Deduplication
def deduplicate_businesses(self, businesses: List[dict]) -> List[dict]:
"""Remove duplicates based on website or name."""
seen = set()
unique = []
for biz in businesses:
# Create composite key
key = (
biz.get('website', '').lower().strip(),
biz.get('name', '').lower().strip()
)
if key not in seen and any(key):
seen.add(key)
unique.append(biz)
return unique
2. Data Validation
def validate_business(self, business: dict) -> bool:
"""Ensure minimum required fields are present."""
required = ['name']
has_contact = any([
business.get('phone'),
business.get('email'),
business.get('website')
])
return all(business.get(field) for field in required) and has_contact
3. Field Normalization
def normalize_phone(self, phone: str) -> str:
"""Standardize phone number format."""
if not phone:
return None
# Extract digits only
digits = ''.join(c for c in phone if c.isdigit())
if len(digits) == 10:
return f"({digits[:3]}) {digits[3:6]}-{digits[6:]}"
elif len(digits) == 11 and digits[0] == '1':
return f"+1 ({digits[1:4]}) {digits[4:7]}-{digits[7:]}"
return phone # Return original if format unknown
Optimized Database Operations
Bulk inserts dramatically improve performance:
def bulk_insert_businesses(self, businesses: List[dict]):
"""Insert multiple businesses efficiently."""
# Prepare data for executemany()
records = [
(
biz['name'],
biz.get('website'),
biz.get('phone'),
biz.get('email'),
biz.get('address'),
json.dumps(biz.get('categories', []))
)
for biz in businesses
]
# Single transaction for all inserts
self.cursor.executemany('''
INSERT OR IGNORE INTO businesses
(name, website, phone, email, address, categories)
VALUES (?, ?, ?, ?, ?, ?)
''', records)
self.conn.commit()
Performance improvement: 500+ records/second vs 10-20 with individual inserts.
Installation and Usage
# Clone and setup
git clone https://github.com/Burton-David/lead-gen-pipeline
cd lead-gen-pipeline
./setup.sh
# Initialize database
python cli.py init
# Extract from chamber
python cli.py chambers --url https://www.paloaltochamber.com
# Export results
python cli.py export --output leads.csv
Results and Insights
After processing Palo Alto Chamber of Commerce:
Data Completeness:
- 296 total businesses extracted
- 100% had names and phone numbers
- 90% had email addresses (266 businesses)
- 85% had websites (252 businesses)
- 26 distinct business categories
Most Common Categories:
- Professional Services (42 businesses)
- Technology (38 businesses)
- Restaurants & Food (34 businesses)
- Retail (29 businesses)
- Healthcare (23 businesses)
Performance Characteristics:
- Average page processing: 2.1 seconds
- LLM inference per page: 0.8 seconds
- Network latency: 0.9 seconds
- Data validation: 0.4 seconds
Key Insight: The LLM accounts for only 38% of processing time. Network latency (43%) is the actual bottleneck, meaning concurrent processing of multiple chambers would scale nearly linearly.
Lessons Learned
1. Local LLMs Are Production-Ready
Qwen2-7B performs remarkably well for structured data extraction:
- Accurate: 98%+ correct field extraction
- Fast: 0.8s per page on M3 Max
- Consistent: Low temperature ensures reliable JSON output
- Private: No data leaves your machine
2. Bulk Operations Matter
Switching from individual inserts to bulk operations improved database performance by 25x. When building data pipelines, always:
- Batch operations where possible
- Use transactions appropriately
- Index frequently queried fields
- Denormalize for read performance
3. Design for Failure
Web scraping is inherently unreliable. The pipeline includes:
- Exponential backoff for rate limiting
- JSON repair for malformed LLM output
- Validation at multiple stages
- Detailed logging for debugging
Future Enhancements
Potential improvements to explore:
- Multi-chamber parallelization - Process multiple chambers concurrently
- Incremental updates - Re-scrape only changed businesses
- Enrichment pipeline - Add company size, revenue, social media
- Entity resolution - Match businesses across multiple directories
- Smaller models - Test 3B parameter models for lower resource usage
Conclusion
AI-powered data pipelines represent a paradigm shift in web scraping. Instead of brittle CSS selectors, we can use language models to understand and extract data like humans do.
This approach scales to any directory structure without custom code per site. The 9-minute extraction time for 296 businesses proves the concept works at practical speeds.
For B2B lead generation, business intelligence, or market research, combining local LLMs with solid pipeline engineering creates powerful, privacy-respecting data collection systems.
Try it yourself:
- Explore the GitHub repository
- Clone and run on your local machine
- Contribute improvements or new directory adapters
- Share your results
Build data pipelines that adapt and scale. Your leads are waiting to be discovered.
All code and performance metrics are from the lead-gen-pipeline GitHub repository.