Building a High-Performance B2B Lead Generation Pipeline with Local AI

August 14, 2024 📝 Blog Post

Building a High-Performance B2B Lead Generation Pipeline with Local AI

Finding qualified B2B leads is time-consuming and expensive. Chamber of Commerce directories contain thousands of local businesses, but manually extracting this data is impractical. Traditional web scrapers fail because each chamber website has a unique structure.

I built lead-gen-pipeline - an AI-powered data extraction system that solves this problem using a local LLM (Qwen2-7B) to intelligently navigate and scrape Chamber of Commerce directories.

GitHub Repository: lead-gen-pipeline

The Challenge

Chamber of Commerce directories present unique scraping challenges:

No consistent structure - Each chamber uses different layouts (categorical, alphabetical, paginated)
Dynamic content - JavaScript-rendered pages and infinite scroll
Data quality - Missing fields, inconsistent formatting, duplicate entries
Rate limiting - Need to respect server resources while maintaining speed

Traditional scrapers require custom code for each website. I needed a solution that could adapt to any directory structure automatically.

The Solution: Local AI + Smart Data Pipeline

The pipeline uses a local 7B parameter language model (Qwen2) to analyze page structure and extract business data. Running locally means:

Complete data privacy - No external API calls
No usage costs - Unlimited scraping
Full control - Customize behavior for specific needs

Real Performance Metrics

Testing on Palo Alto Chamber of Commerce:

296 businesses extracted from 26 categories
9 minutes total runtime (33 businesses/minute)
100% capture rate for names and phone numbers
90% email capture, 85% website capture
4-8GB RAM usage with LLM loaded
500+ records/second bulk database operations

Pipeline Architecture

The system follows a classic ETL (Extract, Transform, Load) pattern with AI-powered extraction:

Chamber URL → LLM Analysis → Navigation → Extraction → Validation → SQLite Database

Core Components

1. LLM Processor (llm_processor.py)

Analyzes HTML structure
Identifies directory patterns
Extracts business information as JSON
Handles malformed output with JSON repair

2. Chamber Parser (chamber_parser.py)

Navigates category pages
Handles pagination
Follows business detail links
Manages rate limiting

3. Web Crawler (crawler.py)

Fetches pages with retry logic
Respects robots.txt
Implements exponential backoff
Configurable timeouts and concurrency

4. Database Layer (bulk_database.py)

Efficient bulk operations
Deduplication based on website/name
SQLite with optimized indexes
CSV export functionality

Technical Implementation

Intelligent Web Scraping with LLM

Traditional scrapers use CSS selectors that break when websites change. My approach asks the LLM to understand the page:

def extract_business_data(self, html_content: str) -> dict:
    """Use LLM to extract structured business data from HTML."""

    prompt = """
    Analyze this HTML and extract business information.
    Return JSON with: name, website, phone, email, address, categories.
    If a field is missing, use null.
    """

    response = self.llm.generate(
        prompt + html_content,
        max_tokens=1000,
        temperature=0.1  # Low temperature for consistent extraction
    )

    # Parse JSON with fallback repair
    try:
        return json.loads(response)
    except json.JSONDecodeError:
        return self._repair_json(response)

Key insight: Setting temperature to 0.1 ensures consistent JSON structure while allowing the LLM to adapt to different HTML formats.

Handling Multiple Directory Layouts

Chambers organize directories in three main patterns:

1. Category-based:

/directory → [Categories] → [Businesses in Category]

2. Alphabetical:

/directory → [A-Z Letters] → [Businesses starting with Letter]

3. Paginated:

/directory?page=1 → /directory?page=2 → ...

The LLM identifies which pattern is in use:

def detect_directory_structure(self, main_page_html: str) -> str:
    """Ask LLM to identify directory organization pattern."""

    prompt = """
    Analyze this Chamber of Commerce directory page.
    Identify the structure: 'categorical', 'alphabetical', or 'paginated'.
    Return only the structure type.
    """

    structure = self.llm.generate(prompt + main_page_html, max_tokens=20)
    return structure.strip().lower()

This approach eliminates hardcoded navigation logic.

Data Quality Pipeline

Raw scraped data requires cleaning and validation:

1. Deduplication

def deduplicate_businesses(self, businesses: List[dict]) -> List[dict]:
    """Remove duplicates based on website or name."""
    seen = set()
    unique = []

    for biz in businesses:
        # Create composite key
        key = (
            biz.get('website', '').lower().strip(),
            biz.get('name', '').lower().strip()
        )

        if key not in seen and any(key):
            seen.add(key)
            unique.append(biz)

    return unique

2. Data Validation

def validate_business(self, business: dict) -> bool:
    """Ensure minimum required fields are present."""
    required = ['name']
    has_contact = any([
        business.get('phone'),
        business.get('email'),
        business.get('website')
    ])

    return all(business.get(field) for field in required) and has_contact

3. Field Normalization

def normalize_phone(self, phone: str) -> str:
    """Standardize phone number format."""
    if not phone:
        return None

    # Extract digits only
    digits = ''.join(c for c in phone if c.isdigit())

    if len(digits) == 10:
        return f"({digits[:3]}) {digits[3:6]}-{digits[6:]}"
    elif len(digits) == 11 and digits[0] == '1':
        return f"+1 ({digits[1:4]}) {digits[4:7]}-{digits[7:]}"

    return phone  # Return original if format unknown

Optimized Database Operations

Bulk inserts dramatically improve performance:

def bulk_insert_businesses(self, businesses: List[dict]):
    """Insert multiple businesses efficiently."""

    # Prepare data for executemany()
    records = [
        (
            biz['name'],
            biz.get('website'),
            biz.get('phone'),
            biz.get('email'),
            biz.get('address'),
            json.dumps(biz.get('categories', []))
        )
        for biz in businesses
    ]

    # Single transaction for all inserts
    self.cursor.executemany('''
        INSERT OR IGNORE INTO businesses
        (name, website, phone, email, address, categories)
        VALUES (?, ?, ?, ?, ?, ?)
    ''', records)

    self.conn.commit()

Performance improvement: 500+ records/second vs 10-20 with individual inserts.

Installation and Usage

# Clone and setup
git clone https://github.com/Burton-David/lead-gen-pipeline
cd lead-gen-pipeline
./setup.sh

# Initialize database
python cli.py init

# Extract from chamber
python cli.py chambers --url https://www.paloaltochamber.com

# Export results
python cli.py export --output leads.csv

Results and Insights

After processing Palo Alto Chamber of Commerce:

Data Completeness:

296 total businesses extracted
100% had names and phone numbers
90% had email addresses (266 businesses)
85% had websites (252 businesses)
26 distinct business categories

Most Common Categories:

Professional Services (42 businesses)
Technology (38 businesses)
Restaurants & Food (34 businesses)
Retail (29 businesses)
Healthcare (23 businesses)

Performance Characteristics:

Average page processing: 2.1 seconds
LLM inference per page: 0.8 seconds
Network latency: 0.9 seconds
Data validation: 0.4 seconds

Key Insight: The LLM accounts for only 38% of processing time. Network latency (43%) is the actual bottleneck, meaning concurrent processing of multiple chambers would scale nearly linearly.

Lessons Learned

1. Local LLMs Are Production-Ready

Qwen2-7B performs remarkably well for structured data extraction:

Accurate: 98%+ correct field extraction
Fast: 0.8s per page on M3 Max
Consistent: Low temperature ensures reliable JSON output
Private: No data leaves your machine

2. Bulk Operations Matter

Switching from individual inserts to bulk operations improved database performance by 25x. When building data pipelines, always:

Batch operations where possible
Use transactions appropriately
Index frequently queried fields
Denormalize for read performance

3. Design for Failure

Web scraping is inherently unreliable. The pipeline includes:

Exponential backoff for rate limiting
JSON repair for malformed LLM output
Validation at multiple stages
Detailed logging for debugging

Future Enhancements

Potential improvements to explore:

Multi-chamber parallelization - Process multiple chambers concurrently
Incremental updates - Re-scrape only changed businesses
Enrichment pipeline - Add company size, revenue, social media
Entity resolution - Match businesses across multiple directories
Smaller models - Test 3B parameter models for lower resource usage

Conclusion

AI-powered data pipelines represent a paradigm shift in web scraping. Instead of brittle CSS selectors, we can use language models to understand and extract data like humans do.

This approach scales to any directory structure without custom code per site. The 9-minute extraction time for 296 businesses proves the concept works at practical speeds.

For B2B lead generation, business intelligence, or market research, combining local LLMs with solid pipeline engineering creates powerful, privacy-respecting data collection systems.

Try it yourself:

Explore the GitHub repository
Clone and run on your local machine
Contribute improvements or new directory adapters
Share your results

Build data pipelines that adapt and scale. Your leads are waiting to be discovered.

All code and performance metrics are from the lead-gen-pipeline GitHub repository.