Profile icon
David Burton
Data Scientist
Home
Interactive Dashboards
Data Science & ML
Engineering & Automation
Research & Writing
Back to Engineering

Lead Generation Automation Platform

September 29, 2025 ⚙️ Engineering
Technologies & Tools
Python
Google Cloud Platform
Selenium
NLTK
spaCy
Tesseract OCR
Google Cloud SQL
MongoDB

Lead Generation Automation Platform

Project Overview

Automated tool for BTL Industries (Team Florida) that collects decision-maker contact information from Google Business Profiles. Uses Google Cloud Platform and Python to gather data based on company size, geo-location, industry, product offerings, and website technologies. Implements OSINT techniques for data enrichment.

Technical Implementation

Architecture

  • Data Collection: GCP-hosted Python scripts calling Google Business Profiles API
  • Web Scraping: Selenium automation for company website data extraction
  • NLP Processing: NLTK and spaCy for data extraction and verification
  • OCR Pipeline: Tesseract OCR on webpage screenshots to fill missing fields
  • Storage: Google Cloud SQL (structured data) + MongoDB (unstructured/screenshots)
  • User Interface: SmartSheets API integration for team access

Technology Stack

  • Languages: Python
  • Cloud: Google Cloud Platform (Compute Engine, Cloud SQL, Cloud Storage)
  • Libraries: Selenium, NLTK, spaCy, Tesseract OCR
  • Databases: Google Cloud SQL, MongoDB
  • APIs: Google Business Profiles, SmartSheets
  • Visualization: Tableau, Power BI

Project Timeline

Duration: 3 months
Methodology: Agile with bi-weekly sprints

Sprint Breakdown

  1. Weeks 1-2: Requirements gathering, search parameter definition
  2. Weeks 3-4: GCP infrastructure setup, initial data collection framework
  3. Weeks 5-6: Web scraping implementation, NLP pipeline development
  4. Weeks 7-8: OCR integration, data validation systems
  5. Weeks 9-10: Database configuration (Cloud SQL + MongoDB)
  6. Weeks 11-12: Testing, debugging, optimization
  7. Week 13: Deployment, documentation, team training

Technical Components

1. Data Collection Module

  • Automated API calls to Google Business Profiles
  • Parameterized search: company size, location, industry, technologies
  • Batch processing with error recovery
  • Rate limiting and quota management

2. Web Scraping System

  • Selenium WebDriver for dynamic content handling
  • Parallel processing across multiple browser instances
  • Adaptive scraping for various website structures
  • Robust error handling for site changes/downtime

3. NLP Pipeline

  • Entity extraction from unstructured text
  • Contact information pattern matching
  • Data standardization and normalization
  • Confidence scoring for extracted data

4. OCR Processing

  • Screenshot capture of web pages
  • Text extraction from images
  • Field mapping to dataset schema
  • Quality validation of OCR output

5. Data Storage Architecture

Structured Data (Google Cloud SQL):

  • Normalized relational schema
  • Indexed for fast queries
  • Automated backups
  • Read replicas for scaling

Unstructured Data (MongoDB):

  • JSON documents from web scraping
  • Binary storage for screenshots
  • Flexible schema for varied data
  • GridFS for large file handling

6. SmartSheets Integration

  • Real-time data synchronization
  • Automated updates via API
  • Role-based access control
  • Audit logging of data changes

7. Reporting Dashboard

  • Tableau/Power BI connections
  • Key metrics: leads collected, data quality, processing status
  • Trend analysis and forecasting
  • Export capabilities for stakeholders

Results

Quantitative Outcomes

  • Processed 10,000+ companies
  • Zero manual data entry required
  • 50% increase in qualified leads
  • Data accuracy rate: 94%

Business Impact

  • President’s Club 2023: Team Florida won company’s top sales award
  • Sales Performance: $30 million in revenue, doubled second-place team
  • Efficiency Gains: Sales team focused on selling vs. data collection
  • Scalability: System handles increasing data volume without additional resources

Technical Achievements

  • Fully automated pipeline from search to storage
  • Handles multiple data formats and sources
  • Self-healing with automatic error recovery
  • Production uptime: 99.8%

Code Sample

# Example: Parallel web scraping with Selenium
from selenium import webdriver
from concurrent.futures import ThreadPoolExecutor
import queue

class ParallelScraper:
    def __init__(self, max_workers=5):
        self.executor = ThreadPoolExecutor(max_workers=max_workers)
        self.driver_pool = queue.Queue()
        
    def scrape_company(self, company_url):
        driver = self.driver_pool.get()
        try:
            driver.get(company_url)
            # Extract data logic here
            return extracted_data
        finally:
            self.driver_pool.put(driver)

Lessons Learned

Technical

  • Importance of robust error handling for web scraping
  • Benefits of polyglot persistence (SQL + NoSQL)
  • Value of comprehensive logging and monitoring

Process

  • Agile methodology enabled quick iterations
  • Regular stakeholder demos ensured alignment
  • Documentation critical for team adoption

Future Enhancements

  • Machine learning for improved data extraction
  • Real-time streaming pipeline
  • Advanced duplicate detection
  • Predictive lead scoring

This project demonstrates end-to-end data engineering: from API integration and web scraping to NLP processing and cloud storage, resulting in measurable business impact.