Lead Generation Automation Platform

September 29, 2025 ⚙️ Engineering

Technologies & Tools

Python

Google Cloud Platform

Selenium

NLTK

spaCy

Tesseract OCR

Google Cloud SQL

MongoDB

Lead Generation Automation Platform

Project Overview

Automated tool for BTL Industries (Team Florida) that collects decision-maker contact information from Google Business Profiles. Uses Google Cloud Platform and Python to gather data based on company size, geo-location, industry, product offerings, and website technologies. Implements OSINT techniques for data enrichment.

Technical Implementation

Architecture

Data Collection: GCP-hosted Python scripts calling Google Business Profiles API
Web Scraping: Selenium automation for company website data extraction
NLP Processing: NLTK and spaCy for data extraction and verification
OCR Pipeline: Tesseract OCR on webpage screenshots to fill missing fields
Storage: Google Cloud SQL (structured data) + MongoDB (unstructured/screenshots)
User Interface: SmartSheets API integration for team access

Technology Stack

Languages: Python
Cloud: Google Cloud Platform (Compute Engine, Cloud SQL, Cloud Storage)
Libraries: Selenium, NLTK, spaCy, Tesseract OCR
Databases: Google Cloud SQL, MongoDB
APIs: Google Business Profiles, SmartSheets
Visualization: Tableau, Power BI

Project Timeline

Duration: 3 months
Methodology: Agile with bi-weekly sprints

Sprint Breakdown

Weeks 1-2: Requirements gathering, search parameter definition
Weeks 3-4: GCP infrastructure setup, initial data collection framework
Weeks 5-6: Web scraping implementation, NLP pipeline development
Weeks 7-8: OCR integration, data validation systems
Weeks 9-10: Database configuration (Cloud SQL + MongoDB)
Weeks 11-12: Testing, debugging, optimization
Week 13: Deployment, documentation, team training

Technical Components

1. Data Collection Module

Automated API calls to Google Business Profiles
Parameterized search: company size, location, industry, technologies
Batch processing with error recovery
Rate limiting and quota management

2. Web Scraping System

Selenium WebDriver for dynamic content handling
Parallel processing across multiple browser instances
Adaptive scraping for various website structures
Robust error handling for site changes/downtime

3. NLP Pipeline

Entity extraction from unstructured text
Contact information pattern matching
Data standardization and normalization
Confidence scoring for extracted data

4. OCR Processing

Screenshot capture of web pages
Text extraction from images
Field mapping to dataset schema
Quality validation of OCR output

5. Data Storage Architecture

Structured Data (Google Cloud SQL):

Normalized relational schema
Indexed for fast queries
Automated backups
Read replicas for scaling

Unstructured Data (MongoDB):

JSON documents from web scraping
Binary storage for screenshots
Flexible schema for varied data
GridFS for large file handling

6. SmartSheets Integration

Real-time data synchronization
Automated updates via API
Role-based access control
Audit logging of data changes

7. Reporting Dashboard

Tableau/Power BI connections
Key metrics: leads collected, data quality, processing status
Trend analysis and forecasting
Export capabilities for stakeholders

Results

Quantitative Outcomes

Processed 10,000+ companies
Zero manual data entry required
50% increase in qualified leads
Data accuracy rate: 94%

Business Impact

President’s Club 2023: Team Florida won company’s top sales award
Sales Performance: $30 million in revenue, doubled second-place team
Efficiency Gains: Sales team focused on selling vs. data collection
Scalability: System handles increasing data volume without additional resources

Technical Achievements

Fully automated pipeline from search to storage
Handles multiple data formats and sources
Self-healing with automatic error recovery
Production uptime: 99.8%

Code Sample

# Example: Parallel web scraping with Selenium
from selenium import webdriver
from concurrent.futures import ThreadPoolExecutor
import queue

class ParallelScraper:
    def __init__(self, max_workers=5):
        self.executor = ThreadPoolExecutor(max_workers=max_workers)
        self.driver_pool = queue.Queue()
        
    def scrape_company(self, company_url):
        driver = self.driver_pool.get()
        try:
            driver.get(company_url)
            # Extract data logic here
            return extracted_data
        finally:
            self.driver_pool.put(driver)

Lessons Learned

Technical

Importance of robust error handling for web scraping
Benefits of polyglot persistence (SQL + NoSQL)
Value of comprehensive logging and monitoring

Process

Agile methodology enabled quick iterations
Regular stakeholder demos ensured alignment
Documentation critical for team adoption

Future Enhancements

Machine learning for improved data extraction
Real-time streaming pipeline
Advanced duplicate detection
Predictive lead scoring

This project demonstrates end-to-end data engineering: from API integration and web scraping to NLP processing and cloud storage, resulting in measurable business impact.