Project Overview
Automated tool for BTL Industries (Team Florida) that collects decision-maker contact information from Google Business Profiles. Uses Google Cloud Platform and Python to gather data based on company size, geo-location, industry, product offerings, and website technologies. Implements OSINT techniques for data enrichment.
Technical Implementation
Architecture
- Data Collection: GCP-hosted Python scripts calling Google Business Profiles API
- Web Scraping: Selenium automation for company website data extraction
- NLP Processing: NLTK and spaCy for data extraction and verification
- OCR Pipeline: Tesseract OCR on webpage screenshots to fill missing fields
- Storage: Google Cloud SQL (structured data) + MongoDB (unstructured/screenshots)
- User Interface: SmartSheets API integration for team access
Technology Stack
- Languages: Python
- Cloud: Google Cloud Platform (Compute Engine, Cloud SQL, Cloud Storage)
- Libraries: Selenium, NLTK, spaCy, Tesseract OCR
- Databases: Google Cloud SQL, MongoDB
- APIs: Google Business Profiles, SmartSheets
- Visualization: Tableau, Power BI
Project Timeline
Duration: 3 months
Methodology: Agile with bi-weekly sprints
Sprint Breakdown
- Weeks 1-2: Requirements gathering, search parameter definition
- Weeks 3-4: GCP infrastructure setup, initial data collection framework
- Weeks 5-6: Web scraping implementation, NLP pipeline development
- Weeks 7-8: OCR integration, data validation systems
- Weeks 9-10: Database configuration (Cloud SQL + MongoDB)
- Weeks 11-12: Testing, debugging, optimization
- Week 13: Deployment, documentation, team training
Technical Components
1. Data Collection Module
- Automated API calls to Google Business Profiles
- Parameterized search: company size, location, industry, technologies
- Batch processing with error recovery
- Rate limiting and quota management
2. Web Scraping System
- Selenium WebDriver for dynamic content handling
- Parallel processing across multiple browser instances
- Adaptive scraping for various website structures
- Robust error handling for site changes/downtime
3. NLP Pipeline
- Entity extraction from unstructured text
- Contact information pattern matching
- Data standardization and normalization
- Confidence scoring for extracted data
4. OCR Processing
- Screenshot capture of web pages
- Text extraction from images
- Field mapping to dataset schema
- Quality validation of OCR output
5. Data Storage Architecture
Structured Data (Google Cloud SQL):
- Normalized relational schema
- Indexed for fast queries
- Automated backups
- Read replicas for scaling
Unstructured Data (MongoDB):
- JSON documents from web scraping
- Binary storage for screenshots
- Flexible schema for varied data
- GridFS for large file handling
6. SmartSheets Integration
- Real-time data synchronization
- Automated updates via API
- Role-based access control
- Audit logging of data changes
7. Reporting Dashboard
- Tableau/Power BI connections
- Key metrics: leads collected, data quality, processing status
- Trend analysis and forecasting
- Export capabilities for stakeholders
Results
Quantitative Outcomes
- Processed 10,000+ companies
- Zero manual data entry required
- 50% increase in qualified leads
- Data accuracy rate: 94%
Business Impact
- President’s Club 2023: Team Florida won company’s top sales award
- Sales Performance: $30 million in revenue, doubled second-place team
- Efficiency Gains: Sales team focused on selling vs. data collection
- Scalability: System handles increasing data volume without additional resources
Technical Achievements
- Fully automated pipeline from search to storage
- Handles multiple data formats and sources
- Self-healing with automatic error recovery
- Production uptime: 99.8%
Code Sample
# Example: Parallel web scraping with Selenium
from selenium import webdriver
from concurrent.futures import ThreadPoolExecutor
import queue
class ParallelScraper:
def __init__(self, max_workers=5):
self.executor = ThreadPoolExecutor(max_workers=max_workers)
self.driver_pool = queue.Queue()
def scrape_company(self, company_url):
driver = self.driver_pool.get()
try:
driver.get(company_url)
# Extract data logic here
return extracted_data
finally:
self.driver_pool.put(driver)
Lessons Learned
Technical
- Importance of robust error handling for web scraping
- Benefits of polyglot persistence (SQL + NoSQL)
- Value of comprehensive logging and monitoring
Process
- Agile methodology enabled quick iterations
- Regular stakeholder demos ensured alignment
- Documentation critical for team adoption
Future Enhancements
- Machine learning for improved data extraction
- Real-time streaming pipeline
- Advanced duplicate detection
- Predictive lead scoring
This project demonstrates end-to-end data engineering: from API integration and web scraping to NLP processing and cloud storage, resulting in measurable business impact.