S&P Boosts SME Data Collection 5X with Deep Web & AI Tech

▼ Summary
– The investing world lacks accessible data on small and medium-sized enterprises (SMEs), making creditworthiness assessments difficult.
– S&P Global Market Intelligence developed RiskGauge, an AI-powered platform that scrapes data from 200M+ websites to generate SME risk scores.
– RiskGauge uses Snowflake architecture and machine learning to expand SME coverage by 5X, improving accuracy for institutional clients.
– The platform employs multi-layer web scraping, data cleaning, and ensemble algorithms to automate credit scoring without human intervention.
– Challenges included processing large datasets, unclean websites, and balancing algorithm speed with accuracy, leading to optimized solutions.
S&P Global has revolutionized SME credit risk assessment by leveraging AI and deep web scraping to expand coverage fivefold. The financial data giant’s new RiskGauge platform addresses a critical industry gap – the lack of transparent financial information for small and medium businesses that don’t face the same disclosure requirements as public companies.
Traditional credit analysis struggled with SMEs because their financial data simply wasn’t publicly available. S&P’s solution combines web scraping technology with machine learning algorithms to extract and analyze data from over 200 million websites, transforming unstructured information into actionable credit scores. The system now covers approximately 10 million U.S. SMEs compared to just 2 million previously.
Moody Hadi, who leads new product development for S&P’s risk solutions, explained how the platform works. “Large institutions need reliable credit assessments when lending to suppliers or partners,” he said. “Our technology provides those insights at scale where manual processes would be impossible.”
The platform’s architecture represents a significant technical achievement. It employs a multi-stage data pipeline that begins with web crawlers extracting information from company websites. Advanced algorithms then clean and structure this data before feeding it into Snowflake’s cloud data platform. Machine learning models analyze the information to generate comprehensive risk profiles.
Key innovations include ensemble algorithms that cross-validate company details and sentiment analysis that evaluates business announcements. The system automatically detects website updates through hash key comparisons, ensuring ongoing data freshness without unnecessary processing.
Developing the platform presented substantial technical hurdles. Processing terabytes of web data required constant optimization to balance accuracy with performance. The team also had to account for the messy reality of website structures – few follow standard formats or sitemap protocols.
“Websites by design aren’t clean,” Hadi noted. “We focused on extracting meaningful text while ignoring irrelevant code elements.” This approach allowed the system to adapt to diverse website architectures without relying on rigid templates or robotic process automation.
For financial institutions, the platform delivers detailed reports including financial analytics, historical performance metrics, and peer comparisons. Each SME receives a risk score from 1 (lowest risk) to 100 (highest risk), giving lenders standardized metrics for decision-making.
The solution represents a major leap forward in financial transparency for the SME sector. By applying AI to alternative data sources, S&P has created a scalable method to assess creditworthiness where traditional financial statements aren’t available – potentially unlocking new lending opportunities across the economy.
(Source: VentureBeat)

