Artificial IntelligenceNewswireStartupsTechnologyWhat's Buzzing

An Engineer’s Quest for a Better Search Engine

▼ Summary

New search engine targets SEO spam – A software engineer built a custom search engine to combat irrelevant results and SEO spam, improving relevance.
Neural embeddings enhance accuracy – Wilson Lin used neural embeddings and sentence-level chunking for semantic understanding, with a classifier for contextual references.
Crawling challenges addressed – Identifying main content via HTML tags and filtering URLs were key hurdles, with DNS and URL protocols causing frequent issues.
Storage optimized for scale – Transitioned from Oracle Cloud to PostgreSQL and RocksDB for high ingestion rates, and shifted GPU embeddings to Runpod for cost efficiency.
Key insights on search limitations – Highlights include index size importance, crawling difficulties, small-scale engine constraints, and trust evaluation challenges.

A software engineer from New York, frustrated by irrelevant results and SEO spam prevalent in mainstream search engines, embarked on an ambitious journey to create a more refined search tool. Within just two months, Wilson Lin had a functional demo ready to showcase. His endeavor sheds light on the inherent challenges and insights gained in developing a search engine free from the usual clutter.

The Inspiration Behind the Project

Wilson’s main motivation for creating this search engine stemmed from the increasing amount of SEO spam infiltrating mainstream platforms. He expressed satisfaction in achieving a significant reduction in such spam with his creation, sharing, “What’s great is the comparable lack of SEO spam.”

Harnessing Neural Embeddings

The key to Wilson’s approach was the use of neural embeddings. Through a small-scale test, he validated this method’s effectiveness in providing quality search results. His focus on embeddings allowed the search engine to interpret and rank content with greater accuracy.

Refining Content Processing

Wilson faced the challenge of processing data efficiently, deciding between paragraph-level or sentence-level content blocks. Opting for sentences offered the most granular level of relevance while still allowing the synthesis of paragraph-level context. To tackle ambiguities with indirect references, Wilson trained a DistilBERT classifier model. This tool helped maintain contextual integrity by linking dependent sentences, ensuring that every piece of information had its proper context.

Pinpointing Main Content

One of the key hurdles in crawling was differentiating between valuable content and non-content parts of a webpage. With varying HTML markup styles across websites, this task was complex. Wilson’s strategy involved relying on several HTML elements to identify main content areas, such as paragraphs, quotes, and lists.

Tackling Crawling Challenges

Crawling presented its own set of challenges. Wilson discovered unexpected issues like DNS resolution failures. Protocol and URL type restrictions further complicated matters. The solutions included ensuring HTTPS usage and canonicalization to manage duplicate content effectively. Handling unusually long URLs and odd characters posed additional complexity, requiring innovative database management techniques.

Navigating Storage Solutions

Initially, Wilson chose Oracle Cloud for its affordable egress costs, vital for managing terabytes of data. However, scalability issues led him to explore other options. After trying PostgreSQL, he settled on RocksDB for its operational simplicity and distribution capacity, enabling high data ingestion rates.

Leveraging GPU for Inference

To generate semantic vector embeddings, Wilson utilized GPU-powered inference, initially relying on OpenAI. The project’s growth necessitated a shift to a more cost-effective solution. Runpod’s high-performance GPUs proved economical, supporting the search engine’s expanding needs.

A Cleaner Search Experience

Wilson proudly highlighted his search engine’s success in minimizing spam and enhancing query understanding. He demonstrated its prowess with complex queries, showcasing its ability to generate meaningful content suggestions from detailed inputs.

Key Learnings from the Endeavor

Wilson shared several insights, crucial for digital marketers and publishers:

The Importance of Index Size: The search engine’s quality heavily depends on the index’s size, where coverage equates to quality.
Challenges in Crawling and Filtering: Striking a balance between quantity and quality in content crawling remains daunting, echoing the fundamental challenges addressed by Page Rank’s original design.
Limitations of Small-Scale Search Engines: The inability to crawl the entire web limits coverage and highlights the constraints of smaller players in this domain.
Complexity of Judging Content Trust and Authenticity: Automatically assessing originality and accuracy is intricate. Wilson believes that newer technologies could simplify and enhance this process, providing more accurate evaluations.

For those interested in exploring Wilson’s innovative search engine, it is accessible online, along with comprehensive technical details of its development.

(Source: Search Engine Journal)

Topics

search engine development 95% seo spam reduction 90% neural embeddings 85% sentence-level chunking 80% content filtering 75% web crawling challenges 70% storage solutions 65% gpu-accelerated inference 65% semantic analysis 60% gpu-powered embeddings 60%