Artificial IntelligenceBigTech CompaniesNewswireTechnologyWhat's Buzzing

Wikimedia Opens Its Data for Easier AI and User Search

▼ Summary

– Wikidata is creating a new AI-friendly database that converts its 19 million entries into vectors for easier use by large language models.
– The project aims to level the playing field for smaller AI developers by providing free access to curated data, unlike large tech companies with more resources.
– This vectorized format helps AI systems better understand the context and relationships between data points, such as connecting Douglas Adams to his works.
– The initiative could lead to AI chatbots that better represent niche topics not widely covered online, improving information diversity.
– The current database uses data up to September 2024 and is stored for free by DataStax, with future updates planned based on developer feedback.

The Wikimedia Foundation is making its vast repository of structured data more accessible to artificial intelligence systems, a move aimed at empowering smaller developers and enriching the knowledge base available to AI. This initiative, known as the Wikipedia Embedding Project, transforms the information within Wikidata into a format that large language models can more easily understand and utilize. By converting nearly 19 million entries into mathematical vectors, the project captures the context and relationships between data points, moving beyond simple keywords to represent meaning.

Consider the example of author Douglas Adams. While his Wikipedia page offers a biography, Wikidata stores a wealth of associated information, from his birth sign to the universal library classification numbers for his books. This data is now being processed into a vectorized format, which can be visualized as a network of interconnected points. In this graph, Adams would be directly linked to concepts like “human” and the titles of his literary works.

The primary goal is to level the playing field for AI developers outside the monied core of Big Tech. Large corporations like OpenAI and Anthropic possess the resources to undertake such data vectorization themselves. This project, however, provides smaller companies and independent researchers with ready-made, high-quality data, giving them a significant advantage. As Lydia Pintscher, the Wikidata portfolio lead, explained, it’s about offering these smaller outfits a chance to compete and innovate.

For the average Wikipedia user, the front-end experience remains unchanged; the encyclopedia is not transforming into a chatbot. The enhancements are happening behind the scenes, making it simpler for developers to build AI applications, such as specialized chatbots, that can tap into Wikidata’s verified information. A real-world example of this data’s potential is Govdirectory, a platform that uses Wikidata to help users find contact information for public officials worldwide.

Another key benefit involves improving the diversity of knowledge within AI systems. Most chatbots are trained on the most popular content from the broader internet, which can overlook niche subjects. By providing straightforward access to Wikidata’s meticulously curated data, the project hopes to foster AI that better represents these specialized topics. This offers a more direct and reliable method for incorporating accurate information into systems like ChatGPT, bypassing the need to generate vast amounts of web content and hope it gets included in a future training cycle.

In practical terms, these vectors enable AI to grasp the context surrounding a piece of information, not just the fact itself. The technical work was carried out by a team at Wikimedia Deutschland, which used a model from Jina AI to process the data. The infrastructure for storing this new vector database is being provided free of charge by DataStax, a company owned by IBM.

The current database snapshot includes all data up to September 18th, 2024. The project team is now seeking feedback from the developer community before updating the database with information added over the past year. According to Philippe Saadé, the Wikidata AI project manager, minor edits to existing entries do not significantly impact the database’s overall utility, as the vectors represent the general concept of an item. The core value lies in the structured, contextual relationships that have now been unlocked for the AI ecosystem.

(Source: The Verge)

Topics

wikidata project 95% vector databases 90% large language models 85% ai accessibility 80% data structuring 80% AI Development 75% open data 75% wikimedia deutschland 75% ai infrastructure 70% ai journalism 70%

The Wiz

Wiz Consults, home of the Internet is led by "the twins", Wajdi & Karim, experienced professionals who are passionate about helping businesses succeed in the digital world. With over 20 years of experience in the industry, they specialize in digital publishing and marketing, and have a proven track record of delivering results for their clients.