DataStax’s AI platform enables Wikimedia Deutschland to ingest and embed 10 million Wikidata articles in less than 3 days

DataStax’s AI platform enables Wikimedia Deutschland to ingest and embed 10 million Wikidata articles in less than 3 days

DataStax, a leading AI platform that helps companies and developers create more accurate Artificial Intelligence (AI) applications with 60% reduced development time, today announced that Wikimedia Deutschland — the organisation that supports German Wikipedia and develops Wikidata and Wikibase — is leveraging the DataStax AI Platform, built with NVIDIA AI, including NVIDIA NeMo Retriever and NIM microservices, to make Wikidata available to developers as an embedded vectorized database.

Wikidata serves all Wikipedia language versions as an integral linked open data platform, and is the largest collaborative knowledge graph for open editable and open accessible data in over 300 languages. The global community, which encompasses more than 24,000 volunteers, has contributed over 114 million entries to date. These entries are used by thousands of software developers in the open source landscape. The shared goal of Wikimedia Deutschland and DataStax is to provide this data as an open accessible dataset of the world’s knowledge available to the Open Source AI/ML Community. One of the key technical challenges was vector embedding such a large and constantly changing dataset such that it is always up to date for developers to use.

“WMDE plans to make Wikidata’s data easily accessible for the Open Source AI/ML Community via an advanced vector search by expanding the functionality with fully multilingual models, such as Jina AI through DataStax’s API portal, to semantically search up to 100 of the languages represented on Wikidata. To vector embed a large, massively multilingual, multicultural, and dynamic dataset is a hard challenge, especially for low-resource, low-capacity open source developers. With DataStax’s collaboration, there is a chance that the world can soon access large subsets of Wikidata’s data for their AI/ML applications through an easier-to-access method. Although only available in English for now, DataStax’s solution provided a valuable initial experiment ~10x faster than our previous, on-premise GPU solution. This near-real-time speed will permit us to experiment at scale and speed by testing the integration of large subsets in a vector database aligned with the frequent updates of Wikidata,” said Dr. Jonathan Fraine, Chief Technology Officer, Wikimedia Deutschland.

Developer efficiency is also key to Wikimedia Deutschland, as Wikidata is one of the world’s largest open source knowledge graphs, and with the DataStax AI Platform on AWS, it was possible to ingest, process, and vector embed over 10 million entries in under 3 days. The vectorized data is certainly still available under free CC0 licence.

Vectorizing such an extensive dataset is highly complex, as each document requires resource-intensive embedding processes to support real-time search and accessibility. Traditional linear read/write operations cannot keep pace with the scale and speed Wikimedia Deutschland needs to make hundreds of thousands of daily updates by the global community instantly accessible to millions of users. As the world’s foremost open source knowledge graph, Wikidata demands high-quality, real-time results for hundreds of updates each minute. With Astra DB’s serverless Vectorize offering, hosted on AWS, and NVIDIA NeMo, the DataStax AI Platform provides the near-zero-latency and scalability needed to ensure Wikidata’s vector database is always up to date, maintaining the reliability essential for serving Wikimedia’s global audience.

“Our cooperation with DataStax and their approach has unlocked new capabilities and streamlined our processes, which will allow us to deliver faster and more accurate insights to our community,” said Lydia Pintscher, Portfolio Lead for Wikidata, Wikimedia Deutschland. “DataStax offers a combination of scalability, ease of use, and advanced embedding models that supports and encourages the development of AI applications for the public good with open and high-quality data.”

“We’re thrilled to see Wikimedia Deutschland improving accessibility to the world’s largest knowledge graph with our AI platform. The open source community is crucial as it can bring more common good and many new ideas and innovations to the digital world,” said Ed Anuff, Chief Product Officer, DataStax.

Wikimedia Deutschland and DataStax plan to expand upon these initial projects, exploring capabilities like graphRAG to enhance search reliability further, and supporting up to hundreds of languages to improve accessibility. The combination of Astra DB’s serverless model, powered by AWS, ensures Wikimedia Deutschland’s infrastructure can flexibly grow with its data demands, solidifying its position as a global leader in open source AI-driven knowledge.

DataStax continues to offer AWS customers the latest innovations with our end-to-end AI development platforms, supporting developers from idea to production. Astra Vectorize simplifies and accelerates vector embeddings by handling embedding generation directly on Astra DB running on AWS, fully supporting Amazon Bedrock. Amazon Bedrock is also supported in DataStax Langflow, offering a drag-and-drop experience for AWS developers to test foundational models with real data. Support for Amazon Q is coming for Langflow, allowing users the low-code convenience for integration with AWS’s AI-powered assistant. DataStax brings cost savings to AWS users by leveraging AWS Graviton processors, lowering operation costs and helping AWS users manage TCO.