OpenCollections Manual

Authors

Affiliation

Daniel Antal, CFA

Ádám Lázár

Andor Kornél Barát

Anna Márta Mester

Published

September 13, 2024

Doi

10.5281/zenodo.13757359

Introduction

Reprex’s new OpenCollections system wants to help small and large enterprises work with big data without huge investments into data infrastructure. OpenCollections is an element of our collaborative toolkits that enables owners of small, local databases to remain competitive in training AI in the age of big data. It helps to fill your databases with up-to-date information, find and correct errors, and connect your database entries to new information as you need them without further IT and data investments.

The OpenCollections component of our solutions aims to interconnect inventories, collections, and repertoires. We want to enable private entities, like music distributors, rights management agencies, and film producers, to synchronise their IT systems with public GLAM memory institutions: archives, libraries, museums, and statistical agencies. We want to enable the enrichment of your inventory or repertoire management from interconnected databases to improve automated sales processes and the training or sales, inventory management or other AI algorithms.

Like many applications in the European open data field, OpenCollections is built around Wikibase. This open-source software system has built one of the world’s most extensive knowledge graphs and knowledge bases, Wikidata, which synchronises the knowledge base of the 329 versions of Wikipedia with global databases, libraries, statistical authorities, company houses and other digital infrastructure.

This manual is not aimed at IT professionals or engineers. Wikibase has many thousands users with a simple and intuitive user interface. With this manual we are aiming for data stewards, data curators, librarians, archivists, inventory managers, who are responsible for documenting, updating repertoires, intellectual property assets, rights claims, webshop inventories, inventory management, and want to automate their processes, or train AI algorithms to do a better job for them.

1 Inspiration will need to be rewritten; it is currently taken from our observatory handbook, which deals with data collection programs, not broader collecting programs.

2 Tidy work is a very brief introduction to tidy data and text. It is a very brief introduction to keeping information tidy for automated computer use and easy database import.

3 Collections offers a typology of collections and the most prevalent problems with collections: ambiguous names, hard-to-translate descriptions, mismatched names and titles. Such problems appear in all large-scale applications and can negatively affect business, sales, legal or research integrity. We give some tips on how to work with our systems to prevent such problems or to resolve existing collection management problems with automated data improvement, enrichment or updating.

4 Wikidata and Other Open Knowledge Graphs introduces Wikidata and other Open Knowledge Graphs. Using Wikidata, Wikipedia’s document database, as an example, we show how to organise knowledge into a graph database and connect it with other knowledge graphs on the Internet of Data.

5 Wikibase and Enterprise Knowledge Graphs introduces the adaptability of Wikibase and enterprise knowledge graphs that are tailored to your needs, and can handle highly confidential data.

6 Reprex’s Sandbox shows how to get familiar with the system in our Sandbox.

The creation of OpenCollections accounts is explained step-by-step in 6.1 Create an Account.

Wikibase has been open source for a long time, but it is in its infancy as a supported open-source product. ReprexBase, our distribution, is enhanced with know-how, and our software libraries help you manage this knowledge system to be tailored to your needs. Wikibase has been successfully used in many EU projects, including the creation of the EU Knowledge Graph↗ (see: 5.5 The EU Knowledge Graph, (Diefenbach, Wilde, and Alipio 2021)). It also has training material on the EU Academy. While Wikibase is fully open-source and accessible, it is a complicated system that requires many extensions and adoptions to support a data-sharing space or a public-private knowledge base like ours. Reprex’s extensions aim to make data importing and enrichment easier and less costly and make data export more reusable.

Using Wikibase allows coordination with Wikidata, which evolved into a central hub on the web of data and it is one of the largest existing knowledge graphs, and perhaps the best known open one. It is synchronised with knowledge from respected public institutions like Eurostat, the German National Library or BBC, and it is one of the backbones of many web services like Google Search. Wikibase is scalable to very big graphs.