Written by Jack Hardinges and Gavin Starks
“Every government agency, everywhere, is working on a new system that’ll solve all data problems and will be ready to use in 18-24 months… Except it will always be ready in 18-24 months.” – Whong’s Law
In December, we submitted a response to the Wellcome Trust and ESRC’s National Data Library Technical White Paper Challenge.
The Challenge was designed to surface concrete implementation options for the proposed National Data Library. We responded to:
- Support UK Government to look beyond the abstract ideas and aspirations projected onto the National Data Library thus far.
- Propose a feasible implementation option, based on how we approach data infrastructure for net zero at Icebreaker One.
- Describe significant challenges to delivery regardless of technical role/architecture, and ways to address them.
We’re delighted that our white paper, Delivering an effective National Data Library, was selected for publication. And, as we present our paper this week, alongside four other submissions, we wanted to publish a summary along with some further thoughts.
Building from a strong foundation
In our response, we argued that there’s an urgent need for UK Government – as well as the wider pool of organisations seeking to influence its design – to:
- Clarify the technical problem the National Data Library should address. We decided to focus our work on a National Data Library aimed at improving the discovery of government-held data for research. We described the fork in the road between research and pursuing improved operational data access, on the basis that the needs of researchers and operational data users are unlikely to be met by this singular intervention. This feels vindicated, given the release of the UK Government’s blueprint for modern digital government, which commits public sector organisations to do various things to improve the way data is used to deliver public services (including using standard APIs to exchange data).
- Take inspiration from existing research data infrastructures, rather than the library metaphor. We pointed to 11 types and 39 examples of data infrastructures that already serve researchers’ needs, such as by generating new datasets for research (UK Biobank), unlocking access to data held by the private sector (Smart Data Research UK) and providing access to linked or combined datasets from multiple organisations (ONS Integrated Data Service). The National Data Library won’t exist in isolation and must be built with an understanding that it’ll be one node in a network of research data infrastructures.
- Ensure the National Data Library adds something new, or improves or replaces what already exists. In addition to these existing research data infrastructures, there are numerous pan-government initiatives working to improve public sector data access. We argued for these initiatives – plus the Library’s intended users – to be engaged early, in order to understand the needs not met by existing efforts.
Our recommendation for the National Data Library
We recommended a simple, decentralised National Data Library to improve discovery of public sector data for research.
In the spirit of the Challenge, we laid out how it could maintain a searchable catalogue (or ‘index’, ‘registry’, ‘portal’; the name doesn’t really matter) of metadata harvested from across many public sector organisations. It wouldn’t copy or store any of the research data those organisations hold, but provide a service to help researchers find relevant datasets wherever they are on the web. It would look similar to Open Net Zero, which we’ve built to make net-zero data discoverable, accessible and usable. Open Net Zero currently indexes metadata on nearly 60,000 datasets from more than 400 organisations.
To do this, this National Data Library would harvest existing metadata from sources like data.gov.uk, gov.uk API Catalogue and Administrative Data Research UK Data Catalogue. It would also collaborate with public sector organisations to make available and harvest new metadata. It would point to datasets that are openly available, as well as datasets that researchers can work with under more restricted technical, legal and commercial conditions. While we described various technical challenges – including varying quality and machine-readability of metadata, and unclear data licensing terms – we think this represents a feasible version of the National Data Library.
There’s also some interesting things happening with metadata and data discovery that the Library could build on. This includes: new metadata formats emerging from the AI community (Croissant); ways to tag datasets with information about restrictions on their use (Data Use Ontology); and new tools that enable users to search across multiple data catalogues and within datasets on them (Open Data Deep Search, HerdingCats).
Data curation must begin with users and use cases
At Icebreaker One, we start with use cases and deliver data infrastructure to enable them. The starting point for Perseus, for example, was to automate high-quality sustainability reporting for every SME in the UK to enable them to access over $100bn of green finance. We’re focusing on unlocking half-hourly emissions data to do this, before we add further use cases and data types.
This is how complex data infrastructures are built. As John Wilbanks of the Astera Institute recently described, “You build a complex data system by answering five questions at a time, using a standards based approach. And then when you’ve answered twenty, you’ll have a functioning complex data system”. We must design data infrastructures for specific primary use and general secondary use.
We recommend the National Data Library takes a similar approach. While original research is needed to clarify its intended users and their needs, we pointed to existing evidence on use cases that the Library could address. This includes: 52 societal challenges hindered by a lack of coordinated data (DARE UK); high level areas of research interest (HM Treasury); and high-value datasets for reuse (European Commission).
The Library can’t be a vehicle for everything
The ambiguous language used around the National Data Library has caused confusion. The UK Government’s AI Opportunities Action Plan has recently described the National Data Library as “an enormous opportunity”. It says that ‘alongside’ the National Data Library, the UK Government should:
- “Run open calls to receive proposals from researchers and industry to propose new data sets”.
- “Rapidly identify at least 5 high-impact public datasets it will seek to make available to AI researchers and innovators”.
- “Establish a copyright-cleared British media asset training data set”.
- “Finance the creation of new high-value datasets that meet public sector, academia and startup needs”.
These could well be useful interventions to make to support the UK’s AI sector. But in order to give the National Data Library a necessary focus, the UK Government should be clear about the wide set of interventions it plans to make for the data economy vs the subset that will be delivered by the Library itself.
The limits of our recommendation
While it’d help improve data discovery, we’re conscious that our recommended execution of the National Data Library wouldn’t move the dial when it comes to streamlining access to research data drawn/linked from multiple public sector organisations.
We compiled evidence that this is the significant challenge holding researchers back from working with public sector data in the UK, including from Administrative Data Research UK, the Public Administration and Constitutional Affairs Committee and the Office for Statistics Regulation.
It’s a very difficult problem to address. We pointed to Ben Goldacre’s 2022 review of the UK’s health data ecosystem, which identified a wide range of barriers to more effective linkage or combination of data for research in health alone. It described how individual data holders operate in silo, developing their own, bespoke approvals processes that make secure data linkage and access neigh on impossible.
The review made 30 detailed recommendations. We suggested that a similarly broad and deep set of interventions will be required to harmonise access to data for research across the whole of the public sector. We think this work is broader and deeper than is possible for the National Data Library to deliver, and that progress on streamlining access to research data will instead be driven by other actors. At a recent event held by Health Data Research UK, Sir Robert Chote, Chair of the UK Statistics Authority, described progress being made on a number of fronts. This included: the Office for National Statistics’s Integrated Data Service; reexamining the Five Safes framework to consider accrediting Safe Programmes rather than only Safe Projects; and the work of the Pan UK Data Governance Group.
Our view and recommendation has also been shaped by budget considerations. We anticipate that investment in the National Data Library will be modest and less permanent in comparison with other public data infrastructure: the ESRC alone has spent more than £200m in data collection, creation, curation and delivery.
Moving forward
Our recommendations for delivering an effective National Data Library have been shaped by our use-case driven approach to delivering data infrastructure for net zero. As we present our paper this week, we hope the UK Government will consider our suggestions as they develop further plans.