Icebreaker One’s Open Net Zero service is a tool for helping people access net-zero data. But, as a catalogue of data catalogues, indexing other organisations’ datasets can prove problematic as we found out talking to Chris Pointon, Product Manager, Data Services at Icebreaker One. We also discussed Open Net Zero’s use-case driven approach to tagging and future goals to raise the bar when it comes to net-zero data.
Ross: Hi Chris, in order to lay the foundations for our chat, can you define what we mean by a data catalogue?
Chris: A data catalogue can be an individual organisation making the datasets that it wants to publish available through a portal or website. It can also be an index of other organisations’ data sets. Our Open Net Zero service, for example, is a data catalogue that indexes many other data catalogues. We don’t copy or store any of the data they hold, we just provide a service to help find their datasets, wherever they are on the web.
Ross: What is the importance of tagging in data catalogues?
Chris: Tagging is a mechanism for organising or labelling your datasets. You want people to find the datasets and understand them when they get there. Tagging should reflect your best idea of what will help people focus in on the data they’re looking for, but within this is the challenge of trying to work out what people think when they are looking for information. One person’s organisational structure can be quite different from the next.
Ross: What is Icebreaker One’s approach to tagging in Open Net Zero?
Chris: We take a use-case-driven approach to tagging. This means taking a real situation of data requirement and working out what the person looking for data needs and consequently how we should structure the data so that it’s useful to them. This leads to a much more purpose-led approach to labelling data. And, over time, we hope to have enough use cases that we end up with a general set, formed around use cases rather than in a vacuum. An example we’re working on at the moment, is a use case focused around finding datasets that can influence impact investing in the built environment. We believe this kind of use case will also help other types of searches in the future.
Adding to this, with Open Net Zero, we’re trying to make data discoverable from a very large number of sources. We store the tags that the originators use because we want to perpetuate that information onto other people that get data from us. But at the same time, we add our own tags in an attempt to map them into categories that are useful. Specifically around the way the data is categorised and how it might be applied in getting to net zero, since that’s what we’re all about.
Eventually, we aim to build a subset of common tags from all of the data we’re seeing so we can start to encourage some standards. Icebreaker One doesn’t intend to become a standards body, but we can see patterns and we want people to join us in contributing to the conversation around the tags net-zero datasets should be labelled with.
Ross: You mentioned standards, what pre-existing standards are there in regards to datasets?
Chris: There are many standards that you can adhere to when you’re talking about properties of data. For example, there are standards for how to represent the file format. There’s also something called ‘controlled vocabularies’. So, within the field of ‘unit of measure’ there’s a control vocabulary of how you can say what the unit of measure is – kilograms, cubic inches etc.
Overall, I think the main problem around adopting a standard is that organisations, whether in the US, EU or elsewhere, will design a vocabulary with one particular purpose but this can become problematic when trying to generalise beyond this one purpose.
Ross: What are some of the main problems you’re finding when working with data catalogues for Open Net Zero?
Chris: Machine-readability: Some data catalogues lack a machine-readable interface, which makes our work with Open Net Zero, as a catalogue of catalogues, much harder. We can add 20,000 datasets from a machine-readable catalogue to our index in five minutes but this isn’t always the reality. In many cases, we find that the data publishers have concentrated on web portals where people manually navigate to find and download a data file but there’s no way to navigate using software.
Custom APIs: We have a backlog of several catalogues that can only be accessed through a custom API (application programming interface). They have an API to list the datasets and retrieve information about them, which is better than nothing, but it means that each API needs to be taken separately and we have to work out how they organised it, how they map to our data structure and so on. What this means is, an organisation has done all the work to create an API, and then everybody who uses it has to do that work to use the API. There are established standards for publishing catalogues, such as DCAT, CSW and INSPIRE that make including the data in other tools much easier.
Licensing: Licensing can be inconsistent, and a question data users have to ask is: Does the dataset’s licence permit me to use it for my intended purpose? Many data portals have no information about what you can do with the data they publish. Some assume it’s implicitly open because they’ve published it on the web, but without a licence you can’t know that for sure. It could be covered by copyright, which means that you’re not allowed to use it.
Recourse: It’s good practice to make sure that you as a publisher are contactable. We have forms on Open Net Zero to ask whether a data set exists for a particular thing, in which case we’ll try and help if we can, and to suggest a data set that we’ve missed, either one that you’ve published or one that you’ve seen that you think should be in our list. It’s really important that you don’t just publish without recourse because you won’t learn whether your data is useful or not. A lot of organisations put a lot of effort into creating, collating, describing and publishing datasets as well as all the legal stuff but then aren’t contactable.
Ross: What countries or organisations are publishing catalogues well/poorly?
Chris: At a national data level, we’ve found France and Ireland very easy to index. They’ve got a sensible categorization and tagging scheme. As a net zero index, we don’t want every bit of data those governments publish about every part of their operations. By having a good categorisation system we are able to index only relevant datasets.
For a recent bad example, I would point to the Intergovernmental Panel on Climate Change (IPCC). They have published a widely-used official database of emission factors, yet the full dataset cannot be downloaded, there’s no catalogue of what’s in there and there’s no API. It’s just a website you have to fish through and download CSV files, or you can download a desktop app that does the same thing.
As a middle ground between these examples, you’ve got data catalogues like the RTE (France) API. It’s a good interactive data portal with 40 or 50 APIs and data sets but again it lacks machine readability. The only way you can see the list is by going to the website and navigating through them. So we haven’t indexed their data, not because it isn’t good data but because 50 APIs is too much to index and maintain manually.
The value of good data publication
Open Net Zero and the Icebreaker One team ultimately want to raise the bar when it comes to sharing net-zero data. Part of this is highlighting what good data publication looks like. If organisations are providing good licensing, clear access conditions and well-structured descriptive metadata, this should be praised. It also means having conversations with those looking to improve, like our work to make net-zero data more discoverable for impact investors. An Icebreaker One membership allows organisations to be part of this conversation and part of a collaborative process of shaping net zero data.