Lesson 4.3 Introduction to Data Interoperability
Learning outcomes
At the end of this lesson, learners will (be able to):
Understand the basics of data interoperability
Identify the different types and layers of interoperability
Understand the basics of semantic interoperability
Choose the vocabularies that better fit their needs
1. Introduction
Data interoperability is the ability of a data set to be reused by any system without special effort.
There are different layers to data interoperability: data can be technically interoperable thanks to a machine-readable format (e.g. CSV) and an easy-to-parse structure (e.g. JSON), but in order for an external system to perform more operations on a dataset, the “meaning” of data in the structure has to be explicit, and this is achieved through semantic interoperability, by using metadata and values that have been previously assigned an unambiguous meaning and an identifier that can be used across systems.
This lesson introduces the different layers of data interoperability: “foundational” or infrastructural (exposure protocols), structural (file formats and content structure) and semantic (use of metadata and values from agreed vocabularies). It only provides an overview and some examples: mastering the technologies for data interoperability would require a dedicated course and is something that is expected only of developers and data engineers.
However, a good understanding of the possible technological choices is useful both for data reusers, who should be able to assess the level of interoperability of the data they want to reuse, and for agricultural data/service providers who want to make sure their data is interoperable with other systems.
A longer section is devoted to semantic interoperability, as it is the most sector-specific aspect of data interoperability: while protocols and formats/structures are sector-agnostic, semantics are created and agreed upon in specific domains and in the case of agricultural data it is very important to know which semantic structures have been published in the domain and for which sub-domains (agronomy, food and nutrition, natural resources…) or types of data (soil data, weather data, crop growth data…).
The next lesson will focus on examples of interoperable data using vocabularies created for farm data.
2. Data interoperability
The most used definition of ‘interoperability’ on the web is: ‘the ability of a system or a product to work with other systems or products without special effort on the part of the user’. Wikipedia defines it as ‘a characteristic of a product or system, whose interfaces are completely understood, to work with other products or systems, at present or future, in either implementation or access, without any restrictions’.
When it comes to data interoperability, considering that data are serialised in datasets, the definitions above can be applied easily to a dataset as a product.
There are some definitions that define it as the interoperability ‘between two systems’, however it is a common view that something is really interoperable (or more interoperable) when as many systems as possible can interoperate it. Even more, by using certain data formats and applying certain data standards, data can be made ‘interoperable by design’ without necessarily knowing with which system they will be interoperable: planned interoperability with specific systems means that the data will be ‘tightly coupled’ with those systems, while maximised interoperability aims at loose coupling with as many systems as possible.
2.1. Levels of interoperability
‘Foundational’ interoperability allows data exchange from one information technology system to be received by another and does not require the ability for the receiving information technology system to interpret the data.
This level is about infrastructural interoperability and mostly about transmission protocols, which will not be addressed here as it is not of interest to the data manager, but it also covers some higher-level exchange protocols, mostly working on top of the common HTTP protocol (for instance special REST APIs like OAI-PMH, SPARQL, Linked Data API or content negotiation based on HTTP content-type requests). These may be of interest to data managers because, before being read and understood, data has to be transmitted, and there are different and more convenient ways to do this besides FTP downloads.
‘Structural’ interoperability is an intermediate level that defines the structure or format of data exchange (i.e., the message format standards). Structural interoperability defines the syntax of the data exchange. It ensures that data exchanges between information technology systems can be interpreted at the data field level.
This is the level where file formats and data formats play the most important role and it is the level where (meta)data become ‘machine-readable’ and can be parsed by machines: the easier it is for machine to parse the (meta)data format/syntax (XML, Json, CSV…) the more structurally interoperable data are.
‘Semantic’ interoperability provides interoperability at the highest level, which is the ability of two or more systems or elements to exchange information and to use the information that has been exchanged. Semantic interoperability takes advantage of both the structuring of the data exchange and the codification of the data including vocabulary so that the receiving information technology systems can interpret the data.
While with structural interoperability machines understand what the different elements are (and their reciprocal structural relation), with semantic interoperability they also understand the meaning of those elements and can process them with semantics-aware tools and do some advanced reasoning. Data formats are not enough for this: semantic technologies allow to embed machine-readable semantic elements from agreed vocabularies in (meta)data serialised in most of the existing data formats, although the most suitable formats for this so far are the various flavours of RDF (RDF/XML, Turtle, N-Triples) and JSON-LD.
2.2. Interoperability of data and metadata
Data without metadata cannot be understood by machines. Data normally come with metadata.
Both parts in the key-value pair can present issues of interoperability, so one can deal with issues of interoperability of the metadata (e.g. agreed schemas for metadata elements, agreed variable names) and issues of interoperability of the data/value (e.g. agreed controlled lists/ranges from which the value has to be taken, or syntax issues). However, since it is also common to consider controlled values and specification of syntax or unit of measure as ‘metadata’ (especially because they should be ideally defined by separate metadata elements and not within the value itself), one can also say that interoperability is mostly about metadata. The interoperability of data is achieved through the interoperability of metadata: for instance, one wants the metadata ‘air temperature’ to be interoperable to then get the actual value (the data) that one needs – the number that expresses the value will never be findable or understandable per se without the associated metadata.
This lesson will follow the same convention adopted in the FAIR guiding principles mentioned above, using the term (meta)data when something applies indifferently to data and metadata.
3. Infrastructural interoperability: exchange protocols
The most popular protocols for exposing data as a service are the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) and SPARQL. OAI-PMH and SPARQL are technically RESTful[1] APIs and like all APIs they expose ‘methods’ which accept a number of parameters and these methods can be called via an HTTP request and return a machine-readable response.
Besides OAI-PMH and SPARQL, it is very common to have data exposed through custom APIs, with standardized documentation about parameters, data types and return values.
OAI-PMH was born in the library environment and it was conceived mainly for metadata, but it can be used to expose any type of ‘records’.
The methods of the OAI-PMH API are designed to enable a series of calls that allow the caller (an application) to preliminarily check some metadata about the repository (the name, what metadata schemas are supported, which subsets can be retrieved…) and then retrieve records filtered according to metadata format, date ranges, subsets. Records can be retrieved in smaller batches using a ‘resumption token’.
Compared to OAI-PMH, SPARQL allows for much more complex queries built using the powerful RDF triple grammar and can expose triples with any type of subject (a dataset, an organisation, a place, a topic…) all from the same interface. This makes it possible to filter data very granularly and look up other properties (e.g. labels) of resources referenced by URIs. SPARQL clients can also send the same query to several SPARQL engines and combine the responses: the use of URIs allows the client to merge duplicates and consolidate properties for the same entity coming from different systems.
Other protocols have been developed in scientific communities, building on common scientific data exchange practices, focusing on efficiency of data transfer (catering for large amounts of data) and leveraging traditional exchange formats like NetCDF and HDF5 (see next section). An example is OPeNDAP.
The choice of protocols to implement to share one’s data should be definitely influenced primarily by the community with which data are expected to be shared and then by technical considerations such as ease of implementation, support for specific formats and general support for the data sharing principles.
For example, OPeNDAP is widely used by governmental agencies such as NASA and NOAA to serve satellite, weather and other observed earth science data, so it could be a good choice for institutions sharing these or similar types of data. On the other hand, if wider sharing across communities is desired, a custom API or known APIs like OAI-PMH or SPARQL can be preferred.
4. Structural interoperability: formats and structures
This is the level where (meta)data become machine-readable. However, ‘machine-readable’ in its most literal sense is not enough for structural interoperability: in order for machines to understand the structure of the (meta)data, i.e. which are the labels/metadata elements/column names/variables and which are the values, and whether there’s a hierarchy, the (meta)data have to be structured enough for a machine to extract individual values.
In other words, (meta)data have to be not just readable but ‘parsable’ by machines. Since ‘parsing’ means ‘splitting a file or other input into pieces of data that can be easily stored or manipulated’, the more regular and rigorous a format is, the easier it is to parse it. Ideally, parsable formats have a published specification so that developers know how the structure is built and can write parsers accordingly. So, this level of interoperability is achieved through the choice of the most ease to parse data format.
There are many file formats that are parsable by machines. What makes a file format more easily parsable and therefore more interoperable is:
The simplicity of the structure: the fewer the elements of the structure and the fewer the possible constructs are, the easier it is for a machine to parse the file without too complex reasoning. On the other hand, the format should also cater for complex data structures: the most suitable file format is the one that combines simplicity with enough flexibility to represent complex data structures.
The rigorousness and ‘regularity’ of the structure: the fewer the options to serialise the same thing in a different way, the easier it is for a machine to parse the file and identify the role of each element.
The existence of a clear and open specification for the format, which makes it easy for any developer to write parsers. (Many formats are tied to a software product and can be correctly and fully parsed only by that product. This makes their interoperability level very low.)
An additional help is the existence of software libraries and APIs that can parse the format.
The formats that are considered the most interoperable against the criteria above are CSV, XML and JSON.
CSV (comma-separated values) files have a simple tabular format with a header record with column names, commas to mark the field boundaries and line feeds to mark the record boundaries.
The most interesting aspect of XML in terms of interoperability is the support of the definition of ‘schemas’ to which a specific XML document can declare to conform. Schemas are machine-readable documents that specify how to name and organise the elements in an XML document and can define hierarchies, constraints etc.
Dataset metadata in XML can reach a high level of complexity.
Of the three, XML is traditionally the one considered as the best combination of (a) simplicity; (b) flexibility for complex data structures; and (c) support of schema definitions that set constraints and help validation and parsing. However, considering that JSON also supports schemas and that both XML and JSON can represent RDF (see below), JSON can be considered as suitable as XML.
Other formats that are extremely well interoperable are the main native RDF formats, Turtle and N-Triples.
The RDF grammar is based on statements made of subject – predicate – object; each statement is a ‘triple’ and the assumption is that combinations of such triples can represent and describe everything. Ideally all three components of the statement should be represented by Unique Resource Identifiers (URIs) that always refer univocally to the same entity.
The triples syntax is very simple:
Using URIs also for the predicates makes it possible to look up the meaning of the property in a published “vocabulary” (see next section on semantic interoperability).
5. Semantic interoperability: vocabularies
All the data formats seen in the previous lesson define only data structures, how to encode fields/variables and values, and the only thing a machine can do is parse the structure and extract variables and values, without knowing how to treat each of them. Variables and values have a meaning which in many cases can be only understood by humans (and in some cases only by humans who speak the same language and know the conventions of the same discipline).
Human beings can interpret data through human-readable semantics that have always been used in (meta)data in different ways: a string to identify the topic of something or the colour of a thing (e.g. in germplasm phenotypical descriptions), codes taken from a code list of authoritative values (e.g. type of soil) or conventional variable names. But as already mentioned, interoperability is all about being understood by computer software: strings can be different in each dataset and in different languages, and even codes without a reference system behind them do not mean anything to computers and do not allow them to make decisions on how to treat the values.
If on the other hand the metadata contained information on the reference system (a ‘semantic structure’ like a thesaurus or a code list) from which each variable and each value came, and that semantic structure were machine-readable and provided some stable identifiers that computer programs could use as stable values to design their behaviour (e.g. using the values as common search values across different datasets), one would have achieved semantic interoperability.
So, on the one hand the metadata have to embed information on the reference semantic structures and point to the exact elements they are using from that structure; on the other hand these semantic structures, like the data, have to be ‘serialised’ in such a way that machines can read and process them, and use them to interpret the data.
Details on how to publish a semantic structure, or a ‘vocabulary’, in machine-readable format are beyond the scope of this lesson. In short, for our purposes in this lesson, let us say that such vocabularies are published as datasets, whose records are terms/concepts and their related descriptions, codes and ideally URIs, in a machine-readable format – for the moment let us assume XML or RDF/XML.
5.1. Semantic structures or ‘vocabularies’
Vocabularies are agreed sets of terms, possibly with defined relationships between them. This includes both terms used for description metadata, like metadata element names, properties, predicates (so terms in description vocabularies: metadata schemas, ontologies…) and terms used to categorise, annotate, classify (so terms in value vocabularies: thesauri, code lists, classifications, authority lists…, also sometimes called ‘Knowledge Organisation Systems’ (KOS)).
There is no formal classification of types of vocabularies (which in itself could be a useful example of a value vocabulary).
categorisation scheme: loosely formed grouping scheme
classification scheme: schedule of concepts and pre-coordinated combinations of concepts, arranged by classification
dictionary: reference source containing words usually alphabetically arranged along with information about their forms, pronunciations, functions, etymologies, meanings, and syntactical and idiomatic uses
gazetteer: geospatial dictionary of named and typed places
glossary: collection of textual glosses or of specialised terms with their meanings
list: a limited set of terms arranged as a simple alphabetical list or in some other logically evident way; containing no relationships of any kind
name authority list or authority file: controlled vocabulary for use in naming particular entities consistently
ontology: formal model that allows knowledge to be represented for a specific domain; an ontology describes the types of things that exist (classes), the relationships between them (properties) and the logical ways those classes and properties can be used together (axioms) [see below a note on how an ontology can be seen as a KOS but also as a description vocabulary, an extended schema]
semantic network: set of terms representing concepts, modeled as the nodes in a network of variable relationship types
subject heading scheme: structured vocabulary comprising terms available for subject indexing, plus rules for combining them into pre-coordinated strings of terms where necessary
synonym ring: set of synonymous or almost synonymous terms, any of which can be used to refer to a particular concept
taxonomy: scheme of categories and subcategories that can be used to sort and otherwise organise items of knowledge or information
terminology: set of designations belonging to one special language
thesaurus: controlled and structured vocabulary in which concepts are represented by terms, organised so that relationships between concepts are made explicit, and preferred terms are accompanied by lead-in entries for synonyms or quasi-synonyms.
As for description/modelling vocabularies, there is no authoritative list of, but the most commonly used such types are:
schema (or metadata element set): any set of metadata elements, like XML schemas, RDF schemas or less formalised set of descriptors
application profile: a schema which consist of metadata elements drawn from one or more namespaces, combined together by implementors, and optimised for a particular local application
messaging standard: standards which describe how to format syntactically (and sometimes semantically) a message usually describing some event- or time-related information; messages are triggered by an event and transmitted in some way
ontology, which can define complex schemas with constraints and rules for reasoning.
Examples of metadata vocabularies are:
Examples of value vocabularies, or KOS, are:
A combination of metadata vocabulary and value vocabulary are ontologies, that define the complete set of metadata element and controlled values/entities that should be used to describe and classify specific “things”. Examples of ontologies are:
Besides “vocabularies”, other terms that are used for defining these resources are ‘semantic resources’ or ‘semantic structures’.
5.2. How to identify the most suitable published vocabularies
To discover what published vocabularies are available, the most useful source of information are of course catalogues that are dedicated to the agri-food domain, but general catalogues searchable by domain can also be of help. An important distinction is to be drawn between catalogues/directories/registries on the one hand and repositories on the other: registries are conceived as metadata catalogues, providing descriptions and categorisation of vocabularies and linking to the original website and original serialisation of the standard, while repositories host the full content of the vocabulary, so that the terms themselves can be browsed. Below is an overview of some existing catalogues/repositories of data standards and vocabularies.
Agri-food domain
General
Besides choosing a vocabulary suitable for the data you manage, some other criteria are useful for selecting the most appropriate vocabulary:
It is preferable to choose vocabularies that are widely used, so that your data is interoperable with more datasets and more systems.
If existing vocabularies do not meet your needs, you may decide to create your own vocabulary and publish it (ideally, in collaboration with other partners in your community so as to ensure wide adoption).
5.3. Embedding semantics in the (meta)data
Semantic interoperability can be achieved at different levels and the implementation is different depending on the data format you’re using and the use you want to make of the vocabulary.
5.3.1. Using a schema for your metadata
If you identify a metadata vocabulary/schema that has the classes and properties that you need to describe your data, you can reuse it to model and represent your data.
An important thing to note about using an existing published schema is that by doing this alone your data will be already more semantically interoperable, because instead of local metadata element names which are meaningless for a computer, you will use element names from a published vocabulary, and software tools that are aware of that vocabulary will be able to do something with it, e.g. match the values with values from other datasets that use the same schema.
The adopted schema becomes the “language” of your data structure: for instance, instead of using a custom XML structure with local element names, by adopting an existing XML schema you declare that you’re using elements from it and for each element you will use the element name of the selected schema instead of a local one, with a prefix that indicates from which schema the element comes.
The agreed way to show in your metadata that an element is coming from a published vocabulary is to add a prefix that identifies the vocabulary. In XML and in RDF-compliant formats like JSON-LD, prefixes can be associated with the URI of the vocabulary, so that the reference to the vocabulary is explicit. In other formats, like CSV, there may be no other way than just using conventional prefixes before column names (or the full URI of the vocabulary before the property name, but this is not very practical for column names).
In the example below, an XML file defines the prefixes for all the vocabularies (schemes) it will use; the om: prefix refers to the ISO data standard for Observations and Measurements (O&M):
After the definition of the prefixes, every time the om: prefix is used, machines know that the following property name comes from the O&M vocabulary:
5.3.2. Using value vocabularies for your data
A slightly different case is when you want to use values from an existing vocabulary as values (not metadata/properties) in your dataset, for instance if you want to use the AGROVOC term for Oryza sativa or the term identifying a country from the FAO Geopolitical Ontology, in order to use unambiguous values across systems.
As a minimum, once some suitable vocabulary has been identified, if the vocabulary doesn’t use URIs and/or the URIs cannot be used in the dataset, at least the literal values of the terms can be used in the dataset: systems that are aware of the vocabulary and can match the literal against the URI can already do something with this.
Preferably, you should use the URI of the term you want to refer to.
Again, you can do it in different ways depending on the data format you are using. In non-RDF XML or in CSV or in JSON, you can use the URI as the value of the element/column/label (e.g. in XML you can use the URI of the Geopolitical Ontology country as the value of the dc:spatial element, perhaps specifying the scheme=”URI” attribute to make it clear it’s a URI).
Ideally, semantic interoperability is fully achieved using an RDF-enabled serialization format (XML/RDF, Turtle, N3, JSON-LD). The advantage of using RDF is that RDF parsers and crawlers would normally look up additional properties from the URI address so you may not need to include redundant values in your data.
Even if you don’t find the ideal vocabulary that meets your needs and resort to using your own terms, you can link your local term to some similar or broader term in existing vocabularies
Using again the same XML example as above, the name of the dimension (Temperature) is taken from a NASA controlled vocabulary of property names, using the full URI, while the fact that the value is a measure is indicated using the MeasureType value from the Open Geospatial Consortium GML controlled vocabulary:
The next lesson will show more examples, focusing on interoperability of farm data.
Summary
This lesson illustrated the three layers of interoperability: infrastructural, structural and semantic.
For infrastructural interoperability, a quick overview of common protocols for data sharing like OAI-PMH, SPARQL and OPeNDAP was provided.
For structural interoperability, the features of the main file formats were highlighted especially in terms of ease of parsing by machines and suitability for semantic technologies.
Finally, for semantic interoperability, the lesson described more in depth the various types of vocabularies that can be used to semantically enrich data: metadata/description vocabularies, like schemas and metadata element sets, and value vocabularies, like thesauri or code lists.
The next lesson will provide some examples of data interoperability in practice, focusing on farm data.
Bibliography
Footnotes
Last updated
Was this helpful?