Skip to contents
library(dataset)
library(rdflib)
#> Warning: package 'rdflib' was built under R version 4.4.1

The RDF (Resource Description Framework) annotation significantly enhances the interoperability and exchangeability of datasets in data repositories by leveraging a standardised, machine-readable format for describing and linking data. This vignette shows how to leverage the capabilities of the dataset package with rdflib, an R-user-friendly wrapper on ROpenSci to work with the redland Python library for performing common tasks on rdf data, such as parsing and converting between formats including rdfxml, turtle, nquads, ntriples, and trig, creating rdf graphs, and performing SPARQL queries.

Standardised Semantic Framework

RDF provides a common framework to describe resources and their relationships using triples (subject-predicate-object). This standardisation ensures that data from different systems can be understood in a unified way, regardless of the original source or format. Notice that this format is a stricter version of the tidy dataset concept, where not only on every observation is in a row, but there are always strictly three columns.

head(iris)[1:2, ]
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa

Instead of placing the relevant measurement of an observed flower into the intersection of columns and rows, in the triple format we put them next to each other:

  • the first flower’s sepal length is 5.1
  • the second flowers’s sepal length is 4.9
dataset_to_triples(iris[1:2, ])[1:10, ]
#>    s            p      o
#> 1  1 Sepal.Length    5.1
#> 2  2 Sepal.Length    4.9
#> 3  1  Sepal.Width    3.5
#> 4  2  Sepal.Width      3
#> 5  1 Petal.Length    1.4
#> 6  2 Petal.Length    1.4
#> 7  1  Petal.Width    0.2
#> 8  2  Petal.Width    0.2
#> 9  1      Species setosa
#> 10 2      Species setosa

We describe the dataset_df datasets in such triplets, where each triplet is a semantic statement: it connects a single observation unit with a single measurement.

Enhanced Interoperability

RDF uses globally unique identifiers (URIs) for resources, ensuring that different datasets can reference the same entities unambiguously. This allows seamless data integration and querying across repositories, even if the datasets come from diverse domains.

Our defined class supports this enhanced interoperability. In the example below, an application can look up that the numeric values in your table conform the statistical definition of GDP, and they are expressed in millions of dollars; meaning that you have to multiply them by 1000 if you want to join them with different data expressed in thousands of dolllars.

gdp_vector <- defined(
  c(3897, 7365, 6753),
  label = "Gross Domestic Product",
  unit = "https://rdf.vegdata.no/V440/v440-doc/v440-brudata-owl-doc/unit_MillionUSD.html",
  definition = "http://data.europa.eu/83i/aa/GDP"
)

There are several ways to add permanent identifiers to observational units, variable definitions, and specific observed values. The simplest (but certainly not the easiest to read for a human eye) standard format for writing them into a plain text file that you can share online is the RDF 1.1 N-Triples format.The NTriple format creates URIs (similarly formatted as URLs) for the definitions that can be looked up in an online resource. This can be combined with literal strings that may also include information if they should be read back to a system as strings, doubles, integers, dates or date-time variables.

n_triple(
  s = "https://doi.org/10.5281/zenodo.10396807", # permanent, global ID of the dataset
  p = "http://purl.org/dc/terms/description", # library definition of 'description'
  o = "The famous (Fisher's or Anderson's) iris data set."
) # literal string
#> [1] "<https://doi.org/10.5281/zenodo.10396807> <http://purl.org/dc/terms/description> \"The famous (Fisher's or Anderson's) iris data set.\"^^<http://www.w3.org/2001/XMLSchema#string> ."

Richer metaadata

RDF supports linking datasets through shared URIs, enabling the creation of interconnected knowledge graphs. Linked Data principles help relate datasets in meaningful ways, making it easier to discover, navigate, and integrate information. RDF annotations allow datasets to include detailed metadata about their structure, provenance, usage rights, and content. This metadata provides critical context, enabling automated tools to interpret and process the data effectively.

Most scientific researchers are familiar with data findability, accessibility, interoperability, and reuse. Your dataset’s properties will significantly improve if you add standard metadata used by libraries globally (according to the Dublin Core standards) or the DataCite data repository standards. Such standards use globally shared definitions on how a title or a subtitle should be added to your dataset or how you can add with IRIs keywords that any user interprets the same way in the world, even if they do not speak English or your language.

RDF supports the use of ontologies and controlled vocabularies (e.g., DataCite, Dublin Core, Schema.org), allowing datasets to be described consistently within and across domains.

The as_dublincore function allows the export of your dataset’s data in the Dublin Core format, and as_datacite in the DataCite format. Some of the metadata are generated behind the scenenes, for example, timestamps or size measurements.

as_dublincore(iris_dataset, type = "ntriples")
#>  [1] "<https://doi.org/10.5281/zenodo.10396807> <http://purl.org/dc/terms/title> \"Iris Dataset\"^^<http://www.w3.org/2001/XMLSchema#string> ."                                            
#>  [2] "<https://doi.org/10.5281/zenodo.10396807> <http://purl.org/dc/terms/description> \"The famous (Fisher's or Anderson's) iris data set.\"^^<http://www.w3.org/2001/XMLSchema#string> ."
#>  [3] "<https://doi.org/10.5281/zenodo.10396807> <http://purl.org/dc/terms/creator> <https://viaf.org/viaf/http://viaf.org/viaf/6440526> ."                                                 
#>  [4] "<https://doi.org/10.5281/zenodo.10396807> <http://purl.org/dc/terms/publisher> \"American Iris Society\"^^<http://www.w3.org/2001/XMLSchema#string> ."                               
#>  [5] "<https://doi.org/10.5281/zenodo.10396807> <http://purl.org/dc/terms/identifier> <https://doi.org/10.5281/zenodo.10396807> ."                                                         
#>  [6] "<https://doi.org/10.5281/zenodo.10396807> <http://purl.org/dc/terms/subject> \"\"^^<http://www.w3.org/2001/XMLSchema#string> ."                                                      
#>  [7] "<https://doi.org/10.5281/zenodo.10396807> <http://purl.org/dc/terms/type> <http://purl.org/dc/terms/DCMITypeDataset> ."                                                              
#>  [8] "<https://doi.org/10.5281/zenodo.10396807> <http://purl.org/dc/terms/contributor> \"Antal Daniel [dtm]\"^^<http://www.w3.org/2001/XMLSchema#string> ."                                
#>  [9] "<https://doi.org/10.5281/zenodo.10396807> <http://purl.org/dc/terms/language> \"en\"^^<http://www.w3.org/2001/XMLSchema#string> ."                                                   
#> [10] "<https://doi.org/10.5281/zenodo.10396807> <http://purl.org/dc/terms/source> <https://doi.org/10.1111/j.1469-1809.1936.tb02137.x> ."                                                  
#> [11] "<https://doi.org/10.5281/zenodo.10396807> <http://purl.org/dc/terms/coverage> \":unas\"^^<http://www.w3.org/2001/XMLSchema#string> ."

Interoperability and reusability can further increase if the next user can trust your dataset, and has to perform less checks on it; or the next user can reproduce what you did. Data provenance is the metadata that provides a comprehensive record of the origins, history, and transformations of data throughout its lifecycle. Our provenance functions records some of this data automatically, and allow you to add more information, for example, about your data sources, the R packages used, the persons involved in the creation and review process, or the statistical transformations carried out.

provenance(iris_dataset)
#> [1] "<http://example.com/dataset_prov.nt> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Bundle> ."                  
#> [2] "<http://example.com/dataset#> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Entity> ."                         
#> [3] "<http://example.com/dataset#> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/linked-data/cube#DataSet> ."                 
#> [4] "<http://viaf.org/viaf/6440526> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Agent> ."                         
#> [5] "<https://doi.org/10.32614/CRAN.package.dataset> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#SoftwareAgent> ."
#> [6] "<http://example.com/creation> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Activity> ."                       
#> [7] "<http://example.com/creation> <http://www.w3.org/ns/prov#generatedAtTime> \"2024-12-25T12:26:25Z\"^^<xs:dateTime> ."

Adding your dataset into an RDF triplestore

RDF data can be stored in triple stores and queried using SPARQL, a powerful query language. This makes it easier to retrieve specific subsets of data or infer new information based on existing annotations

# initialise an rdf triplestore:
dataset_describe <- rdf()

# open a temporary file:
temp_prov <- tempfile()

# describe the dataset in temporary file:
describe(iris_dataset, temp_prov)

# parse temporary file into the RDF triplestore;
rdf_parse(rdf = dataset_describe, doc = temp_prov, format = "ntriples")
#> Total of 18 triples, stored in hashes
#> -------------------------------
#> <http://example.com/creation> <http://www.w3.org/ns/prov#generatedAtTime> "2024-12-25T12:26:25Z"^^<xs:dateTime> .
#> <https://doi.org/10.32614/CRAN.package.dataset> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#SoftwareAgent> .
#> <http://viaf.org/viaf/6440526> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Agent> .
#> <https://doi.org/10.5281/zenodo.10396807> <http://purl.org/dc/terms/creator> <https://viaf.org/viaf/http://viaf.org/viaf/6440526> .
#> <https://doi.org/10.5281/zenodo.10396807> <http://purl.org/dc/terms/identifier> <https://doi.org/10.5281/zenodo.10396807> .
#> <http://example.com/dataset_prov.nt> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Bundle> .
#> <https://doi.org/10.5281/zenodo.10396807> <http://purl.org/dc/terms/publisher> "American Iris Society"^^<http://www.w3.org/2001/XMLSchema#string> .
#> <http://example.com/dataset#> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/linked-data/cube#DataSet> .
#> <http://example.com/dataset#> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Entity> .
#> <https://doi.org/10.5281/zenodo.10396807> <http://purl.org/dc/terms/coverage> ":unas"^^<http://www.w3.org/2001/XMLSchema#string> .
#> 
#> ... with 8 more triples

# show RDF triples:
dataset_describe
#> Total of 18 triples, stored in hashes
#> -------------------------------
#> <http://example.com/creation> <http://www.w3.org/ns/prov#generatedAtTime> "2024-12-25T12:26:25Z"^^<xs:dateTime> .
#> <https://doi.org/10.32614/CRAN.package.dataset> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#SoftwareAgent> .
#> <http://viaf.org/viaf/6440526> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Agent> .
#> <https://doi.org/10.5281/zenodo.10396807> <http://purl.org/dc/terms/creator> <https://viaf.org/viaf/http://viaf.org/viaf/6440526> .
#> <https://doi.org/10.5281/zenodo.10396807> <http://purl.org/dc/terms/identifier> <https://doi.org/10.5281/zenodo.10396807> .
#> <http://example.com/dataset_prov.nt> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Bundle> .
#> <https://doi.org/10.5281/zenodo.10396807> <http://purl.org/dc/terms/publisher> "American Iris Society"^^<http://www.w3.org/2001/XMLSchema#string> .
#> <http://example.com/dataset#> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/linked-data/cube#DataSet> .
#> <http://example.com/dataset#> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Entity> .
#> <https://doi.org/10.5281/zenodo.10396807> <http://purl.org/dc/terms/coverage> ":unas"^^<http://www.w3.org/2001/XMLSchema#string> .
#> 
#> ... with 8 more triples

By using RDF, datasets can be exchanged as interoperable graphs (e.g., in formats like RDF/XML, Turtle, or JSON-LD).

options(rdf_print_format = "jsonld")
dataset_describe
#> Total of 18 triples, stored in hashes
#> -------------------------------
#> {
#>   "@graph": [
#>     {
#>       "@id": "http://example.com/creation",
#>       "@type": "http://www.w3.org/ns/prov#Activity",
#>       "http://www.w3.org/ns/prov#generatedAtTime": {
#>         "@type": "xs:dateTime",
#>         "@value": "2024-12-25T12:26:25Z"
#>       }
#>     },
#> 
#> ... with 8 more triples

Make the entire dataset interoperable

Eventually you can make the entire dataset interoperable, with making every observation, every statement independent of R, your computer, your OS, and to a large extent the natural language that you use. This will be further developed until we can express in a semantically correct way an entire dataset automatically

n_triples(dataset_to_triples(iris[1:4, ]))
#>    s            p      o
#> 1  1 Sepal.Length    5.1
#> 2  2 Sepal.Length    4.9
#> 3  3 Sepal.Length    4.7
#> 4  4 Sepal.Length    4.6
#> 5  1  Sepal.Width    3.5
#> 6  2  Sepal.Width      3
#> 7  3  Sepal.Width    3.2
#> 8  4  Sepal.Width    3.1
#> 9  1 Petal.Length    1.4
#> 10 2 Petal.Length    1.4
#> 11 3 Petal.Length    1.3
#> 12 4 Petal.Length    1.5
#> 13 1  Petal.Width    0.2
#> 14 2  Petal.Width    0.2
#> 15 3  Petal.Width    0.2
#> 16 4  Petal.Width    0.2
#> 17 1      Species setosa
#> 18 2      Species setosa
#> 19 3      Species setosa
#> 20 4      Species setosa