Skip to contents

Overview

The dataset package extends tidyverse workflows with lightweight semantic metadata, provenance tracking, and interoperable dataset structures.

It supports gradual semantic stabilization ranging from lightweight semantic mappings to formally defined variables and semantically enriched datasets suitable for FAIR, machine-readable, and standards-aligned data exchange.

The package draws inspiration from:

  • SDMX and statistical data cube models
  • Dublin Core and DataCite
  • FAIR and reproducible research workflows

The goal is to preserve metadata when reusing statistical and repository datasets, improve interoperability, and make it easy to turn tidy data frames into web-ready, publishable datasets that comply with ISO and W3C standards.

Installation

You can install the latest released version of dataset from CRAN with:

install.packages("dataset")

To install the development version from GitHub with pak or remotes:

# install.packages("pak")
pak::pak("dataobservatory-eu/dataset")

# install.packages("remotes")
remotes::install_github("dataobservatory-eu/dataset")

Minimal Example

Real-world datasets rarely begin with fully standardized values. Early in a project, inconsistencies may be easy to spot, such as mixing AD and Andorra for the same country. As datasets are combined from multiple sources, however, additional variants often appear, for example the ISO-3166 alpha-2 code AD, the country name Andorra, or the ISO-3166 alpha-3 code AND.

The prelabel() constructor provides a lightweight way to stabilize such values before committing to a formal semantic definition.

library(dataset)

x <- prelabel(
  c("AD", "Andorra", "AND", "LI", "Liechtenstein"),
  labels = c(
    Andorra = "AD",
    AND = "AD",
    Liechtenstein = "LI"
  )
)

as.character(x)
#> [1] "AD" "AD" "AD" "LI" "LI"

Unlike a formal semantic definition, a prelabelled vector records provisional mappings that may still evolve during data integration. The original observational values remain available alongside the current semantic assumptions:

attr(x, "prelabel")
#>       Andorra           AND Liechtenstein            AD            LI 
#>          "AD"          "AD"          "LI"          "AD"          "LI"

When semantic assumptions become sufficiently stable, variables can be formalized with defined() and combined into a semantically enriched dataset_df() object:

library(dataset)

df <- dataset_df(
  country = defined(
    c("AD", "LI"),
    label = "Country",
    namespace = "https://www.geonames.org/countries/$1/"
  ),
  gdp = defined(
    c(3897, 7365),
    label = "GDP",
    unit = "million euros"
  ),
  dataset_bibentry = dublincore(
    title = "GDP Dataset",
    creator = person("Jane", "Doe", role = "aut"),
    publisher = "Small Repository"
  )
)

print(df)
#> Doe (2026): GDP Dataset [dataset]
#>   rowid country   gdp 
#>   <chr> <chr>   <dbl>
#> 1 obs1  AD       3897
#> 2 obs2  LI       7365

This illustrates the semantic lifecycle supported by the package:

raw values
    ↓
prelabelled
    ↓
defined
    ↓
dataset_df
    ↓
RDF and FAIR publication

Because semantic assumptions and provenance are preserved explicitly, semantically enriched datasets can be exported as interoperable RDF triples without manually reconstructing metadata at publication time.

Export as RDF triples:

dataset_to_triples(df, format = "nt")
#> [1] "<http://example.com/dataset#obsobs1> <http://example.com/prop/country> <https://www.geonames.org/countries/AD/> ."
#> [2] "<http://example.com/dataset#obsobs2> <http://example.com/prop/country> <https://www.geonames.org/countries/LI/> ."
#> [3] "<http://example.com/dataset#obsobs1> <http://example.com/prop/gdp> \"3897\"^^<xsd:decimal> ."                     
#> [4] "<http://example.com/dataset#obsobs2> <http://example.com/prop/gdp> \"7365\"^^<xsd:decimal> ."

Retain automatically recorded provenance:

#> [1] "<http://example.com/dataset_prov.nt> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Bundle> ."                  
#> [2] "<http://example.com/dataset#> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Entity> ."                         
#> [3] "<http://example.com/dataset#> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/linked-data/cube#DataSet> ."                 
#> [4] "_:doejane <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Agent> ."                                              
#> [5] "<https://doi.org/10.32614/CRAN.package.dataset> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#SoftwareAgent> ."
#> [6] "<http://example.com/creation> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Activity> ."                       
#> [7] "<http://example.com/creation> <http://www.w3.org/ns/prov#generatedAtTime> \"2026-06-03T06:29:26Z\"^^<xsd:dateTime> ."

Contributing

The package does not attempt automatic ontology alignment, entity reconciliation, or rule-based semantic inference. It focuses on preserving semantic assumptions made by the analyst in a transparent and reproducible form.

We welcome contributions and discussion!

Please refer to this package as:

Daniel Antal. (2026). dataset: Create Data Frames that are Easier to Exchange and Reuse (0.4.4). The Comprehensive R Archive Network. https://zenodo.org/records/17621464, DOI: 10.32614/CRAN.package.dataset

See contributors on the website and in the DESCRIPTION file.

Code of Conduct

This project follows the rOpenSci Code of Conduct. By participating, you are expected to uphold these guidelines.