Motivation: Make Tidy Datasets Easier to Release Exchange and Reuse
The aim of the ‘dataset’ package is to make tidy datasets easier to release, exchange and reuse. It organizes and formats data frame ‘R’ objects into well-referenced, well-described, interoperable datasets into release and reuse ready form. It applies a subjective, R-language compatible interpretation of the W3C Data Cube Vocabulary based on the statistical SDMX data cube model1. We enrich these statistical data exchange definition with similar definitions from information science and the exchange of scientific resources on open science repositories by applying the Dublin Core and DataCite metadata specifications for R attributes to improve the findability, accessibility, interoperability and reusability of the datasets.
We organize and format standard tabular data R objects (data.fame, data.table, tibble, or even well-structured lists like json and json-ld) become highly interoperable and can be placed into relational databases, semantic web applications, archives, repositories. This puts the FAIR principles into the practice of an R environment workflow: they make these tabular data objects more findable, more accessible, interoperable and reusable.
Open science repositories and analyst computers are full with datasets that have no provenance, structural or referential data. We believe that whenever possible, metadata should be machine-recorded when possible, and should not be detached from an R object.
There are several R packages that have overalapping goals or
dataset, but they use a different
philosophy. When exporting to different files, they should be written as
exported, but no sooner, and preferably into the file that contains the
The dataspice package aims to create well-defined and referenced datasets, but follows a different schema and a different publication strategy. The dataset package follows the more restrictive W3C/SDMX “DataSet” definition within the datacube model, which is better suited to synchronize with statistical data sources. Unlike dataset, it uses a manual metadata entry from CSV files. (See the documentation of the dataspice package.)
dataset package aims for a higher level of
reproducibility, and does not detach the metadata from the R object’s
attributes (it is aimed to be used in other reproducible research
pacakges that will directly record provenance and other transactional
metadata into the attributes.) We aim to bind together
dataset by creating export
functions to csv files that contain the same metadata that dataspice
records. Generally, dataspice seems to be better suited to raw,
observational data, while dataset for statistically processed data.
The intended use of
dataset is to start correctly record
referential, structural and provenance metadata retrieved by various
reproducible science packages that interact with statistical data (such
as the rOpenGov packages eurostat and iotables, or the
dataspice are very
suitable of or documenting social sciences survey data, which are
usually held in datasets. Our aim is to connect
dataset, declared and DDIwR to create such
datasets with DDI codebook metadata. They will create a stable new
foundation of the retroharmonize
package to create new, well-documented and harmonized statistical
datasets from the observational datasets of social sciences surveys.
package provides reproducible export functionality to the zenodo open
science repository. Interacting with
zen4R may be
intimidating for the casual R user as it uses R6 classes. Our aim to
provide an export function that completely wraps the workings of
zen4R when releasing the dataset.
In our experience, while the tidy data standards make reuse more efficient by eliminating unnecessary data processing steps before analysis or placement in a relational database, the application of DataSet definition and the datacube model with the information science metadata standards make reuse more efficient with exchanging and combining the data with other data in different datasets.
Another use case we are having in mind is the releasing of the data on a knowledge graph, or in an API that uses a well-defined schema. This is the primary motivation for choosing the datacube model, because it is used by most statistical APIs. The datacube model allows easy “slicing”, or dimension reduction, that allows itself to an easy modeling into RDF(S). (See later, and see a more comprehensive From dataset To RDF vignette, which in turn heavily borrows from A tidyverse lover’s intro to RDF by Carl Boettiger.)
The current development dilemma of
datasetif it should develop the
dataset()into an s3 class, and if so, should it be inherited from data.frame or tibble, keeping in mind the desire to work well with data.table objects, too.
If not, then the
dataset()functions should be clearly seen as helper functions to augment data.frame, tibble and data.table objects for easier use in data publication, release, placement in databases and on knowledge graphs.
In the first step, we organize tidy datasets according to the data cube model originally developed by the SDMX for exchanging statistical data, by clearly stating which columns are to be used for grouping, filtering and for calculations. In this step, we clearly state which tidy columns are to be used as “measurements”, “dimensions” and “attributes” of the dataset. This step will greatly simplify how we obtain tidy subsets from a dataset, or how we reduce the dimensions of a dataset to a triad or a quad to be placed on a knowledge graph.
In the second step, we add metadata is attributes to the R object (regardless if it is a data.frame or an inherited tibble/data.table object) which make the finding, understanding and the exchange of data (including refreshing data from source), its placement in a database or on a knowledge graph easier. This step will make the dataset easier to understand both for human analysis and for computers, and greatly increase their reuse potential, or the ability to review and replicate them in reproducible research.
A mapping of R objects into these models has numerous advantages: it makes data importing easier and less error-prone; it leaves plenty of room for documentation automation; and it makes the publication of results from R following the FAIR principles much easier.
The tidy data principles are seemingly very simple: each variable is a column, each observation is a row, and place all new units of measurement into a new table. However, the steps of tidying up a messy dataset require practice, like solving a Rubik cube––take a long time to do quicky.
The tidy data principles frame Codd’s relational algebra, particularly the third normal form (3NF) into statistical language, and make the storage and reuse of data in relational databases optional. It suggests a format for tabular data and data semantics that makes a very important, and usually the most time (resource) consuming part of data analysis far more efficient, because the analyst does not have to reinvent the wheel all the time.The data cube model can be seen as an extension of the tidy model that makes the storage, reuse, and exchange of data among different database, or their placement on a knowledge graph optimal.
The wording of the tidy data principles is not sufficiently exact for the data cube model. THe 3NF form stipulates that the (e.g. table columns) are functionally dependent on solely the primary key, which means that we must have a primary key that is locally unique in the table. The primary key must be a unique combination of the dimension and attribute columns, including the row identifier (rowid), if it exists. This means a lot more than the uniqueness of the measurement unit.
A usual dimension is the reference time period and the reference geographical area. Consider a dataset that contains population growth numbers from a county of a country, collected in the time period of 2010..2015 quarterly, and in 2016:2020 annually, with the county borders changing in 2014, and the unit of measure changed from the number of people in thousands to the number of people in millions. The meausurement of the reference time period and the aggregation area has changed, too.
library(dataset) #> #> Attaching package: 'dataset' #> The following object is masked from 'package:base': #> #> as.data.frame my_dataset <- dataset (x = data.frame ( time = rep(c(2019:2022),4), geo = c(rep("NL",8), rep("BE",8)), sex = c(rep("F", 4), rep("M", 4), rep("F", 4), rep("M", 4)), value = c(1,3,2,4,2,3,1,5, NA_real_, 4,3,2,1, NA_real_, 2,5), unit = rep("NR",8), freq = rep("A",8)), Dimensions = c("time", "geo", "sex"), Measures = "value", Attributes = c("unit", "freq"), sdmx_attributes = c("sex", "time", "freq"), Title = "Example dataset", Creator = person("Jane", "Doe") ) my_dataset <- dataset_local_id(my_dataset) my_dataset #> local_id time geo sex value unit freq #> 1 geo=NL_sex=F_time=2019 2019 NL F 1 NR A #> 2 geo=NL_sex=F_time=2020 2020 NL F 3 NR A #> 3 geo=NL_sex=F_time=2021 2021 NL F 2 NR A #> 4 geo=NL_sex=F_time=2022 2022 NL F 4 NR A #> 5 geo=NL_sex=M_time=2019 2019 NL M 2 NR A #> 6 geo=NL_sex=M_time=2020 2020 NL M 3 NR A #> 7 geo=NL_sex=M_time=2021 2021 NL M 1 NR A #> 8 geo=NL_sex=M_time=2022 2022 NL M 5 NR A #> 9 geo=BE_sex=F_time=2019 2019 BE F NA NR A #> 10 geo=BE_sex=F_time=2020 2020 BE F 4 NR A #> 11 geo=BE_sex=F_time=2021 2021 BE F 3 NR A #> 12 geo=BE_sex=F_time=2022 2022 BE F 2 NR A #> 13 geo=BE_sex=M_time=2019 2019 BE M 1 NR A #> 14 geo=BE_sex=M_time=2020 2020 BE M NA NR A #> 15 geo=BE_sex=M_time=2021 2021 BE M 2 NR A #> 16 geo=BE_sex=M_time=2022 2022 BE M 5 NR A
The tidy data is designed to be easily handled in relational databases and it is an ideal format for analysis in a closed system – for example, when all your data for the analysis is stored in a well-designed relational database. This semantics is however insufficient for exchanging data with several data sources, a problem that has been attracting solutions for a long time by statistical agencies and designers of hyperlinked, web-based products.
The dataset package is inspired by the way statisticians have been for a long-time organizing data for publications: the data cube model. The data cube model has additional, non-normative prescriptions for the structure of the (tidy) dataset: columns must be grouped into measurements, dimensions, and attributes. The dimensions are non-measured variables, such as reference geographical areas, reference time periods that are usually used to filter out subsets (slices) from a datacube. Attributes provide information about the measurements, such as unit of information, estimation method, and usually can be seen as metadata for modeling purposes: they contain observations (measurements) level metadata that is also often used for filtering (selecting actual, imputed or missing cases, for example.) Measurements are in fact the observations that we want to analyze.
The data cube model has been long the preferred format to organize data for analytical use by SDMX. Statistical agencies exchange with each other and publish data in datasets that are conforming the data cube model. By organizing your R tabular data like statisticians, you make your work easier to synchronize with statistical data sources: you can download or exchange date in this format. Following the data cube model helps you to understand more easily the dataset and bring your data to be analyzed to a tidy form in one or two steps — usually using attributes for subsetting, dimensions for grouping, and measurements to carry out arithmetic calculations like computing the group average values.
While tidying aims and preventing reinventing the wheel and many times repeating the work of bringing the data from a data storage, such as a database, into a form that an analyst can work with, reproducible research has further aims that add to the quality and inefficiency of the analysis: it aims to make the tidy datasets and the research output easier to reuse, and to review and replicate for quality control and peer review. Data scientists know that however simple the tidy data rules are, it takes a long time to get a feeling to easily put any messy, stored data into a tidy format. To put the data into a format that is easy to review, and which offers an easy replication is even harder. As the famous article by Monya Baker, 1,500 scientists lift the lid on reproducibility shows, the vast majority of scientists have experienced the problem that seemingly replicable, published scientific data cannot be replicated, and more worryingly, the majority of them expressed doubt about their ability to replicate their own findings. Datasets that are not easily replicated are also not easily used as building blocks for further joined, more composite data.
The lack of interoperability is often connected to the barrier of understanding a dataset. Having a simple term in the column heading, and knowing that observations are in rows, variables are in columns is not always a sufficient help to understand the meaning of the dataset. The data cube model, which explicitly states dimensions, attributes for the measurements greatly increase the understandability.
“Data is only potential information, raw and unprocessed, prior to anyone actually being informed by it.” —Jeffrey Pomerantz, Metadata
Understanding data (stored in a tabular or other format) requires metadata, information by which the data can be represented in a simpler form. The simplest and most useful metadata for this purpose is a title or a label in the RDF model, a single, human readable sentence that describes the data.
The data cube model describes attributes, which can be seen as metadata about observations, to be placed into the tabular format, and dataset level metadata, such as the title. What is missing from a tidy dataset in a data.frame, tibble or data.table format is the metadata.
In R, attributes of an object are designed to record metadata about the object – if it is a tabular data, then about the dataset itself. For a tidy dataset, at least the names of the columns, the names of the rows, and the name of the object in memory are recorded. R places very few restrictions on how to add metadata in the attribute fields of an object, or place new fields – you can add far more metadata than the size of the actual data frame if you like.
The dataset package defines a pseudo-class that enriches data frames with standardized metadata used in the RDF and SDMX datacube model, and the Dublin Core and DataCite metadata standards of libraries that make datasets easier to find, attribute, and more interoperable and reuseable by providing more information about their contents.
The dataset class (see:
dataset())is a pseudo-class.
Although it is defined as a class inherited from the base R data.frame
S3 class, in fact, it does not alter the way you interact with the data,
and for analytical purposes the object remains a data.frame. It is also
a pseudo-class because the functions built in the dataset constructor
work exactly as well on tibbles or data.table objects, because they only
touch metadata attributes of the R object.
dublincore() function and its helper functions adds
metadata as attributes to any R object The Dublin Core Elements are
originally library standards that facilitate finding resources (files)
in libraries. The RDF datacube definition, which aims to enable machines
to connect datasets automatically, refers to a part of the Dublin Core
definition —however, it is a good practice to add all Dublin Core
elements to a dataset.
datacite() functions and its helper functions adds
mandatory and optional metadata from the DataCite standard, which is
often preferred by open science data repositories aimed to a public
storage of datasets. The DataCite standard is newer than Dublin Core,
and was specifically designed to facilitate finding, accessing, and
reusing data. It greatly overlaps with Dublin Core but contains further
information that is particularly useful for datasets (for example,
estimated dataset object size in memory or data storage.)
Both the Dublin Core Metadata Elements definition 1.1 and DataCite
Metadata Scheme 4.4 refer to standard information elements, properties,
for example, the ISO standard of describing the size of an object in
computers, or the ISO standard of describing reference time or reference
geographical areas for the data collection. Some of the helper functions
validate the correct formatting of this information, some will be added
in later versions.
iris_datacite <- datacite_add( x = dataset(iris, Measures = c(1:4), Attributes = "Species"), Title = "Iris Dataset", Creator = person("Anderson", "Edgar", role = "aut"), Publisher = "American Iris Society", Identifier = "https://doi.org/10.1111/j.1469-1809.1936.tb02137.x", PublicationYear = 1935, Description = "This famous (Fisher's or Anderson's) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.", Language = "en")
In R, objects can have arbitrary attributes. For example, a data.frame has a class attribute that tells functions to treat the object as a data.frame.Under the hood, you still keep your data frame object of choice, the good old base r ?data.frame, or the more modern data.table or tibble.
We add descriptive metadata conforming the Dublin Core and DataCite standard as data frame attributes, because they must clearly describe the dataset. The dataset-level attributes do not interfere with the tidy data concept, because the tidy data concept relates to the contents of the data frame.
datacite(iris_datacite) #> $names #>  "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species" #> #> $dimensions #>  names class isDefinedBy codeList #> <0 rows> (or 0-length row.names) #> #> $measures #> names class isDefinedBy #> Sepal.Length Sepal.Length numeric https://purl.org/linked-data/cube #> Sepal.Width Sepal.Width numeric https://purl.org/linked-data/cube #> Petal.Length Petal.Length numeric https://purl.org/linked-data/cube #> Petal.Width Petal.Width numeric https://purl.org/linked-data/cube #> codeListe #> Sepal.Length not yet defined #> Sepal.Width not yet defined #> Petal.Length not yet defined #> Petal.Width not yet defined #> #> $attributes #> names class #> Species Species factor #> isDefinedBy #> Species https://purl.org/linked-data/cube|https://raw.githubusercontent.com/UKGovLD/publishing-statistical-data/master/specs/src/main/vocab/sdmx-attribute.ttl #> codeListe #> Species not yet defined #> #> $Type #> $Type$resourceType #>  "Dataset" #> #> $Type$resourceTypeGeneral #>  "Dataset" #> #> #> $Description #>  "This famous (Fisher's or Anderson's) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica." #> #> $Size #>  "17.42 kB [17.01 KiB]" #> #> $Date #>  "2022-12-15" #> #> $Title #> $Title$Title #>  "Iris Dataset" #> #> #> $Creator #>  "Anderson Edgar [aut]" #> #> $Identifier #>  "https://doi.org/10.1111/j.1469-1809.1936.tb02137.x" #> #> $Publisher #>  "American Iris Society" #> #> $Issued #>  1935 #> #> $publication_year #>  1935 #> #> $Geolocation #>  NA #> #> $Language #>  "eng" #> #> $Rights #>  NA
The assumption of third normal form (3NF), or a more precise form of tidiness, is that each observation has a primary key locally, i.e. in the dataset, is insufficient. For updating a dataset, i.e. joining new observations, or new columns, we need to work with globally universal observation identifiers, or URIs. (Otherwise, we must map the various locally unique keys to each other, i.e. we need to know various database schemas.) The idea of linked data, or RDF, is to reduce the data to a triple, and move all metadata, including referential or structural metadata into this triple.
my_dataset_uri <- dataset_uri(my_dataset, prefix = "https:://example.org/my_iris", keep_local_id = FALSE) my_dataset_uri #> URI time geo sex value unit freq #> 1 https:://example.org/my_iristime=2019 2019 NL F 1 NR A #> 2 https:://example.org/my_iristime=2020 2020 NL F 3 NR A #> 3 https:://example.org/my_iristime=2021 2021 NL F 2 NR A #> 4 https:://example.org/my_iristime=2022 2022 NL F 4 NR A #> 5 https:://example.org/my_iristime=2019 2019 NL M 2 NR A #> 6 https:://example.org/my_iristime=2020 2020 NL M 3 NR A #> 7 https:://example.org/my_iristime=2021 2021 NL M 1 NR A #> 8 https:://example.org/my_iristime=2022 2022 NL M 5 NR A #> 9 https:://example.org/my_iristime=2019 2019 BE F NA NR A #> 10 https:://example.org/my_iristime=2020 2020 BE F 4 NR A #> 11 https:://example.org/my_iristime=2021 2021 BE F 3 NR A #> 12 https:://example.org/my_iristime=2022 2022 BE F 2 NR A #> 13 https:://example.org/my_iristime=2019 2019 BE M 1 NR A #> 14 https:://example.org/my_iristime=2020 2020 BE M NA NR A #> 15 https:://example.org/my_iristime=2021 2021 BE M 2 NR A #> 16 https:://example.org/my_iristime=2022 2022 BE M 5 NR A
rdf_parse(nq_file) #> Total of 16 triples, stored in hashes #> ------------------------------- #> _:r1671090702r51275r9 <https:://example.org/my_iristime=2019> _:r1671090702r51275r10 . #> _:r1671090702r51275r5 <https:://example.org/my_iristime=2019> "2"^^<http://www.w3.org/2001/XMLSchema#decimal> . #> _:r1671090702r51275r13 <https:://example.org/my_iristime=2022> "2"^^<http://www.w3.org/2001/XMLSchema#decimal> . #> _:r1671090702r51275r6 <https:://example.org/my_iristime=2020> "3"^^<http://www.w3.org/2001/XMLSchema#decimal> . #> _:r1671090702r51275r15 <https:://example.org/my_iristime=2020> _:r1671090702r51275r16 . #> _:r1671090702r51275r18 <https:://example.org/my_iristime=2022> "5"^^<http://www.w3.org/2001/XMLSchema#decimal> . #> _:r1671090702r51275r4 <https:://example.org/my_iristime=2022> "4"^^<http://www.w3.org/2001/XMLSchema#decimal> . #> _:r1671090702r51275r14 <https:://example.org/my_iristime=2019> "1"^^<http://www.w3.org/2001/XMLSchema#decimal> . #> _:r1671090702r51275r3 <https:://example.org/my_iristime=2021> "2"^^<http://www.w3.org/2001/XMLSchema#decimal> . #> _:r1671090702r51275r7 <https:://example.org/my_iristime=2021> "1"^^<http://www.w3.org/2001/XMLSchema#decimal> . #> #> ... with 6 more triples
The dataset package is aiming to make tidy datasets easier to release, exchange and reuse. The motivation of the ecosystem of the dataset, statcodelists, and the dataobservatory packages is to provide a comprehensive toolkit that helps many reproducible research tasks.
dataset package already has an offspring, the
surveydataset package, which is aimed to connect the
dataset functionality with some elements of DDI, a metadata standard to
work with social sciences survey data, and to connect the dataset
package with the retroharmonize
package that harmonizes data from different surveys.
Our real goal is to facilitate efficient reproducible research workflows that perform flawlessly tasks that most R users, even scientific researchers, are unfamiliar with—or if they are familiar with them, they find it boring, because they will usually not get credited for it as analyst or researchers.
- contain Dublin Core or DataCite (or both) metadata that makes the findable and easier accessible via online libraries
- are easily reduced to linked open data formats to be serialized to, or synchronized with semantic web applications
- follow the datacube model of the Statistical Data and Metadata eXchange, therefore allowing easy refreshing with new data from the source of the analytical work
- contain processing metadata that greatly enhance the reproducibility of the results, and the reviewability of the contents of the dataset, including metadata defined by the [DDI]
- relatively lightweight in dependencies and easily works with data.frame, tibble or data.table R objects.
The datacube model is used by the W3C consortium for describing datasets, tabular data resources that can be easily connected via the internet infrastructure. RDF , Because the SDMX standards pre-date the definition of RDF, it took a long time sufficiently harmonize RDF and SDMX to become a standard for statistical data that can be easily exchanged by humans and computers alike, therefore making it ideal for reproducible research.
We also aim to replace the
survey in the retroharmonize
survey harmonization package to an inherited dataset that is optimized
to contain social sciences survey data.
The connecting statcodelists packages facilitates further reproducibility with standardized, natural language independent codelists for categorical variables. Such categorical variables are correctly interpreted in a wide array of statistical applications and can be easily joined with data from many countries irrespective of the primary sources’ language.
See: Linked SDMX Data
Baker, Monya. 2016. “1,500 Scientists Lift the Lid on reproducibility.” Nature 533 (7604): 452–54. https://doi.org/10.1038/533452a.
Capadisli, Sarven, Sören Auer, and Axel-Cyrille Ngonga Ngomo. 2015. “Linked SDMX Data: Path to High Fidelity Statistical Linked Data.” Semantic Web 6 (2): 105–12. https://doi.org/10.3233/SW-130123.
Core, Dublin. 2020. “DCMI Metadata Terms.” http://dublincore.org/specifications/dublin-core/dcmi-terms/2020-01-20/.
Group, DataCite Metadata Working. 2021. “DataCite Metadata Schema Documentation for the Publication and Citation of Research Data and Other Research Outputs. Version 4.4.” DataCite e.V. https://doi.org/10.14454/3w3z-sa82.
Pomerantz, Jeffrey. 2015. Metadata. The MIT Press Essential Knowledge Series. Cambridge, MA, USA: MIT Press.
Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (10): 1–23. https://doi.org/10.18637/jss.v059.i10.