Richer Semantics for the Dataset's Variables • dataset

There are only two hard things in Computer Science: cache invalidation and naming things. – Phil Karlton

In a tidy dataset, variables are in columns. An R data frame (in its form as data.frame, tibble, tsibble, data.table) allows using column names that do not contain characters reserved with a special meaning in the R language itself. Choosing good variable names is an art for the data scientist: it improves code readability, review, and later reuse. However, the use of often used (and standardised) variable names like GEO, SEX, UNIT is not enough for a third-party review or reuse; the same analyst will likely have problems recalling the meaning of these variables without further documentation.

Let’s take a look at the famous iris dataset. What does Sepal.Length or Species means? What happens if we make a mutating transformation of Sepal.Width? What is the unit of measure if we take the average of Petal.Width?

head(iris)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.7         0.4  setosa

One good practice for increasing reviewability and reproducibility is to include this information in a notebook or similar document. While this is undoubtedly a good practice, often the reuse or review is hampered by the fact that the data and the documentation are separated, and for example, the dataset is available without the documentation, or worse still, the dataset is modified after the documentation was closed, but this is not visible for the third party (re-)user.

Labelling variables

There are some R packages that help to add semantic information about the variables with a free-text variable label that allows normal punctuation and more words than the restricted col.names attribute of the data.frame. Our dataset package also implements variable labelling.

library(dataset)

dsd_iris <- DataStructure(iris_dataset)
dsd_iris$Sepal.Length$label <- "The sepal length measured in centimeters."
dsd_iris$Sepal.Width$label  <- "The sepal width measured in centimeters."
dsd_iris$Petal.Length$label <- "The petal length measured in centimeters."
dsd_iris$Petal.Width$label  <- "The petal width measured in centimeters."
dsd_iris$Species$label      <- "The name of the Iris species in the observation."

iris_dataset_labelled <- DataStructure_update(iris_dataset, dsd_iris)

DataStructure(iris_dataset_labelled)[["Sepal.Length"]]$label
#> [1] "The sepal length measured in centimeters."

Variable labels vary in quality, and they are natural language and context-dependent. They are not machine-actionable and require substantial time investment for interpretation or validation for reuse.

Increased interoperability

Interoperability is enhanced if we know what is the valid range of the variable. Until you save the iris_dataset_labelled R object in native .rds form, basic interoperability is ensured: the sepal length variable will be read as a numeric variable. But even in a CSV serialisation, it is only sometimes clear to the following user’s application if the variable should be read as an integer, double or a string. The dataset package allows simpler or richer range definitions to be added to each variable’s structural information set.

DataStructure(iris_dataset_labelled)[["Sepal.Length"]]$range
#> [1] "xsd:decimal"

Conceptualisation of variables

Professional dataset producers, for example, statistical agencies, therefore encourage using controlled vocabularies, such as taxonomies or formal ontologies, to describe the meaning of the variables. Our main goal with regards to increase the usability, reviewability or knowledge-expanding mutation capability of variables by adding standardised, machine-actionable information to the datasets.

The following example is taken from the RDF Data Cube Vocabulary definition [@rdf_data_cube_2014]. Using perhaps the most human-friendly textual serialisation of the Resource Description Framework, i.e., Turtle [@rdf_turtle_1_1],the example below shows machine-actionable or standardised dataset-, variable-, and observation level semantic information. This serialisation should be possible to obtain from a reprocessing of the data frame itself and its metadata attributes; in other words, our dataset() S3 class objects must be able to carry the following semantic information.

# Example 9 of the RDF Data Cube Vocabulary definition

eg:dataset-le1 a qb:DataSet;
    rdfs:label "Life expectancy"@en;
    rdfs:comment "Life expectancy within Welsh Unitary authorities - extracted from Stats Wales"@en;
    qb:structure eg:dsd-le ;
    .  

eg:o1 a qb:Observation;
    qb:dataSet  eg:dataset-le1 ;
    eg:refArea                 ex-geo:newport_00pr ;                  
    eg:refPeriod               <http://reference.data.gov.uk/id/gregorian-interval/2004-01-01T00:00:00/P3Y> ;
    sdmx-dimension:sex         sdmx-code:sex-M ;
    sdmx-attribute:unitMeasure <http://dbpedia.org/resource/Year> ;
    eg:lifeExpectancy          76.7 ;
    .