8 Tidy Data

Our reproducible research practice follows the tidy data principle, which has very complex computer science and information management consequences. Still, for the lay user of data, it boils down to simplicity.

Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types.

Following three rules makes a dataset tidy: variables are in columns, observations are in rows, and values are in cells. From [R For Data Science - 12. Tidy Data](https://r4ds.had.co.nz/tidy-data.html)

Figure 8.1: Following three rules makes a dataset tidy: variables are in columns, observations are in rows, and values are in cells. From R For Data Science - 12. Tidy Data

In tidy data: - Every column is a variable. We do not use colours (our machine-to-machine pipelines is colourblind). If we need comments or specifications, we add a new column. - Every row is an observation. Every variable belonging to Bulgaria is in the Bulgaria row, and there is one and only Bulgaria row. - Every cell is a single value. We never merge cells! A tidy dataset has no divided columns and no divided rows.

This is often far more easier to write than to do, but still, if you can make it that simple, then you already mastered Codd’s 3rd normal formframed in statistical language.

Tidy data is data in Codd’s 3rd normal form, but with the constraints framed in statistical language, and the focus put on a single dataset rather than the many connected datasets common in relational databases.

Tidy data always has two forms: wide and long. From [R For Data Science - 12. Tidy Data](https://r4ds.had.co.nz/tidy-data.html).

Figure 8.2: Tidy data always has two forms: wide and long. From R For Data Science - 12. Tidy Data.

Our task in WP4 is to help tidying some datasets that are commonly used in music, and surveying, and which will be used in WP1 (linking royalty accounts with SDMX compatible statistical data), WP2 (linking various, special purpose music data resources), and WP3 (DDI survey datasets, DDI survey codebooks). See more: dataset: Create Data Frames that are Easier to Exchange and Reuse.