Messy-desk

### Data Management

by [César Herrera](https://github.com/CexyNature) [![Twitter](https://img.shields.io/twitter/follow/CexyNature?style=social)](https://twitter.com/cexynature?lang=en) [![GitHub](https://img.shields.io/github/followers/CexyNature?style=social)](https://github.com/CexyNature) [![Open Source? Yes!](https://badgen.net/badge/Open%20Source%20%3F/Yes%21/blue?icon=github)](https://github.com/Naereen/badges/)

Made with ♥ for CoderTSV
## What is Data? > "facts or information, especially when examined and used to find out things or to make decisions" > -- Oxford dictionary \ > accessed Oct 4th, 2020

Why managing data is important?

Data is an essential aspect of modern business models.
"Data management aims to unlock the data, and make it easy to capture, organize, analyze, visualize and generate knowledge."
paraphrasing from Gray et al 2015
Q1: How many of us have received training in Data Management?

No

Yes

Q1: How many of us have received training in Data Management?

No

Yes

Q2: How many of us have organized DM training for fellow junior colleagues/lab-mates?

No

Yes

Q2: How many of us have organized DM training for fellow junior colleagues/lab-mates?

No

Yes

“Any fool can write code that a computer can understand. Good programmers write code that humans can understand.” ― Martin Fowler
“Any scientist can collect data for a scientific publication. Good scientist collect/make-available data for the human collective.”

The life cycle of data

The life cycle of data

Classic approach: concept/theory driven

vs

Modern approach: data driven

Theories and hypothesis are constantly evolving

"If more scientists make their data available in such a way you can find it and reuse it, then you can get data at the same speed that you can think about problems or new hypothesis. This is an enormous asset for Science going forward." paraphrasing Quintana & Heathers 2016

Science can be obscure

An example from Ecology

As time progress, data entropy increase. This is, data becomes more difficult to find, reuse or make sense of it.

Ideally we should attempt to extend the life cycle of data

“A lot of what is published is incorrect.”

"... something has gone fundamentally wrong with one of our greatest human creations (Science)" The Lancet, 2015

The opinion of 1500 scientists

"A survey of Nature readers revealed a high level of concern about the problem of irreproducible results. Researchers, funders and journals need to work together to make research more reliable" Nature, 2015

Must try harder

"Too many sloppy mistakes are creeping into scientific papers. Lab heads must look more rigorously at the data — and at themselves." Nature, 2012

#### Recognize and celebrate good data management practices (including your own practices, teach others) ![](dm_assets/tmp-9.gif)
#### Implement and encourage a set of norms about how to use software and how to manage data ![](dm_assets/tmp-10.gif)

Understand that policing is not the best strategy

"Punishing individuals for failure to replicate their original results is unlikely to be effective at stopping the evolution of bad science." Smaldino and McElreath 2016

Data Management outlook

- How to organize data

Data Management outlook

- How to curate data

Data Management outlook

- How to test data and procedures

Data Management outlook

- How to manage distributed contributions

Effective data management

Integrity, Intelligibility, and Interoperability

Instrument data

Ensuring data is preserved in a time-proof format. Including meta-information.

Metadata

Metadata is an indispensable part of the data.

Digitize, Convert and Backup

Selecting the best long lasting data format. In my case, I avoid proprietary file extensions.

                            
                                crabs_occurrence <- read.csv("crab_occurrence_annandale.csv")
                                head(crab_occurences)

                                |ID    | lat    | long   | Taxa    | ...|
                                | name | -19.65 | 142.43 | Tpolita | ...|
                                | ...  | ...    | ...    | ...     | ...|

                            
                        
                          
                                crabs_occurrence <- read.csv("crab_occurrence_annandale.csv")
                                head(crab_occurences)

                                |ID    | lat    | long   | Taxa    | ...|
                                | name | -19.65 | 142.43 | Tpolita | ...|
                                | ...  | ...    | ...    | ...     | ...|

                                test_data <- function(data) {
                                    # Define checks

                                }
                            
                      
                          
                                crabs_occurrence <- read.csv("crab_occurrence_annandale.csv")
                                head(crab_occurences)

                                |ID    | lat    | long   | Taxa    | ...|
                                | name | -19.65 | 142.43 | Tpolita | ...|
                                | ...  | ...    | ...    | ...     | ...|

                                test_data <- function(data) {
                                    # Define checks
                                    for (col in columns){
                                        check1 <- check_column_data_type()
                                        check2 <- check_column_values(min, max, na.values)
                                    }
                                }

                          
                      
                          
                                crabs_occurrence <- read.csv("crab_occurrence_annandale.csv")
                                head(crab_occurences)

                                |ID    | lat    | long   | Taxa    | ...|
                                | name | -19.65 | 142.43 | Tpolita | ...|
                                | ...  | ...    | ...    | ...     | ...|

                                test_data <- function(data) {
                                    # Define expected result
                                    check1_expect <- c("character", "numeric", "numeric", ...)
                                    check2_expect <- c(c(NA, NA, NA), (-20, -18, "Yes"), ...)
                                    # Define checks
                                    for (col in columns){
                                        check1 <- check_column_data_type()
                                        check2 <- check_column_values(min, max, na.values)
                                    }
                                }

                          
                      
                          
                                |ID    | lat    | long   | Taxa    | ...|
                                | name | -19.65 | 142.43 | Tpolita | ...|
                                | ...  | ...    | ...    | ...     | ...|

                                test_data <- function(data) {
                                    # Define expected result
                                    check1_expect <- c("character", "numeric", "numeric", ...)
                                    check2_expect <- c(c(NA, NA, NA), (-20, -18, "Yes"), ...)
                                    # Define checks
                                    for (col in columns){
                                        check1 <- check_column_data_type()
                                        check2 <- check_column_values(min, max, na.values)
                                    }
                                    test1 <- ifelse(check1 == check1_expect, "Pass", "Fail")
                                    test2 <- ifelse(...)
                                }

                          
                      
                          
                                |ID    | lat    | long   | Taxa    | ...|
                                | name | -19.65 | 142.43 | Tpolita | ...|
                                | ...  | ...    | ...    | ...     | ...|

                                test_data <- function(data) {
                                    # Define expected result
                                    check1_expect <- c("character", "numeric", "numeric", ...)
                                    check2_expect <- c(c(NA, NA, NA), (-20, -18, "Yes"), ...)
                                    # Define checks
                                    for (col in columns){
                                        check1 <- check_column_data_type()
                                        check2 <- check_column_values(min, max, na.values)
                                    }
                                    test1 <- ifelse(check1 == check1_expected, "Pass", "Fail")
                                    test2 <- ifelse(...)
                                }

                              test_data(crabs_occurrence)

                          
                      

If you have obscure names

You must document well

Include a vignette

Explain the naming convention, the functions and variables you defined. The general purpose of the script

Organize your code

It would make it easy for you and for others.

Describe the files/directories structure in your project

It could required some thinking at the beginning, but once you do it once you can keep using it in the future

Short scripts for specific tasks

Compartmentalize your code

Avoid repeating code along scripts

Instead create functions that you can use on request