### Data Management
by [César Herrera](https://github.com/CexyNature) [![Twitter](https://img.shields.io/twitter/follow/CexyNature?style=social)](https://twitter.com/cexynature?lang=en) [![GitHub](https://img.shields.io/github/followers/CexyNature?style=social)](https://github.com/CexyNature) [![Open Source? Yes!](https://badgen.net/badge/Open%20Source%20%3F/Yes%21/blue?icon=github)](https://github.com/Naereen/badges/)
Made with ♥ forNo
Yes
No
Yes
No
Yes
No
Yes
Classic approach: concept/theory driven
vs
Modern approach: data driven
"If more scientists make their data available in such a way you can find it and reuse it, then you can get data at the same speed that you can think about problems or new hypothesis. This is an enormous asset for Science going forward." paraphrasing Quintana & Heathers 2016
As time progress, data entropy increase. This is, data becomes more difficult to find, reuse or make sense of it.
Ideally we should attempt to extend the life cycle of data
"... something has gone fundamentally wrong with one of our greatest human creations (Science)" The Lancet, 2015
"A survey of Nature readers revealed a high level of concern about the problem of irreproducible results. Researchers, funders and journals need to work together to make research more reliable" Nature, 2015
"Too many sloppy mistakes are creeping into scientific papers. Lab heads must look more rigorously at the data — and at themselves." Nature, 2012
"Punishing individuals for failure to replicate their original results is unlikely to be effective at stopping the evolution of bad science." Smaldino and McElreath 2016
- How to organize data
- How to curate data
- How to test data and procedures
- How to manage distributed contributions
Integrity, Intelligibility, and Interoperability
Ensuring data is preserved in a time-proof format. Including meta-information.
Metadata is an indispensable part of the data.
Selecting the best long lasting data format. In my case, I avoid proprietary file extensions.
crabs_occurrence <- read.csv("crab_occurrence_annandale.csv")
head(crab_occurences)
|ID | lat | long | Taxa | ...|
| name | -19.65 | 142.43 | Tpolita | ...|
| ... | ... | ... | ... | ...|
crabs_occurrence <- read.csv("crab_occurrence_annandale.csv")
head(crab_occurences)
|ID | lat | long | Taxa | ...|
| name | -19.65 | 142.43 | Tpolita | ...|
| ... | ... | ... | ... | ...|
test_data <- function(data) {
# Define checks
}
crabs_occurrence <- read.csv("crab_occurrence_annandale.csv")
head(crab_occurences)
|ID | lat | long | Taxa | ...|
| name | -19.65 | 142.43 | Tpolita | ...|
| ... | ... | ... | ... | ...|
test_data <- function(data) {
# Define checks
for (col in columns){
check1 <- check_column_data_type()
check2 <- check_column_values(min, max, na.values)
}
}
crabs_occurrence <- read.csv("crab_occurrence_annandale.csv")
head(crab_occurences)
|ID | lat | long | Taxa | ...|
| name | -19.65 | 142.43 | Tpolita | ...|
| ... | ... | ... | ... | ...|
test_data <- function(data) {
# Define expected result
check1_expect <- c("character", "numeric", "numeric", ...)
check2_expect <- c(c(NA, NA, NA), (-20, -18, "Yes"), ...)
# Define checks
for (col in columns){
check1 <- check_column_data_type()
check2 <- check_column_values(min, max, na.values)
}
}
|ID | lat | long | Taxa | ...|
| name | -19.65 | 142.43 | Tpolita | ...|
| ... | ... | ... | ... | ...|
test_data <- function(data) {
# Define expected result
check1_expect <- c("character", "numeric", "numeric", ...)
check2_expect <- c(c(NA, NA, NA), (-20, -18, "Yes"), ...)
# Define checks
for (col in columns){
check1 <- check_column_data_type()
check2 <- check_column_values(min, max, na.values)
}
test1 <- ifelse(check1 == check1_expect, "Pass", "Fail")
test2 <- ifelse(...)
}
|ID | lat | long | Taxa | ...|
| name | -19.65 | 142.43 | Tpolita | ...|
| ... | ... | ... | ... | ...|
test_data <- function(data) {
# Define expected result
check1_expect <- c("character", "numeric", "numeric", ...)
check2_expect <- c(c(NA, NA, NA), (-20, -18, "Yes"), ...)
# Define checks
for (col in columns){
check1 <- check_column_data_type()
check2 <- check_column_values(min, max, na.values)
}
test1 <- ifelse(check1 == check1_expected, "Pass", "Fail")
test2 <- ifelse(...)
}
test_data(crabs_occurrence)
You must document well
Explain the naming convention, the functions and variables you defined. The general purpose of the script
It would make it easy for you and for others.
It could required some thinking at the beginning, but once you do it once you can keep using it in the future
Compartmentalize your code
Instead create functions that you can use on request