why reproducible

The fundamental idea behind a robust, reproducible analysis is a clean, repeatable script-based workflow (i.e. the sequence of tasks from the start to the end of a project) that links raw data through to clean data and to final analysis outputs.

principles of a good analysis workflow

  1. Any cleaning, merging, transforming, etc. of data should be done in scripts, not manually.
  2. Split your workflow (scripts) into logical thematic units. For example, you might separate your code into scripts that
      1. load, merge and clean data
      1. analyse data
      1. produce outputs like figures and tables
  3. Eliminate code duplication by packaging up useful code into custom functions (Programming: write a function). Make sure to comment your functions thoroughly, explaining their expected inputs and outputs, and what they are doing and why.

  4. Document your code and data as comments in your scripts or by producing separate documentation (see Programming and Reproducible reports).

  5. Any intermediary outputs generated by your workflow should be kept separate from raw data. 结果输出应该和原始数据分开

file system structure

  1. The data folder contains all input data (and metadata) used in the analysis.
  2. The docs folder contains the manuscript.
  3. The figs directory contains figures generated by the analysis.
  4. The output_data folder contains any type of intermediate or output files (e.g. simulation outputs, models, processed datasets, etc.). You might separate this and also have a cleaned-data folder.
  5. The scripts contains R scripts with function definitions.
  6. The rmd folder contains RMarkdown and reports files that document the analysis or report on results.

good name principle

  1. machine readable
  • Use delimiters to separate and make important metadata information
  • Avoid spaces, punctuation, accented characters and case sensitivity.
  • “_” to separate metadata to be extracted as strings later on
  • “-” instead of spaces or vice versa but do not mix
  1. human readable
  • Ensure file names also include informative description of file contents
  • Adapt the concept of the slug to link outputs with the scripts in which they are generated
  1. easy to order by default
  2. Starting file names with a number helps.
  3. For data, this might be a date allowing chronological ordering.
  4. Make sure to use ISO 8601 format (YYYY-MM-DD) to avoid confusion between differing local dating conventions.
  5. For scripts, you could use a number indicating the position of the scripts in the analysis sequence e.g. 01_download-data.R