Reproducible Research
The reproducible research stage consists of two or more pieces. A report and the supporting workbooks. The purpose of this work is to
- Provide a document containing sufficient details to allow others to reproduce your work.
Reproducible research is extremely important to a large portion of the data science world as well as the scientific community in general. While papers are written to describe results, reproducible research documents are written to describe how the results were obtained and to allow others to recreate these results.
While true reproducible research is probably not possible in excel, we should attempt to follow the practice at least. The following is based on Ten Simple Rules for Reproducible Research in Jupyter Notebooks. This article is related to research done in python, and stored in a juypter notebooks, so not all of the rules apply to excel, it is still worth a quick read.
As you work on your project you should
- Prepare a document that
- Describes what you are doing and why you did it. (Read rule 1)
- Describes each experiment and why you made the choices you did (Read rule 2)
- Work in small, understandable portions. (Rules 3 and 4).
- In our case, this will be different worksheets for each task.
- If the data needs cleaned
- Document how data errors were detected
- How errors were corrected
- Why new values to errors were supplied.
- In excel, it is important you document this explicitly.
- Do as little as possible by "hand".
- Never change a cell value directly, use tools in excel to do this.
- Document how values were changed.
All workbooks, including the raw data, should be included in this submission.
In addition, you should produce a data description document. This document should
- Document the source of the dataset.
- Provide an overview of the dataset.
- Provide detailed information about all fields in the dataset
- A description of the field.
- For numeric data
- A five number summary for most fields.
- A graph depicting the values of the data (histogram, column chart, circle graph, ...)
- For text data (when applicable)
- Count of the number of different items .
- Distribution of values
- Graphical representation of values.
- If a field is not used, simply state this.
- Provide detailed information about all derived fields used in the a