Top 5 Best Practices for Data Collection and Storage in Microsoft Excel

As a biostatistician, my preferred way to receive data from a client is a csv extract from a well-designed database that included validation with the data-entry process. However, 95% of the data I see comes as an Excel file.

I have just sent an email to a client with some tips for using Excel to collect and store their research data. These 5 steps are my recommendations that save me a lot of headache when importing data from .xlsx files into R for analysis:

  1. (basic) Only enter a single piece of information into a cell. For example, if the column is “number”, then 365 is the only thing in the cell, do not do this: “365 (children only)”. If you need to leave a comment, then put it in another column.
  2. (basic) Keep column names simple, my preference is up to a maximum of 3 words, separated by underscores, all lower caps. If you can, keep column names meaningful to a human reader. Also, only use Row 1 for column names/information. Extra information about the data contained within a specific column should be placed in the data dictionary, which is in a different worksheet.
  3. (basic) Include a data dictionary that should have at least 5 columns: 1| column name (point 2), 2| Description, 3| data type (e.g. dichotomous (yes/no, true/false), categorical (list the categories), numeric, text), 4| valid ranges of data (for example, valid entries for age could be between 0 and 110 years, if 200 was entered then I know it’s likely an error), and 5| if the value is numeric, provide the unit of measurement (e.g. years, days, cm, kg, …).
  4. (advanced) Use Excel’s data validation functions – especially to make drop-down lists for categorical responses.
  5. (intermediate) If you have any date columns, change the cell data type to “Text” (the same for phone numbers).

Note: Always, any information that can be used to identify an individual should be removed from the file before it gets sent to a statistician.

If everyone followed these 5 steps, then 80% of my work (and time) would be freed up for analysis and communicating results. I know there must be other simple pieces of advice. What have I missed? What other simple advice do you give your colleagues/clients about data entry into Microsoft Excel?

Photo by Scott Graham on Unsplash

How Do You Organise Your R Project? This Is What We Do.

The Biometrics group at Telethon Kids Institute uses a standardised template project directory to manage our biostatistical consultation projects. This approach allows us to streamline our workflow, initiate projects, and produce professional looking reports directly from the statistical analysis platform minimising the time spent on the non-analytical aspects of our projects. This project structure is identical and successful for both simple and large-scale projects.

Although this workflow has been developed in R, the broader principles discussed here are applicable to any scripting language and non-coding projects alike. This blog is based on recent presentations that I made at the UserR!2019, Toulouse, France (lightning talk) and the 40th Annual Conference of the International Society for Clinical Biostatistics (ISCB40), Leuven, Belgium (winner of best poster award) conferences in July 2019.

Project Directory Structure

We have developed and refined the skeleton of a template project that contains a series of sub-directories with distinct objectives that cover all steps of a project from initial import of raw data to reporting. Our template is pre-populated with directories and starter-files that has enabled our group to save time in project initiation, preliminary analysis, analysis, and reporting.

Figure 1. Schematic of our template project. Sub-directories are indicated by squares and template files are shown by circles.

This structure is based on what is described by the “ProjectTemplate” R package ( We have found this structure to be attractive thanks to its automated project initiation, executing data wrangling scripts, and loading library packages and data. Without going into excessive detail, we have several directories including:

  • Directories to house data at various stages of rawness, cleanliness, and wrangling
  • admin directories for project meta data
  • 2 scripting directories (ProjectTemplate can be configured to automatically run scripts in the munge directory; the R directory can be reproduced/changed for different scripting languages or changed to a generic “src” folder)
  • and the vignettes folder for our reports.

This structure is also useful as it allows us to build our projects as R packages that are easily distributed to our collaborators once the analysis is complete. We chose to build our reports in the vignettes directory rather than locate them in a “reports” sub‑directory since we can share the final work as an installable package – enabling us to make a collection of project reports available to our collaborators by browsing the package vignettes. Packaging projects is also a useful way to share the cleaned data as it is available as an included dataset along with any documentation that was created specifically for the analysis.

options(ProjectTemplate.templatedir = “path/to/templates")
create.project(paste(“path/to/projects”, “00_project”, sep = “/”), template = "biometrics_project")

Box 1. It only takes 2 lines of code to initiate a new project. A template directory can contain multiple project templates.

Reproducible Research

It is important that our analysis is traceable from raw data to final report and that all changes that were made to the analysis throughout the project life cycle are tracked. I suggest that no modifications are made to the raw data once it is received from the researcher (which often comes as a .xlsx or a .csv). The first thing to do once data is received is to prefix the file name with the current date (YYYYMMDD_) then make it read-only. Any further cleaning is then performed in-script where all changes are documented and can be verified and audited. Subsequent changes to the analysis are tracked via. GitLab, which we have installed on our secure servers, or If the data isn’t sensitive then services such as GitHub or Bit Bucket.

Reporting with R Markdown

R Markdown is an excellent tool that allows analysts to compose the project narration and data analysis output in a single document. Markdown also helps to maintain the quality of a report; by keeping the analysis data frames and the report commentary all within in the R environment. Inserting the analysis outputs directly into the report removes the possibility of transcription errors. Markdown is also great for updating reports when minor changes have been made to the underlying data.

To streamline our analysis plans and reporting we have developed a series of R markdown templates that produce documents that conform to the Telethon Kids Institute’s style guide and online branding. These reports produce beautiful stand-alone HTML documents that we distribute to our collaborators and are built on the bootstrap CSS libraries which allow for dynamic responsive pages that can be viewed on a range of devices.

Figure 2. The default reporting template that we have packaged in; when the package is installed we can create new reports by clicking in R Studio: “File > New File > R Markdown… > From Template”. More information about our templates can be found here.


The following table lists some packages that we use to simplify our reporting. These packages are useful as they allow us to focus on the data without wasting time on the non-analytical parts of the project such as package citations, caption numbering, and tabulating and visualising model output. A brief description is provided about why we use each of these packages, you should visit the official documentation for further details.

Table 1. Summary of useful packages that we use during our professional biostatistical consultations.

Package Description
Tidyverse collection Data wrangling/summaries/visualisation
Captioner Cross-reference tables/figures/models
KableExtra Nicely format data frames for reporting
Stargazer Create well-formatted regression tables
Broom Extract a model’s estimates and statistics
Repmis Create a bibliography of loaded packages
Devtools::build_vignettes() Knit all vignette .Rmd files
jtools::plot_summs() Visualise a model(s) effect estimates and CIs
Gggally::ggpairs() Look at your data with a plot pair matrix
roxygen2 Documenting code by writing .Rd files in the man/ directory

The default project directory structure that we have developed can be seen in action as part of the Telethon Kids rstudio GitHub repository; this repo is an implementation of R Studio within a Docker container (see here and here). You can navigate through the repo on GitHub, or clone it to your local machine. The template project is found in the projects/00_next_project sub-directory. Each of the sub-directories in this template contain a README that briefly describes its purpose.


I doubt there will be anything in this article that can be called “new”, but unless someone has worked in a place with a clearly defined project structure then it is unlikely they have thought about an efficient way to organise their myriad of relates files (documents/data/scripts).

Conversations that I had with data scientists, statisticians, and analysts at both the UseR!2019 and ISCB40 conferences indicate that organisations are becoming increasingly aware of the importance of well-structured data projects. This workflow has come from reading many articles and trying out several packages – very few of which I recorded; thus, unfortunately, I am using other people’s ideas/concepts without proper acknowledgement. I don’t claim anything in this article as my original work and if you know of any authoritative sources on this content then please leave a comment.

There is a plethora of other sub-directories that could be included in a template project. For example, a figures directory for high resolution publication-ready images is a worthy inclusion. I am interested in how do YOU structure your projects; leave a comment and let me know what tools you use and how you increase throughput to ease your workload.

Use RStudio Server in a Virtual Environment with Docker in Minutes!

A fundamental aspect of the reproducible research framework is that (statistical) analysis can be reproduced; that is, given a set of instructions (or a script file) the exact results can be achieved by another analyst with the same raw data. This idea may seem intuitive, but in practice it can be difficult to achieve in an analytical environment that is always evolving and changing.

For example, Tidyverse users will have recently seen the following warning in R:

Warning message:
'data_frame()' is depreciated, use 'tibble()'.

data_frame(a = 1:3)

This is because dplyr has changed, hopefully for the better, and tibble() is now the preferred route. It is conceivable that in a future release the data_frame() function will no longer be a part of dplyr at all.

So, what does this have to do with reproducible research? Say you want to come back to your analysis in 5 years and re-run your old code. Your future installation of R with the latest Tidyverse installation may not be backwards compatible with your now ancient old code and the analysis will crash.

There are a couple of strategies that we could use to deal with updating dependencies. The first, and possibly the easiest, is to use devtools to install older package versions (see here for further details). But what if the version of the package you used is not archived on CRAN? For example, the analysis could have used a package from GitHub that has since changed and the maintainer hasn’t used systematic releases for you to pull from. Thus, this strategy is quite likely to fail.

Another solution is to put your entire project inside a static virtual environment that is isolated from the rest of your machine. The benefit of project isolation is that any dependencies that are installed within the environment can be made persistent, but are also separated from any dependency upgrades that might be applied to your host machine or to other projects.

Docker isn’t a new topic for regular R-Bloggers readers, but for those of you that are unfamiliar: Docker is a program that uses virtual containers, which isolate and bundle applications. The containers are defined by a set of instructions detailed in a docker-compose.yml or Dockerfile and can effectively be stacked to call upon the capabilities of base images. Furthermore, one of the fundamental tenets of Docker is portability – if a container will work on your local machine then it will work anywhere. And because base images are versioned, they will work for all time. This is also great for scalability onto servers and distribution to colleagues and collaborators who will see exactly what you have prepared regardless of their host operating system.

Our dockerised RStudio environment.

Our group at the Telethon Kids Institute has prepared source code to launch a dockerised RStudio Server instance via a NGINX reverse proxy enabled for HTTPS connections. Our GitHub repository provides everything you need to quickly establish a new project (just clone the repository) with the features of RStudio Web Server with the added benefit of SSL encryption (this will need some local configuration); even though RStudio Server is password protected, web encryption is still important for us as we routinely deal with individual level health data.

(Important note, even with data security measures in place, our policy is for patient data to be de-identified prior to analysis as per good clinical practice norms and, except for exceptional circumstances, we would use our intranet to further restrict access).

The defining docker-compose.yml that we use can be found at out our GitHub site via this link: We used an RStudio base image with Tidyverse pre-installed with R 3.5.3, which is maintained by rocker at As updates are made to the base-image, the repository will be updated with new releases that provide an opportunity to re-install a specific version of the virtual environment. It has also been set up with persistent volumes for the projects, rstudio, and lib/R directories to keep any changes made to the virtual environment for back-up and further version control.

By combining tools like Docker and Git we believe we can refine and make common place, within our institute and those we collaborate with, a culture of reproducible research as we conduct world class research to improve the lives of sick children.

Automating R-markdown Tables With Hooks

KableExtra was chosen as the primary way to format tabulated data in the HTML R markdown templates used at Telethon Kids Institute, see the Biometrics R package and this article. I like the tables produced by kableExtra: they look tidy and the package has a feature that highlights table rows that the pointer is hovering over. Moreover, it is very easy to use the package, which is invoked by adding the following 2 lines of code (via. dplyr syntax with the R iris data set):

Note: this article was first posted on the Telethon Kids blog in January, 2019, but has been re-posted as a contribution to R-bloggers.

head(iris) %>%
kable("html") %>%
kable_styling("hover", full_width = F)

This is all well and good, and I would be mostly happy to leave the template here and move on. BUT. The motivation behind these R markdown templates was to streamline as much of the formatting code as is possible; this is to help to maintain a constant theme between the reports/communications produced by Telethon Kids’ researchers, staff and students. I also like my code to be as DRY as possible.

The requirement of adding these 2 lines of code at the end of each table chunk must be fixed!

As a starting point, the default output of a data.frame() in an R markdown document is like this:

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

I approached the problem by using code suggested in this old stack overflow answer by Konrad Rudolph. Konrad used the the knitr::knit_hooks() function to re-parse the chunk output and prepare the string for further manipulation.

The difficulty in formatting the chunk output is that character data is passed to the hook. The workflow first needed to remove the ##’s from each line with gsub() then re-parse the table with read.table(), before adding in the formatting code. This feels messy to me and it fails for wide tables that have rows broken into multiple lines (which was acknowledged in the original answer from 2013).

With a night to dwell on my problem, I realised/remembered that R markdown also sends the chunk source code to knitr for display, which can also be picked up and manipulated by a hook!

My solution, which is very similar to that on stack overflow (but better), takes the source code and uses eval() to evaluate the code and append the formatting lines from kableExtra. This is my knitr::knit_hooks() code:

default_source_hook <- knit_hooks$get("source")
source = function(x, options) {
default_source_hook(x, options)
else {
eval(parse(text = x)) %>%
kable("html") %>%
kable_styling("hover", full_width = F)

which is invoked by:

    ```{r table_format, results = "hide", table = T, eval = F}


and produces a table identical to the first one illustrated in this article.

Enabled via the chunk option table = T, the hook applies formatting only to the chunks that I want formatted with kableExtra. Since the source code is re-evaluated in the knitr hook, the option eval = F should be included in the chunk header so the chunk isn’t evaluated twice; however, this double evaluation wouldn’t be a problem for small tables.

Finally, because the table is displayed using source, and output isn’t touched, the option results = "hide" should be included so that the default table is excluded from the final knit.

Mission accomplished!? No… Unfortunately there are still a couple of problems with this solution:

  1. You are unable to display the code used to generate the table in knitted document (echo = T), because the source code string was changed, not the output string.
  2. I’m replacing 2 lines of code with 3 chunk options – although not a deal breaker (as you may have set these options anyway) it’s not quite as auto-magical as I would have liked.

At the time of writing I still haven’t got an accepted answer to this problem, which I posted on stack overflow. If you have a better solution – or want to up-vote my question – then you know what to do.

This post has been shared with

Full example .Rmd file

title: "Untitled"
author: "Paul"
date: "27 September 2018"
output: html_document

```{r setup, include = F}


default_source_hook <- knit_hooks$get('source')

  source = function(x, options) {
      default_source_hook(x, options)
    else {
      eval(parse(text = x)) %>%
        kable("html") %>%
        kable_styling("hover", full_width = F)



## Normal
With no chunk options:

```{r normal}


## The desired ouptut
With chunk options `results = "hide"` and `table = T`:

```{r table_format, results = "hide", table = T, eval = F}


## It still work as normal for other other output types
With no chunk options:

```{r image}
iris %>%
  ggplot(aes(x = Sepal.Length, y = Sepal.Width, group = Species)) +


Deploying an R Shiny App With Docker

If you haven’t heard of Docker, it is a system that allows projects to be split into discrete units (i.e. containers) that each operate within their own virtual environment. Each container has a blueprint written in its Dockerfile that describes all of the operating parameters including operating system and package dependencies/requirements. Docker images are easily distributed and, because they are self-contained, will operate on any other system that has Docker installed, include servers.

When multiple instances/users attempt to start a Shiny App at the same time, only a single R session is initiated on the serving machine. This is problematic. For example, if one user starts a process that takes 10 seconds to complete, all other users will need to wait until that process has completed before any other tasks can be processed.

One of the benefits of deploying a containerised Shiny App is that each new instance will run in its own R session.

In this article, I will go through the steps that I took to deploy an app developed in Shiny onto a fully-functional web server. All the code here, plus some basic web-server configuration, can be found at this Telethon Kids GitHub repository.

Getting Started

Docker can be installed following the instructions here, making sure to also follow the post-installation instructions here. Docker-compose will also need to be installed by following these instructions here. For the example presented here, Docker was installed on an Ubuntu 18.04 OS in a Virtual Box; the same approach will be used for other apps running on an AWS EC2 instance.

I will be using the words image and container throughout this article. A Docker image is a functioning snapshot of the blueprint. A running image is called a container, which is operating by receiving, processing and sending information to the client or other containers.


The Dockerfile contains the schematics of a Docker container, that is it is used to call a base image and define and customisations that need to be made for the specific application to run correctly.

This is the Dockerfile that describes our Shiny App (i.e. the Default app when a new “Shiny Web App …” is created in RStudio):

FROM rocker/shiny:3.5.1

RUN apt-get update && apt-get install libcurl4-openssl-dev libv8-3.14-dev -y &&\
mkdir -p /var/lib/shiny-server/bookmarks/shiny

# Download and install library
RUN R -e "install.packages(c('shinydashboard', 'shinyjs', 'V8'))"

# copy the app to the image COPY shinyapps /srv/shiny-server/
# make all app files readable (solves issue when dev in Windows, but building in Ubuntu)
RUN chmod -R 755 /srv/shiny-server/


CMD ["/usr/bin/"]

This blueprint is using the base image rocker/shiny built on R version 3.5.1. For some packages to run properly on an Ubuntu OS (the container’s base operating system) I needed to install libcurl4 and libv8 by adding the following lines to our Dockerfile.

RUN apt-get update &&\
apt-get install libcurl4-openssl-dev libv8-3.14-dev -y &&\
mkdir -p /var/lib/shiny-server/bookmarks/shiny

The packages that the app depends on are also installed via. the Dockerfile with the RUN statement:

RUN R -e "install.packages(c('shinydashboard', 'shinyjs', 'V8'))"

Our app.R file (i.e. the Shiny App) is copied from the shinyapps directory on the host PC to a folder inside the container image with the COPY statement. See the GitHub repo for an example of the project directory structure.

I had some file permission issues while building on Ubuntu via. a Windows VM. To overcome this, all copied file permissions were changed with chmod -R 755 to ensure they were readable and executable inside the container.

# Copy the app to the image
COPY shinyapps /srv/shiny-server/

# Make all app files readable
RUN chmod -R +r /srv/shiny-server/

The final two lines expose port 3838 to receive incoming traffic and tell docker how to start the app. With the Dockerfile complete, images are easily built with the following one-liner:

docker build -t telethonkids/new_shiny_app ./path/to/Dockerfile/directory

and can be run as a standalone container with:

docker run --name=shiny_app --user shiny --rm -p 80:3838 telethonkids/new_shiny_app

and accessed in a browser at


The story doesn’t end there. ShinyProxy can be easily set up in another container to facilitate container creation. An image for a ShinyProxy container is defined by the following Dockerfile, this is an example from ShinyProxy’s GitHub repository for a containerised deployment:

FROM openjdk:8-jre

RUN mkdir -p /opt/shinyproxy/
RUN wget -O /opt/shinyproxy/shinyproxy.jar
COPY application.yml /opt/shinyproxy/application.yml

WORKDIR /opt/shinyproxy/
CMD ["java", "-jar", "/opt/shinyproxy/shinyproxy.jar"]

The apps that are to be hosted are defined in the application.yml file. This contains the information for ShinyProxy to launch Shiny Apps and needs to be copied into the container from the host with COPY application.yml /opt/shinyproxy/application.yml.

The following code snippet has only one Shiny App, but multiple app configurations can be added to specs.

title: Telethon Kids Institute Shiny Apps
hide-navbar: false
landing-page: /
heartbeat-rate: 10000
heartbeat-timeout: 600000
port: 8080
internal-networking: true
- id: shiny_app
display-name: New Shiny App
description: The default app when initiating a new shiny app via. RStudio
container-cmd: ["/usr/bin/"]
container-image: telethonkids/new_shiny_app
container-network: tki-net
user: "shiny"

This ShinyProxy set up listens for traffic on port 8080 and will time-out after a period of inactivity, which has been set to 10 minutes.

heartbeat-rate: 10000
heartbeat-timeout: 600000

internal-networking: true must be set because ShinyProxy is being run on a container on the same host as the Shiny App. The configuration for the each app is listed under specs, which give instructions on how to launch the container, what Docker image to use to start the container and what network it should be listening on (note that the container-network must be the same as the network listed in the docker-compose.yaml file, more to come soon).

container-cmd: ["/usr/bin/"]
container-image: telethonkids/new_shiny_app
container-network: tki-net

We also have some environment variables that we want to set so the R session is being run for the user shiny and logs are not recorded.

user: 'shiny'


The docker-compose.yml file contains a series of instructions that can be used to build and launch all the containers needed for deployment in a single or multi-container project. The benefit of this is that a complex network of containers and networks can be version controlled, and launched with one line of code.

version: "3.6"
      - nginx
      - influxdb
    image: telethonkids/shinyproxy
    container_name: tki_shinyproxy
    restart: on-failure
      - tki-net
      - ./shinyproxy/application.yml:/opt/shinyproxy/application.yml
      - /var/run/docker.sock:/var/run/docker.sock
      - 8080:8080

    name: tki-net

Docker uses networks to enable communication between containers. For ShinyProxy to communicate properly with the Shiny App, the network specified in docker-compose.yml must be the same as the same as that listed in application.yml. (Tip, if you’re using a docker-compose file to launch the app, don’t set up the docker network manually, see here for why).

Each container listed in services: is named in container_name, with the Dockerfile being pointed to via. the context and Dockerfile variables.
If the container was terminated with an error code, the restart line instructs Docker what action is to be taken . The networks variables lists the Docker network that the container has access to. Finally, the app is exposed on port 80.

The network used by all applications in this docker-compose must match the network specified in the ShinyProxy application.yml.

name: telethonkids-net

With everything in place, the ShinyProxy image is built and the container(s) started with docker-compose build and docker-compose up -d. (Important, the app defined in application.yml also be built). Containers are stopped and removed with docker-compose down.

With the configurations listed here, a full-functioning Shiny App can be deployed to the internet with little extra configuration. Even simpler, if this setup can be cloned from GitHub and the files in “webapp/shinyapp” replaced by your Shiny App’s app.R.

Bonus Tip

Another app that I made used CSS to do some custom formatting (only 4 tags). For some reason the style sheet was ignored by Chrome when run with ShinyProxy, but not when running standalone. The problem did not exist when tested on Firefox and Internet Explorer browsers. I think this had something to do with the iframe used by ShinyProxy. This was resolved by putting the style in app.R within a head tag and removing the reference to the external style.css.