Top 5 Best Practices for Data Collection and Storage in Microsoft Excel

As a biostatistician, my preferred way to receive data from a client is a csv extract from a well-designed database that included validation with the data-entry process. However, 95% of the data I see comes as an Excel file.

I have just sent an email to a client with some tips for using Excel to collect and store their research data. These 5 steps are my recommendations that save me a lot of headache when importing data from .xlsx files into R for analysis:

  1. (basic) Only enter a single piece of information into a cell. For example, if the column is “number”, then 365 is the only thing in the cell, do not do this: “365 (children only)”. If you need to leave a comment, then put it in another column.
  2. (basic) Keep column names simple, my preference is up to a maximum of 3 words, separated by underscores, all lower caps. If you can, keep column names meaningful to a human reader. Also, only use Row 1 for column names/information. Extra information about the data contained within a specific column should be placed in the data dictionary, which is in a different worksheet.
  3. (basic) Include a data dictionary that should have at least 5 columns: 1| column name (point 2), 2| Description, 3| data type (e.g. dichotomous (yes/no, true/false), categorical (list the categories), numeric, text), 4| valid ranges of data (for example, valid entries for age could be between 0 and 110 years, if 200 was entered then I know it’s likely an error), and 5| if the value is numeric, provide the unit of measurement (e.g. years, days, cm, kg, …).
  4. (advanced) Use Excel’s data validation functions – especially to make drop-down lists for categorical responses.
  5. (intermediate) If you have any date columns, change the cell data type to “Text” (the same for phone numbers).

Note: Always, any information that can be used to identify an individual should be removed from the file before it gets sent to a statistician.

If everyone followed these 5 steps, then 80% of my work (and time) would be freed up for analysis and communicating results. I know there must be other simple pieces of advice. What have I missed? What other simple advice do you give your colleagues/clients about data entry into Microsoft Excel?

Photo by Scott Graham on Unsplash

2 thoughts on “Top 5 Best Practices for Data Collection and Storage in Microsoft Excel

  1. 1. Avoid using special characters (except underscore) for column names
    2. Every data should have a unique identifier (preferably a number, which is unique to each row / respondent). This helps to co-ordinate, with the “data supplier”, if we have any queries regarding any rows.
    3. Good practice to tell the “receiver” about how many rows and columns are there in the data file.

    Liked by 1 person

  2. Another recommendation could be to organize all data in a single table rather having separate tables for each sub-population. For example, rather than a table for females and another for males, have a single table with gender column.

    Liked by 1 person

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s