DATA TO INSIGHT

This is the official link to the course. The course requires 3 hours per week over 8 weeks.

This model shows the process of abstracting and solving a statistical problem to help solve a larger real problem. A knowledge-based solution to the real problem requires better understanding of how some things work.

“Garbage in, Garbage out” is a standard catch-phrase in information-technology and it certainly applies to statistical data analysis today.

Useful data means more than just a set of answers from a questionnaire. Before we spend time analysing data in the real world, we need to answer these questions about its relevance and credibility:

  • Who collected the data, where did they collect it and why?
  • How did they measure or categorise the various factors the data addresses?
  • How were the people (or other entities) selected?
  • What did they do when people refused to answer?
  • What did they do about missing values?
  • If questions were asked over a period of years, how did they treat the data on people they lost contact with or who died?
  • What allowances were made for periods when data was unable to be collected?

 

Some general data preparation rules

  1. You may wish to eliminate data which is:
    * Incomplete e.g. critical pages of a questionnaire are missing.
    * Not collected according to instructions e.g. the data was collected after the cut-off date or the respondent was not properly qualified.
    * Not of interest e.g. there is no variation (all values identical).
  2. Make corrections to illegible, incomplete, inconsistent or ambiguous answers e.g. impossible dates of birth, membership of non-existent categories.
  3. All numeric data should have the same units. E.g. All minutes, not seconds and minutes.
  4. Manipulate the data where it requires weighting or scale transformations.
  5. Assign codes to answers to assist with ordering or other data manipulations.
  6. Reformat the data (e.g. names, scales, ordering) so as to make it accessible.
  7. Give each variable a unique name (preferably with letters and numbers only).
  8. Allow for annotations in the data e.g. total rows, documentation or comments.
  9. Verify that blanks are actually blank and not spaces.
  10. Remove units from data, e.g. 5 years/5 yrs/5yr should all be just 5 and $10 is just 10.

 

Comments are closed.

[Top]