DATA TO INSIGHT
This is the official link to the course. The course requires 3 hours per week over 8 weeks.
This model shows the process of abstracting and solving a statistical problem to help solve a larger real problem. A knowledge-based solution to the real problem requires better understanding of how some things work.
“Garbage in, Garbage out” is a standard catch-phrase in information-technology and it certainly applies to statistical data analysis today.
Useful data means more than just a set of answers from a questionnaire. Before we spend time analysing data in the real world, we need to answer these questions about its relevance and credibility:
- Who collected the data, where did they collect it and why?
- How did they measure or categorise the various factors the data addresses?
- How were the people (or other entities) selected?
- What did they do when people refused to answer?
- What did they do about missing values?
- If questions were asked over a period of years, how did they treat the data on people they lost contact with or who died?
- What allowances were made for periods when data was unable to be collected?
Some general data preparation rules
- You may wish to eliminate data which is:
* Incomplete e.g. critical pages of a questionnaire are missing.
* Not collected according to instructions e.g. the data was collected after the cut-off date or the respondent was not properly qualified.
* Not of interest e.g. there is no variation (all values identical). - Make corrections to illegible, incomplete, inconsistent or ambiguous answers e.g. impossible dates of birth, membership of non-existent categories.
- All numeric data should have the same units. E.g. All minutes, not seconds and minutes.
- Manipulate the data where it requires weighting or scale transformations.
- Assign codes to answers to assist with ordering or other data manipulations.
- Reformat the data (e.g. names, scales, ordering) so as to make it accessible.
- Give each variable a unique name (preferably with letters and numbers only).
- Allow for annotations in the data e.g. total rows, documentation or comments.
- Verify that blanks are actually blank and not spaces.
- Remove units from data, e.g. 5 years/5 yrs/5yr should all be just 5 and $10 is just 10.
[Top]