15.1 Types of Data
A core component in the data analysis process is that data can be in many different forms. When we collect or analyze data it is usually made up of cases that are the objects in the collection that are the intended unit of analysis. Sometimes these objects are the students in a classroom, houses in a neighborhood, or individuals who filled out a survey. The cases are often labeled with some type of name or number to distinguish between different cases. Sometimes the data has been changed to instead give the number of cases with a certain property, rather than listing each case individually. For example, we know that in our classroom that the favorite color of 8 students is red, 7 students like green, 2 students like blue, and 4 students prefer purple.
For each of the cases in our data set, there corresponds one or more attributes, called variables. These variables can be verbal or numerical descriptions of some property that varies among the cases studied. Using a code book is extremely useful to keep track of the cases in a study and the variables included.
Example 15.1 At various points in this text we will use the mosaic
package in R, along with the mosaic data sets in the mosaicData
package. Here is the code book for one of the data sets that includes SAT scores. While the type of variable is not listed, it can easily be inferred from the variable description.
State by State SAT data
Description: SAT data assembled for a statistics education journal article on the link between SAT scores and measures of educational expenditures
Format: A data frame with 50 observations on the following variables.
state
a factor with names of each stateexpend
expenditure per pupil in average daily attendance in public elementary and secondary schools, 1994-95 (in thousands of US dollars)ratio
average pupil/teacher ratio in public elementary and secondary schools, Fall 1994salary
estimated average annual salary of teachers in public elementary and secondary schools, 1994-95 (in thousands of US dollars)frac
percentage of all eligible students taking the SAT, 1994-95verbal
average verbal SAT score, 1994-95math
average math SAT score, 1994-95sat
average total SAT score, 1994-95
The code book is usually written in some type of text document (often a .txt
file) that can be read by anyone that may later want to use the data set. The data for an investigation is usually cleaned and stored in a spreadsheet (often in a .csv
format to make it easier to transfer between software for analysis). By having the code book stored with the spreadsheet, the spreadsheet can have all of the extra information removed to make it easier to use for analysis.
In the code book, we often describe the cases, including the number of cases, and the techniques used to collect the data. The variables for the cases are then each listed with their variable name. This is usually a single word or a string of words connected by underscores, i.e. arm_span
. This helps clarify what each variable represents during the analysis process (rather than a single letter) and most statistical software does not allow spaces in a single variable. The code book then has a description of the property described by the variable, possible values for the variable and their meanings (for example when describing educational level of parents we sometimes use 0= did not graduate high school, 1= high school diploma, 2=some college, 3= college degree), and the units of measurement (if applicable). You would also want to include any additional information about the variable that you may need to remember as you work through the data analysis process. Remember that the context and information about the cases and variables need to be in the forefront of your mind throughout the entire investigative process.
A primary purpose of the code book is to help others to understand your data in future studies. This means that you need to define any abbreviations that may not be obvious to others and to explain things in much more detail that you would naturally do. What you will find is that this process will also help you when you return to your data and have forgotten many more details than you expected.
Related Content Standards
(6.SPB.5) Summarize numerical data sets in relation to their context, such as by:
- Reporting the number of observations.
- Describing the nature of the attribute under investigation, including how it was measured and its units of measurement.
A key aspect of the variables that should be included in the code book is the type of data involved, as the type of data impacts the types of statistical techniques that can be applied.
15.1.1 Categorical
A variable that can be put into a finite number of categories such as color, blood type, political party affiliation, zip code, or gender. These categorical descriptions are often each replaced with a single word or number in the corresponding data spreadsheet to allow for the statistical software to run analyses. For example, in a study involving a sample from across the United States we may mark different regions with numbers (Southeast is a 1, Mid-Atlantic is a 2, etc.). The correspondence between these abbreviations and the original description is included in the code book for the data set.
Since categorical data does not have an intrinsic ordering to it, it is called nominal. So we need to remember that if we have replaced a category with a number that it is not in an order.
15.1.2 Ordinal
A variable that has some type of order within the possible values of the variable, but the differences between successive values is imprecise. A common example is the ranking of college football teams. While there is an order in the ranking of the teams, the difference between the second and third ranked teams is not likely to be the same as the difference between the third and fourth ranked teams. Another such example is the order in which runners finish a race. There may only be 1 second between the first and second place finishers, while the difference between the seventh and eighth place finishers may be 20 seconds.
When analyzing ordinal data, the order is usually entered as an integer value, so it is important to label the variable as ordinal, rather than a count or continuous, as the types of analyses run on ordinal data is very different than other types of variables. Such analyses usually use non-parametric statistical techniques since the data does not fit within a normal distribution.
15.1.3 Binary
A variable that has two possible values associated to it. These are often ‘yes/no’, ‘true/false’, or ‘correct/incorrect’ type values. When stored in the corresponding spreadsheet, these are usually replaced with ‘0/1’ options with the correspondence between the number and the actual value of the variable described in the code book.
Sometimes binary variables have an order (like correct or incorrect on a test question), in which case they are a type of ordinal variable. Other times there is no order to the values of the variable and so it can be thought of as a categorical variable.
15.1.4 Binomial
A variable based on the number of successes out of \(N\) possible is binomial. A common example of a binomial variable is the number of heads achieved when flipping a coin 10 times. A more complex example is the number of patients with a disease from a sample of 15 patients randomly chosen from different hospitals. In this situation, the hospital is the case and the number of patients (out of 15) is the binomial variable.
15.1.5 Count
A count is very similar to a binomial variable, but it is not limited to a certain number of possible cases. It could be something like the number of people standing in line at different check-out stations in a supermarket or the number of kids in each classroom of a school.
15.1.6 Continuous
When the value of a variable can range over a large number of possible values where the differences in values have meaning, then the variable is assumed to be continuous, or real-valued. These could include temperatures, test scores, heights, or speeds. There are rarely variables that are continuous in the mathematical sense due to measurement restrictions. For instance a thermometer may only differentiate up to a degree. So mathematically this would be discrete data. However, statistically we consider it continuous data if there could theoretically be any values between two values.
15.1.7 Exercise
For each of the following variables, determine a variable name, type(s) of variable, units of measurement, and possible values for each of the following descriptions. These do not all have a single correct answer as not all of these are well defined. So you will likely need to make a case for some of your choices.
- age
- amount of time spent interacting with a screen each day
- number of views of a YouTube video
- height of students
- calories in a hamburger
- if a person is voting for a certain candidate
- number of wins for a football team during a season
Consider the following variables from the Census at School United States5 data:
- Region: Identifies the state the participant lives in. (50 possible values, 1 for each state)
- Planned education level: Indicates the highest degree a student intends to earn. (6 possible values: less than high school, high school, some college, undergraduate degree, graduate degree, other.)
- Reaction time: The amount of time, in seconds, it takes to click their mouse after an image appears on a screen. (Range is theoretically 0—not inclusive– to infinity.)
- Memory game score: Score on a memory assessment, with the score corresponding to the number of moves it takes to solve a memory puzzle. (Minimum score = 20, no theoretical maximum.)
- Favorite season: Name of student’s favorite season. (4 possible values.)
- School work pressure: The amount of pressure the students identify as experiencing in response to the question “How much pressure do you feel because of the schoolwork you have to do?” (4 possible values: none, very little, some, a lot)
Determine the types of variables for each of these.