14.1 Statistics, Data Analysis, and Mathematics
Most people think of statistics and data analysis as part of the field of mathematics. While statistics and data analysis use a great many mathematical ideas, they are different in many ways. The development of statistics and data analysis out of mathematics in the nineteenth and twentieth centuries parallels the development of physics in the seventeenth and eighteenth centuries and computer science in the nineteenth and twentieth centuries. Each of these disciplines include vast applications of mathematics (physics uses fields such as calculus, differential geometry, and Hilbert spaces; computer science often uses set theory, abstract algebra, and numerical analysis; and statistics and data analysis makes use of topics such as probability, real analysis, and linear algebra). However, the language, goals, methods, and culture of these disciplines are distinct from one another.
Since the middle of the nineteenth century, the field of mathematics moved from the study of quantities to the study of abstract structures, built upon the foundation of logic and set theory. On the other hand, statistics as a discipline is grounded in the study of quantities within their original context. George Cobb and David Moore (1997) describe the differences very well.
Although mathematicians often rely on applied context both for motivation and as a source of problems for research, the ultimate focus in mathematical thinking is on abstract patterns: the context is part of the irrelevant detail that must be boiled off over the flame of abstraction in order to reveal the previously hidden crystal of pure structure. In mathematics, context obscures structure. Like mathematicians, data analysts also look for patterns, but ultimately, in data analysis, whether the patterns have meaning, and whether they have any value, depends on how the threads of those patterns interweave with the complementary threads of the story line. In data analysis, context provides meaning. (p. 803)
In many ways one can think of data analysis as a process that involves the study of quantitative patterns within a situated context. Just as cooking takes many different raw ingredients and puts them together in unique ways using many different tools in order to create something to eat, data analysis takes raw data from a certain context and cleans it, transforms it, analyzes it, and reconfigures the generated information for consumption of better understanding of the context and making decisions about it. Hence, data analysis is an inherently inter-disciplinary process focused activity.
A statistic is a quantity computed from values in a collection of values. Some examples are a mean, median, or standard deviation computed from a set of numbers. It could also be a percentage of people that have a driver’s license, or the number of people in a building at a certain time of day. It could also be the probability that someone will be diagnosed with cancer based upon other measurable health, physical, or sociological variables. The science of statistics uses these various quantities and ideas from the theory of probability distributions to process data and represent it in different ways. So we can think of statistics as a methodological discipline comprised of a vast set of tools used to analyze and interpret data.
A descriptive statistic is a summary statistic to quantitatively describe a collection of information. Descriptive statistics is the process of using those statistics to understand data and the context from which the data is derived. This could include describing properties of the data using graphs or tables. It could also describe how two variables are related using scatter plots and correlation coefficients. A key aspect of descriptive statistics is that these describe the data. These do not, by themselves, tell us anything about a population.
Inferential statistics uses samples to infer properties about a larger population, while descriptive statistics focuses on properties of the observed data. As such, inferential statistics uses descriptive statistics of a population of inference, sample statistic, along with assumptions about the underlying probability distribution of the population of interest to estimate a population parameter.
Example 14.1 Assume we are interested in exploring how height varies between boys and girls in the United States as they age. We ask our fellow teachers to collect data on age (in months) and height (in inches) for us, so we end up with data from 429 students between the ages of 11 and 17 from a single school district in the United States.
(Population versus sample:) The population is the universe of things that fit the criteria of the thing you want to study. It is also described as the population of interest or inference. The sample is the set of objects on which you actually have measurements. In our example, the population of interest is the set of students in the United States. Our sample is the students between the ages of 11 and 17 in a single school district in the United States from which we received data.
(Descriptive statistics versus inferential statistics:) The descriptive statistics of the average height, and related standard deviation, of 12-year-old boys in the sample is a sample statistic that is used to estimate the population parameters of the mean and standard deviation of all 12-year-old boys in the United States.
Since data analysis is a process discipline and statistics is a methodological discipline, both embedded in context, working within these disciplines is not a practice of solving problems, proving theorems, or getting results. It is instead a process of making arguments for certain conclusions based upon the process of observing and analyzing data within a context from which the data is derived. This means that there are no ‘right’ answers, only strong or weak arguments.
The next few chapters focus on using statistical techniques in the data analysis process. So the distinctions between these two terms will blur as we walk in the overlap between them.
14.1.1 The Centrality of Variability
Variability underlies everything around us, particularly in quantitative situations. As such, variability is the foundation of statistics.
Individuals vary. Repeated measurements on the same individual vary. In some circumstances, we want to find unusual individuals in an overwhelming mass of data. In others, the focus is on the variation of measurements. In yet others, we want to detect systematic effects against the background noise of individual variation. Statistics provides means for dealing with data that take into account the omnipresence of variability (Cobb & Moore, 1997, p. 801)
Certain situations lend themselves to a deterministic model using mathematical functions. These could include computing the volume of a swimming pool based on certain assumptions about its dimensions. However, estimating the cost of building a swimming pool requires probabilistic models involving statistics. For example, one cannot know exactly how much concrete will be needed for the swimming pool since there is variability in the mixing of the concrete, the effects of temperature and humidity on the application of the concrete, the inability to have perfectly shaped forms used to pour the concrete, and many other factors. So when ordering the concrete for the swimming pool, the contractor needs to take into account the variability of these factors and order an amount of concrete for which he is sufficiently confident he can complete the job. However, he would also not want to order an excess amount of concrete in order to keep down the cost of the project.
The more we learn about the world, the more we understand how probabilistic models do a better job of describing our world than deterministic models. For that reason, we predict hurricane paths with a cone of certainty. We can understand how the better sports team lost a game, even though they had an 80% chance of winning.
The goal of statistics is to better understand and quantify the variability in a certain context. We can then interpret and apply this information as we study situations and improve our abilities to make decisions.
14.1.2 Exercises
Investigate the origins of the fields of data analysis and statistics and compare them to the history of the field of mathematics.
Find 3 unrelated careers that use data analysis as a key aspect of their daily work and describe them.
What are some possible contributors to variability in scores on a class exam?
With most news reports about the stock market, the rise or fall of the stock index is usually attributed to one or two key news events of the day. This represents a deterministic way of thinking. What would a similar report about the stock market look like that had a more probabilistic way of thinking?
Write a short paragraph describing a scenario in which you would use a sample statistic to infer something about a population parameter. Clearly identify the sample, population, statistic, and parameter in your example. Be as specific as possible, and do not use any example discussed in the book or in class.