Main navigation

What do the statistics tell me? Statistical concepts, inference & analysis


This chapter is intended as an introduction to quantitative research. It first describes a range of methods that can be used to summarise and interpret statistical data. This is then followed by a description of more complex analytical methods, such as hypothesis testing and regression analysis.

More detailed guidance is contained in the Background Document (pdf - 673kb)

Descriptive statistics

Data in their raw form usually consist of many rows of information and it is therefore not possible to obtain any useful conclusions from simply inspecting the raw data. It is one of the tasks of the data analyst to make meaningful sense of such data, by employing techniques that summarise the data and show relationships within the data.

Inferential statistics

Inferential data analysis goes beyond describing and summarising data, by exploring and determining relationships between variables (or sub-groups). Rarely are we just interested in the relationship within the sample, but more often we want to know if such relationships can be generalised to the whole population.

Inferential data analyses involve trying to answer questions, such as, 'are differences between men's and women's incomes significant?' or, 'does smoking cause lung cancer?', etc.

Levels of measurement

Conventionally, variables can be measured at four different levels of measurement although in practice social scientists use only three of them.

Measures of central tendency

A range of measures is available to identify the centre (or average) of a set of values.

The arithmetic mean
The most common measure is the arithmetic mean, partly because it has useful statistical properties. The mean is calculated as the sum of all the values divided by the number of values. A simple example is given below in Figure 4.1:

Figure 4.1

The mean of the values (2,2,4,7,10) is:

mean = (2+2+4+7+10)/5 = 25/5 = 5

The median
The median is the value which is at the middle of the distribution and is defined so that half the observations are smaller, and half are larger than it.

As described above, the median is relatively unaffected by extreme values and thus suited as a measure of the 'average' for heavily skewed data, but is more sensitive to sampling variability compared to the arithmetic mean. For the above data, the median would be the third largest of the five values, which is 4.

The mode
The mode is the most frequent value of a distribution. In the above example the modal value is 2.

Sometimes a distribution has more than one value with similarly large number of observations. This is called a bimodal distribution if there are two modal values or multimodal if there are more.

Measures of variability

In addition to estimates of the average value, a measure of the spread of the values is also often reported.

Variance and standard deviation
The measure commonly used to summarise the spread of data such as height is the variance, as this has the most useful statistical properties. It is defined to be the average of the squared distance of each value to the mean value.

The square root of the variance is called the standard deviation (s or SD). A value of the variance (or standard deviation) that is close to zero indicates that all the values in the sample are approximately the same and a large value implies that the observations vary widely.

Range
The range is the difference between the largest and smallest values and hence is likely to depend on the sample size (it will increase as the sample size increases) and be sensitive to extreme values. This is one of the weaknesses of using the range as a measure of variation.

Inter-quartile range
The inter-quartile range is a more stable measure of the spread, being the difference between the 25th and 75th percentile. It is often used as an alternative measure of 'range' as it is unaffected by extreme values. However this measure does not share the useful statistical properties of the variance and so is less frequently used.

The Normal Distribution

The normal distribution is an important distribution in statistics as many 'natural' phenomena (e.g. height and weight) are normally distributed. The normal distribution has a distinctive 'bell' shape (Figure 4.2).

It is determined by two parameters, the mean (mu) and the standard deviation (sigma). Once we know these values, then we know all we need about that distribution.

Figure 4.2

The standardised normal distribution:

Diagram that shows the distinctive 'bell' shape of normal distribution in statistics.

Using the properties of the normal distribution, it is possible to calculate the probability of obtaining a measure above or below any given value. It is usual to standardise the variable of interest to have zero mean and variance equal to one (done by subtracting the mean and then dividing by the standard deviation) so that the well-known properties of the standard normal distribution can be utilised. The standard normal distribution is the normal distribution with mean equal to zero and variance equal to one.

Probabilities associated with this distribution are available in published tables but it is worth noting that in the standard normal distribution the area under the normal curve takes a particular form:

Z-scores
There are many different normal curves each with different means and standard deviations. The Standard Normal Distribution has a mean of 0 and a standard deviation of 1. Standardising normal distributions to the Standard Normal Distribution facilitates comparisons.

Z-scores are a useful way of standardising variables so that they can be compared. Standardisation allows us to compare variables with different means and/or standard deviations and scores expressed in differing original units. To standardise the values of a variable, we need to take the difference between each value and the variable mean and divide each difference by the variable's standard deviation. Statistical tables of the Standard Normal Distribution can then be used to calculate the probability of getting a smaller (or larger) z-score than the one actually obtained. Thus z-scores can also help us to identify extreme values (outliers).

The t-distribution

Associated with the normal distribution is the t-distribution (often called Students' t). The most important difference between the normal distribution and the t-distribution is that the distribution of the t-distribution is different depending on the sample size. The t-distribution is therefore defined by the mean, variance and the sample size (expressed as the degrees of freedom = sample size - 1).

Because the t-distribution tends to the normal distribution as the sample size increases, it only makes a difference in practice when the sample size is relatively small (e.g. n < 100).


Confidence intervals

A confidence interval gives an estimated range of values which is likely to include an unknown population parameter, the estimated range being calculated from a given set of sample data.

In other words because the estimate is based on a sample rather than the full population, it deviates from the population values by an amount that varies according to the particular sample selected. This variation of a sample estimate from the true population value implies that it is not possible to report the exact population value based on a sample of the population.

However, through sampling theory, it is possible to state a range of values within which one is fairly sure that the true population value lies. This range is the confidence interval.

Graphs - Scatterplots

One way of visualising the relationship between two continuous variables is to create a scatterplot. Scatterplots provide graphical tools for exploring the distributions and relationships of the two continuous variables.

The question we try to answer is whether or not the variation in one variable (dependent variable Y) can be explained by the variation of the other (independent variable X). For example Figure 4.3 shows the scatterplot for female illiteracy rate by infant mortality (source: World Bank, 1992):

Figure 4.3

Scatterplot of female illiteracy rate by infant mortality:

Scatterplot of female illiteracy rate by infant mortality

Covariance
Scatterplots are useful tools for exploring the data by providing a visual presentation of the relationship between X and Y, but do not provide a measure of the strength of the relationship. An appropriate numerical measure (i.e. statistic) is needed in order to draw any conclusions about relationships between variables. Covariance, is a measure that reflects the strength of the (linear) association between two variables.

However, the problem with the covariance is that it is dependent on the scale of measurement of either variable.

The correlation coefficient
In order to get round the problem of differing units and to get a measure that can be used to compare between different pairs of variables, we would need to standardise our measure. Correlation, is a measure of association derived from standardised variables, or in other words, a standardised covariance. Each variable is standardised by subtracting the mean and dividing by the standard deviation

Correlation does not imply causation. It is just a mathematical measure of the strength of the relationship between the two variables. Correlation coefficients are measured between 0 and 1. A high correlation between X and Y could be because:

Another misconception is that a low correlation coefficient suggests that the relationship between X and Y is weak or low. This interpretation is true only for a linear association. It is possible for two variables to have a very strong relationship that is non-linear but the correlation coefficient would not be able to pick this up.

A correlation, whether linear or non-linear, can infer a causal relationship between two variables if:

Regression

If we are interested in trying to 'explain' the behaviour of one variable (the dependent variable) using the predictive power of another variable (the independent or predictor variable), we need to use simple regression analysis. If we have two or more independent variables we would use multiple regression analysis.

Simple linear regression - the equation of a straight line
The relationship between two variables can be measured more precisely by drawing a straight line through a scatter plot, as in Figure 4.4 below. The equation of a straight line (in the population) can be written as:

Figure 4.4

Equation of a straight linein the population

where,

Y is the continuous dependent variable.
X is the continuous independent variable.
alphais the intercept, which is the value of the dependent variable for (or the value where the line crosses the Y-axis).
beta is the slope which describes how much the value of the dependent variable changes when the independent variable increases by 1 unit (for the above example 1 unit = 1%).

How to interpret the regression equation: a hypothetical example
Imagine we have run a simple linear regression analysis on house price data. We are interested in the effect of investing money (measured in £000s) in home improvements on property values. The value of the property (measured in £000s) is therefore the dependent variable and the value of home improvements is the independent variable. In other words we are using regression analysis to predict property values from the value of home improvements.

The results of the analysis gives us alpha =150 and beta =1.5.
betaindicates the average change in the dependant measure, corresponding to one-unit change in the independent variable. The regression formula would be…

Predicted property value = 150 + (1.5 x value of home improvements)

This indicates that on average a £1,000 investment in home improvements is accompanied by a £1,500 increase in house value. The intercept (alpha) suggests that if there was no investment in the house, we would expect it to be worth £150,000.

However, we can hypothesise that house value depends on more than just the amount spent on home improvements. To test this hypothesis we could add more independent variables, such as the number of bedrooms (beta2) and conduct a multiple regression. This time the analysis gives us alpha=130, beta1=1.2 and beta2=50, which translates into the following equation…

Predicted house value = 130 + 1.2 x value of home improvements + 50 * number of bedrooms

From this we can calculate the expected house value. For example we would expect a studio flat with no investment in home improvements to be worth on average £150,000 [150 + (1.2x0 + 50x0)]. Whereas we would expect a five-bedroom home where the owner(s) have invested £50,000 to be worth £460,000 [150 + (1.2x50) + (50x5)].

Logistic regression
Another form of regression techniques is logistic regression. This is used when the dependent variable is binary (i.e. only has two possible outcomes). Logistic regression estimates the probability of group membership on the dependent variable.

Statistical hypotheses

Hypothesis testing
Often the difference between two estimates, or between an estimate and a specific fixed value, will be reported as statistically significant. This implies that the difference is large enough that it is unlikely to have been observed simply because of sampling error (or, put the other way, that it is likely to have been observed because of a real difference in the population).

The test to determine whether a difference is significant or not involves (often implicitly) the notion of a hypothesis test. In statistical terms, a hypothesis test is undertaken to ascertain if there is enough evidence to reject one hypothesis about the population (the null hypothesis) in favour of another (the alternative hypothesis) using estimates from the sample.

In most cases, the null hypothesis is the 'default' state - e.g. that a value is zero or that the difference between two values is zero (although there are exceptions to this). The alternative hypothesis tends to be the opposite of the null hypothesis - e.g. that a value is not zero or that the difference between two values is not zero.

The level of significance
The level of significance is the threshold that is used to decide if an observed difference in the sample was unlikely to have been observed by chance and hence to reject the null hypothesis.

The level of significance is expressed as a probability and is often taken to be 0.05. This may also be described as significant at the 5% or 95% level, or displayed as p<0.05. A significance level of 0.05 implies that a difference extreme enough to reject the null hypothesis by chance when the null hypothesis is actually true will be observed one time in twenty. It is possible to test the significance at the 10% (p<0.10 ) or 1% (p<0.01) levels.

Hypothesis testing for categorical data
When the analysis variables are categorical, hypothesis testing can be used to compare two proportions (or percentages) or measure the association between the variables. Suppose we were interested in whether or not there was a relationship between gender and ownership of a car (using the health survey data). There are a number of ways to check this:

i) Calculate a confidence interval for the difference in the proportion owning a car between men and women. If the confidence interval does not include 0 then there is a significant difference in the proportions.

ii) Use the t-test for proportions.

iii) Use the chi-square test.

The chi-square test (summary)
The rationale behind the X 2 test is that if the two variables are not related (i.e. gender is not related to car ownership) then we should have the same proportion of men and women owning a car. There may be a difference in the proportions due to pure chance, but (depending on the sample size) this difference must be small. Consequently, if the difference between the two proportions is very large, then we would be led to conclude that there is some association between gender and car ownership.

For a more detailed explanation of the chi-square test, and an example, see section 4.13.3 in the background paper.

The odds ratio
Rather than compare the difference between two proportions it can sometimes be useful to compare the odds.

If P1 is the proportion of men owning a car and P2 is the corresponding proportion for women, then the odds of owning a car for men is given by P1 / 1-P2 while that for women is P2 / 1-P2.

The odds ratio (odds ratio) measures the association between owning a car and gender. If the same proportions of men and women own cars, then the odds ratio will equal 1.

> Magenta Book contents