This chapter is intended as an introduction to quantitative research. It first describes a range of methods that can be used to summarise and interpret statistical data. This is then followed by a description of more complex analytical methods, such as hypothesis testing and regression analysis.
More detailed guidance is contained in the Background Document (pdf - 673kb)
Data in their raw form usually consist of many rows of information and it is therefore not possible to obtain any useful conclusions from simply inspecting the raw data. It is one of the tasks of the data analyst to make meaningful sense of such data, by employing techniques that summarise the data and show relationships within the data.
Inferential data analysis goes beyond describing and summarising data, by exploring and determining relationships between variables (or sub-groups). Rarely are we just interested in the relationship within the sample, but more often we want to know if such relationships can be generalised to the whole population.
Inferential data analyses involve trying to answer questions, such as, 'are differences between men's and women's incomes significant?' or, 'does smoking cause lung cancer?', etc.
Conventionally, variables can be measured at four different levels of measurement although in practice social scientists use only three of them.
A range of measures is available to identify the centre (or average) of a set of values.
The arithmetic mean
The most common measure is the arithmetic mean, partly because it
has useful statistical properties. The mean is calculated as the
sum of all the values divided by the number of values.
A simple example is given below in Figure 4.1:
Figure 4.1
| The mean of the values (2,2,4,7,10) is:
|
The median
The median is the value which is at the middle
of the distribution and is defined so that half the observations
are smaller, and half are larger than it.
As described above, the median is relatively unaffected by extreme values and thus suited as a measure of the 'average' for heavily skewed data, but is more sensitive to sampling variability compared to the arithmetic mean. For the above data, the median would be the third largest of the five values, which is 4.
The mode
The mode is the most frequent value of a distribution.
In the above example the modal value is 2.
Sometimes a distribution has more than one value with similarly large number of observations. This is called a bimodal distribution if there are two modal values or multimodal if there are more.
In addition to estimates of the average value, a measure of the spread of the values is also often reported.
Variance and standard deviation
The measure commonly used to summarise the spread of data such as
height is the variance, as this has the most useful statistical
properties. It is defined to be the average of the squared distance
of each value to the mean value.
The square root of the variance is called the standard deviation (s or SD). A value of the variance (or standard deviation) that is close to zero indicates that all the values in the sample are approximately the same and a large value implies that the observations vary widely.
Range
The range is the difference between the largest and smallest
values and hence is likely to depend on the sample size (it will
increase as the sample size increases) and be sensitive to extreme
values. This is one of the weaknesses of using the range as a measure
of variation.
Inter-quartile range
The inter-quartile range is a more stable measure of the spread,
being the difference between the 25th and 75th percentile. It is
often used as an alternative measure of 'range' as it is unaffected
by extreme values. However this measure does not share the useful
statistical properties of the variance and so is less frequently
used.
The normal distribution is an important distribution in statistics as many 'natural' phenomena (e.g. height and weight) are normally distributed. The normal distribution has a distinctive 'bell' shape (Figure 4.2).
It is determined by two parameters, the mean (
) and the standard deviation (
).
Once we know these values, then we know all we need about that distribution.
Figure 4.2
| The standardised normal distribution:
|
Using the properties of the normal distribution, it is possible to calculate the probability of obtaining a measure above or below any given value. It is usual to standardise the variable of interest to have zero mean and variance equal to one (done by subtracting the mean and then dividing by the standard deviation) so that the well-known properties of the standard normal distribution can be utilised. The standard normal distribution is the normal distribution with mean equal to zero and variance equal to one.
Probabilities associated with this distribution are available in published tables but it is worth noting that in the standard normal distribution the area under the normal curve takes a particular form:
Z-scores
There are many different normal curves each with different means
and standard deviations. The Standard Normal Distribution
has a mean of 0 and a standard deviation of 1. Standardising normal
distributions to the Standard Normal Distribution facilitates comparisons.
Z-scores are a useful way of standardising variables so that they can be compared. Standardisation allows us to compare variables with different means and/or standard deviations and scores expressed in differing original units. To standardise the values of a variable, we need to take the difference between each value and the variable mean and divide each difference by the variable's standard deviation. Statistical tables of the Standard Normal Distribution can then be used to calculate the probability of getting a smaller (or larger) z-score than the one actually obtained. Thus z-scores can also help us to identify extreme values (outliers).
Associated with the normal distribution is the t-distribution (often called Students' t). The most important difference between the normal distribution and the t-distribution is that the distribution of the t-distribution is different depending on the sample size. The t-distribution is therefore defined by the mean, variance and the sample size (expressed as the degrees of freedom = sample size - 1).
Because the t-distribution tends to the normal distribution as the sample size increases, it only makes a difference in practice when the sample size is relatively small (e.g. n < 100).
A confidence interval gives an estimated range of values which is likely to include an unknown population parameter, the estimated range being calculated from a given set of sample data.
In other words because the estimate is based on a sample rather than the full population, it deviates from the population values by an amount that varies according to the particular sample selected. This variation of a sample estimate from the true population value implies that it is not possible to report the exact population value based on a sample of the population.
However, through sampling theory, it is possible to state a range of values within which one is fairly sure that the true population value lies. This range is the confidence interval.
One way of visualising the relationship between two continuous variables is to create a scatterplot. Scatterplots provide graphical tools for exploring the distributions and relationships of the two continuous variables.
The question we try to answer is whether or not the variation in one variable (dependent variable Y) can be explained by the variation of the other (independent variable X). For example Figure 4.3 shows the scatterplot for female illiteracy rate by infant mortality (source: World Bank, 1992):
Figure 4.3
| Scatterplot of female illiteracy rate by infant mortality:
|
Covariance
Scatterplots are useful tools for exploring the data by providing
a visual presentation of the relationship between X and Y, but do
not provide a measure of the strength of the relationship. An appropriate
numerical measure (i.e. statistic) is needed in order to draw any
conclusions about relationships between variables. Covariance, is
a measure that reflects the strength of the (linear) association
between two variables.
However, the problem with the covariance is that it is dependent
on the scale of measurement of either variable.
The correlation coefficient
In order to get round the problem of differing units and to get
a measure that can be used to compare between different pairs of
variables, we would need to standardise our measure. Correlation,
is a measure of association derived from standardised variables,
or in other words, a standardised covariance. Each variable is standardised
by subtracting the mean and dividing by the standard deviation
Correlation does not imply causation. It is just a mathematical measure of the strength of the relationship between the two variables. Correlation coefficients are measured between 0 and 1. A high correlation between X and Y could be because:
Another misconception is that a low correlation coefficient suggests that the relationship between X and Y is weak or low. This interpretation is true only for a linear association. It is possible for two variables to have a very strong relationship that is non-linear but the correlation coefficient would not be able to pick this up.
A correlation, whether linear or non-linear, can infer a causal relationship between two variables if:
If we are interested in trying to 'explain' the behaviour of one variable (the dependent variable) using the predictive power of another variable (the independent or predictor variable), we need to use simple regression analysis. If we have two or more independent variables we would use multiple regression analysis.
Simple linear regression - the equation of a straight line
The relationship between two variables can be measured more precisely
by drawing a straight line through a scatter plot, as in Figure
4.4 below. The equation of a straight line (in the population) can
be written as:
Figure 4.4
where,
Y is the continuous dependent variable.
X is the continuous independent variable.
is the intercept, which is the value of the dependent variable for (or the value where the line crosses the Y-axis).
is the slope which describes how much the value of the dependent variable changes when the independent variable increases by 1 unit (for the above example 1 unit = 1%).
How to interpret the regression equation: a hypothetical example
Imagine we have run a simple linear regression analysis on house
price data. We are interested in the effect of investing money (measured
in £000s) in home improvements on property values. The value
of the property (measured in £000s) is therefore the dependent
variable and the value of home improvements is the independent variable.
In other words we are using regression analysis to predict property
values from the value of home improvements.
The results of the analysis gives us
=150 and
=1.5.
indicates the average change in the dependant measure, corresponding
to one-unit change in the independent variable. The regression formula
would be
Predicted property value = 150 + (1.5 x value of home improvements)
This indicates that on average a £1,000 investment in home
improvements is accompanied by a £1,500 increase in house
value. The intercept (
)
suggests that if there was no investment in the house, we would
expect it to be worth £150,000.
However, we can hypothesise that house value depends on more than
just the amount spent on home improvements. To test this
hypothesis we could add more independent variables, such
as the number of bedrooms (
2)
and conduct a multiple regression. This time the analysis gives
us
=130,
1=1.2
and
2=50,
which translates into the following equation
Predicted house value = 130 + 1.2 x value of home improvements + 50 * number of bedrooms
From this we can calculate the expected house value. For example we would expect a studio flat with no investment in home improvements to be worth on average £150,000 [150 + (1.2x0 + 50x0)]. Whereas we would expect a five-bedroom home where the owner(s) have invested £50,000 to be worth £460,000 [150 + (1.2x50) + (50x5)].
Logistic regression
Another form of regression techniques is logistic regression. This
is used when the dependent variable is binary (i.e.
only has two possible outcomes). Logistic regression estimates the
probability of group membership on the dependent variable.
Hypothesis testing
Often the difference between two estimates, or between an estimate
and a specific fixed value, will be reported as statistically
significant. This implies that the difference is large enough
that it is unlikely to have been observed simply because of sampling
error (or, put the other way, that it is likely to have been observed
because of a real difference in the population).
The test to determine whether a difference is significant or not
involves (often implicitly) the notion of a hypothesis test.
In statistical terms, a hypothesis test is undertaken to ascertain
if there is enough evidence to reject one hypothesis about the population
(the null hypothesis) in favour of another (the alternative
hypothesis) using estimates from the sample.
In most cases, the null hypothesis is the 'default' state - e.g. that a value is zero or that the difference between two values is zero (although there are exceptions to this). The alternative hypothesis tends to be the opposite of the null hypothesis - e.g. that a value is not zero or that the difference between two values is not zero.
The level of significance
The level of significance is the threshold that is used to decide
if an observed difference in the sample was unlikely to have
been observed by chance and hence to reject the null hypothesis.
The level of significance is expressed as a probability and is often taken to be 0.05. This may also be described as significant at the 5% or 95% level, or displayed as p<0.05. A significance level of 0.05 implies that a difference extreme enough to reject the null hypothesis by chance when the null hypothesis is actually true will be observed one time in twenty. It is possible to test the significance at the 10% (p<0.10 ) or 1% (p<0.01) levels.
Hypothesis testing for categorical data
When the analysis variables are categorical, hypothesis testing
can be used to compare two proportions (or percentages) or measure
the association between the variables. Suppose we were interested
in whether or not there was a relationship between gender and ownership
of a car (using the health survey data). There are a number of ways
to check this:
i) Calculate a confidence interval for the difference in the proportion owning a car between men and women. If the confidence interval does not include 0 then there is a significant difference in the proportions.
ii) Use the t-test for proportions.
iii) Use the chi-square test.
The chi-square test (summary)
The rationale behind the X
2 test is that if the two variables are not
related (i.e. gender is not related to car ownership) then we should
have the same proportion of men and women owning a car. There may
be a difference in the proportions due to pure chance, but (depending
on the sample size) this difference must be small. Consequently,
if the difference between the two proportions is very large, then
we would be led to conclude that there is some association between
gender and car ownership.
For a more detailed explanation of the chi-square test, and an example, see section 4.13.3 in the background paper.
The odds ratio
Rather than compare the difference between two proportions it can
sometimes be useful to compare the odds.
If P1
is the proportion of men owning a car and P2
is the corresponding proportion for women, then the odds of owning
a car for men is given by
while that for women is
.
The odds ratio (
)
measures the association between owning a car and gender. If the
same proportions of men and women own cars, then the odds ratio
will equal 1.