Wednesday, October 21, 2009

Correspondence analysis of raw data

Abstract

Correspondence analysis has found extensive use in the social and environmental sciences as a method for visualizing the patterns of association in a table of frequencies. Inherent to the method is the expression of the frequencies in each row or each column relative to their respective totals, and it is these sets of relative frequencies (called profiles) that are visualized. This “relativization” of the frequencies makes perfect sense in social science applications where sample sizes vary across different demographic groups, and so the frequencies need to be expressed relative to these different bases in order to make these groups comparable. But in ecological applications sampling is usually performed on equal areas or equal volumes so that the absolute abundances of the different species are of relevance, in which case relativization is optional. In this paper we define the correspondence analysis of raw abundance data and discuss its properties, comparing these with the regular correspondence analysis based on relative abundances.

Correspondence analysis (CA) and its variants – multiple, joint, subset and canonical correspondence analysis – have found acceptance and application by a wide variety of researchers in different disciplines, notably the social and environmental sciences (for an up to date account, see Greenacre, 2007). The method has also appeared in the major statistical software packages, for example SPSS, Minitab, Stata, SAS, Statistica and XLSTAT, and it is freely available in several implementations in R (R Development Core Team, 2007) – for example, the ca package by Nenadić and Greenacre (2007) and the vegan package by Oksanen et al. (2006). The method is routinely applied to a table of non-negative data to obtain a spatial map of the important dimensions in the data, where proximities between points and other geometric features of the map indicate associations between rows, between columns and between rows and columns.

In the social science context where the method originated, CA is typically applied to crosstabulations between two or more categorical variables, for example a demographic variable such as education level cross-tabulated against responses to a question in an opinion survey. Because the sample sizes in the demographic groups are different, a valid comparison between these groups is achieved by expressing the response frequencies relative to their respective sample sizes, a process which we call relativization. These vectors of relative frequencies are called profiles in CA, and it is the profiles that are visualized in the resulting maps.

This technology has been transferred ‘as is’ to ecological applications, but it is frequently the case that ecological sampling is conducted on physically equal-sized samples, either fixed areas or fixed volumes. The Bray-Curtis index, for example, which is used to measure similarity between samples in terms of their species abundances, does not relativize the data, but aggregates absolute differences between raw abundances. Bray-Curtis measures of similarity or dissimilarity could be computed on relative abundance data, however, if it were deemed important to do so – the point is that the relativization step is optional in this ecological context, and not compulsory. CA inherently analyzes profiles, so the question arises how CA functions when the raw abundance levels are of interest (i.e., the size of the data) as well as the relative abundances (i.e., the shape). In this report we define the CA of raw ‘unrelativized’ abundance data and compare its properties to those of regular CA, with an illustration on a benthos data set from the North Sea.

Download full text paper via ziddu


No comments:

Post a Comment