LINQ Style Variance and Standard Deviation Operators
In statistics, the variance and standard deviation for a set of data indicate how spread out the individual values are. Small values indicate that the elements of a set are close to the average value, whereas larger values suggest a greater spread.
Variance and Standard Deviation
Language-Integrated Query (LINQ) includes the Average standard query operator, which calculates the arithmetic mean for a sequence of numeric values. This is simply the sum of all of the values in the set divided by the number of elements. The mean helps you understand the central tendency of a data set but gives no information about the overall spread of values.
To gain more understanding of a data set we can combine the mean with either the variance or the standard deviation. Both give low values for collections where the numbers are generally close together and higher values for data sets with wider variations. The variance is calculated by finding the difference between the set's mean and each value, squaring that difference for every element in the set and determining the average of those squared values.
The formula for the variance of a set of values is as follows. In the formula x represents items in the set, N is the number of elements and μ is the arithmetic mean of the full population.
The standard deviation is the square root of the variance. For a normal distribution of values, you would expect to see around two thirds of the values falling within the range between one standard deviation below the mean and one standard deviation above. 90% of values would normally appear within two standard deviations of the mean.
Let's consider an example. If we have the values 98, 100 and 105, the mean is 101. The differences between the values and the mean are 3, 1 and 4 respectively. When squared, the values are 9, 1 and 16. The mean of these three values, which is the variance, is 8.67. The standard deviation is, therefore, 2.94.
Another set of three values with a mean of 101 are 5, 25 and 273. These are much more diverse values than in the previous set but the mean does not tell us this. However, if we calculate the variance, which is 14858.67, or the standard deviation, which is approximately 121.90, we gain a deeper understanding of the data.
Sample Variance and Standard Deviation
The above formulae apply when we know the entire population of our data. For example, if we gathered the ages of every person in a country via a census we could calculate the variance and standard deviation of the ages in the above manner.
In many cases we cannot reasonably obtain a full set of data and instead must take a sample and use it to estimate the results for the complete population. In this case the formula for the variance is slightly different. When determining the average of the squared differences we sum them and divide not by the number of items but by one less than this value. This somewhat allows for the usual case of the variance of a sample being smaller than the variance of the population.
The revised formula is shown below. If our set of values containing 98, 100 and 105 is a sample, the revised variance is 13 instead of 8.67.
The standard deviation of a sample is affected in the same manner. It is still the square root of the variance but uses the sample variance calculation. The revised standard deviation for the sample containing 98, 100 and 105 is 3.61, instead of 2.94. The formula is:
Implementing Variance and Standard Deviation
In the remainder of this article we'll implement the variance and standard deviation calculations. We'll create extension methods that provide a syntax similar to that of the Average method of LINQ. We'll only deal with input sequences that implement IEnumerable<double>, though it's relatively trivial to add variants for other data types or overloads that include Func delegates to extract numeric values from other data types.
We'll implement the extension methods twice. The first versions will be simple but naive. Although they produce accurate results, they require that the source sequence be enumerated several times. This will cause errors for sequences can only be read once. The second implementation will require a single pass over the input data, at the cost of potential rounding errors due to extra multiplication and division.
To begin, create a new console application. We'll be using the Main method for testing later.
2 April 2013