 SQL 2005+SQL Server Grouping
The twenty-eighth part of the SQL Server Programming Fundamentals tutorial adds to the description of the use of aggregate functions within queries. In this article the GROUP BY and HAVING clauses are used to summarise groups of similar result rows.
Grouping Results
Is the previous article in this tutorial we examined some of the aggregate functions provided by SQL Server. These functions allow us to count rows, calculate totals and averages, and find the maximum and minimum values across a series of rows returned by a query. In each case, we performed an aggregation over a query's entire result set. This allowed us to answer questions such as "What is the entire sales income for the business?"
A common requirement is to perform similar functions over groups of related rows in a result set, calculating the aggregates for each group to provide a summary. This allows us to refine the questions asked into, for example, "What is the sales income by region for the business?" The results of this query would include one row per group with totals in each.
This type of query can be created using the GROUP BY clause within a SELECT statement. The GROUP BY clause defines one or more columns or expressions that should be used to determine which rows belong to each group. The groups can then be summarised using the appropriate aggregate functions.
JoBS Database
The examples in this article use the JoBS database. This is the database that has been created and populated with information throughout the course of the tutorial. If you do not have an up-to-date copy of the database you should create one with the script that may be downloaded using the link at the top of this page.
Using GROUP BY
The GROUP BY clause is added to the end of a query's SELECT statement. The clause is followed by a comma-separated list of the expressions for which you wish to create grouped sets. Often each item in the list will be a simple column name. However, you can create expressions that combine column names, literal values and functions to create groups.
GROUP BY expression1, expression2, ..., expressionN
As an example, we may wish to calculate the number of units of stock held by each engineer in the JoBS database. To calculate the total number of stocked parts for all of the engineers combined, we would use the following statement:
SELECT sum(UnitsHeld) FROM EngineerStock
To modify the statement to show a subtotal for each engineer individually we can add the engineer ID to the list of returned columns and to a GROUP BY clause. NB: The EngineerId is not required in the column list but without it you will not be able to see the link between each total and the engineer.
SELECT EngineerId, sum(UnitsHeld) FROM EngineerStock GROUP BY EngineerId
As with previous aggregation operations you will see that the calculated column in the results has no column name assigned. As before, we can add a column alias if desired:
SELECT EngineerId, sum(UnitsHeld) AS Units FROM EngineerStock GROUP BY EngineerId
When creating queries that use grouping, the list of columns to be returned in the results is more restricted than for a standard query. The column list may only include aggregated values and columns or expressions that appear in the GROUP BY clause's list. The following query attempts to select the PartNumber column, which meets neither criterion, so causes an error when executed.
SELECT
EngineerId,
PartNumber,
sum(UnitsHeld)
FROM
EngineerStock
GROUP BY
EngineerId
Linking Tables
A GROUP BY clause operates on any data that is gathered by the preceding query. This makes it appropriate for queries that gather information from more than one linked table. However, there are some common mistakes that can occur when aggregating data from joined tables with grouping. When using joins for one-to-many or many-to-many relationships, queries often return some duplicated data from the joined tables. If the duplicated data includes values that are to be included within sum or avg functions they can return unexpected results due to double counting. Similarly, when using the count function, duplicates can be counted twice unless the DISTINCT clause is included.
When using small amounts of data for testing purposes it is possible that you will not notice such calculation errors. You should always pay extra attention when testing aggregates in grouped queries. It can be useful to execute queries without the grouping to help identify areas of potential duplication that are likely to cause miscalculations.
In the following query we join the Engineers and EngineerStock tables to improve the previous query. In this example we include the engineer's name in addition to their unique ID. As both the ID and name are displayed, they are both included in the GROUP BY clause. This does not generate extra groups because there is a one-to-one relationship between an engineer's name and ID.
SELECT
E.EngineerName,
S.EngineerId,
sum(S.UnitsHeld)
FROM
EngineerStock S
INNER JOIN
Engineers E
ON
S.EngineerId = E.EngineerId
GROUP BY
S.EngineerId,
E.EngineerName
GROUP BY ALL
The behaviour of the GROUP BY clause can be modified using the ALL keyword. This keyword causes all of the possible groups from a table to be returned, even if the query's WHERE clause filters out all of the results for a particular group. In this case, the group will be displayed but all aggregated values in the row will be null.
GROUP BY ALL has some limitations. Key to these is that it should never be used with remote tables. Additionally, the clause is not supported by Microsoft when used with a WHERE clause. Without such a WHERE clause the addition of the ALL keyword has no meaning. Generally you should not use the ALL variation; it is included in this article only for completeness. Microsoft has indicated that GROUP BY ALL will be removed from future versions of SQL Server.
Filtering Aggregated Results
The results of a query containing grouping can be filtered in two ways. Firstly, a WHERE clause may be used before the GROUP BY clause to filter the results before any aggregate calculations are made. Secondly, you can use the HAVING clause to add criteria that are considered after the grouping and aggregation operations.
The HAVING clause uses a similar syntax to the WHERE clause. The key difference is the ability to apply filtering to calculated columns. For example, if we wish to amend the previous query to only return engineers that have a total stock of more than twenty units we can use the following query:
SELECT
E.EngineerName,
S.EngineerId,
sum(S.UnitsHeld)
FROM
EngineerStock S
INNER JOIN
Engineers E
ON
S.EngineerId = E.EngineerId
GROUP BY
S.EngineerId,
E.EngineerName
HAVING
sum(S.UnitsHeld) > 20
Often both a WHERE clause and a HAVING clause will be used in the same query. For example, the following query lists the engineers who have more than twenty units of stock of 15mm Copper Pipe (15COPIPE). The WHERE clause limits the initial query to the copper pipe item before the grouping occurs. The HAVING clause removes any rows where the total stock calculation gives an answer of twenty or less.
SELECT
E.EngineerName,
S.EngineerId,
sum(S.UnitsHeld)
FROM
EngineerStock S
INNER JOIN
Engineers E
ON
S.EngineerId = E.EngineerId
WHERE
S.PartNumber = '15COPIPE'
GROUP BY
S.EngineerId,
E.EngineerName
HAVING
sum(S.UnitsHeld) > 20
Sorting Aggregated Results
For the final example we will add a sort order to the above query. When grouping results, ordering may be applied to any of the columns in the query, including those that are calculated by aggregate functions. The query below demonstrates this by ordering the results by the calculated stock total, with the highest stock level first and the lowest last.
SELECT
E.EngineerName,
S.EngineerId,
sum(S.UnitsHeld)
FROM
EngineerStock S
INNER JOIN
Engineers E
ON
S.EngineerId = E.EngineerId
WHERE
S.PartNumber = '15COPIPE'
GROUP BY
S.EngineerId,
E.EngineerName
HAVING
sum(S.UnitsHeld) > 20
ORDER BY
sum(S.UnitsHeld) DESC
|