BlackWasp
Programming Concepts

Database Normalisation

The seventeenth part of the SQL Server Programming Fundamentals tutorial discusses the concept of database normalisation. Normalisation is a database design technique that minimises duplication of information, reducing the risk of introducing data errors.

Data Anomalies

When information is held in a list format, the data is likely to include some duplication of information. This is often the case when creating a large list with many columns, such as in a spreadsheet format. Duplication can often occur when a single database table is used to hold lots of related information.

Duplication of information causes two key problems in a relational database. Firstly, the additional information can often lead to increased storage requirements. Secondly, and more importantly, you can experience data integrity problems, the majority of which can lead to one or more types of anomaly. Three anomalies are described in the following sections.

Update Anomalies

An update anomaly occurs when all copies of a single piece of information are not updated correctly. For example, consider the following table:

NameJob TitleSkill
BobDeveloperC#
BobDeveloperSQL
JimDeveloperC#
SueDeveloperVB
PatProject ManagerPRINCE2

To allow the table to hold the details of multiple employees, each with a job title and one or more skills, this table includes duplication in every column. This gives many opportunities for update anomalies. For example, if Bob's information is updated so that his job title is "Programmer", this should probably also update all other developer rows to indicate that the new job title replaces the old one for all developers. If not, the update could leave the table in the following state:

NameJob TitleSkill
BobProgrammerC#
BobProgrammerSQL
JimDeveloperC#
SueDeveloperVB
PatProject ManagerPRINCE2

The information in the table may still be considered accurate. However, a query of the database that intends to return all of the developers has become more difficult. A worse scenario will occur if the next update to change Bob's title back to "Developer" is only executed against one of his two entries, especially if the new job title is misspelled:

NameJob TitleSkill
BobDeveloprC#
BobProgrammerSQL
JimDeveloperC#
SueDeveloperVB
PatProject ManagerPRINCE2

Insert Anomalies

Insert anomalies occur when the structure of a database table makes it difficult to insert information. This can happen when two or more distinct areas of information (or domains) exist within a table. In our previous example table, we may want to create a new employee who does not have a specific technical skill. This forces us to use a null value or dummy value for the skill:

NameJob TitleSkill
BobDeveloprC#
BobProgrammerSQL
JimDeveloperC#
SueDeveloperVB
PatProject ManagerPRINCE2
LeeTrainee Programmernull)

In this situation, the information in the table seems reasonable. However, when Lee obtains his first skill, simply adding a row, as would be done for other developers, causes an anomaly:

NameJob TitleSkill
BobDeveloprC#
BobProgrammerSQL
JimDeveloperC#
SueDeveloperVB
PatProject ManagerPRINCE2
LeeTrainee Programmer(null)
LeeTrainee ProgrammerC#

Delete Anomalies

A delete anomaly occurs when the deletion of information from a table has unintended consequences, generally because the structure of the table makes deleting data difficult. Consider the situation where the company using the sample table has decided that they will no longer require the C# skill. If you were to delete all references to C#, Jim would disappear from the database entirely. Similarly, if Pat were to leave the company, removing her entry would erase the PRINCE2 skill from the database.

Normalisation

Database normalisation (US: normalization) reduces or removes the issues caused by duplicated data by extracting the replicated information into separate tables. Foreign key relationships are then added between the related tables to maintain referential integrity. Once this normalisation has occurred, the key information in each database table is held uniquely.

As an example, we may decide to move the job title values in the previous example to their own table with a primary key. Each of the employee rows would include a reference to that primary key. When querying employees, we would join the two tables together, ensuring that every employee with a particular title will be linked to the same job title row and would therefore return exactly the same information. This means that an update to a job title need only occur in one place to affect every linked employee.

Normalisation should be applied to databases that are updated frequently but seldom queried. Adding tables to a database generally increases the number of tables that must be joined to execute a query. As each join adds an overhead to the query, decreasing performance, it is possible to over-normalise a database. This is particularly noticeable for databases that are used only for reporting purposes. In these situations it is common to denormalise (US: denormalize) the database, introducing duplication that increases query performance.

Normal Forms

The normal forms of relational databases describe the level of vulnerability that a particular design has to data integrity problems. They define a set of rules that should be applied when designing a database. The forms each have an ordinal number, with "first normal form" being the least normalised and most vulnerable to anomalies. Higher ordinals indicate progressively stricter rules. Additionally, there are some named normal forms, such as Boyce-Codd Normal Form, that apply additional rules.

Normal forms should be considered to be guidelines for database design, rather than absolute rules. There are some situations where it is necessary to break with the normal forms. This is generally to improve the performance of queries that otherwise would require a large number of join operations.

Originally, Edgar F. Codd described the first, second and third normal forms. These three forms are those that are the most commonly adhered to and that are always achievable when designing databases. They will be described in the remaining sections of this article. There are further normal forms, including Boyce-Codd Normal Form and fourth, fifth and sixth normal form. However, these stricter forms are not always achievable.

First Normal Form

For a table to be compliant with first normal form (1NF), it must meet certain conditions. The key rules are:

  • The columns of the table must not require any specific ordering.
  • The rows of the table must not require any specific ordering.
  • It must not be possible to create duplicate rows.
  • Every column of every row must contain exactly one element of data.
  • Rows must not contain hidden information such as row identifiers or hidden timestamps.

These rules can be illustrated with some example tables.

Example 1

MessageType
Opened text file.Information
Outputted 15 lines to text file.Information
Failed to write line 16.Error

This table holds information, warning and error messages that are logged by a computer program. In order to read the log correctly, the details must be retrieved in the order in which they were written to the database. This breaks the second rule, which states that rows of the table must not require any specific ordering to be understood. It also breaks the third rule, as there is no possible unique candidate key for the table.

To resolve the ordering issue, the date and time of every log entry can be recorded. The candidate key problem can be fixed by adding a surrogate identifier to the table, shown underlined. In the following table, this column has been added and is using an identity column to generate values.

Log IDLog TimeMessageType
115/12/2008 12:01:23.123456Opened text file.Information
215/12/2008 12:01:23.123472Outputted 15 lines to text file.Information
315/12/2008 12:01:23.130120Failed to write line 16.Error

Example 2

NameJob TitleSkills
BobDeveloperC#, SQL
JimDeveloperC#, VB, SQL
SueDeveloperVB

In the second example we return to the employee list. In this case, the database designer has created a Skills column to contain the list of skills that each employee has. Where an employee has several skills, they are comma-separated. This breaks the rule that every column of every row contains exactly one element of data.

This issue can be resolved by creating two tables instead of one. The first table contains employees' names and job titles with the name as the primary key. The second table contains one row for each skill for each employee with both columns being included in a composite key. On performing a query that requires both sets of data, a join is made based upon the employee name.

Employees Table
NameJob Title
BobDeveloper
JimDeveloper
SueDeveloper
Employee Skills Table
EmployeeSkills
BobC#
BobSQL
JimC#
JimVB
JimSQL
SueVB

Second Normal Form

For a table to comply with second normal form (2NF), it must meet all of the requirements of 1NF. Additionally, all non-prime columns, ie. columns that are not part of a candidate key, must depend upon entire candidate keys, not just parts of those keys. This rule only applies to tables that contain composite keys. If all possible candidate keys are composed only of single columns, this rule has no effect.

Example

EmployeeSkillsSalary
BobC#£30,000
BobSQL£30,000
JimC#£35,000
JimVB£35,000
JimSQL£35,000
SueVB£30,000

In this example, a column has been added to the Employee Skills table to hold the employee's salary. This column breaks the 2NF rule, as the salary is dependant upon the employee but not the skill. This is only one column of the two-column composite key.

To modify the design to achieve 2NF, the Salary column should be removed from the Employee Skills table and added to the Employees table instead. This also removes the duplication of salary data that could cause update anomalies.

Employees Table
NameJob TitleSalary
BobDeveloper£30,000
JimDeveloper£35,000
SueDeveloper£30,000
Employee Skills Table
EmployeeSkills
BobC#
BobSQL
JimC#
JimVB
JimSQL
SueVB

Third Normal Form

Third normal form (3NF) includes all of the requirements of 2NF. In addition, 3NF states that all non-prime columns must be directly dependant upon the candidate keys. Non-key columns may not be transitively dependant upon prime columns. ie. A column should not be dependant upon a non-key column that is dependant upon a key column.

Example

EmployeePrimary OfficeOffice Phone
BobLondon020 123456
JimLeeds0113 234567
SueManchester0161 987654

In the above sample, the designer has created a new table that links each employee to an office location and that office's main telephone number. This table meets all of the requirements for 2NF. The only candidate key for the table is the Employee column. The Primary Office column is directly dependant upon this key. However, the Office Phone column is not directly related to the employee. It is transitively related to the employee via the primary office. This table does not therefore meet the requirements for 3NF.

To achieve 3NF, the table must be normalised into two tables. The first table will link the employee to their location. The second will hold the details of each location, including the telephone number. To perform a query that returns an employee's name and telephone number, you must join the tables. This solution removes the risk of update anomalies where the same office appears for multiple employees but the telephone number is changed for only one of them.

Employee Offices Table
EmployeePrimary Office
BobLondon
JimLeeds
SueManchester
Offices Table
OfficeOffice Phone
London020 123456
Leeds0113 234567
Manchester0161 987654

Limitations

Third normal form can be overly complex for some business scenarios. You may decide not to fully normalise your tables to achieve 3NF for performance or other reasons. An example of this can be seen in the following table. This table holds details of customer locations including the customer ID, address and postcode. Technically, the address can be though of as dependent upon the postcode, which itself is related to the customer ID. To achieve 3NF, you could create a table of possible locations with the postal code as its primary key. In many situations, this would be deemed to be unnecessary.

CustomerIDAddressPostcode
123020 123456W1 1WW
4560113 234567LS2 2LS
7890161 987654M3 3MM
Link to this Page15 December 2008
TwitterTwitter RSS Feed RSS