The Soundex Algorithm
Cultural differences and input errors can lead to words being spelled differently to a user's expectations. This makes it difficult to locate information quickly. The Soundex algorithm can alleviate this by assigning codes based upon the sound of words.
The Soundex algorithm generates four-character codes based upon the pronunciation of English words. These codes can be used to compare two words to determine whether they sound alike. This can be very useful when searching for information in a database or text file, particularly when looking for names that are commonly misspelled.
An example of the use of Soundex is the search function of a customer database. When performing a text search for the surname, "Smith", people with the name, "Smythe", would not be found. However, as the Soundex code for both surnames is "S530", a phonetic Soundex-based search would find both customers. The codes and data could also be used to ask the user, "Did you mean Smythe?".
The Soundex algorithm applies a series of rules to a string to generate the four-character code. The encoding steps are as follows:
- Ignore all characters in the string being encoded except for the English letters, A to Z.
- The first letter of the Soundex code is the first letter of the string being encoded.
- After the first letter in the string, do not encode vowels or the letters H, W and Y. These letters may affect the code by being present but are not encoded directly.
- Assign a numeric digit between one and six to all letters after the first using the following mappings:
- 1: B, F, P or V
- 2: C, G, J, K, Q, S, X, Z
- 3: D, T
- 4: L
- 5: M, N
- 6: R
- Where adjacent digits are the same, remove all but one of those digits unless a vowel, H, W or Y was found between them in the original text. The C# code described below uses a temporary placeholder for these non-encodable letters to avoid incorrect removal of adjacent, matching digits. NB: The "Variations" section at the end of the article describes an alternative version of this rule.
- Force the code to be four characters in length by padding with zero characters or by truncation.
Implementing Soundex in C#
In this article we will implement the Soundex algorithm in C#. We will create a class containing two public methods. The first method will generate the Soundex code for a string. The second will compare two strings and return a value that indicates how similar they sound.
To begin, create a new console application project and add a class named "Soundex".
Calculating a Soundex Code
The first step is to add the GetSoundex method signature. This public method accepts a single parameter containing the string to encode. It returns the Soundex code as a string. To create the method, add the following code to the Soundex class:
public string GetSoundex(string value)
When checking the characters of the input string it is easier to work in one case only. As the initial letter of a Soundex code is usually presented in upper case, we will begin by capitalising the string. We will also create a new StringBuilder to hold the Soundex code as it is constructed. The first two lines of the GetSoundex method are therefore:
value = value.ToUpper();
StringBuilder soundex = new StringBuilder();
With the algorithm initialised we can process the individual letters in the string by looping through the characters and calling a method named "AddCharacter" when a letter is found. AddCharacter is responsible for adding the Soundex letter and digits to the StringBuilder object. We will create the AddCharacter method later.
foreach (char ch in value)
At the end of the loop, the StringBuilder will contain the converted characters but is likely to not be exactly four characters in length. It may also include placeholder characters, as we will see shortly. The final part of the GetSoundex method rectifies these problems before converting the StringBuilder into a string and returning the result. Add the final three lines to the method:
12 February 2010