
.NET 2.0+The Soundex Algorithm (2)
Cultural differences and input errors can lead to words being spelled differently to a user's expectations. This makes it difficult to locate information quickly. The Soundex algorithm can alleviate this by assigning codes based upon the sound of words.
Calculating a Soundex Code
The first step is to add the GetSoundex method signature. This public method accepts a single parameter containing the string to encode. It returns the Soundex code as a string. To create the method, add the following code to the Soundex class:
public string GetSoundex(string value)
{
}
When checking the characters of the input string it is easier to work in one case only. As the initial letter of a Soundex code is usually presented in upper case, we will begin by capitalising the string. We will also create a new StringBuilder to hold the Soundex code as it is constructed. The first two lines of the GetSoundex method are therefore:
value = value.ToUpper();
StringBuilder soundex = new StringBuilder();
With the algorithm initialised we can process the individual letters in the string by looping through the characters and calling a method named "AddCharacter" when a letter is found. AddCharacter is responsible for adding the Soundex letter and digits to the StringBuilder object. We will create the AddCharacter method later.
foreach (char ch in value)
{
if (char.IsLetter(ch))
AddCharacter(soundex, ch);
}
At the end of the loop, the StringBuilder will contain the converted characters but is likely to not be exactly four characters in length. It may also include placeholder characters, as we will see shortly. The final part of the GetSoundex method rectifies these problems before converting the StringBuilder into a string and returning the result. Add the final three lines to the method:
RemovePlaceholders(soundex);
FixLength(soundex);
return soundex.ToString();
Adding Soundex Character Codes
The GetSoundex method calls several private methods that have yet to be defined. The first is the AddCharacter method, which encodes a letter as a Soundex character and appends it to the code. The first letter is copied to the Soundex string; subsequent letters are converted to digits first and added only if they are not duplicates of the previous digit.
private void AddCharacter(StringBuilder soundex, char ch)
{
if (soundex.Length == 0)
soundex.Append(ch);
else
{
string code = GetSoundexDigit(ch);
if (code != soundex[soundex.Length - 1].ToString())
soundex.Append(code);
}
}
Determining Soundex Digits
The GetSoundexDigit method encodes letters as digits. The letter is converted to a value between one and six according to the algorithm rules. If the letter is not encodable, a full stop (period) character is used as a placeholder. The placeholders ensure that duplicates are not removed when separated by a vowel, H, W or Y.
private string GetSoundexDigit(char ch)
{
string chString = ch.ToString();
if ("BFPV".Contains(chString))
return "1";
else if ("CGJKQSXZ".Contains(chString))
return "2";
else if ("DT".Contains(chString))
return "3";
else if (ch == 'L')
return "4";
else if ("MN".Contains(chString))
return "5";
else if (ch == 'R')
return "6";
else
return ".";
}
Removing Placeholder Characters
The next method removes placeholder characters from the StringBuilder object, leaving only encoded characters. This is achieved with a call to the StringBuilder's Replace method:
private void RemovePlaceholders(StringBuilder soundex)
{
soundex.Replace(".", "");
}
12 February 2010