Free Excel Spreadsheet to Help You Find Duplicate Data

by bobroan on June 12, 2013


Finding duplicate records is one of the most common, and gnarly, jobs.

One of the difficulties is that information can be recorded differently and computers are very literal.  So 123 Main Street is not the same as 123 Main St. to a computer.

Here’s an Excel spreadsheet you can use to somewhat “normalize” names, addresses and phone numbers.

Download the spreadsheet “purgeduplicates” from here and save it to your computer.  You may have to enable the content to get it to work.

The general idea is to strip out all the phrases that could be expressed some way else, or not at all.

For instance, “street” could also be writing as “st” or even “road.”  And if we’re looking for potential duplicates, “123 Main Street” won’t seem the same to the computer as “123 Main Rd.”  But if we remove “street” and “rd” before comparing, then we are just comparing “123 Main” to “123 Main” and we’ve identified a potential duplicate.

We could create a column in excel for each phrase we want removed, but that would be a lot of columns and awkward to add new phrases.

So I’ve created what’s called a user defined function do remove them all without creating those extra columns.  And if there’s another term you want removed or a term you don’t want removed, it’s easy to change.

For example, consider the “datasheet” worksheet of the purgeDuplicates spreadsheet you just downloaded.

DataPass01

 

 

 

 

 

 

The data to be cleaned up is in the first three columns and the “cleaned up” data is in the last three columns.

Take a look at the “cleaned up name” on row 9.  We should take out the “Mister”

That will be easy.  There are three worksheets solely for the purpose of listing the phrases we want to strip out of the address, name, or phone.

Open the worksheet “NamePhrases” and you’ll see:

Names01

 

This is a list of all the phrases we want removed from the names.

The phrases are in the first column and their lengths are in the second column (you don’t need to update the length.  The computer does that.)

 

The length is important because we want to search for the longest phrase first or else we could have problems.  For example, suppose we’re removing both “parkway” and “way” from addresses.  If it removes the shorter “way” first, “parkway” will become “park” instead of being removed completely.

 

 

 

 

 

 

 

 

 

 

So let’s add “mister” right under “dr” (there can’t be any empty lines) so it looks like this:

Names02

 

 

 

 

 

 

 

 

 

 

There is no length.

Now save the workbook.  Saving the workbook is what causes everything to be updated.

The “namephrases” worksheet, has been updated:

Names03

 

 

 

The length of “mister” has been calculated to be 6 and it’s been moved to row 7, just under “doctor” because we want to check for these phrases longest first.

 

 

 

 

 

 

 

And now, if you look at the data, you’ll see that the “mister” has been removed from the name

DataPass02

 

 

 

 

 

 

Look at the “cleaned up Address” for Jane Smith on row 6.  It has a “#” in it.  Go to the “addresphrases” worksheet, where you’ll see

addresses01

 

 

 

 

 

 

 

 

Add “#” under “st” and save the workbook.

The “addressPhrases” worksheet now looks like this:

addresses02

 

 

 

 

 

 

 

The length of “#” has been calculated to be 1, so it stays as the last thing to be checked.

 

 

And the data worksheet now looks like this:

DataPass03

 

 

 

 

 

The “#” is gone from the “cleaned up Address” on row 6.

These functions also remove all spaces and capitalize everything because those can also throw you off.

You can add, insert and move around the columns as much as you want.

The functions to use are “cleanaddress(),” “cleanname()”, and “cleanphone().”  Use them just as you would any excel function.  Here’s how it looks in the formula bar:

DataPass04

 

 

 

 

 

 

 

This shows that the formula in cell D2 is “cleanname()) and it cleans the name from cell A2.

Once you’ve cleaned up these fields, you can sort on them to identify possible duplicates.

Here’s how you might use this spreadsheet:

  • Copy into it the information from which you want to purge duplicates
  • Add 3 columns for the “cleaned up” information and set them to the three functions (cleanaddress, cleanname, and cleanphone)
  • After those columns have been calculated (when in doubt, save the worksheet, close it and reopen it) copy them and do a “paste values” to remove the reference to the functions and replace it with the cleaned up value.
  • Copy the information back into its original source and proceed with your clean up process with these three new columns to help you better identify possible duplicates.

 

Leave a Comment

Previous post:

Next post: