Wednesday, March 16, 2011

Feeling Fuzzy

"SimMetrics is an open source extensible library of Similarity or Distance Metrics, e.g. Levenshtein Distance, L2 Distance, Cosine Similarity, Jaccard Similarity etc etc. SimMetrics provides a library of float based similarity measures between String Data as well as the typical unnormalised metric output."

Python: difflib
"This module provides classes and functions for comparing sequences. It can be used for example, for comparing files, and can produce difference information in various formats..."

SSIS: Fuzzy Lookup Transformation
"The Fuzzy Lookup transformation performs data cleaning tasks such as standardizing data, correcting data, and providing missing values."