Friday 2 December 2011

Handling NULL values in SSIS fuzzy lookup

I'm using the SSIS Fuzzy Lookup data flow transformation in my current ETL package to find account name matches between source and target. So I compare the similarity of a number of fields using a conditional split transformation like:

(_Similarity > .70 || _Confidence > .70)
|| (_Similarity_CleanedCompanyName > .70 && _Similarity_CleanedABN > .70 && !ISNULL(CleanedABN))
|| (_Similarity_TradingNames__c > .70 && !ISNULL(TradingNames__c) && _Similarity_CleanedABN > .70 && !ISNULL(CleanedABN))

But SSIS gives a score of 1 to fields that are NULL in both the source and reference tables. E.g. if the phone number is null in the source table and the reference table phone number is also null then SSIS marks this as an exact match.

After searching without success for an easy way to handle null value matches, I decided to simply change all the null values in my reference table to 'unknown'. Now when SSIS compares the fields it returns a 0 similarity value for that particular column.

This is important because I am relying on the overall _Similarity and _Confidence values to automatically match records. For instance, if the overall _Similarity value is greater than 0.8 then the two records are an exact match and I insert the matched Id from source to target. Null values matches skew the overall _Similarity score, and in extreme cases a record pair might be matched because all the fields contain null values.

No comments:

Post a Comment