Any organisation that deals with customer, prospect, supplier, distributor, product and service information, uses all kinds of data in their day-to-day business processes. Identification of a customer or a product within an automated system, using a specific id-number, the name or any other identifying feature, is a key issue in these processes. Furthermore, it is a task that needs considerable attention, since the collection and management of data is essentially error-prone. People make mistakes, names are understood incorrectly, numbers are typed in the wrong order; there are just too many reasons for defective data and poor information quality.
The collective term ‘business data’ is often used without a precise notion of what business data actually contain. It is not just the customer identification numbers and product codes. Naturally, the sort and the importance of data used in a business process will differ from organization to organization. However, a closer look at the seemingly endless variation will show that names and addresses of persons and organizations are as detailed and complicated as they are identifying. The following classification will show the details of names, addresses and complementary data.
Defining the data groups as precisely and as detailed as possible, is the first step towards useful interpretation. People, applying their natural language processing capabilities, structure the information as they interpret it. They will use their frame of reference, which includes their knowledge dictionary, their linguistic repository, statistical information and mathematical information.
Knowledge-based interpretation, incorporated in an automated system to solve data quality issues, must work in exactly the same way. Consider the following examples:
Peter Arnold Frank
If you had to interpret this name, you would probably (considering you are of European or American origin) designate Peter as a given name, Arnold as a middle name (or second given name) and Frank as a surname. Of course all three names are very common given names and all three of them also exist as surname. But the signification [given name-given name-surname] is definitely the most probable signification in this particular context.
Mohammad Ouazzani Benhaddou
This name seems to have a similar structure as the name above. However, we probably (and oftentimes unconsciously) will interpret this name differently. This happens, because our frame of reference tells us that this name is most likely of Arab origin and that names from that particular region in the world have different naming conventions. Although the name Mohammad Ouazzani Benhaddou does not carry identification mark such as “attention, this is name from Arab origin!” we will consider precisely this origin in the interpretation of the name.
Chr. London Int. Transp. Co.
This example may seem puzzling in the beginning, since most of the words are abbreviations (which are very common in organization names) and the word that is not an abbreviation, London, is actually ambiguous. In this case, London is probably a surname. Chr.is most likely the abbreviation of a given name such as Christopher. The abbreviations Int. Transp. Co. most likely signify International Transport Company.
The examples above show, that a knowledge repository can be very useful in interpretation (remember the usage of your own frame of reference and knowledge dictionary mentioned above). Of course, the automated interpretation based on natural language will need additional help to perform as well as we humans do. But the creation of the knowledge universe is the starting point for answering that short question: Who is who and what is what in my database?
Holger Wandt was Principal Advisor at Human Inference. He joined Human Inference in 1991. As a linguist, he was one of the pioneers of the interpretation and matching technology in the data quality product suite. In his position as Principal Advisor he was responsible for conveying vision to customers and partners and for promoting ideas and vision to industry boards, thought communities, universities and analyst firms.
Names that are understood incorrectly, misspelled numbers; there are just too many reasons for defective data and poor information qualtiy