Mess

Mess Tolerance

A data set is always messy, and will always be, no matter how many attempts are made to clean it. New information is coming in constantly, and information that is current one day becomes obsolete, often without notice.

Rather than ignoring this mess, it is possible to use it at your own benefit. Specific queries can be issued at any time within the topic graph to filter information that corresponds to given criteria. These queries can be used to refactor the data set and perform some cleaning.

An interconnected graph of topics may contain topics that are clean and precise, that coexist with topics that are messy and out of date. The graph will keep working. Therefore, cleaning the mess is not a must. It's a nice-to-have.

The tolerance to messy information can translate into significant cost gains. A topic network that has some messy parts will still work, and the phases of local, partial cleaning can be initiated whenever desirable, alleviating the need to rebuild the information system from scratch when some mess is starting to develop.

There are different kinds of messes. One type is when several topics exist under similar names, and they should be merged. Another type is when the one topic combines several units of meaning, and therefore should be split into different topics. Multilingual data sets can get messy too, if several topic networks exist in parallel, to describe the same information in various languages. Grouping the various language versions into one multilingual graph can represent significant saving costs as well.