Duplicate Detection With AI - Triple Boost for Data Consistency and Increased User-Friendliness (3/3)

03 Feb 2021 - Artificial Intelligence, Logistics, Production, Technology

© Tommy Lee Walker/shutterstock and alphaspirit/iStock (edited by PSI)

Do you know the situation when you come across multiple entries for one and the same fact in your database? We are talking about duplicates, which are multiple entries for the same data record, e.g. due to different spellings. The first two articles in this series describe how to use auto-completion and input validation to keep newly entered records consistent. The third part shows how to conveniently clean up inconsistencies in an existing database using AI-based duplicate detection.

What's the Challenge?

In almost any business process today, data is the basis for acting efficient and effective. Maintaining a continuously high level of data quality is a major challenge both for editors and administrators of such databases.

In the case of partially unmonitored data collection - for example without the use of auto-completion or automated data input validation - inconsistencies are generated more and more over time. Sometimes it can lead to disruptions in the process itself and those that follow. This often results in manual rework having to be carried out or even the occurrence of planning errors.

Read part 2 of the series:

Data Input Validation With AI - Triple Boost for Data Consistency and Increased User-Friendliness (2/3)

Case Study: 8 Records for the Same Supplier

For many years, addresses of suppliers operating worldwide were collected in a database. The entries were always made manually and by many different editors. Addresses that were supposedly not found, a new one was created. Over the course of time, duplicates of the same supplier were generated due to different spellings. One example is a supplier in Italy whose street name can be entered in many different ways: In the local language as "Via delle Fabbriche" or as a German translation in the variants "Fabrikstr.", "Fabrikstrasse" or "Fabrikstraße." In addition, the company name can likewise be entered in the local language or as a German translation as well. In this way alone, there are eight possible entries for the same information. Additionally, variants with upper and lower case letters can also be created.

Consistency in address management thus steadily decreases over time, which reduces user-friendliness as well as the process itself.

What Is the Conventional Approach With Similarity Metrics?

In case of a constantly growing database that has existed for years, a manual search for duplicates to maintain consistency is out of the question due to the amount of time involved. A first approach is to use similarity metrics. For this purpose, the contents of data sets are interpreted as text objects with a sequence of letters and then the distances between them are calculated. If this deviation does not exceed a specified rate, the two checked objects are treated as duplicates. However, this represents a method approach searching for well-defined anomalies. In essence, this is a threshold check for a similarity comparison, which also depends on the length of the word. In addition, such processes have weak runtime behavior for large amounts of data, which limits applicability in the context of Big Data. Moreover, similarity metrics are sometimes unstable regarding semantics when processes change over time.

A mechanism is needed that automatically detects anomalies in the structures of a data set comparison and can continuously adapt to the current conditions.

How Do I Detect Duplicates Based on Data?

In most business processes, a broad base of historicized data already exists. Through Qualitative Labeling combined with Machine Learning, the structures of an entire database can be learned from past data in a process-specific manner.

Data-driven methods offer many advantages, especially for detecting multilevel relationships and complex similarities in data-such as finding a supplier that is listed with several entries in address management.

Explainable AI by Means of Interpretable KPI Labels

How Does Duplicate Detection Fit Into the Overall AI System?

The fundamentals for duplicate detection based on the Deep Qualicision AI Framework are the combination of Qualitative Labeling with a knowledge base of historicized data trained by Machine Learning. In addition, similarity metrics are used to make comparisons between text objects. However, the framework also enables decision support by simply giving preference to different evaluation KPIs. In this way, not only syntactic similarities, but also semantic analogies - as with different spellings of street names or company names - can be included for detecting duplicates.

This kind of KPI based self-learning inspection mechanism can thus provide an automatic way for continuously detecting duplicate data entries.

For the process itself and those that follow, this ensures that planning can be carried out with consistent data to reduce manual rework and avoid errors.

Benefits of duplicate detection

Detection of duplicates as anomalies in an entire database
Automated detection of duplicate data sets
Significant time savings and planning reliability in downstream processes
Consistency across the entire database
Qualitative standardization and plausibility analyses
Continuous relearning of the knowledge base to maintain a current data status

Triple Boost for Data Consistency and Usability

The modular linking of the Auto Complete, Data Entry Validation and Duplicate Detection modules - each of which can also be operated individually - creates a knowledge base that is constantly being expanded by self-learning machines to provide automated support for data entry, verification and storage.