AI’s Achilles’ Heel: The Data Quality Dilemma

As AI has gained prominence, all the data quality issues we’ve faced historically are still relevant. However, there are additional complexities faced when dealing with the nontraditional data that AI often makes use of.

AI Data Has Different Quality Needs

When AI makes use of traditional structured data, all the same data cleansing processes and protocols that have been developed over the years can be used as-is. To the extent an organization already has confidence in its traditional data sources, the use of AI shouldn’t require any special data quality work.

The catch, however, is that AI often makes use of nontraditional data that can’t be cleansed in the same way as traditional structured data. Think of images, text, video, and audio. When using AI models with this type of data, quality is as important as ever. But unfortunately, the traditional methods utilized for cleansing structured data simply don’t apply. New approaches are required.

AI’s Different Needs: Input And Training

First, let’s use an example of image data quality from the input and model training perspective. Typically, each image has been given tags summarizing what it contains. For example, “hot dog” or “sports car” or “cat.” This tagging, typically done by humans, can have true errors and also situations where different people interpret the image differently. How can we identify and handle such situations?

It isn’t easy! With numerical data, it is possible to identify bad data via mathematical formulas or business rules. For example, if the price of a candy bar is $125, we can be confident it can’t be right because it is so far above expectation. Similarly, a person shown as age 200 clearly doesn’t make any sense. There really isn’t an effective way today to mathematically check if tags are accurate for an image. The best way to validate the tag is to have a second person assess the image.

An alternative is to develop a process that uses other AI models to scan the image and see if the tags applied appear to be correct. In other words, we can use existing image models to help validate the data being fed into future models. While there is potential for some circular logic doing this, models are becoming strong enough that it shouldn’t be a problem pragmatically.

AI’s Different Needs: Output And Scoring

Next, let’s use an example of image data quality from the model output and scoring perspective. Once we have an image model that we have confidence in, we feed the model new images so that it can assess the images. For instance, does the image contain a hot dog, or a sports car, or a cat? How can we assess if an image provided for assessment is “clean enough” for the model? What if the image is blurry or pixelated or otherwise not clear? Is there a way to “clean” the image?

The confidence we can have in what an AI model tells us is in the image directly depends on how clean the image is. In a case such as the image above, how do we know if the image is a blurred view of trees or something else entirely? Even as humans, there is subjectivity in this assessment and no clear path for having an automated, algorithmic approach to declaring the image as “clean enough” or not. Here, manual review might be best. In absence of that, we can again have an algorithm that scores the clarity of the input image along with processes to rate the confidence in the descriptions generated by the model’s assessment. Many AI applications do this today, but there is surely improvement possible.

Rising To The Challenge

The examples provided illustrate that classic data quality approaches like missing value imputation and outlier detection can’t be applied directly to data such as images or audio. These new data types, which AI is heavily dependent on, will require new and novel methodologies for assessing quality both on the input and the output end of the models. Given it took us many years to develop our approaches for traditional data, it should come as no surprise that we have not yet achieved similar standards for the unstructured data which AI uses.

Until those standards arise, it is necessary to:

Constantly scan industry blogs, papers, and code repositories to keep tabs on newly developed approaches
Make your data quality processes modular so that it is easy to alter or add procedures to use the latest advances
Be diligent in studying identified errors so that you can identify if patterns exist related to where your cleansing processes and models are performing better and worse

Data quality has always been a thorn in the side of data and analytics practitioners. Not only do the traditional issues remain as AI is deployed, but the different data that AI uses introduces all sorts of novel and difficult data quality challenges to address. Those working in the data quality realm should have job security for some time to come!

Originally posted in the Analytics Matters newsletter on LinkedIn

The post AI’s Achilles’ Heel: The Data Quality Dilemma appeared first on Datafloq.

AI’s Achilles’ Heel: The Data Quality Dilemma

AI Data Has Different Quality Needs

AI’s Different Needs: Input And Training

AI’s Different Needs: Output And Scoring

Rising To The Challenge

Related Posts

How CIS Credentials Can Launch Your AI Development Career

5 Technologies Enhancing Digital Twins

Leave a Reply Cancel reply