Before we can talk about the new AI corpus, we need to look backward.
For decades, data + AI teams have been trained to look downstream towards their analysts or business users for requirements.
This is in part because data quality is specific to the use-case. For example, a machine learning application may require fresh but only directionally accurate data while a finance report might need to be accurate down to the penny but only updated once per day.
But it wasn’t all pragmatic. It was also responsive.
The truth is, even if you wanted to look upstream, most upstream data sources wouldn’t talk to you. They were either third-party sources pumping data into the void, or internal software engineers creating a web of microservices… that were also pumping data into the void.
New number who dis?
In response, we’d even begun to play middleman, bringing requirements from downstream consumers to our data producers upstream in the form of .
And this approach (flawed as it was) really worked for a time. The challenge we’re facing in the wake of the AI race is that, while it’s not obsolete, it’s no longer sufficient.
So, what’s the latest?
The Data + AI Team’s New Best Friend: Knowledge Managers?
With unstructured RAG pipelines, the data source is no longer a messy database… it’s a messy knowledge base, doc repo, wiki, SharePoint site etc.
And guess what?
These data sources are just as opaque as their structured foils, but with the added complication of also being less predictable.
BUT there’s a silver lining.
Unlike those structured stalwarts that ruled before the AI enlightenment, unstructured data sources are (almost always) owned by a subject matter expert – or “knowledge manager” – with a clear understanding of what good looks like.
This AI corpus was created and cultivated for a reason, likely to answer the same types of questions and solve the same problems that your AI chatbot or agent is looking to solve.
And where those third-parties and software engineers might be unwilling to dialogue about the minutiae of their data, these knowledge managers are be more than happy to guide you through their painstakenly curated and managed repository.
“And they said, what do you mean version control?”
And that means these knowledge managers are the perfect partner to define what quality looks like.
Managing Unstructured Data Quality Upstream
When it comes to the unpredictability of unstructured data + AI pipelines, the best defense is a good offense. That means shifting left to build requirements alongside the knowledge managers who understand their data the best.
If you want to get to the beating heart of your AI corpus, start with questions like:
- What canonical documents should always be there? (completeness)
- What is the process for updating documents, how often does it happen? (freshness)
- How stable are the file structures? Are there headings, sections, etc. (chunking strategy, validity)
- What are the most critical metadata filters? How often do they change? (schema)
- Is it all in one language? Does it contain code or HTML? (validity)
- Are there file naming conventions? Any jargon or shorthand or contradictory terms? (validity)
- Who are the most common users? What are the most common questions? (eval strategy)
Once you understand who maintains that data source and what questions you need them to answer, you’re just a conversation away from gathering the requirements you need to create reliable data + AI systems.
Don’t Let Your AI Corpus Become a Crisis
An AI response can be relevant, grounded, and absolutely wrong. And if you aren’t as intimately familiar with your AI corpus (and its administrators) as you are with your pipelines and your models, you will fail.
The most practical way to get ahead of this silent failure is to ensure your AI is always receiving the most accurate and up-to-date content.
And the good news is, you probably have a resource in your organization who’s ready and willing to help.
One of the best ways to do that is to ensure you always have corpus-embedding alignment – which means data + AI team and knowledge manager alignment.
Once upon a time, downstream alignment was enough to create effective requirements. But no longer. If you’re building data + AI systems, you HAVE to cast an eye both downstream and upstream.
Outputs are only HALF the story. If your AI is wrong, the problem is just as likely to be upstream with your inputs (or lack of inputs) as it is in the model itself.
Remember that lesson – and operationalize a data + AI observability solution – and you’ll be one step ahead of the AI reliability game.
;