Data archaeology: why your “agentic future” depends on your past

By: Ian Cruzen, Lead Consultant – Salesforce

Data is my obsession. It all started with a nervous question at the age of twelve: “Mom, who is my dad?”

That question kicked off years of searching. It didn’t help that the name was one of the most common in American records, and I only had a rough idea of his age and where he lived. I was working with a mystery and very little data to anchor it.

It kicked off my need to figure out my genealogy. I spent years in libraries poring over family bibles, history books, newspapers, and government records, hoping to uncover any hint that would lead me to my father. Along the way, I quickly discovered a “consistent inconsistency:” humans. Records were modified to hide secrets; names were misspelled by tired clerks or census takers; misfiled records pointed to the wrong lineage. This “clutter” led to years of wrong conclusions and confusion. Only when the data began to clear up did the truth emerge. After eight years of searching, I finally found my biological father… and a lifelong obsession with data.

The data layers of “good enough.”

Business data is no different. Your CRM and ERP aren’t just databases; they are digital strata—layers of human shortcuts, legacy “family secrets,” and abandoned business logic. Fresh out of college and in a brand-new company, I found myself staring at what I perceived as “a massive data mess” in our home-grown system. I’m cringing admitting this, but I publicly and loudly announced how “garbage” the data was to anyone who would listen. My mentor (who was also my boss) quietly pulled me aside and leveled me with a single sentence: “I’m the one who made that data, Ian.” He was driven by the demands of the day. Eek. Horrified. Humbled.

No one did anything wrong. “Life happened.” My mentor at the time is the last person on earth I would equate with messy. He was a human who could recognize my “good intent” rather than the apparent data in front of him.

These layers of “good enough” logic accumulated over the years, and now, as we lead organizations into the era of agentic AI, we are hitting a sobering reality: an Agent is only as smart as the records it reads. It unfortunately lacks the context to see the “pressure of the day” — it only sees the sediment.

Look, everyone has a day job. Data is usually messy, not because people are trying to hide things, but because they are pressured by the “now.” In the heat of the moment, people will circumvent the restrictive nature of a system (like those well-meaning validation rules) to simply jam results into a field. They are trying to survive their Tuesday, not sabotage the company.

Probably my favorite example of this, one that always makes me smile, comes from Salesforce itself. If you’ve spent any time in the schema, you’ve seen the Product2 object. When Salesforce was just starting out, it had an object called “Product.” But it needed to add more complex objects like “Pricebooks” and “Opportunity Line Items.” Unfortunately, the existing ‘Product’ object was too rigid, but it was already the foundation for thousands of customers. Salesforce couldn’t move the mountain, so it just built a second one next to it and called it ‘Product2.’ It was likely intended to be temporary, but in tech, ‘temporary’ is often the first step toward ‘permanent.’

And what about at the field level? You’ve probably seen some version of this:
• Api name = Field X / Label= Full Name
• Api name = Field X_V2 / Label= Full Name
• Api name = Field X_V3 / Label= Full Name

Or maybe you know about, or are quietly living with, an organization that has actually maxed out the fields possible on an object. I am curious, does your current business actually need 900 custom fields on the Account object? These are artifacts of a business operating for today without considering “what does future-me need?” To an AI agent, these are conflicting values. The agent processes the facts in front of it but lacks common sense. It doesn’t know which of those three fields is relevant. How many times have we seen different iterations of a field on different layouts where, in context, they mean entirely different things? That nuance is lost on an AI. You will get an expensive, high-speed hallucination — and you can’t blame the AI. It’s just doing its job. The data is the problem.

The methodology of the overburden

Okay, you might have noticed me leaning a bit into archaeology terms here. I should admit, I’ve always had a soft spot for Indiana Jones, and there was definitely a time I imagined myself as a globetrotting archaeologist. That’s the loose inspiration behind this theme. In archaeology, “overburden” is the valueless soil and rock that must be removed to reach the site of value. In enterprise systems, this is ROT data (redundant, obsolete, or trivial).

Most legacy organizations are layered systems. What we describe as “complexity” is often just sediment: years of accumulated configuration shaped by constraints that no longer exist. If you don’t clear the overburden first, you aren’t “powering your AI,” you are just automating the dirt.

From storage fees to “cognitive Ttx”

For over 30 years, we’ve been incentivized to keep everything. In the 1990s, we deleted records because storage was expensive (and it probably helped keep things moving quickly by accident). Today, storage is cheap, but the cost hasn’t disappeared; it has simply shifted from dollars per gigabyte to noise per insight.

We have traded storage fees for a “cognitive tax” on our AI. Every obsolete field and ghost record becomes a financial drain on an agent’s ability to reason effectively. You are actually paying compute costs for your AI to “think” about data that should have been deleted a decade ago. (Editor’s note: I may also be using this analogy because my wife wants me to get rid of stuff in the attic, but the logic holds. I’m certain Indiana Jones has a huge, albeit highly organized, attic… err, museum.)

The “project mindset” hangover

Why do we have so much clutter? Because for decades, we’ve operated with a project mindset rather than a product mindset.
In a project mindset, the goal is completion. We build “Field X v2” to hit a deadline, we check the box, and the project team moves on. We focus on the “inside-out” perspective: what we need to do to finish the task today.

But the agentic world requires a product mindset. We have to stop viewing data as a checklist to be completed and start viewing it as a product to be consumed. In a product mindset, the focus is “outside-in” – how is the agent actually going to use this? It’s iterative, continuous, and focused on the quality of the “fuel,” not just the existence of the “tank.”

The three paths forward

So you’ve accepted the diagnosis. Your data has sediment, your AI will amplify it, and something has to change. The good news is that you have options.

Path one: clean the org
My late friend and finance VP “Frank” had a rule I’ve never forgotten: money, time, and resources: pick any two, but never all three. I’ve seen exceptions with my own eyes — it can happen. So if you’re one of the lucky ones with all three right now, stop reading and do this.

Go clean the org. Audit the fields. Deprecate the legacy objects. Deduplicate the contacts. Rewrite the validation rules that everyone has been working around for six years. Test every assumption you thought you knew about your data against cold hard facts.

It is unglamorous. It can be slow and feel like a slog. Importantly, it may surface uncomfortable conversations about process and ownership that people have been avoiding for years. Do it anyway.

The payoff is not just AI readiness. A clean org means reporting you can trust implicitly, without a pre-meeting argument about which number is right. It means faster rep onboarding because new hires aren’t inheriting someone else’s workarounds. It means field labels that actually correspond to their API names and not just a hack because someone needed to shoehorn something in under pressure. It means compliance and audit cycles that don’t induce panic. And critically — every agent, automation, and insight you build from that point forward inherits the benefit. You stop paying the cognitive tax on every new initiative, forever.

Look, I’m a data nerd. This is my version of a clean house. But more than aesthetics, it’s about being able to stand in front of an executive and defend every number without flinching. You don’t need Data Cloud to make your agents work well. You don’t need a DMO layer or a two-track architecture. Clean data queried directly is clean data. Full stop.

Path two: clean the org and add data cloud
You’ve done the hard work of Path One, and now you want more. Maybe you’re pulling data from multiple sources beyond Salesforce. Maybe you have some very large attics of data that you want to pull relevant information from to enhance profiles. Maybe you need cross-cloud harmonization, segmentation, or activation at scale. Maybe your AI use cases have grown complex enough that a dedicated grounding layer genuinely earns its keep.

This is the premium option, and it’s genuinely powerful when the complexity justifies it. The distinction worth making here is that Data Cloud works best as an amplifier of clean data, not a substitute for it. If you’re considering this path, Path One is still the foundation if at all possible.

Path three: DMO-first — the pragmatic workaround

As Frank would caution me (and I can hear him today authoritatively as ever), most organizations don’t have all three — mandate, budget, and patience — at once. The business won’t pause while you renovate the foundation. Leadership wants AI results in quarters; that is just reality.
This is where the DMO-first approach earns its place. Rather than cleansing 20 years of accumulated sediment before you can move, you become a data surgeon instead of a data janitor. You curate the truth in the Data Cloud layer — selectively, surgically — covering only what the agent needs to function. The rest of the attic stays as cluttered as the business requires, for now.

Go in clear-eyed, though. This path comes with trade-offs. Every DMO mapping is a transformation you own, maintain, and debug. When source objects change (and they will), someone has to update the mapping. You are not eliminating the data problem; you are containing it. That is a legitimate and often necessary choice. Just don’t confuse it with Path One.

The two-track architecture

To achieve path three fully, we advocate for a two-track retrieval architecture:
• The agentic data store (DMOs): We use Salesforce Data Cloud to create Data Model Objects. This is the “cleaned” site. We perform the archaeology during the mapping phase, ensuring the agent is grounded only in high-fidelity, analyzed versions of the truth. This helps us eliminate the Cognitive Tax.
• Transactional verification (actions): When the agent needs to perform a task (like checking a live status) it uses direct actions (Flow or Apex) to talk to the live CRM record. This bridges the gap between the “historical truth” and the “active record.”

The bottom line

I alluded to it earlier, but it’s worth stating plainly: data is the fuel. And if you take away just one thought, let it be this:

If you own a Lamborghini, not all fuel is the same. You can’t pour diesel into a supercar and expect performance. It might sputter forward for a moment, but the damage is inevitable and inescapable. AI is no different.

AI maturity is a data problem. It is not a model problem. Large language models and agents are high-definition amplifiers. Whatever you feed them — clean signal or legacy sediment — gets magnified. If you point them at a foundation of overburden, they won’t fix the mess. Unlike a human who pauses and interprets conflicting signals, the model will synthesize a conclusion without context, automating that error across your entire ecosystem at a scale and speed no human could ever catch.

In the agentic era, the competitive advantage goes to the organizations that can look at their past and finally decide what to delete. We don’t need to fix the past to power the future; we just need to stop the past from lying to our agents. We clear the dirt to find the gold.

Technical note: The architectural patterns described here—specifically the Two-Track Retrieval and DMO-First approach—align with emerging Hybrid RAG standards and Salesforce’s Data Cloud for AI methodology. The term ROT Data is a standard of the Information Governance industry (AIIM).

Continue reading

Blog

Q1 2026 compliance and remediation update

Introduction Welcome to our Turnberry Compliance and Remediation update for Q1 2026. Each quarter, our C&R…

Learn More

Blog

Earth Month spotlight: thoughtful AI for a sustainable future

How can we use Artificial Intelligence responsibly while still embracing its power to innovate? This Earth…

Learn More

Blog

SAP AI use case implementation lifecycle

Why many initiatives stall and how to get them right Over the past several months, I…

Learn More