What is the Dirty Data Tax?

The Dirty Data Tax is the compounding cost of bad data — mislabeled, stale, duplicated, or unreconciled records — that gets silently priced into every decision an organization makes on top of it. The worse the data, the higher the tax.

Why does the Dirty Data Tax matter more with AI?

AI removes the analyst-time bottleneck, so data quality becomes the binding constraint. An AI system trained on dirty data makes confident mistakes faster and at greater scale, presenting bad answers behind a clean interface.

How do you pay down the Dirty Data Tax?

You pay it down by standardizing definitions, reconciling records to a source of truth, deduplicating, and versioning assumptions before layering analytics or AI on top of the data.

Glossary · Coined framework

Dirty Data Tax

Coined by Ross Blankenship

Illustration: messy data filtered through a funnel with a cost coin

Definition

The Dirty Data Tax is the compounding cost of bad data — mislabeled, stale, duplicated, or unreconciled records — that is silently priced into every decision built on top of it. You never see a line item for it, but you pay it in wrong calls, missed opportunities, and the slow erosion of trust in your own numbers.

Origin

I named this in Dirt, Data and Decisions to capture why so many real estate and operating businesses fail to get value from analytics and AI. They invest in dashboards and models while the underlying data is a swamp — rent rolls that do not reconcile to the bank, comps that are mislabeled, the same property entered three different ways. The tools then produce precise-looking answers built on rotten inputs.

It is a tax because it is unavoidable until you fix the source, it scales with how much you rely on the data, and it is regressive in a specific way: the more advanced your tooling, the faster dirty data converts into confident errors.

Worked example

A firm deploys an AI underwriting assistant to speed up deals. The model is excellent. But the rent roll it reads counts three concession-burdened units as full-rent, an expense category is double-booked, and two comps are duplicates of the same sale entered under different addresses. The AI does exactly what it is told and returns a clean, confident value — that is roughly 8% too high.

Nobody re-checks it, because the output looks authoritative. The firm overpays, and the loss never gets attributed to data quality; it gets attributed to “a bad deal.” That misattributed loss is the Dirty Data Tax. The fix was never a better model — it was reconciling the inputs first.

Start a conversation Back to glossary