Services Process Blog Demo

Get in touch

hello@sovont.com
Back to blog
· Sovont · 3 min read

The Join Key That Changed Halfway Through

Source systems quietly change their primary keys and your pipelines keep running — producing wrong answers instead of errors. That's the worst kind of failure.

Data Engineering

The pipeline didn’t break. That’s the problem.

At some point, the upstream team migrated from integer IDs to UUIDs. They sent an email. It was forwarded to someone. The ticket was closed. The pipeline kept running, because the column name didn’t change — just the values in it.

Now your join is a cartesian product dressed up as a lookup. And it’s been that way for six weeks.


Why this is so dangerous.

A broken pipeline announces itself. You get a null pointer, a schema mismatch, a row count of zero, an alert at 3 AM. You fix it.

A silently incorrect pipeline is invisible. Rows still flow. Row counts look normal. Dashboards update. No alert fires because nothing technically failed. The wrong data just quietly becomes the truth your organization makes decisions from.

Join key changes are almost always in this second category. The column exists on both sides. The type might even coerce silently. You get matches — just the wrong ones, or too few, or too many — and you’d need to actually inspect the data to know.


How join keys change without you noticing.

It’s rarely a big bang migration. More often it’s incremental: new records start using UUIDs while old ones keep integer IDs. Or the upstream team starts populating a new column and stops populating the old one. Or a SaaS tool switches identity providers and the user_id in your warehouse no longer matches the user_id in your CRM.

In each case, the column is there. The join compiles. The data is wrong.


What you can do about it.

Assert referential integrity, don’t assume it. After every join that matters, check your match rate. If you expected 98% of rows to join and you’re seeing 71%, something changed upstream. This is a five-line dbt test. Write it.

Track cardinality and uniqueness on your join keys. A primary key that was unique last month should still be unique this month. If it isn’t, something changed. Alert on it before it reaches a model that depends on it.

Build a source change log. When upstream systems change — schema, keys, anything — that change should be documented and the downstream owner should be notified. This is a process problem as much as a technical one. If you don’t know when your sources change, you’re flying blind.

Test your joins for fanout. An unexpected many-to-one or many-to-many result is a red flag. Most pipelines don’t check for this. Add a row count assertion before and after the join. If the output has more rows than the left table, you have a fanout problem. Find out why.


The join is not a formality.

A join is a claim: these two rows describe the same thing. That claim needs to be validated, not just assumed. Join keys drift. Systems evolve. People migrate databases without coordinating with the data team.

The pipeline runs either way. The question is whether it’s running correctly.


If your most important joins have never failed, it might be because you’re not checking them.