Data Engineering
Blog
dbt
5
min read

DevOps for data: why your data isn't versioned

Your dbt code is in Git, but your data isn't. Here's what DevOps for data really means, and the versioning gap most teams miss.
Author
Kristy Broekmans
Kristy Broekmans
Data Engineer
DevOps for data: why your data isn't versioned
Share article

A pipeline broke in production. Wrong numbers landed in the dashboard. Someone is asking what the rollback plan is.

For most data teams, the honest answer is: there isn't one.

Software engineers solved this decades ago. They version their code, branch their environments, run automated tests before anything reaches production, and deploy with confidence because they can reverse course. Data teams have adopted some of these habits - dbt models in Git, pull requests, CI jobs - but the actual data, the environments, the deployment process? Still managed by hand, by convention, and by hope.

This post is about closing that gap. Not by adopting every DevOps tool that exists, but by applying the same mindset to the data stack that engineering teams have applied to software for years - and showing what that looks like with Snowflake and dbt.

What DevOps for data actually means

DevOps, stripped of its buzzword layer, is about shortening the feedback loop between writing something and knowing if it works in production. Catch problems early, in isolated environments, before they reach the people who depend on your outputs.

For a data team, that translates to a few concrete things: 

  • Isolated development environments so engineers can work without touching production data.
  • Automated testing before any change is deployed.
  • Reproducible deployments so the team knows exactly what changed and when.
  • The ability to roll back when something goes wrong.

None of this is radical. All of it is harder than it sounds in a data context.

Why Git alone is not enough

Most data teams use Git. That is good. But there is a common misconception that versioning code is the same as versioning the data stack. It is not.

Git is designed for text files that change incrementally. It handles merge conflicts at the line level. It has no concept of a database schema, a table relationship, or a row-level change.

When data engineers store their dbt models in Git, they are versioning their transformation logic. That is valuable. But the actual data sitting in Snowflake, the state of the tables your dashboards read from, Git knows nothing about that. A model can pass a Git diff review and still silently drop 30% of rows because of a join condition nobody caught.

There are two distinct versioning problems at play here, and most teams only solve the first.

Pipeline versioning covers the code, the logic, the transformations. This is where Git helps and where dbt CI/CD plugs in. Models live in repositories. Pull requests enforce review. CI runs tests before merge.

This part is solved.

Data state versioning is the hard part. This is the actual rows and columns in Snowflake after dbt run completes. When a model produces wrong output, reverting the dbt commit does not undo the damage. The bad data is already materialised. Downstream dashboards are already serving it. The Slack messages have already started.

Solving only the first problem and calling it done is like a software team versioning their application code but deploying directly to production with no staging environment and no way to revert a database migration.

The real cost of skipping it

The consequences of operating without proper DevOps practices on the data side are usually invisible, until they are very visible.

Fear of deploying. Engineers who work without isolated environments and automated tests learn quickly that changes are risky. The rational response is to batch changes together and deploy less frequently. Less frequent deployment means bigger changes, which means higher risk. It is a cycle that slows everything down, and it is the exact opposite of what high-performing teams do.

Debugging in production. Without environment isolation, the only way to test a complex transformation is to run it against real data. That often means running it in production, or against a stale development copy that does not reflect current reality. Neither gives confidence. A pipeline that works on a 1,000-row dev sample might fall over at ten million rows because of a skew in the join key or a type mismatch the sample did not contain.

No accountability when things go wrong. When a pipeline fails and corrupts data, the investigation becomes archaeology. Teams piece together what changed, when, and why, across Git history, Slack messages, and tribal knowledge. A proper deployment process creates an audit trail by default.

Analyst distrust. Numbers that were right last week are wrong this week, and nobody is sure why. This erodes confidence in the data team faster than almost anything else. Trust, once lost, takes a long time to rebuild. Automated testing is the cheapest insurance against it, and it pairs naturally with a broader approach to improving data quality.

The mindset shift before the tooling

Before any tool or process can help, data teams need to accept one uncomfortable truth: the data stack is a software product, and it should be treated like one.

Changes should flow through an approval process. A dbt model that goes directly from a developer's laptop to production, without review, without tests, without a staging run, is a liability. The same engineer writing application code would never merge to main without a pull request.

Environments are not optional. "We only have one Snowflake database" is a cost concern, and it is a legitimate one. But running development work in production is far more expensive when something goes wrong. The cost of environment isolation is predictable. The cost of a data incident is not.

Automation is not a nice-to-have. Manual tests get skipped under pressure. Automated checks do not. The discipline that seems unnecessary on a calm Tuesday is the thing that saves the team on a chaotic Friday afternoon before a board presentation.

Tests are a form of documentation. A dbt test that asserts a column is never null is also a statement about what the business expects to be true. It communicates intent. It catches regressions. It makes onboarding faster, because new engineers can understand what invariants the models are supposed to maintain.

What good looks like with dbt and Snowflake

A data team operating with proper DevOps practices looks something like this.

An engineer picks up a task. They create a feature branch in Git and a corresponding isolated environment in Snowflake, a clone of production that costs almost nothing to create and gets discarded when they are done. They build and test their changes against real data, without touching production. When they are ready, they open a pull request. An automated CI pipeline runs dbt tests against their branch. A colleague reviews the logic.

The change merges, and an automated deployment pushes it to production. If something goes wrong, there is a known-good state to return to.

This is not a fantasy. Teams run this way today. Flibco built exactly this kind of modern data stack on Snowflake and dbt, turning manual transformation work into a repeatable, DevOps-proof process. The tooling exists. What is usually missing is the decision to prioritise it.

Facts & figures

About client

Testimonial