Git Workflows for Research Teams Without a DevOps Person
2025-08-19
Most Git workflow advice is written for software teams: continuous integration, mandatory code review, a dedicated person who thinks about deployment. Research teams usually have none of that. Three to eight people, a shared repository that grew without a plan, and a recurring low-level anxiety that something important is going to get overwritten. The advice written for tech companies doesn't translate cleanly.
Research teams don't need sophisticated workflows. They need a small number of rules that everyone on the team actually follows. The complexity that tools like Gitflow were designed to handle — multiple release branches, hotfix tracks, release candidates — exists to solve problems that research teams don't have. You're not shipping to customers. You're producing analyses and papers. That's a simpler problem and it deserves a simpler solution.
The structure that works: keep `main` (or `master`, if that's what you inherited) stable and runnable. Do exploratory work on branches. Merge when you're confident the code does what you think it does. Atlassian's comparison of Git workflows covers the full spectrum if you want to understand your options, but for most research groups, a basic feature branch model without any particular ceremony works fine. Pull requests make sense if the team has a culture of reading each other's code. Most research groups don't, and the overhead isn't worth forcing it.
One thing research teams need to get comfortable with: most branches are dead ends. You try a new analysis approach, it doesn't support the hypothesis, you move on. That's fine — it's documented experimentation, not wasted work. The mistake is treating every branch like something that eventually has to merge to main. Some branches exist to record 'what I tried in October that didn't work,' and that's a legitimate use. Just name them so that three months later nobody wonders whether `rfactor-attempt-v2` is still relevant or can be deleted.
Tags are underused and genuinely valuable. Tag commits at meaningful checkpoints: when you submit a paper, when you run the analysis for a grant report, before a major refactor. `git tag submission-jasa-2025-03` takes five seconds and makes it trivial to return to exactly where you were. A lot of teams figure out the value of tagging after the first time they can't identify which commit generated the figures in a published paper and have to spend two days reconstructing it.
The hardest part of Git for research teams is usually not branching at all — it's large files. Datasets, trained model weights, high-resolution images, raw instrument output. Git handles binary files badly by default. A repository that accumulates them gets slow to clone, slow to push, and eventually runs into hosting platform limits. The two main options are git-annex and Git LFS. git-annex stores file contents outside the repository and tracks only metadata and checksums in Git, so a repository can nominally contain hundreds of gigabytes while only downloading what you actually need on a given machine. Git LFS is simpler to set up but requires a central server and can get expensive as storage grows.
For teams working with large datasets spread across multiple machines — a laptop, a university cluster, an external drive — git-annex fits the decentralized reality of academic computing better. Git LFS assumes you have a reliable central server that everyone can reach, which is fine until someone needs to work somewhere they can't. The setup cost for git-annex is real, but it's a one-time investment that tends to pay back.
Commit messages are where research teams consistently cut corners. 'Updated analysis.' 'Fixed bug.' 'Changes.' The team is small, everyone knows what they've been working on, it feels redundant to write it down. Six months later, nobody remembers why they changed a threshold from 0.05 to 0.03, or what was wrong with the version before the 'fixed bug' commit. The commit message is the only place where that context can reliably travel with the code — to collaborators who weren't there, to your future self, to anyone trying to understand why the code looks the way it does. Two sentences: what you did and why you did it. That's enough.
The Turing Way's section on version control for data covers the large file problem and more, and it's written for researchers rather than engineers. Worth reading before making decisions about tooling if you're setting up version control for a group for the first time.