How do I merge a small part of a Git repo into another repo?

git logo

At my new day job, one of the things we want to do is migrate a portion of a really large Git repository (over 20GB) which I’ll call LargeRepo, into a much smaller repository (< 2GB) which I’ll call SmallRepo, because when we commit new files to the larger one, the build process takes a very long time and affects productivity. Commits to the smaller repository, by comparison, can run in under ten minutes.

A plan was developed whereby a small section of LargeRepo that we work with will be moved over to SmallRepo so that we can leverage the build times there instead. The plan does not include moving the history from LargeRepo, which is a good idea because there’s about 19.5GB of stuff in there we don’t want. Bringing all the history over would bloat SmallRepo and make it larger than LargeRepo.

After chatting with my teammates though, we agreed that having at least two or three years of history for our small section of LargeRepo (which I’ll call SubsetRepo) would be valuable to avoid having to go back to the large repository all the time to see how files had changed.

So, I did an experiment which I’d like to share here. It’s not a fully working example as I do hand-wave over some steps, but I wanted to write it down in the hope that someone else might find it useful.

Note: there is assumed knowledge about Git and distributed source control in this post. If you would like to read a primer on Git, you can start at Kendra Little’s blog.

The goal is to:

  • Make a copy of LargeRepo that only goes back three years, using a shallow git clone
  • Remove any content and history that isn’t the part we work with, using git-filter-repo
  • Merge this SubsetRepo into SmallRepo, making it only a little bigger, using git merge

No problem at all, right? Except Git prevents us from mixing unrelated histories because it can break things badly with merge conflicts galore. We must create a brand-new repo and merge in both SmallRepo and SubsetRepo as distinct sources separately. To do that, we use the --allow-unrelated-histories flag on the git merge command.

It is worth mentioning that we will not get merge conflicts in this process because the directory structure of the two repositories is different.

Step 1: Create an empty Git repository

This one’s easy. Create a new folder, run git init on it, and we’re ready to start. Let’s call this one CleanRepo. If Git’s defaults haven’t been changed, we should also make sure our main branch is named correctly with git -m master main.

Step 2: Merge SmallRepo into CleanRepo

We’re first going to add the SmallRepo remote server as a source for CleanRepo, then merge that in with the allow-unrelated-histories flag. We don’t want to merge in the larger content until it’s just got the pieces we want, but we must start somewhere. When we’re finished with this step we remove the remote server, so we don’t accidentally push back our unfinished changes.

  • git remote add origin
  • git merge --allow-unrelated-histories main
  • git remote rm origin

Step 3: Make a shallow clone of LargeRepo with limited history

In a new folder, use git clone to bring down only the previous three years of history. This repo will be the start of SubsetRepo, but we still need to trim it down further to the folders we want to keep.

  • git clone --shallow-since="3 years"

Step 4: Remove all the folders in SubsetRepo that aren’t what we want

This is the part of the exercise that would take days or weeks (I am not exaggerating) to run and is prone to many errors. Enter the magical Python script called git-filter-repo.

Normally you might be tempted to use the incredibly slow and error-prone filter-branch, or the slightly faster but not really filter-index. Maybe you have experimented with fast-export and fast-import. Perhaps you have become comfortable with the Java-based BFG Repo Cleaner.

Nay, friend, stop right there and gaze upon this marvel of scalable Git repository manipulation that is git-filter-repo. In this one single step, we will identify four folders in the shallow clone (only three years of history) of LargeRepo and keep just those folders and related history. Anything else will be removed.

  • git filter-repo --path folder1 --path folder2 --path folder3 --path folder4 --tag-rename '':'merged-'

This bad boy of a command goes into the repository’s index and preserves the four folders I want. It also rewrites the tags with a prefix, so that when I merge the SubsetRepo into the CleanRepo, it won’t cause a name collision with an existing tag. We don’t want any conflict. When it’s done, it cleans up after itself too.

When I ran this, it took under ten minutes to go through five years of history, 600,000 commits, and tens of thousands of files. If that doesn’t impress you, I know you haven’t had to use git filter-branch before. After these 500 or so seconds, it ran git gc all by itself. The final SubsetRepo folder including the entire .git hidden directory is 450MB. That is a 97.8% reduction in size1The default behaviour of this process is to create new commit hashes because it does rewrite the repository’s history, but you can turn that off if you really want..

Step 5: Merge SubsetRepo into CleanRepo

Now our SubsetRepo is small enough to be merged into the CleanRepo (which is an exact copy of SmallRepo). Since we only kept four folders, we know there won’t be a merge conflict, so let’s just do it locally by setting up the SubsetRepo folder as a remote server. Yes, you can do that with Git.

We are going to change back to the CleanRepo folder and add a new remote which points to the SubsetRepo folder, then merge it. Once we’re done, we remove the old remote.

  • git remote add origin ../SubsetRepo.git
  • git merge --allow-unrelated-histories main
  • git remote rm origin

If all goes to plan, there should be no merge conflicts and we’re ready to push this new repo to the SmallRepo remote server.

Step 6: Force push CleanRepo to the SmallRepo remote server

For this we need a bit of help because we’re rewriting history, so definitely make sure there’s a backup of SmallRepo before doing this step.

  • git remote add origin
  • git push origin main --force

And that’s it! Share your Git tips in the comments.

  • 1
    The default behaviour of this process is to create new commit hashes because it does rewrite the repository’s history, but you can turn that off if you really want.

2 thoughts on “How do I merge a small part of a Git repo into another repo?

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: