A blazingly fast shell one-liner
This shell script displays all blob objects in the repository, sorted from smallest to largest.
For my sample repo, it ran about 100 times faster than the other ones found here.
On my trusty Athlon II X4 system, it handles the Linux Kernel repository with its 5.6 million objects in just over a minute.
The Base Script
git rev-list --objects --all | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | sed -n 's/^blob //p' | sort --numeric-sort --key=2 | cut -c 1-12,41- | $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest
When you run above code, you will get nice human-readable output like this:
...0d99bb931299 530KiB path/to/some-image.jpg2ba44098e28f 12MiB path/to/hires-image.pngbd1741ddce0d 63MiB path/to/some-video-1080p.mp4
macOS users: Since numfmt
is not available on macOS, you can either omit the last line and deal with raw byte sizes or brew install coreutils
.
Filtering
To achieve further filtering, insert any of the following lines before the sort
line.
To exclude files that are present in HEAD
, insert the following line:
grep -vF --file=<(git ls-tree -r HEAD | awk '{print $3}') |
To show only files exceeding given size (e.g. 1 MiB = 220 B), insert the following line:
awk '$2 >= 2^20' |
Output for Computers
To generate output that's more suitable for further processing by computers, omit the last two lines of the base script. They do all the formatting. This will leave you with something like this:
...0d99bb93129939b72069df14af0d0dbda7eb6dba 542455 path/to/some-image.jpg2ba44098e28f8f66bac5e21210c2774085d2319b 12446815 path/to/hires-image.pngbd1741ddce0d07b72ccf69ed281e09bf8a2d0b2f 65183843 path/to/some-video-1080p.mp4
Appendix
File Removal
For the actual file removal, check out this SO question on the topic.
Understanding the meaning of the displayed file size
What this script displays is the size each file would have in the working directory. If you want to see how much space a file occupies if not checked out, you can use %(objectsize:disk)
instead of %(objectsize)
. However, mind that this metric also has its caveats, as is mentioned in the documentation.
More sophisticated size statistics
Sometimes a list of big files is just not enough to find out what the problem is. You would not spot directories or branches containing humongous numbers of small files, for example.
So if the script here does not cut it for you (and you have a decently recent version of git), look into git-filter-repo --analyze
or git rev-list --disk-usage
(examples).