Updating file copyright information using git to find the files
Update: I missed off --author="Your Name".
All source files should technically have a copyright section at the top. However updating this every time the file is changed is tiresome and tends to be missed out. So after a long period of development you find yourself asking the question “Which files have I modified sufficiently that I need to add myself to the list of copyright holders on the file?”.
Of course you are using a version control system so the information you need about which files you modified and how much is available it just needs to be extracted. The shell is powerful and a solution is just one (long) line (using long options and splitting over multiple lines for readability).
$ find -type f -exec bash -c 'git log --ignore-all-space --unified=0 --oneline \ --author="Your Name" {} | grep "^[+-]" | \ grep --invert-match "^\(+++\)\|\(---\)" | wc --lines | \ xargs test 20 -le ' \; -print > update_copyright.txt \ && wc --lines update_copyright.txt
In words: find all files and run a command on them and if that command returns true (0) then print out that file name to the ‘update_copyright.txt’ file and then count how many lines are in the file. Where the command is: use git log to find all changes which changed things other than space and minimise things other than the changes themselves (--oneline to reduce commit message etc and --unified=0 to remove context) then strip out all lines which don’t start with + or – and then strip out all the lines which start with +++ or — then count how many lines we get from that and test whether this is larger than 20. If so return true (zero) else return false (non zero).
This should result in an output like:
284 update_copyright.txt
I picked 20 for the ‘number of lines changed’ value because 10 lines of changes is generally the size at which copyright information should be updated (I think GNU states this) and we are including both additions and removals so we want to double that.
Now I could go from there to a script which then automatically updated the copyright rather than going through manually and updating it myself… however the output I have contains lots of files which I should not update. Then there are files in different languages which use different types of comments etc. so such a script would be much more difficult to write.
Apologies for the poor quality of English in this post and to those of you who have no idea what I am on about.
Tags: code, copyright, find, git, GNU, gnuprologjava, GSoC, script, source files, VCS, version control
September 1st, 2010 at 13:50
I guess that is a good start, but I know that a lot of the changes I make barely modify the files at all, so you could get thirty lines of diff from a tiny refactor with some whitespace changes (like changing one if and de-indenting a block). I guess that is a good place to start though.
On a slightly related note, my latest awesome VCS discovery a few weeks ago was when working on some projects using Hg. HgQueue is an awesome extension, and seems thoroughly worth investigating if you find yourself ever needing to manage a lot of changes on top of a tree where you will build up a long queue before you can commit.
September 1st, 2010 at 13:57
The purpose of “–ignore-all-space” is to ignore all whitespace changes made to file. In my project there were quite a lot of line ending encoding changes which I particularly wanted to ignore but this should also catch de-indenting a block.
I have heard good things about Hg and HgQueue but not really used either yet.
September 1st, 2010 at 14:25
Also this doesn’t deal neatly with files which were moved which should probably have that diff excluded.