Daniel Thomas' Blog

Posts Tagged ‘script’

tidy_vig: Automatically reformatting generated HTML into something cleaner

Friday, February 4th, 2011

As webmaster and secretary of various things I regularly need to upload minutes to websites and hence want to upload html files. While Open/LibreOffice’s export to html functionality works it doesn’t produce nice html. tidy is a useful tool for finding flaws in html and making it correct and nicer but it is not sufficient to accomplish this task on its own. Hence I have finally scriptified the various automatable parts of turning generated html into something publishable (this loses all style definitions so won’t look the same – use tidy_up if you want to avoid that).

#!/bin/bash


set -e #bail if something goes wrong
tidy_up='tidy -indent -modify -clean -bare -asxml -utf8 -wrap 80 -access 3 --logical-emphasis yes'

$tidy_up $1 #Normalise to lowercase and remove most rubbish $tidy_up $1 $tidy_up $1 #Repeat until stabalises - this happens third time # Get sed to select the range of lines to apply the replacement on first. # No I don't know what is going on here. sed -i '/]*>/,/<\/style>/ {:ack N; /<\/style>/! b ack s/]*>.*<\/style>//g }' $1 sed -i 's/ class="[^"]*"//g' $1 sed -i 's/<\/*span>//g' $1 $tidy_up $1 #Reformat now that remaining cruft removed sed -i 's/ class="[^"]*"//g' $1 #Remove any classes that got un-line breaked

Unfortunately there may still need to be some manual work if for example headers haven’t been specified as headers when the person who wrote the original file wrote it and so it may be that some sections might need conversion.

It is probably possible to do this in a cleaner more logical way and I have probably missed edge cases and this probably counts as being a little hacky however hopefully someone will find it useful.

Tags: html, libreoffice, minutes, openoffice.org, script, secretary, webmaster
Posted in CompSci | 1 Comment »

Updating file copyright information using git to find the files

Wednesday, September 1st, 2010

Update: I missed off --author="Your Name".
All source files should technically have a copyright section at the top. However updating this every time the file is changed is tiresome and tends to be missed out. So after a long period of development you find yourself asking the question “Which files have I modified sufficiently that I need to add myself to the list of copyright holders on the file?”.
Of course you are using a version control system so the information you need about which files you modified and how much is available it just needs to be extracted. The shell is powerful and a solution is just one (long) line (using long options and splitting over multiple lines for readability).

$ find -type f -exec bash -c 'git log --ignore-all-space --unified=0 --oneline  \
    --author="Your Name" {} | grep "^[+-]" | \
    grep --invert-match "^\(+++\)\|\(---\)" | wc --lines | \
    xargs test 20 -le ' \; -print > update_copyright.txt \
    && wc --lines update_copyright.txt

In words: find all files and run a command on them and if that command returns true (0) then print out that file name to the ‘update_copyright.txt’ file and then count how many lines are in the file. Where the command is: use git log to find all changes which changed things other than space and minimise things other than the changes themselves (--oneline to reduce commit message etc and --unified=0 to remove context) then strip out all lines which don’t start with + or – and then strip out all the lines which start with +++ or — then count how many lines we get from that and test whether this is larger than 20. If so return true (zero) else return false (non zero).

This should result in an output like:

 284 update_copyright.txt

I picked 20 for the ‘number of lines changed’ value because 10 lines of changes is generally the size at which copyright information should be updated (I think GNU states this) and we are including both additions and removals so we want to double that.

Now I could go from there to a script which then automatically updated the copyright rather than going through manually and updating it myself… however the output I have contains lots of files which I should not update. Then there are files in different languages which use different types of comments etc. so such a script would be much more difficult to write.

Apologies for the poor quality of English in this post and to those of you who have no idea what I am on about.

Tags: code, copyright, find, git, GNU, gnuprologjava, GSoC, script, source files, VCS, version control
Posted in CompSci | 3 Comments »