Posts Tagged ‘html’

tidy_vig: Automatically reformatting generated HTML into something cleaner

Friday, February 4th, 2011

As webmaster and secretary of various things I regularly need to upload minutes to websites and hence want to upload html files. While Open/LibreOffice’s export to html functionality works it doesn’t produce nice html. tidy is a useful tool for finding flaws in html and making it correct and nicer but it is not sufficient to accomplish this task on its own. Hence I have finally scriptified the various automatable parts of turning generated html into something publishable (this loses all style definitions so won’t look the same – use tidy_up if you want to avoid that).


#!/bin/bash

set -e #bail if something goes wrong

tidy_up='tidy -indent -modify -clean -bare -asxml -utf8 -wrap 80 -access 3 --logical-emphasis yes'

$tidy_up $1 #Normalise to lowercase and remove most rubbish
$tidy_up $1
$tidy_up $1 #Repeat until stabalises - this happens third time
# Get sed to select the range of lines to apply the replacement on first.
# No I don't know what is going on here.
sed -i '/]*>/,/<\/style>/ {:ack N; /<\/style>/! b ack s/]*>.*<\/style>//g }' $1
sed -i 's/ class="[^"]*"//g' $1
sed -i 's/<\/*span>//g' $1
$tidy_up $1 #Reformat now that remaining cruft removed
sed -i 's/ class="[^"]*"//g' $1 #Remove any classes that got un-line breaked

Unfortunately there may still need to be some manual work if for example headers haven’t been specified as headers when the person who wrote the original file wrote it and so it may be that some sections might need conversion.

It is probably possible to do this in a cleaner more logical way and I have probably missed edge cases and this probably counts as being a little hacky however hopefully someone will find it useful.