tidy_vig: Automatically reformatting generated HTML into something cleaner

As webmaster and secretary of various things I regularly need to upload minutes to websites and hence want to upload html files. While Open/LibreOffice’s export to html functionality works it doesn’t produce nice html. tidy is a useful tool for finding flaws in html and making it correct and nicer but it is not sufficient to accomplish this task on its own. Hence I have finally scriptified the various automatable parts of turning generated html into something publishable (this loses all style definitions so won’t look the same – use tidy_up if you want to avoid that).


#!/bin/bash

set -e #bail if something goes wrong

tidy_up='tidy -indent -modify -clean -bare -asxml -utf8 -wrap 80 -access 3 --logical-emphasis yes'

$tidy_up $1 #Normalise to lowercase and remove most rubbish
$tidy_up $1
$tidy_up $1 #Repeat until stabalises - this happens third time
# Get sed to select the range of lines to apply the replacement on first.
# No I don't know what is going on here.
sed -i '/]*>/,/<\/style>/ {:ack N; /<\/style>/! b ack s/]*>.*<\/style>//g }' $1
sed -i 's/ class="[^"]*"//g' $1
sed -i 's/<\/*span>//g' $1
$tidy_up $1 #Reformat now that remaining cruft removed
sed -i 's/ class="[^"]*"//g' $1 #Remove any classes that got un-line breaked

Unfortunately there may still need to be some manual work if for example headers haven’t been specified as headers when the person who wrote the original file wrote it and so it may be that some sections might need conversion.

It is probably possible to do this in a cleaner more logical way and I have probably missed edge cases and this probably counts as being a little hacky however hopefully someone will find it useful.

Tags: , , , , , ,

One Response to “tidy_vig: Automatically reformatting generated HTML into something cleaner”

  1. Nicholas Wilson Says:

    Funky; I needed to man some of those sed commands. It’s pretty grunky though. GIGO. Much easier would be using DocBook: download XXE for free, edit with more joy and efficiency than OO.o, and then select the ‘export to HTML’ button; all your neat, structured editing will be XSLT’ed to neat, structured HTML. I have some post-processing XSLT if you want to further polish the default output up to suit post-1990s taste in markup. The result that the DocBook XSL sheets gives is fundamentamentally ‘nice’ and tractable, though.

    Expect me to neaten up and publish some time my paean to structured editing sometime soon.

Leave a Reply