Archive for the ‘Part II Project’ Category

Vertical labels on gnuplot LaTeX graphs

Thursday, April 28th, 2011

This post exists because doing this was far harder than it should have been and hopefully this will help someone else in the future.

When creating a bar chart/histogram in gnuplot using the latex driver if there are a lot of bars and the labels for the bars are over a certain length then the labels overlap horribly. The solution to this would be to rotate them and the following LaTeX and gnuplot code allows that to happen and deals with various fallout that results.

The following defines a length for storing offsets of the labels in \verticallabeloffset as this offset will depend on the length of the text. It also stores a length which holds the maximum of those values \maxverticallabeloffset. It provides a command \verticallabel which does hte following: calculates the length of the label, moves the start position across by 1.4em, moves the label down by 2em (to provide space for the x axis label) and then down by the length of the text. It then uses a sideways environment to make the text vertical. Then it uses the ifthen package to work out if the \verticallabeloffset is bigger than any previously seen and if so it sets \globaldefs=1 so that the effect of the \setlength command will be global rather than being restricted to the local scope and then sets it back to 0 (this is a nasty hack).
It also provides the \xaxislabel command which shifts the x axis title up into the space between the x axis and the labels.

\providecommand{\verticallabel}[1]{\settowidth{\verticallabeloffset}{#1}\hspace{1.4em}\vspace{-2em}\vspace{-\verticallabeloffset}\begin{sideways} #1 \end{sideways}\ifthenelse{\lengthtest{\verticallabeloffset>\maxverticallabeloffset}}{\globaldefs=1\setlength\maxverticallabeloffset{\verticallabeloffset}\globaldefs=0}{}}

Having defined that the following allows gnuplot histograms generated with the LaTeX driver to be included in a LaTeX document. It first resets the maximum offset and then ensures there is sufficient space for the labels

\caption{foo bar chart}

The following gnuplot code generates a histogram which uses this to get the labels to display correctly.

set terminal latex size 16cm, 7.5cm
set style histogram errorbars

set ylabel "\\begin\{sideways\}Mean absolute error\\end\{sideways\}"
set xlabel "\\xaxislabel\{Data set\}"

set output "comparisonSummary_data.tex"
plot "comparisonSummary.dat" index 0 using 2:($3*1.96):xtic("\\verticallabel\{" . stringcolumn(1) . "\}") with histogram title "Mean absolute error for different data sets"

I hope this helps someone. I should get around to actually patching the gnuplot latex driver so that it works properly – but that will have to wait until post exams.

LaTeX user search path

Monday, November 1st, 2010

Because this took far too long to find out:
If you have .sty files you have written want to be able to easily reuse in multiple documents then put them in ~/texmf/tex
Then you should be able to use them with \usepackage{foo} as normal.

Models of human sampling and interpolating regular data

Saturday, October 23rd, 2010

On Thursday I submitted my project proposal for my Part II project. A HTML version of it (generated using hevea and tidy from LaTeX with all styling stripped out) follows. (With regard to the work schedule – I appear to be one week behind already. Oops.)

Part II Computer Science Project Proposal

Models of human sampling and interpolating regular data

D. Thomas, Peterhouse

Originator: Dr A. Rice

Special Resources Required

The use of my own laptop (for development)
The use of the PWF (backup, backup development)
The use of the SRCF (backup, backup development)
The use of zeus (backup)

Project Supervisor: Dr A. Rice

Director of Studies: Dr A. Norman

Project Overseers: Alan Blackwell + Cecilia Mascolo


When humans record information they do not usually do so in the same
regular manner that a machine does as the rate at which they sample depends
on factors such as how interested in the data they are and whether they have
developed a habit of collecting the data on a particular schedule. They are
also likely to have other commitments which prevent them recording at precise
half hour intervals for years. In order to be able to test methods of
interpolating from human recorded data to a more regular data stream such as
that which would be created by a machine we need models of how humans collect
data. contains data collected by humans which can be used
to evaluate these models. Using these models we can then create test data
sets from high resolution machine recorded data sets1 and then try to interpolate back to the
original data set and evaluate how good different machine learning techniques
are at doing this. This could then be extended with pluggable models for
different data sets which could then use the human recorded data set to do
parameter estimation. Interpolating to a higher resolution regular data set
allows for comparison between different data sets for example those collected
by different people or relating to different readings such as gas and

Work that has to be done

The project breaks down into the following main sections:-

  1. Investigating the distribution of recordings in
    the data set.
  2. Constructing hypotheses of how the human recording
    of data can be modelled and evaluating these models against the data set.
  3. Using these models to construct test data sets by
    sampling the regular machine recorded data sets2 to produce pseudo-human read test data sets
    which can be used to be learnt from as the results can be compared with the
    reality of the machine read data sets.
  4. Using machine learning interpolation techniques to
    try and interpolate back to the original data sets from the test data sets
    and evaluating success of different methods in achieving this.

    • Polynomial fit
    • Locally weighted linear regression
    • Gaussian process regression (see Chapter 2 of
      Gaussian Processes for Machine Learning by Rasmussen &
    • Neural Networks (possibly using java-fann)
    • Hidden Markov Models (possibly using jahmm)
  5. If time allows then using parameter estimation on
    a known model of a system to interpolate from a test data set back to the
    original data set and evaluating how well this compares with the machine
    learning techniques which have no prior knowledge of the system.
  6. Writing the Dissertation.

Difficulties to Overcome

The following main learning tasks will have to be undertaken before the
project can be started:

  • To find a suitable method for comparing different
    sampling patterns to enable hypothesises of human behaviour to be
  • Research into existing models for related human

Starting Point

I have a good working knowledge of Java and of queries in SQL.
I have read “Machine Learning” by Tom Mitchell.
Andrew Rice has written some Java code which does some basic linear
interpolation it was written for use in producing a particular paper but
should form a good starting point at least providing ideas on how to go
forwards. It can also be used for requirement sampling. database

I have worked with the database before (summer 2009) and
with large data sets of sensor readings (spring 2008).
For the purpose of this project the relevant data can be viewed as a table
with three columns: “meter_id, timestamp, value“.
There are 99 meters with over 30 readings, 39 with over 50, 12 with over 100
and 5 with over 200. This data is to be used for the purpose of constructing
and evaluating models of how humans record data.

Evaluation data sets

There are several data sets to be used for the purpose of training and
evaluating the machine learning interpolation techniques. These are to be
sampled using the model constructed in the first part of the project for how
humans record data. This then allows the data interpolated from this sampled
data to be compared with the actual data which was sampled from.
The data sets are:

  • half hourly electricity readings for the WGB from
    2001-2010 (131416 records in “timestamp, usage rate
  • monthly gas readings for the WGB from 2002-2010 (71
    records in “date, total usage” format)
  • half hourly weather data from the DTG weather
    station from 1995-2010 (263026 records)


This project should mainly developed on my laptop which has sufficient
resources to deal with the anticipated workload.
The project will be kept in version control using GIT. The SRCF, PWF and zeus
will be set to clone this and fetch regularly. Simple backups will be taken
at weekly intervals to SRCF/PWF and to an external disk.

Success criterion

  1. Models of human behaviour in recording data must
    be constructed which emulate real behaviour in the
  2. The machine learning methods must produce better
    approximations of the underlying data than linear interpolation and these
    different methods should be compared to determine their relative merits on
    different data sets.
  3. The machine once trained should be able apply this
    to unseen data of a similar class and produce better results than linear
  4. A library should be produced which is well
    designed and documented to allow users – particularly researchers – to be
    able to easily combine various functions on the input data.
  5. The dissertation should be written.

Work Plan

Planned starting date is 2010-10-15.

Dates in general indicate start dates or deadlines and this is clearly
indicated. Work items should usually be finished before the next one starts
except where indicated (extensions run concurrently with dissertation

Monday, October 18
Start: Investigating the distribution of
recordings in the data set
Monday, October 25
Start: Constructing hypotheses of how the human
recording of data can be modelled and evaluating these models against the data set.
This involves examining the distributions and modes of recording found in
the previous section and constructing parametrised models which can
encapsulate this. For example a hypothesis might be that some humans record
data in three phases, first frequently (e.g. several times a day) and then
trailing off irregularly until some more regular but less frequent mode is
entered where data is recorded once a week/month. This would then be
parametrised by the length and frequency in each stage and within that
stage details such as the time of day would probably need to be
characterised by probability distributions which can be calculated from the dataset.
Monday, November 8
Start: Using these models to construct test data
sets by sampling a regular machine recorded data sets.
Monday, November 15
Start: Using machine learning interpolation techniques to try and
interpolate back to the original data sets from the test data sets and
evaluating success of different methods in achieving this.

Monday, November 15
Start: Polynomial fit
Monday, November 22
Start: Locally weighted linear
Monday, November 29
Start: Gaussian process regression
Monday, December 13
Start: Neural Networks
Monday, December 27
Start: Hidden Markov Models
Monday, January 3, 2011
Start: Introduction chapter
Monday, January 10, 2011
Start: Preparation chapter
Monday, January 17, 2011
Start: Progress report
Monday, January 24, 2011
Start: If time allows then using parameter
estimation on a known model of a system to interpolate from a test data set
back to the original data set. This continues on until 17th
March and can be expanded or shrunk depending on available time.
Friday, January 28, 2011
Deadline: Draft progress
Wednesday, February 2,
Deadline: Final progress report
printed and handed in. By this point the core of the project should be
completed with only extension components and polishing remaining.
Friday, February 4, 2011,
Deadline: Progress Report
Monday, February 7, 2011
Start: Implementation Chapter
Monday, February 21, 2011
Start: Evaluation Chapter
Monday, March 7, 2011
Start: Conclusions chapter
Thursday, March 17, 2011
Deadline: First Draft of
Dissertation (by this point revision for the exams will be in full swing
limiting time available for the project and time is required between drafts
to allow people to read and comment on it)
Friday, April 1, 2011
Deadline: Second draft
Friday, April 22, 2011
Deadline: Third draft
Friday, May 6, 2011
Deadline: Final version of
dissertation produced
Monday, May 16, 2011
Deadline: Print, bind and
submit dissertation
Friday, May 20, 2011,
Deadline: Dissertation
submission deadline

Such as the WGB’s Energy usage, see ┬žStarting
Point for more details.
These are detailed in ┬žStarting Point

Proving Java Terminates (or runs forever) – Avoiding deadlock automatically with compile time checking

Tuesday, October 12th, 2010

This is one of the two projects I was considering doing for my Part II project and the one which I decided not to do.
This project would involve using annotations in Java to check that the thread safety properties which the programmer believes that the code has actually hold.
For example @DeadlockFree or @ThreadSafe.
Now you might think that this would be both incredibly useful and far too difficult/impossible however implementations of this exist at least in part: The MIT licensed CheckThread project and another implementation forms part of the much larger JSR-305 project to use annotations in java to detect software defects. JSR-305 was going to be part of Java 7 but development appears to be slow (over a year between the second most recent commit and the most recent commit (4 days ago)).

Essentially if you want to use annotations in Java to find your thread safety/deadlock bugs and prevent them from reoccurring the technology is out there (but needs some polishing). In the end I decided that since I could probably hack something working together and get it packaged in Debian in a couple of days/weeks by fixing problems in the existing implementations then it would not be particularly useful to spend several months reimplementing it as a Part II project.

If a library which allows you to use annotations in Java to prevent thread safety/deadlock problems is not in Debian (including new queue) by the end of next summer throw things at me until it is.