Posts Tagged ‘dplumb’

Models of human sampling and interpolating regular data

Saturday, October 23rd, 2010

On Thursday I submitted my project proposal for my Part II project. A HTML version of it (generated using hevea and tidy from LaTeX with all styling stripped out) follows. (With regard to the work schedule – I appear to be one week behind already. Oops.)

Part II Computer Science Project Proposal

Models of human sampling and interpolating regular data

D. Thomas, Peterhouse

Originator: Dr A. Rice

Special Resources Required

The use of my own laptop (for development)
The use of the PWF (backup, backup development)
The use of the SRCF (backup, backup development)
The use of zeus (backup)

Project Supervisor: Dr A. Rice

Director of Studies: Dr A. Norman

Project Overseers: Alan Blackwell + Cecilia Mascolo
(AFB/CM)

Introduction

When humans record information they do not usually do so in the same
regular manner that a machine does as the rate at which they sample depends
on factors such as how interested in the data they are and whether they have
developed a habit of collecting the data on a particular schedule. They are
also likely to have other commitments which prevent them recording at precise
half hour intervals for years. In order to be able to test methods of
interpolating from human recorded data to a more regular data stream such as
that which would be created by a machine we need models of how humans collect
data. ReadYourMeter.org contains data collected by humans which can be used
to evaluate these models. Using these models we can then create test data
sets from high resolution machine recorded data sets1 and then try to interpolate back to the
original data set and evaluate how good different machine learning techniques
are at doing this. This could then be extended with pluggable models for
different data sets which could then use the human recorded data set to do
parameter estimation. Interpolating to a higher resolution regular data set
allows for comparison between different data sets for example those collected
by different people or relating to different readings such as gas and
electricity.

Work that has to be done

The project breaks down into the following main sections:-

  1. Investigating the distribution of recordings in
    the ReadYourMeter.org data set.
  2. Constructing hypotheses of how the human recording
    of data can be modelled and evaluating these models against the
    ReadYourMeter.org data set.
  3. Using these models to construct test data sets by
    sampling the regular machine recorded data sets2 to produce pseudo-human read test data sets
    which can be used to be learnt from as the results can be compared with the
    reality of the machine read data sets.
  4. Using machine learning interpolation techniques to
    try and interpolate back to the original data sets from the test data sets
    and evaluating success of different methods in achieving this.

    • Polynomial fit
    • Locally weighted linear regression
    • Gaussian process regression (see Chapter 2 of
      Gaussian Processes for Machine Learning by Rasmussen &
      Williams)
    • Neural Networks (possibly using java-fann)
    • Hidden Markov Models (possibly using jahmm)
  5. If time allows then using parameter estimation on
    a known model of a system to interpolate from a test data set back to the
    original data set and evaluating how well this compares with the machine
    learning techniques which have no prior knowledge of the system.
  6. Writing the Dissertation.

Difficulties to Overcome

The following main learning tasks will have to be undertaken before the
project can be started:

  • To find a suitable method for comparing different
    sampling patterns to enable hypothesises of human behaviour to be
    evaluated.
  • Research into existing models for related human
    behaviour.

Starting Point

I have a good working knowledge of Java and of queries in SQL.
I have read “Machine Learning” by Tom Mitchell.
Andrew Rice has written some Java code which does some basic linear
interpolation it was written for use in producing a particular paper but
should form a good starting point at least providing ideas on how to go
forwards. It can also be used for requirement sampling.

ReadYourMeter.org database

I have worked with the ReadYourMeter.org database before (summer 2009) and
with large data sets of sensor readings (spring 2008).
For the purpose of this project the relevant data can be viewed as a table
with three columns: “meter_id, timestamp, value“.
There are 99 meters with over 30 readings, 39 with over 50, 12 with over 100
and 5 with over 200. This data is to be used for the purpose of constructing
and evaluating models of how humans record data.

Evaluation data sets

There are several data sets to be used for the purpose of training and
evaluating the machine learning interpolation techniques. These are to be
sampled using the model constructed in the first part of the project for how
humans record data. This then allows the data interpolated from this sampled
data to be compared with the actual data which was sampled from.
The data sets are:

  • half hourly electricity readings for the WGB from
    2001-2010 (131416 records in “timestamp, usage rate
    format).
  • monthly gas readings for the WGB from 2002-2010 (71
    records in “date, total usage” format)
  • half hourly weather data from the DTG weather
    station from 1995-2010 (263026 records)

Resources

This project should mainly developed on my laptop which has sufficient
resources to deal with the anticipated workload.
The project will be kept in version control using GIT. The SRCF, PWF and zeus
will be set to clone this and fetch regularly. Simple backups will be taken
at weekly intervals to SRCF/PWF and to an external disk.

Success criterion

  1. Models of human behaviour in recording data must
    be constructed which emulate real behaviour in the ReadYourMeter.org
    dataset.
  2. The machine learning methods must produce better
    approximations of the underlying data than linear interpolation and these
    different methods should be compared to determine their relative merits on
    different data sets.
  3. The machine once trained should be able apply this
    to unseen data of a similar class and produce better results than linear
    interpolation.
  4. A library should be produced which is well
    designed and documented to allow users – particularly researchers – to be
    able to easily combine various functions on the input data.
  5. The dissertation should be written.

Work Plan

Planned starting date is 2010-10-15.

Dates in general indicate start dates or deadlines and this is clearly
indicated. Work items should usually be finished before the next one starts
except where indicated (extensions run concurrently with dissertation
writing).

Monday, October 18
Start: Investigating the distribution of
recordings in the ReadYourMeter.org data set
Monday, October 25
Start: Constructing hypotheses of how the human
recording of data can be modelled and evaluating these models against the
ReadYourMeter.org data set.
This involves examining the distributions and modes of recording found in
the previous section and constructing parametrised models which can
encapsulate this. For example a hypothesis might be that some humans record
data in three phases, first frequently (e.g. several times a day) and then
trailing off irregularly until some more regular but less frequent mode is
entered where data is recorded once a week/month. This would then be
parametrised by the length and frequency in each stage and within that
stage details such as the time of day would probably need to be
characterised by probability distributions which can be calculated from the
ReadYourMeter.org dataset.
Monday, November 8
Start: Using these models to construct test data
sets by sampling a regular machine recorded data sets.
Monday, November 15
Start: Using machine learning interpolation techniques to try and
interpolate back to the original data sets from the test data sets and
evaluating success of different methods in achieving this.

Monday, November 15
Start: Polynomial fit
Monday, November 22
Start: Locally weighted linear
regression
Monday, November 29
Start: Gaussian process regression
Monday, December 13
Start: Neural Networks
Monday, December 27
Start: Hidden Markov Models
Monday, January 3, 2011
Start: Introduction chapter
Monday, January 10, 2011
Start: Preparation chapter
Monday, January 17, 2011
Start: Progress report
Monday, January 24, 2011
Start: If time allows then using parameter
estimation on a known model of a system to interpolate from a test data set
back to the original data set. This continues on until 17th
March and can be expanded or shrunk depending on available time.
Friday, January 28, 2011
Deadline: Draft progress
report
Wednesday, February 2,
2011
Deadline: Final progress report
printed and handed in. By this point the core of the project should be
completed with only extension components and polishing remaining.
Friday, February 4, 2011,
12:00
Deadline: Progress Report
Deadline
Monday, February 7, 2011
Start: Implementation Chapter
Monday, February 21, 2011
Start: Evaluation Chapter
Monday, March 7, 2011
Start: Conclusions chapter
Thursday, March 17, 2011
Deadline: First Draft of
Dissertation (by this point revision for the exams will be in full swing
limiting time available for the project and time is required between drafts
to allow people to read and comment on it)
Friday, April 1, 2011
Deadline: Second draft
dissertation
Friday, April 22, 2011
Deadline: Third draft
dissertation
Friday, May 6, 2011
Deadline: Final version of
dissertation produced
Monday, May 16, 2011
Deadline: Print, bind and
submit dissertation
Friday, May 20, 2011,
11:00
Deadline: Dissertation
submission deadline

1
Such as the WGB’s Energy usage, see §Starting
Point for more details.
2
These are detailed in §Starting Point