Archive for October, 2010

Firesheep as applied to Cambridge

Tuesday, October 26th, 2010

Many of you will have already heard about Firesheep which is essentially a Firefox extension which allows you to login to other people’s Facebook, Amazon etc. accounts if they are on the same (unsecured) network to you. This post is on my initial thoughts on what this means to the people on Cambridge University networks.

Essentially this whole thing is nothing new – in one sense people who know anything about security already knew that this was possible and that programs for doing this existed. The only innovation is an easy to use User Interface and because Human Computer Interaction (HCI) is hard, this means that Eric Butler has won.

In Cambridge we have unsecured wireless networks such as Lapwing and the CLs shared key networks and I think that Firesheep should work fine on these and so for example in lectures where lots of students are checking Facebook et al. (especially in the CL) there is great potential for “pwned with Firesheep” becoming the status of many people. However this would be morally wrong and violate the Terms of Service of the CUDN/JANET etc. If that isn’t sufficient – the UCS has magic scripts that watch network traffic, they know where you live and if you do something really bad they can probably stop you graduating. So while amusing I don’t think that a sudden epidemic of breaking into people’s accounts would be sensible.

So what does that mean for the users of Cambridge networks? Use Eduroam. Eduroam is wonderful and actually provides security in this case (at least as long as you trust the UCS, but we have to do that anyway). If you are using Lapwing and you use a site listed on the handlers page for firesheep (though don’t visit that link on an unsecured network as GitHub is on that list) then you have to accept the risk that someone may steal your cookies and pretend to be you.

What does this mean for people running websites for Cambridge people? Use SSL, if you are using the SRCF then you win as we provide free SSL and it is simply a matter of using a .htaccess file to turn it on. It should also be pointed out that if you are using Raven for authentication (which you should be) then you still need to use SSL for all the pages which you are authenticated on or you lose[0]. If you are not using the SRCF – then why not? The SRCF is wonderful![1] . If you are within *.cam.ac.uk and not using the SRCF then you can also obtain free SSL certificates from the UCS (though I doubt anyone likely to read this is).

So do I fail on this count? Yes I think I have multiple websites on the SRCF which don’t use SSL everywhere they should and I don’t think any uses secure cookies. I also feel slightly responsible for another website which both uses poorly designed cookies and no SSL.

Users – know the risks. Developers – someone is telling us to wake up again, and even though I knew I was sleeping.

[0]: Unfortunately I think that until the SRCF rolls out per user and society subdomains which will be happening RSN if you use raven to login to one site on the SRCF and then visit any non-SSL page on the SRCF then your Raven cookie for the SRCF has just leaked to anyone listening. Oops. Using secure cookies would fix this though I haven’t worked out how to do this yet – I will post a HOWTO later Update: if the original authentication is done to an SSL protected site then the Raven cookie will be set to be secure.
[1]: I may be wearing my SRCF Chairman hat while writing that – though that doesn’t mean it isn’t true.

Models of human sampling and interpolating regular data

Saturday, October 23rd, 2010

On Thursday I submitted my project proposal for my Part II project. A HTML version of it (generated using hevea and tidy from LaTeX with all styling stripped out) follows. (With regard to the work schedule – I appear to be one week behind already. Oops.)

Part II Computer Science Project Proposal

Models of human sampling and interpolating regular data

D. Thomas, Peterhouse

Originator: Dr A. Rice

Special Resources Required

The use of my own laptop (for development)
The use of the PWF (backup, backup development)
The use of the SRCF (backup, backup development)
The use of zeus (backup)

Project Supervisor: Dr A. Rice

Director of Studies: Dr A. Norman

Project Overseers: Alan Blackwell + Cecilia Mascolo
(AFB/CM)

Introduction

When humans record information they do not usually do so in the same
regular manner that a machine does as the rate at which they sample depends
on factors such as how interested in the data they are and whether they have
developed a habit of collecting the data on a particular schedule. They are
also likely to have other commitments which prevent them recording at precise
half hour intervals for years. In order to be able to test methods of
interpolating from human recorded data to a more regular data stream such as
that which would be created by a machine we need models of how humans collect
data. ReadYourMeter.org contains data collected by humans which can be used
to evaluate these models. Using these models we can then create test data
sets from high resolution machine recorded data sets1 and then try to interpolate back to the
original data set and evaluate how good different machine learning techniques
are at doing this. This could then be extended with pluggable models for
different data sets which could then use the human recorded data set to do
parameter estimation. Interpolating to a higher resolution regular data set
allows for comparison between different data sets for example those collected
by different people or relating to different readings such as gas and
electricity.

Work that has to be done

The project breaks down into the following main sections:-

  1. Investigating the distribution of recordings in
    the ReadYourMeter.org data set.
  2. Constructing hypotheses of how the human recording
    of data can be modelled and evaluating these models against the
    ReadYourMeter.org data set.
  3. Using these models to construct test data sets by
    sampling the regular machine recorded data sets2 to produce pseudo-human read test data sets
    which can be used to be learnt from as the results can be compared with the
    reality of the machine read data sets.
  4. Using machine learning interpolation techniques to
    try and interpolate back to the original data sets from the test data sets
    and evaluating success of different methods in achieving this.

    • Polynomial fit
    • Locally weighted linear regression
    • Gaussian process regression (see Chapter 2 of
      Gaussian Processes for Machine Learning by Rasmussen &
      Williams)
    • Neural Networks (possibly using java-fann)
    • Hidden Markov Models (possibly using jahmm)
  5. If time allows then using parameter estimation on
    a known model of a system to interpolate from a test data set back to the
    original data set and evaluating how well this compares with the machine
    learning techniques which have no prior knowledge of the system.
  6. Writing the Dissertation.

Difficulties to Overcome

The following main learning tasks will have to be undertaken before the
project can be started:

  • To find a suitable method for comparing different
    sampling patterns to enable hypothesises of human behaviour to be
    evaluated.
  • Research into existing models for related human
    behaviour.

Starting Point

I have a good working knowledge of Java and of queries in SQL.
I have read “Machine Learning” by Tom Mitchell.
Andrew Rice has written some Java code which does some basic linear
interpolation it was written for use in producing a particular paper but
should form a good starting point at least providing ideas on how to go
forwards. It can also be used for requirement sampling.

ReadYourMeter.org database

I have worked with the ReadYourMeter.org database before (summer 2009) and
with large data sets of sensor readings (spring 2008).
For the purpose of this project the relevant data can be viewed as a table
with three columns: “meter_id, timestamp, value“.
There are 99 meters with over 30 readings, 39 with over 50, 12 with over 100
and 5 with over 200. This data is to be used for the purpose of constructing
and evaluating models of how humans record data.

Evaluation data sets

There are several data sets to be used for the purpose of training and
evaluating the machine learning interpolation techniques. These are to be
sampled using the model constructed in the first part of the project for how
humans record data. This then allows the data interpolated from this sampled
data to be compared with the actual data which was sampled from.
The data sets are:

  • half hourly electricity readings for the WGB from
    2001-2010 (131416 records in “timestamp, usage rate
    format).
  • monthly gas readings for the WGB from 2002-2010 (71
    records in “date, total usage” format)
  • half hourly weather data from the DTG weather
    station from 1995-2010 (263026 records)

Resources

This project should mainly developed on my laptop which has sufficient
resources to deal with the anticipated workload.
The project will be kept in version control using GIT. The SRCF, PWF and zeus
will be set to clone this and fetch regularly. Simple backups will be taken
at weekly intervals to SRCF/PWF and to an external disk.

Success criterion

  1. Models of human behaviour in recording data must
    be constructed which emulate real behaviour in the ReadYourMeter.org
    dataset.
  2. The machine learning methods must produce better
    approximations of the underlying data than linear interpolation and these
    different methods should be compared to determine their relative merits on
    different data sets.
  3. The machine once trained should be able apply this
    to unseen data of a similar class and produce better results than linear
    interpolation.
  4. A library should be produced which is well
    designed and documented to allow users – particularly researchers – to be
    able to easily combine various functions on the input data.
  5. The dissertation should be written.

Work Plan

Planned starting date is 2010-10-15.

Dates in general indicate start dates or deadlines and this is clearly
indicated. Work items should usually be finished before the next one starts
except where indicated (extensions run concurrently with dissertation
writing).

Monday, October 18
Start: Investigating the distribution of
recordings in the ReadYourMeter.org data set
Monday, October 25
Start: Constructing hypotheses of how the human
recording of data can be modelled and evaluating these models against the
ReadYourMeter.org data set.
This involves examining the distributions and modes of recording found in
the previous section and constructing parametrised models which can
encapsulate this. For example a hypothesis might be that some humans record
data in three phases, first frequently (e.g. several times a day) and then
trailing off irregularly until some more regular but less frequent mode is
entered where data is recorded once a week/month. This would then be
parametrised by the length and frequency in each stage and within that
stage details such as the time of day would probably need to be
characterised by probability distributions which can be calculated from the
ReadYourMeter.org dataset.
Monday, November 8
Start: Using these models to construct test data
sets by sampling a regular machine recorded data sets.
Monday, November 15
Start: Using machine learning interpolation techniques to try and
interpolate back to the original data sets from the test data sets and
evaluating success of different methods in achieving this.

Monday, November 15
Start: Polynomial fit
Monday, November 22
Start: Locally weighted linear
regression
Monday, November 29
Start: Gaussian process regression
Monday, December 13
Start: Neural Networks
Monday, December 27
Start: Hidden Markov Models
Monday, January 3, 2011
Start: Introduction chapter
Monday, January 10, 2011
Start: Preparation chapter
Monday, January 17, 2011
Start: Progress report
Monday, January 24, 2011
Start: If time allows then using parameter
estimation on a known model of a system to interpolate from a test data set
back to the original data set. This continues on until 17th
March and can be expanded or shrunk depending on available time.
Friday, January 28, 2011
Deadline: Draft progress
report
Wednesday, February 2,
2011
Deadline: Final progress report
printed and handed in. By this point the core of the project should be
completed with only extension components and polishing remaining.
Friday, February 4, 2011,
12:00
Deadline: Progress Report
Deadline
Monday, February 7, 2011
Start: Implementation Chapter
Monday, February 21, 2011
Start: Evaluation Chapter
Monday, March 7, 2011
Start: Conclusions chapter
Thursday, March 17, 2011
Deadline: First Draft of
Dissertation (by this point revision for the exams will be in full swing
limiting time available for the project and time is required between drafts
to allow people to read and comment on it)
Friday, April 1, 2011
Deadline: Second draft
dissertation
Friday, April 22, 2011
Deadline: Third draft
dissertation
Friday, May 6, 2011
Deadline: Final version of
dissertation produced
Monday, May 16, 2011
Deadline: Print, bind and
submit dissertation
Friday, May 20, 2011,
11:00
Deadline: Dissertation
submission deadline

1
Such as the WGB’s Energy usage, see §Starting
Point for more details.
2
These are detailed in §Starting Point

“How do you think higher education should be funded?”

Saturday, October 16th, 2010

I am currently considering this question as the Peterhouse JCR is in the process of running a referendum and this is the first and most important question on that referendum the purpose of which is to determine how Peterhouse should vote at the next CUSU Council meeting.
The possible options are:

  1. Raised tuition fees
  2. A graduate tax
  3. Offer fewer university places / close down less well performing Universities
  4. Higher universal taxation
  5. Cuts to other public services instead
  6. Other / Abstain

However there are more fundamental underlying questions which need to be considered:
What are the purposes of University?
Why are those good purposes?
How well does University achieve those purposes?
What value to we place on outcomes beyond the simple increase in potential earnings such as on producing better adjusted individuals with improved support networks who are better able to play their part in society?
Should ‘Universities’ which are ‘rubbish’ and don’t actually provide ‘proper’ degrees be called Universities? (No clearly not: they should be called polytechnics or similar and not offer degrees but rather more flexible qualifications which actually fit the useful things they are there to teach)
Should these polytechnics exist? Should they receive government funding in the way that Universities do?
Is University the best way of teaching people the skills they need for work in areas such as Engineering and Computer Science? Does that matter?

Clearly a graduate tax is a stupid idea because it would mean that anyone we educated and who then left the country to work abroad would not pay for the cost of their education – and that many people would do this, particularly among the highest earners. It also does not provide the money directly to the universities which educated them and would instead go to some general pot and so not reward universities for how good they were at educating their students (from the point of view of earning potential).

Offering fewer university places / close down less well performing Universities… well to Cambridge students that seems like a rather appealing option (and it is the favourite to win the JCR vote). However it is important to ensure that we are not thinking that this is a good plan simply because it means that University funding becomes an issue affecting other people at other Universities rather than us which is easy to do on a subconscious level and to then justify on a concious one. One justification is that we know that our friends and fellow pupils at school did not always work as hard as we did in order to get where we have got and so why should they be supported at our expense? Clearly we put more work in than they did. However the question of what the value of University is to both society and individuals even if the University doesn’t manage to teach the individual anything is one for which I don’t have an answer. Putting concrete values on externalities is not something which we are particularly good at as a society. I should probably study some more economics in order to get better at doing so.
The problem with this point then is that while it seems appealing on a superficial level I worry that in the grander scheme of things it might not be such a good idea. For example how would reducing the number of university places be managed? Remove the same proportion from all universities? Clearly that would be a stupid idea as it places no value on the relative quality of teaching at different universities. We don’t want those who should go to University missing out due to lack of places in good universities while those who probably shouldn’t get in to the lower quality ones. How about making the number of places available on a course be dependent on how many people applied for it? So that for example if 200 people apply then a maximum of 100 places can be funded. However there might be problems with that if there are good courses which only appeal/accept candidates from a small pool of potential applicants and so most of those who apply should get a place as they are sufficiently brilliant.

Higher universal taxation? Well here we have to consider whether the benefit of university is for society as a whole than to the individuals directly as otherwise it is perhaps not fair to make everyone pay more. Here again I think we struggle to be able to make good decisions on what proportion of university funding for teaching should come from the students and what proportion from general taxation due to the lack of a function for determining the value of university and apportioning that to individuals and society as a whole.

Raised tuition fees? Clearly this is controversial for students as it affects us most directly and does cause real problems for students. It is thus perfectly understandable that many students and their representatives vehemently oppose tuition fees in general and their increase in particular. As per one of the CUSU motions “Education is a public good” which is true but to be able to weigh its value against that of other government expenditure we need some way of measuring relative worth of different public goods which I don’t think we have. At least not in a clear manner which allows decisions to be reached which don’t appear to be simply arbitrary. Instead long discussions are had and long articles written which skirt around the edges of issues and are dissatisfying in not being able to deal with these issues directly.[0]
However here it is perhaps useful to consider that compared with private secondary education University is still cheap even with increased tuition fees to £7,000. A private day secondary school could easily be charging in excess of £9,000 a year and at least in comparison to Cambridge not be providing nearly as high a quality of education. A private boarding school could easily be charging £26,000 a year per student. The cost my going to University per year is ~£10,000 including tuition fees, rent etc. this is significantly less than what my parents were paying for my sixth form education even with the 20% scholarship. My parents could still pay for the full costs of my university education if it was ~£14,000 instead and then I walk out with a degree and no debt… This only applies to a small minority of students though and somewhere around University children need to become adults and stop relying on parents for all supplies of funding. I suppose the point I am trying to make here is that there are students who have parents who could easily pay the higher fees (or even higher still fees) and not really be affected by doing so, however it is unfortunately probably not feasible to identify who these students are. Higher levels of debt are likely to put off students, particularly those from disadvantaged backgrounds from applying which is a serious concern as it is very important to find those people from disadvantaged backgrounds who have the ability to perform and give them a helping hand to make sure that they can perform to the best of that ability.

Of the CUSU motions a and c seem reasonable, b is poorly worded and says things which are blatantly wrong and d makes some good points but also some silly ones and some of its action points seem unrelated to solving the issues identified. E which the JCR as a whole is not voting on also appears to be reasonable.

Peterhouse JCR people: Vote. Everyone else: vote early, vote often.

Apologies for the unsystematic and poorly written brain dump, really I should go back through this and rewrite it…

[0]: Here I am thinking back to discussions I had last night relating to the difficulty of expressing and discussing truly important things compared to the ease and simplicity of discussing trivialities.

Proving Java Terminates (or runs forever) – Avoiding deadlock automatically with compile time checking

Tuesday, October 12th, 2010

This is one of the two projects I was considering doing for my Part II project and the one which I decided not to do.
This project would involve using annotations in Java to check that the thread safety properties which the programmer believes that the code has actually hold.
For example @DeadlockFree or @ThreadSafe.
Now you might think that this would be both incredibly useful and far too difficult/impossible however implementations of this exist at least in part: The MIT licensed CheckThread project and another implementation forms part of the much larger JSR-305 project to use annotations in java to detect software defects. JSR-305 was going to be part of Java 7 but development appears to be slow (over a year between the second most recent commit and the most recent commit (4 days ago)).

Essentially if you want to use annotations in Java to find your thread safety/deadlock bugs and prevent them from reoccurring the technology is out there (but needs some polishing). In the end I decided that since I could probably hack something working together and get it packaged in Debian in a couple of days/weeks by fixing problems in the existing implementations then it would not be particularly useful to spend several months reimplementing it as a Part II project.

If a library which allows you to use annotations in Java to prevent thread safety/deadlock problems is not in Debian (including new queue) by the end of next summer throw things at me until it is.