Daniel Thomas' Blog

Models of human sampling and interpolating regular data

October 23rd, 2010

On Thursday I submitted my project proposal for my Part II project. A HTML version of it (generated using hevea and tidy from LaTeX with all styling stripped out) follows. (With regard to the work schedule – I appear to be one week behind already. Oops.)

Part II Computer Science Project Proposal

Models of human sampling and interpolating regular data

D. Thomas, Peterhouse

Originator: Dr A. Rice

Special Resources Required

The use of my own laptop (for development)
The use of the PWF (backup, backup development)
The use of the SRCF (backup, backup development)
The use of zeus (backup)

Project Supervisor: Dr A. Rice

Director of Studies: Dr A. Norman

Project Overseers: Alan Blackwell + Cecilia Mascolo
(AFB/CM)

Introduction

When humans record information they do not usually do so in the same
regular manner that a machine does as the rate at which they sample depends
on factors such as how interested in the data they are and whether they have
developed a habit of collecting the data on a particular schedule. They are
also likely to have other commitments which prevent them recording at precise
half hour intervals for years. In order to be able to test methods of
interpolating from human recorded data to a more regular data stream such as
that which would be created by a machine we need models of how humans collect
data. ReadYourMeter.org contains data collected by humans which can be used
to evaluate these models. Using these models we can then create test data
sets from high resolution machine recorded data sets¹ and then try to interpolate back to the
original data set and evaluate how good different machine learning techniques
are at doing this. This could then be extended with pluggable models for
different data sets which could then use the human recorded data set to do
parameter estimation. Interpolating to a higher resolution regular data set
allows for comparison between different data sets for example those collected
by different people or relating to different readings such as gas and
electricity.

Work that has to be done

The project breaks down into the following main sections:-

Investigating the distribution of recordings in
the ReadYourMeter.org data set.
Constructing hypotheses of how the human recording
of data can be modelled and evaluating these models against the
ReadYourMeter.org data set.
Using these models to construct test data sets by
sampling the regular machine recorded data sets² to produce pseudo-human read test data sets
which can be used to be learnt from as the results can be compared with the
reality of the machine read data sets.
Using machine learning interpolation techniques to
try and interpolate back to the original data sets from the test data sets
and evaluating success of different methods in achieving this.
- Polynomial fit
- Locally weighted linear regression
- Gaussian process regression (see Chapter 2 of
  Gaussian Processes for Machine Learning by Rasmussen &
  Williams)
- Neural Networks (possibly using java-fann)
- Hidden Markov Models (possibly using jahmm)
If time allows then using parameter estimation on
a known model of a system to interpolate from a test data set back to the
original data set and evaluating how well this compares with the machine
learning techniques which have no prior knowledge of the system.
Writing the Dissertation.

Difficulties to Overcome

The following main learning tasks will have to be undertaken before the
project can be started:

To find a suitable method for comparing different
sampling patterns to enable hypothesises of human behaviour to be
evaluated.
Research into existing models for related human
behaviour.

Starting Point

I have a good working knowledge of Java and of queries in SQL.
I have read “Machine Learning” by Tom Mitchell.
Andrew Rice has written some Java code which does some basic linear
interpolation it was written for use in producing a particular paper but
should form a good starting point at least providing ideas on how to go
forwards. It can also be used for requirement sampling.

ReadYourMeter.org database

I have worked with the ReadYourMeter.org database before (summer 2009) and
with large data sets of sensor readings (spring 2008).
For the purpose of this project the relevant data can be viewed as a table
with three columns: “meter_id, timestamp, value“.
There are 99 meters with over 30 readings, 39 with over 50, 12 with over 100
and 5 with over 200. This data is to be used for the purpose of constructing
and evaluating models of how humans record data.

Evaluation data sets

There are several data sets to be used for the purpose of training and
evaluating the machine learning interpolation techniques. These are to be
sampled using the model constructed in the first part of the project for how
humans record data. This then allows the data interpolated from this sampled
data to be compared with the actual data which was sampled from.
The data sets are:

half hourly electricity readings for the WGB from
2001-2010 (131416 records in “timestamp, usage rate”
format).
monthly gas readings for the WGB from 2002-2010 (71
records in “date, total usage” format)
half hourly weather data from the DTG weather
station from 1995-2010 (263026 records)

Resources

This project should mainly developed on my laptop which has sufficient
resources to deal with the anticipated workload.
The project will be kept in version control using GIT. The SRCF, PWF and zeus
will be set to clone this and fetch regularly. Simple backups will be taken
at weekly intervals to SRCF/PWF and to an external disk.

Success criterion

Models of human behaviour in recording data must
be constructed which emulate real behaviour in the ReadYourMeter.org
dataset.
The machine learning methods must produce better
approximations of the underlying data than linear interpolation and these
different methods should be compared to determine their relative merits on
different data sets.
The machine once trained should be able apply this
to unseen data of a similar class and produce better results than linear
interpolation.
A library should be produced which is well
designed and documented to allow users – particularly researchers – to be
able to easily combine various functions on the input data.
The dissertation should be written.

Work Plan

Planned starting date is 2010-10-15.

Dates in general indicate start dates or deadlines and this is clearly
indicated. Work items should usually be finished before the next one starts
except where indicated (extensions run concurrently with dissertation
writing).

Monday, October 18

Start: Investigating the distribution of
recordings in the ReadYourMeter.org data set

Monday, October 25

Start: Constructing hypotheses of how the human
recording of data can be modelled and evaluating these models against the
ReadYourMeter.org data set.
This involves examining the distributions and modes of recording found in
the previous section and constructing parametrised models which can
encapsulate this. For example a hypothesis might be that some humans record
data in three phases, first frequently (e.g. several times a day) and then
trailing off irregularly until some more regular but less frequent mode is
entered where data is recorded once a week/month. This would then be
parametrised by the length and frequency in each stage and within that
stage details such as the time of day would probably need to be
characterised by probability distributions which can be calculated from the
ReadYourMeter.org dataset.

Monday, November 8

Start: Using these models to construct test data
sets by sampling a regular machine recorded data sets.

Monday, November 15

Start: Using machine learning interpolation techniques to try and
interpolate back to the original data sets from the test data sets and
evaluating success of different methods in achieving this.

Monday, November 15: Start: Polynomial fit
Monday, November 22: Start: Locally weighted linear
regression
Monday, November 29: Start: Gaussian process regression
Monday, December 13: Start: Neural Networks
Monday, December 27: Start: Hidden Markov Models

Monday, January 3, 2011

Start: Introduction chapter

Monday, January 10, 2011

Start: Preparation chapter

Monday, January 17, 2011

Start: Progress report

Monday, January 24, 2011

Start: If time allows then using parameter
estimation on a known model of a system to interpolate from a test data set
back to the original data set. This continues on until 17^th
March and can be expanded or shrunk depending on available time.

Friday, January 28, 2011

Deadline: Draft progress
report

Wednesday, February 2, 2011

Deadline: Final progress report
printed and handed in. By this point the core of the project should be
completed with only extension components and polishing remaining.

Friday, February 4, 2011, 12:00

Deadline: Progress Report
Deadline

Monday, February 7, 2011

Start: Implementation Chapter

Monday, February 21, 2011

Start: Evaluation Chapter

Monday, March 7, 2011

Start: Conclusions chapter

Thursday, March 17, 2011

Deadline: First Draft of
Dissertation (by this point revision for the exams will be in full swing
limiting time available for the project and time is required between drafts
to allow people to read and comment on it)

Friday, April 1, 2011

Deadline: Second draft
dissertation

Friday, April 22, 2011

Deadline: Third draft
dissertation

Friday, May 6, 2011

Deadline: Final version of
dissertation produced

Monday, May 16, 2011

Deadline: Print, bind and
submit dissertation

Friday, May 20, 2011, 11:00

Deadline: Dissertation
submission deadline

1: Such as the WGB’s Energy usage, see §Starting
Point for more details.
2: These are detailed in §Starting Point

Tags: CompSci, dissertation, dplumb, project, project proposal
Posted in CompSci, Part II Project | No Comments »

“How do you think higher education should be funded?”

October 16th, 2010

I am currently considering this question as the Peterhouse JCR is in the process of running a referendum and this is the first and most important question on that referendum the purpose of which is to determine how Peterhouse should vote at the next CUSU Council meeting.
The possible options are:

Raised tuition fees
A graduate tax
Offer fewer university places / close down less well performing Universities
Higher universal taxation
Cuts to other public services instead
Other / Abstain

However there are more fundamental underlying questions which need to be considered:
What are the purposes of University?
Why are those good purposes?
How well does University achieve those purposes?
What value to we place on outcomes beyond the simple increase in potential earnings such as on producing better adjusted individuals with improved support networks who are better able to play their part in society?
Should ‘Universities’ which are ‘rubbish’ and don’t actually provide ‘proper’ degrees be called Universities? (No clearly not: they should be called polytechnics or similar and not offer degrees but rather more flexible qualifications which actually fit the useful things they are there to teach)
Should these polytechnics exist? Should they receive government funding in the way that Universities do?
Is University the best way of teaching people the skills they need for work in areas such as Engineering and Computer Science? Does that matter?

Clearly a graduate tax is a stupid idea because it would mean that anyone we educated and who then left the country to work abroad would not pay for the cost of their education – and that many people would do this, particularly among the highest earners. It also does not provide the money directly to the universities which educated them and would instead go to some general pot and so not reward universities for how good they were at educating their students (from the point of view of earning potential).

Offering fewer university places / close down less well performing Universities… well to Cambridge students that seems like a rather appealing option (and it is the favourite to win the JCR vote). However it is important to ensure that we are not thinking that this is a good plan simply because it means that University funding becomes an issue affecting other people at other Universities rather than us which is easy to do on a subconscious level and to then justify on a concious one. One justification is that we know that our friends and fellow pupils at school did not always work as hard as we did in order to get where we have got and so why should they be supported at our expense? Clearly we put more work in than they did. However the question of what the value of University is to both society and individuals even if the University doesn’t manage to teach the individual anything is one for which I don’t have an answer. Putting concrete values on externalities is not something which we are particularly good at as a society. I should probably study some more economics in order to get better at doing so.
The problem with this point then is that while it seems appealing on a superficial level I worry that in the grander scheme of things it might not be such a good idea. For example how would reducing the number of university places be managed? Remove the same proportion from all universities? Clearly that would be a stupid idea as it places no value on the relative quality of teaching at different universities. We don’t want those who should go to University missing out due to lack of places in good universities while those who probably shouldn’t get in to the lower quality ones. How about making the number of places available on a course be dependent on how many people applied for it? So that for example if 200 people apply then a maximum of 100 places can be funded. However there might be problems with that if there are good courses which only appeal/accept candidates from a small pool of potential applicants and so most of those who apply should get a place as they are sufficiently brilliant.

Higher universal taxation? Well here we have to consider whether the benefit of university is for society as a whole than to the individuals directly as otherwise it is perhaps not fair to make everyone pay more. Here again I think we struggle to be able to make good decisions on what proportion of university funding for teaching should come from the students and what proportion from general taxation due to the lack of a function for determining the value of university and apportioning that to individuals and society as a whole.

Raised tuition fees? Clearly this is controversial for students as it affects us most directly and does cause real problems for students. It is thus perfectly understandable that many students and their representatives vehemently oppose tuition fees in general and their increase in particular. As per one of the CUSU motions “Education is a public good” which is true but to be able to weigh its value against that of other government expenditure we need some way of measuring relative worth of different public goods which I don’t think we have. At least not in a clear manner which allows decisions to be reached which don’t appear to be simply arbitrary. Instead long discussions are had and long articles written which skirt around the edges of issues and are dissatisfying in not being able to deal with these issues directly.[0]
However here it is perhaps useful to consider that compared with private secondary education University is still cheap even with increased tuition fees to £7,000. A private day secondary school could easily be charging in excess of £9,000 a year and at least in comparison to Cambridge not be providing nearly as high a quality of education. A private boarding school could easily be charging £26,000 a year per student. The cost my going to University per year is ~£10,000 including tuition fees, rent etc. this is significantly less than what my parents were paying for my sixth form education even with the 20% scholarship. My parents could still pay for the full costs of my university education if it was ~£14,000 instead and then I walk out with a degree and no debt… This only applies to a small minority of students though and somewhere around University children need to become adults and stop relying on parents for all supplies of funding. I suppose the point I am trying to make here is that there are students who have parents who could easily pay the higher fees (or even higher still fees) and not really be affected by doing so, however it is unfortunately probably not feasible to identify who these students are. Higher levels of debt are likely to put off students, particularly those from disadvantaged backgrounds from applying which is a serious concern as it is very important to find those people from disadvantaged backgrounds who have the ability to perform and give them a helping hand to make sure that they can perform to the best of that ability.

Of the CUSU motions a and c seem reasonable, b is poorly worded and says things which are blatantly wrong and d makes some good points but also some silly ones and some of its action points seem unrelated to solving the issues identified. E which the JCR as a whole is not voting on also appears to be reasonable.

Peterhouse JCR people: Vote. Everyone else: vote early, vote often.

Apologies for the unsystematic and poorly written brain dump, really I should go back through this and rewrite it…

[0]: Here I am thinking back to discussions I had last night relating to the difficulty of expressing and discussing truly important things compared to the ease and simplicity of discussing trivialities.

Tags: elections, funding, policy, politics, referendum, student, tax
Posted in politics, University | 4 Comments »

Proving Java Terminates (or runs forever) – Avoiding deadlock automatically with compile time checking

October 12th, 2010

This is one of the two projects I was considering doing for my Part II project and the one which I decided not to do.
This project would involve using annotations in Java to check that the thread safety properties which the programmer believes that the code has actually hold.
For example @DeadlockFree or @ThreadSafe.
Now you might think that this would be both incredibly useful and far too difficult/impossible however implementations of this exist at least in part: The MIT licensed CheckThread project and another implementation forms part of the much larger JSR-305 project to use annotations in java to detect software defects. JSR-305 was going to be part of Java 7 but development appears to be slow (over a year between the second most recent commit and the most recent commit (4 days ago)).

Essentially if you want to use annotations in Java to find your thread safety/deadlock bugs and prevent them from reoccurring the technology is out there (but needs some polishing). In the end I decided that since I could probably hack something working together and get it packaged in Debian in a couple of days/weeks by fixing problems in the existing implementations then it would not be particularly useful to spend several months reimplementing it as a Part II project.

If a library which allows you to use annotations in Java to prevent thread safety/deadlock problems is not in Debian (including new queue) by the end of next summer throw things at me until it is.

Tags: annotation, code, deadlock, debian, java, library, thread safety
Posted in CompSci, Part II Project | 1 Comment »

Updating file copyright information using git to find the files

September 1st, 2010

Update: I missed off --author="Your Name".
All source files should technically have a copyright section at the top. However updating this every time the file is changed is tiresome and tends to be missed out. So after a long period of development you find yourself asking the question “Which files have I modified sufficiently that I need to add myself to the list of copyright holders on the file?”.
Of course you are using a version control system so the information you need about which files you modified and how much is available it just needs to be extracted. The shell is powerful and a solution is just one (long) line (using long options and splitting over multiple lines for readability).

$ find -type f -exec bash -c 'git log --ignore-all-space --unified=0 --oneline  \
    --author="Your Name" {} | grep "^[+-]" | \
    grep --invert-match "^\(+++\)\|\(---\)" | wc --lines | \
    xargs test 20 -le ' \; -print > update_copyright.txt \
    && wc --lines update_copyright.txt

In words: find all files and run a command on them and if that command returns true (0) then print out that file name to the ‘update_copyright.txt’ file and then count how many lines are in the file. Where the command is: use git log to find all changes which changed things other than space and minimise things other than the changes themselves (--oneline to reduce commit message etc and --unified=0 to remove context) then strip out all lines which don’t start with + or – and then strip out all the lines which start with +++ or — then count how many lines we get from that and test whether this is larger than 20. If so return true (zero) else return false (non zero).

This should result in an output like:

 284 update_copyright.txt

I picked 20 for the ‘number of lines changed’ value because 10 lines of changes is generally the size at which copyright information should be updated (I think GNU states this) and we are including both additions and removals so we want to double that.

Now I could go from there to a script which then automatically updated the copyright rather than going through manually and updating it myself… however the output I have contains lots of files which I should not update. Then there are files in different languages which use different types of comments etc. so such a script would be much more difficult to write.

Apologies for the poor quality of English in this post and to those of you who have no idea what I am on about.

Tags: code, copyright, find, git, GNU, gnuprologjava, GSoC, script, source files, VCS, version control
Posted in CompSci | 3 Comments »

Packaging a java library for Debian as an upstream

August 17th, 2010

I have just finished my GSoC project working on GNU Prolog for Java which is a Java library for doing Prolog. I got as far as the beta release of version 0.2.5. One of the things I want to do for the final release is to integrate the making of .deb files to distribute the binary, source and documentation into the ant build system. This post exists to chronicle how I go about doing that.

First some background reading: There is a page on the debian wiki on Java Packaging. You will want to install javahelper as well as debain-policy and debhelper. You will then have the javahelper documentation in /usr/share/doc/javahelper/. The debian policy on java will probably also be useful.

Now most debian packages will be created by people working for debian using the releases made by upstreams but I want to be a good upstream and do it myself.

So I ran:

jh_makepkg --package=gnuprologjava --maintainer="Daniel Thomas" \
    --email="drt24-debian@srcf.ucam.org" --upstream="0.2.5" --library --ant --default

I then reported a bug in jh_makepkg where it fails to support the --default and --email options properly because I got “Invalid option: default”. Which might be fixed by the time you read this.
So I ran:

jh_makepkg --package=gnuprologjava --maintainer="Daniel Thomas" \
    --email="drt24-debian@srcf.ucam.org" --upstream="0.2.5" --library --ant

And selected ‘default’ when prompted by pressing F.
I got:

dch warning: Recognised distributions are:
{dapper,hardy,intrepid,jaunty,karmic,lucid,maverick}{,-updates,-security,-proposed,-backports} and UNRELEASED.
Using your request anyway.
dch: Did you see that warning?  Press RETURN to continue...

Which I think stems from the fact that javahelper is assuming debian but javahelper uses debhelper or similar which Ubuntu has configured for Ubuntu. However this doesn’t matter to me as I am trying to make a .deb for Debian (as it will then end up in Ubuntu and many other places – push everything as far upstream as possible).
Now because of the previously mentioned bug the --email option is not correctly handled so we need to fix changelog, control and copyright in the created debian folder to use the correct email address rather than user@host in the case of changelog and NAME in the case of control and copyright (two places in copyright).

copyright needs updating to have Upstream Author, Copyright list and licence statement and homepage.
For licence statement I used:

    This library is free software; you can redistribute it and/or
    modify it under the terms of the GNU Library General Public
    License as published by the Free Software Foundation; either
    version 3 of the License, or (at your option) any later version.

    This library is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
    Library General Public License for more details.

    You should have received a copy of the GNU Library General Public
    License along with this library; if not, write to the Free Software
    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston MA  02110-1301 USA

On Debian systems the full text of the GNU Library Public License, can be
found in the file /usr/share/common-licenses/LGPL-3.

control needs updating to add homepage, Short Description and Long Description (which needs to be indented with one space). I also added texinfo and libgetopt-java to Build-Depends and Suggests: libgetopt-java.
GNU Prolog for Java has a build dependency on gnu.getopt as gnu.prolog.test.GoalRunner uses it so I added
export CLASSPATH=/usr/share/java/gnu-getopt.jar to debian/rules. libgnuprologjava-java.javadoc needed build/api put in it so that it could find the javadoc.

So at this point I have a debian folder with the right files in it but I don’t have an ant task which will do everything neatly. Since debian package building really wants to happen on an unpacked release tarball of the source I modified my existing dist-src target to copy the debian folder across as well and added the following to my init target:

<!-- Key id to use for signing the .deb files -->
<property name="dist.keyid" value="" />
<property name="dist.name-version" value="${ant.project.name}-${version}"/>
<property name="dist.debdir" value="dist/${dist.name-version}/" />

I could then add the following dist-deb target to do all the work.

<target name="dist-deb" depends="dist-src" description="Produce a .deb to install with">
	<mkdir dir="${dist.debdir}"/>
	<unzip src="dist/${dist.name-version}-src.zip" dest="${dist.debdir}"/>
	<!-- Delete the getopt code as we want to use the libgetopt-java package for that -->
	<delete dir="${dist.debdir}src/gnu/getopt"/>
	<exec executable="dpkg-buildpackage" dir="${dist.debdir}">
		<arg value="-k${dist.keyid}"/>
	</exec>
</target>

So this gives me binary and source files for distribution and includes the javadoc documentation but lacks the manual.
To fix that I added libgnuprologjava-java.docs containing

build/manual/
docs/readme.txt

, libgnuprologjava-java.info containing build/gnuprologjava.info and libgnuprologjava-java.doc-base.manual containing:

Document: libgnuprologjava-java-manual
Title: Manual for libgnuprologjava-java
Author: Daniel Thomas
Abstract: This the Manual for libgnuprologjava-java
Section: Programming

Format: HTML
Index: /usr/share/doc/libgnuprologjava-java/manual
Files: /usr/share/doc/libgnuprologjava-java/manual/*.html

Now all of this might result in the creation of the relevant files needed to get GNU Prolog for Java into Debian but then again it might not :-). The .deb file does install (and uninstall) correctly on my machine but I might have violated some debian-policy.

Tags: .deb, ant, apt, build systems, code, debian, distributions, dpkg, GNU Prolog for Java, gnuprologjava, GSoC, java, packaging, prolog
Posted in CompSci | 2 Comments »

Compiling git from git fails if core.autocrlf is set to true

August 16th, 2010

If you get the following error when compiling git from a git clone of git:

: command not foundline 2: 
: command not foundline 5: 
: command not foundline 8: 
./GIT-VERSION-GEN: line 14: syntax error near unexpected token `elif'
'/GIT-VERSION-GEN: line 14: `elif test -d .git -o -f .git &&
make: *** No rule to make target `GIT-VERSION-FILE', needed by `git-am'. Stop.
make: *** Waiting for unfinished jobs....

and you have core.autocrlf set to true in your git config then consider the following “it’s probably not a good idea to convert newlines in the git repo” curtsey of wereHamster on #git.

So having core.autocrlf set to true may result in bad things happening in odd ways because line endings are far more complicated than they have any reason to be (thanks to decisions made before I was born). Bugs due to white space errors are irritating and happen to me far too often :-).

Today I used CVS – it was horrible, I tried to use git’s cvsimport and cvsexportcommit to deal with the fact that CVS is horrible unfortunately this did not work ;-(.

This post exists so that the next time someone is silly like me they might get a helpful response from Google.

Tags: code, compiling, error, git, make, white space
Posted in CompSci | No Comments »

Major Multi-VCS surgery

June 25th, 2010

This summer I am working on the Google Summer of Code project “Revive GNU Prolog for Java”. It now has a project page and a Git repository which resulted in a rather entertaining screen shot.

Yesterday I found out that last year someone did a lot of the work I was intending to do this summer and they gave me a svn dump of the changes that they made. I had previously found two other people that had made some changes over the last 10 years while the project was dormant.

So I was faced with the task of taking the SVN dump of the changes made by Michiel Hendriks (elmuerte) and splicing them onto the old CVS history of the code he took the source .zip and then converting it into Git (which is what we are now using as our VCS). I was kind of hoping that with luck the two development histories would then share a common root which could be used to help with merging the two development histories back together that hope was in vain (though I have another idea I might try later).

Anyway this whole splicing thing was non-trivial so I thought I would document how I did it (partly so that I can find the instructions later).

So there were problems with the SVN dumpfile I was given: it couldn’t be applied to a bare repository as all the Node-Paths had an additional extension ‘trese/ample/trunk’ on the beginning. No one fix I tried was to add

Node-path: trese
Node-kind: dir
Node-action: add

Node-path: trese/ample
Node-kind: dir
Node-action: add

Node-path: trese/ample/trunk
Node-kind: dir
Node-action: add

Into the the first commit using vim: now this worked but it meant that I still had the extra ‘trese/ample/trunk/’ which I didn’t want.
So I removed that using vim and “:%s/trese\/ample\/trunk\///g”, unfortunately there were a couple of instances where trese/ample/trunk was refereed to directly when files were being added to svn:ignore unfortunately I didn’t find out how to refer to the top level directory in an svn dump file so I just edited those bits out (there were only 2 commits which were effected). So now I had a working svn dumpfile. To do the splicing I used svndumptool.py to remove the first commit resulting in a dumpfile I called gnuprolog-mod.mod2-86.svn.dump (see below for instructions on how to do this).

I got a copy of the CVS repository, so that the current working directory contained CVSROOT and also a folder called gnuprolog which contained the VCed code.

# This makes a dumpfile 'svndump' of the code in the gnuprolog module I only care about the trunk.
cvs2svn --trunk-only --dumpfile=svndump gnuprolog
# Then we use svndumptool to remove the first commit as cvs2svn adds one to the beginning where it makes various directories.
svndumptool.py split svndump 2 8 cvs-1.svn.dump
# Then we use vim to edit the dump file and do :%s/trunk\///g to strip of the leading 'trunk/' from Node-Paths
vim cvs-1.svn.dump
# Create a SVN repository to import into
svnadmin create gnuprolog-mod.plaster.svn
# Import the CVS history
svnadmin load gnuprolog-mod.plaster.svn < cvs-1.svn.dump
# Import the SVN history
svnadmin load gnuprolog-mod.plaster.svn < gnuprolog-mod.mod2-86.svn.dump
# Make a git repository from the SVN repository
git svn clone file:///home/daniel/dev/gnuprolog/gnuprolog-mod.plaster.svn gnuprolog-mod.plaster.git

Things that I found which are useful

The SVN dump file format
svn-to-git got me the closest to being able to import the svn dumpfile into git.
These instructions on how to fix svn dumpfiles.

Sorry this post is rather ramblely stuck half way between a howto and an anecdote but hopefully someone will find it useful.

Tags: code, cvs, cvs2svn, git, gnuprologjava, GSoC, history, howto, import, subversion, svn, svndumpfilter, svndumptool
Posted in CompSci | No Comments »

Phone scammers

June 24th, 2010

Today I received a call at about 10:05 to my home landline. I rapidly realised it was some kind of computer based scam and decided to have some fun seeing what they would try and do.
I had great fun doing this but I think that someone who does not understand computer could have easily been taken in.

As in many such scams they claimed to be a company working for Microsoft and offering this free service of finding out what is wrong with my computer as they detected that it was downloading lots of junk files from the internet which were slowing it down. Now our old Windows XP desktop is indeed old and slow and this is quite possibly due to junk. However it was obvious that they were making all this up. So they wanted me to turn my computer on – now obviously I wasn’t going to risk following any instructions on the real computer so I booted my XP VM on my laptop instead (which I will subsequently need to wipe).

Having booted the XP VM and possibly being passed onto a different call centre person. I was given a series of instructions the purpose of which was to prove that the computer had a problem. This involved going to the event viewer in computer administration (Start -> right click on “My Computer” -> Manage -> Event viewer and then to both Application and System. With a little sorting for effect we get a screen something like the following:
I suppose many people might find that quite scary but I have previously looked at such screens and it was what I expected to see.

Having ‘proved’ that there was something wrong with my computer they then proceeded to try and get me to provide greater access to them. This was done by getting me to visit www.logmein123.com and use the code 807932 (which they really didn’t want me to reveal to anyone).

They then got remote access to my computer and went and installed a fake scanner from http://majorgeeks.com/Advanced_WindowsCare_v2_Personal_d4991.html This proceeded to produce some fake results:

They then wanted to see if my “software warranty” had expired as this would be why my computer was “downloading junk files which can’t be removed by anti-virus”.
This was done by opening cmd and doing

cd \
tree

and while tree was running typing “expired.” so that it would appear at the bottom.

At this point they went in for the kill and opened up a form and claimed that “it is a timed http form so we can’t look at it” and that it would “automatically go in 8 minutes so you need to fill it in quickly”.
Obviously I wasn’t going to fill this form in so at that point I revealed that I knew that they were scammers. They denied this and got progressively more angry and incoherent and when I asked to be put through to their supervisor they hung up.

Follow up

Now obviously it is my duty to try and prevent this kind of thing from happening again.
So my first step was to try to find out the number which was used to call me using 1471 but unfortunately this did not work. I then tried the local police but they could not be of any help and they advised me to contact BT unfortunately BT could not help either as it was an international call with no number given.
I then reported relevant URLs to google and the exe to Stop Badware.

I contacted the company behind logmein123.com which seems to be a legitimate company telling them that their services are being abused and requesting comment from them about this. I received a very positive response: “Thanks for the heads up on this. We take this stuff very seriously and will investigate immediately. Any misuse of the product or trials for the purpose you describe is a violation of our terms and immediate grounds for termination of the service. Thank you for sending the PIN as it helps us not only track this down to end their service, it also gives us information we need should we decide to press legal action. …”

Now looking to see whether anyone else has discovered this gogreenpc.net scam I found that they have. So gogreenpc.net is a big scam site. Now I need to work out how to take them out. :-D

The people on #cl on irc.srcf.ucam.org were helpful in providing advice on follow up.

Tags: cold calling, CompSci, con, fraud, gogreenpc.net, logmein123.com, phishing, phone, remote pc access, scam, tech support, virus
Posted in CompSci | 14 Comments »

Who should I vote for?

May 4th, 2010

Now I am a floating voter and my final decision will be made in the ballot box on Thursday (though I have a fair idea who it will be for). It is my duty and privilege to vote and to vote for both the best candidate(s) for my constituency and the best party for the country.
As someone who likes to think that they have half a brain I want to be making my decision based not on irrelevant details such as who my parents/friends support. I want to be voting based on the merits of the beliefs, skills and policies of the candidates and on the beliefs and policies of the parties they represent.
Now obviously it is necessary to do some tactical voting under our current first-past-the-post voting system (though hopefully we will have something better before the next election) and so that is another thing to take into account.

The Peterhouse Politics Society held a hustings which I attended and which allowed me to assess the MP candidates in person which was quite useful. I found it ruled out Daniel Zeichner (Labour), I wasn’t that impressed with Nick Hillman (Conservatives) either though he did a better job as a candidate than Zeichner: he was constrained by the policies of his party from doing well in my eyes :-).

I have watched the first election debate and the second election debate and I have downloaded the third debate which I will watch later.

This afternoon I have been experimenting with various websites which claim to be able to help you decide who to vote for. I have found the experience interesting (though it didn’t really tell me much I didn’t already know).

A comparison of the various websites I have tried in order of preference
Website	Pros	Cons
They Work For You’s Election website Asks how much you agree or disagree with a series of questions and then shows you your candidates answers.	By far the best interface (and data) for determining what the candidates I can vote for think. Weights all the candidates based on how close they are to what I agree with. Very transparent on how it is working out which party I agree with. Produced by MySociety who have produced some pretty cool stuff.	Not quite as good at dealing with national policies – it is focusing on local politics. It could be extended to also cover councillors and party leaders to give it both national and even more local coverage.
Vote Match Similar to the above in that it asks you whether you agree/disagree with a series of statements and then tells you how this compares with national party policies	Possibly better at national politics as that is its focus.	Lacks transparency on how it is calculating the result. Doesn’t tell you what each party thinks about each statement as you go along.
Vote for Policies Gives you a selection of 4 policy areas (to pick the ones you care most about) and then presents you with a set of policy statements from each party showing their policies in that area.	Uses actual party manifesto data to help people determine who to vote for	I was initially confused by the interface and filled the first page in wrong and had to go back and correct it. It would be greatly improved by better granularity on policies within the same sub-area of policy. It only allows the selection of the one you like the best from the available options and doesn’t allow any credit to be given to the policies which would have come second (or any pain to be served out to parties who have a policy which means that I would never vote for them in a million years (e.g. “we don’t believe in global warming” (RAGE))).
Who should you vote for? Another how much do you agree/disagree with the following statements quiz.	Has a few other political quizzes which among other things determined that I am an idealistic lefty :-) (but then I knew that)	Doesn’t say how much each party agrees with each statement as clearly as theyworkforyou (though this information is available in the onhover text.
Active History Presents choices between policies from the three top parties.	Uses actual manifesto data	Only chooses between the top 3 parties and so isn’t so useful in Cambridge where the Greens have a good chance. It also feels to simple (like voteforpolicies in that it only lets you choose the policy you like the most but doesn’t give any wait to policies you would have put second). People like me who know what policies parties have can guess which is which reasonably easily.

Vote For Policies’ constituency results are also quite interesting as it indicates that the Lib Dems are wrong in their two horse race (between them and Labour) claims for Cambridge. Rather it is a three horse race: Labour, Greens, Lib Dem. Of course this data isn’t that reliable.

The Guardian’s pole of the polls indicates that the Lib Dems have failed to make the breakthrough I might have hoped for.

But in answer to the question it is for me a toss up between the Lib Dems and the Greens who both have a reasonable chance of winning in Cambridge (though the Lib Dems are more likely to win). I think on average I agree with Lib Dem policies more frequently than Green policies (but I consider the environment to be very important) however Tony Juniper is standing for the Greens in Cambridge and he is the most qualified candidate standing. However if I vote Green and they loose then I am fairly sure that the Lib Dems will win instead and that is another result I quite like. I suspect that this is a fairly rare situation for voters to find themselves in. Hopefully we will get STV before the next elections and then everyone will have a better chance of their vote counting.

Tags: analysis, choice, data, elections, environment, Green, Labour, Lib Dem, links, local, policy, politics
Posted in politics | No Comments »

Fix my street

March 26th, 2010

Today someone (AJ) pasted a link to FixMyStreet into #srcf. It actually looks rather good and could be an important way of fixing the current pot hole problem resulting from the snow last winter and in general improving the quality of our local environment.

So far I have reported some pot holes and an abandoned car. :-) Maybe they will now magically disappear.
I also noticed that other people had reported most of the problems I had been thinking I should report in Cambridge :-).

See Web 2.0 can do something useful :-).

Tags: action, community action, crowd sourcing, fix, fix my street, making things better, tidy
Posted in environment, politics | 1 Comment »