Available Imagination

Thoughts and notes

Gyms Near Me

| Comments

I’ve been very busy working on Superfastmatch, but I wanted to spend a day or two trying out a few ideas with the Fusion Tables API and static hosting on App Engine. One of the interesting things with App Engine is that you get 1GB a day of bandwidth for free. If you use static files, then no CPU cost is incurred, effectively giving you a high performance CDN for zero pence!

Gyms Near Me Fusion tables are intriguing because they offer spatial query processing and very thorough Google Maps integration. What this means is that you can completely avoid setting up any database infrastructure such as PostGIS and any tile generation services like Tilestache.

I employed these two services to come up with a simple app called Gyms Near Me which does what it says on the tin! It uses Google Maps with a Fusion Tables Layer to show colour-coded pins for all the gyms in your vicinity. When you select a gym, a JSONP call finds the five nearest gyms by ordering by ST_DISTANCE from the selected latitude/longitude with a LIMIT of 5. The really cool thing about Fusion Tables Layers is that they are delivered as PNG tiles which gives the potential to a have huge number of points on a map, avoiding the issues you might encounter if you were creating individual markers via javascript.

The app is now live, so I’ll soon see if it gets picked up for any Google search keywords. Looking at autocomplete it seems like Gyms Near Me is the top term so maybe it will get picked up! In the future I might also look at feeding in gyms for global locations and saving map positions like Gyms in London.

Unordered Map With Custom Structures

| Comments

I encountered the problem of trying to put custom data structures into the key of an unordered map and found different answers all over the Internet. The same question came up on the sparsehash mailing list so I thought I’d put an example up here:

The trick is to define both a hash and an equality function for the custom data structure. The prime numbers in the 3D vector hash function come from this paper. The “correct” hash function depends a lot on the distribution of the keys and only real world benchmarking will give a satisfactory answer as to what works best for your data.

The output of the above code would be something like:

Developer Diary

| Comments

SuperFastMatch is a project which began life as a piece of research and development in the production of Churnalism.com in 2010/2011 for the Media Standards Trust. A means was required of comparing a given text with over 3 million news articles in the Journalisted.com archive. The idea was to find the articles which had the longest sections of common text and to then rank them according to their respective percentages of reuse.

Initially, we looked at existing search engines like Xapian and Lucene and although being very effective at finding articles containing the same words, they weren’t particulary good at finding the articles with the longest common strings. More info on this can be found here.

Besides spotting churnalism, the software seemed to have other potential uses and the Sunlight Foundation believes it could be applied to spotting duplicated federal law. To facilitate this application they have generously offered to fund the open sourcing of this software, and the results should hopefully be soon tested against the OpenStates project. I’d like to thank Martin Moore, Ben Campbell, Tom Lee and everyone else at Sunlight for the opportunity to work on this great project!

Week 1

This is my first attempt at a truly open-source project so I’ve had a lot to learn about making software accessible by more than just the development team I am working with. SuperFastMatch originally started as the algorithm, developed over many sleepless nights, for comparing press releases with newspaper articles used by Churnalism.com. This was a single use-case, but as the site became known to the public, other use-cases became apparent. One is to track legislation for lobbyist influence. Others might include tracking plagiarism in fiction or making sure academic texts have correct citations.

The point is that when more than one use-case comes up, a piece of code needs to develop into a framework which is easy to understand, is well documented and is easily integrated and installable in concert with other software. Seeing as the majority of the existing code is written in Python it made sense to look at how some other Python projects made use of existing infrastructure to meet these ends. I examined Celery, a distributed job engine and tracker, as I was familiar with it from previous work, and saw how it made use of tools to assist its open source development. Below are some ideas and tools that I borrowed from it:

  • Github has become the de facto platform for sharing and distributing code. It’s free for open source projects and is built on top of git, which has many advantages over previous generations of source code control software. It also offers static file hosting, from where you are probably reading this documentation!

  • Pypi is a centralised index of Python packages that allows for the versioning and sharing of Python software without the need to remember lots of urls and is reliable enough to be referred to in bootstrap scripts.

  • Sphinx is a documentation generator that works well with Python packages as well as C++ (and hopefully Lua!). This page and the rest of the docs are written in reStructured Text and then passed throught a batch process which converts them into HTML, after which they are uploaded to the static Github hosting.

  • Paver is a combination of tools which allows for the scripting in Python of many development workflows such as bootstrapping, documentation generation, packaging and test execution. I’m used to using Fabric for deployment, and Paver seems to offer a similar experience for development. It will be interesting to see if I integrate some of the C++ build process into the pavement.py file.

  • VirtualEnv is an essential tool for isolating Python packages into an application-specific environment. This ensures you always know which versions of software are present and can accurately recreate this when you deploy to either a production or staging server. It comes with Pip installed in the created environment for easy integration with Pypi.

  • Django is the platform on which Churnalism.com is built and is widely used so it seemed like the natural choice for building an example usage of SuperFastMatch.

Week 2

Now that all the open source infrastructure is in place it’s time to start learning a bit more about the use-case in hand, US Congress and state legislation.

Legislature

The US Congress has two chambers, both elected, the House of Representatives and the Senate. Both chambers can introduce bills and the bill can take one of many paths through both chambers. The passage of a bill through Congress appears to be quite undefined with a whole series of possible routes and referrals being possible before an Act is signed by the President.

The end result of acts of Congress is the US Code which is an up to date record of all enrolled bills. States in the US are also capable of passing their own local laws.

Data Access

The US has a well maintained digital archive of all House of Representative Bills from 2004, and a more recent archive of Senate Bills. The data is accessible in XML form which allows for metadata extraction of Bill information and easy manipulation of the Bill text itself, ideal for the purposes of this project. Text-only bills are available from 1993 onwards.

Week 3

Framework design

The aim of the project is to design a reusable tool for bulk text comparison and analysis. To succeed in becoming reusable, at least two use-cases have to be considered and a clean, easy to understand interface for the user has to be designed. Django offers many ways of designing models for data storage, and this flexibility is useful - but there can be a number of caveats that can obstruct progress further down the line.

Model Inheritance allows for the subclassing of a Model which itsef permits further extension. This is an ideal pattern for different types of Document. Say that you have Press Releases and News Articles, or State Law and Congress Law, all with potentially similar content. When you search, you might not know what to expect as results and would like to search all Document types simultaneously. By defining a Document base model, this is possible because it can be assumed that the necessary content in it’s indexable form is present.

With Django, there is a choice of either abstract or multi-table inheritance (and also proxy…). Multi-table inheritance results in an extra table, while abstract inheritance just adds fields to the subclasses’ tables. The simplest choice is abstract inheritance and I went for this. However, the disadvantage is that you cannot define a ForeignKey to the abstract base class itself but the related clean indexed content needs to be stored somewhere as a cache so that updates can be detected. To solve this issue I turned to content types, which makes it possible to relate any model instance to any other model instance, regardless of type. This means that the cleaned Content can be related to any of Document’s subclasses. At this point it’s probably worth looking at the model definition.

Faceting

One of the limitations of the original Churnalism code is the inability to search for news articles, only press releases. However, both document types have very similar features, such as date published, an author, a publisher and of course the content itself. Therefore it makes sense to also be able to search for news articles with either a news article, press release or other text as the input text. This would be useful for journalists to check for plagiarism and is useful in lots of other contexts.

The challenge is that the script that builds the search results needs a limit on the number of results to return to make the data and server load manageable. However, imagine searching for both news articles and press releases with a limit of 20 results. It might be the case that 20 news articles have higher ranking than matching press releases. The press releases might be a more interesting result though, so excluding them could omit valuable data. The solution to this is faceted search where extra data, in this case the document type is stored along with the document id in the index. The number of results returned is per document type, and therefore useful data should not be missed. This has implications for total index disk space usage, but is a truly useful addition to the capabilities of SuperFastMatch.

Other fields, such as published date, could be used as filter, but then we are entering the territory of advanced search engines like Xapian and Solr which are already good at that type of thing!

Week 4

Kyoto Cabinet

Storing and processing data is a challenging task with many, many options! Often the performance characteristics and ceilings of a platform only become apparent after a large amount of implementation for that platform has occurred. These evaluations can take a lot of time and definitely help form an opinion on which platform is good at a particular task. SuperFastMatch works on the simple concept of storing every hash of every window of text in a document with the associated document id (and now document type). This is done for every document and yields an inverted index of hashes to documents. Thus, for a search document or text, hashing its windows allows for a fast search of matching documents.

The storage requirement for this is very high - typically 5x the original document size. Access speed is very important, given that a search of a document n characters long requires (n-window_size) lookups. Also index creation speed is vital. If it takes 2 days to index 1 days worth of news, Churnalism.com would be behind the current affairs very quickly!

The obvious starting candidate for a platform is SQL, and I experimented with Postgres, testing out such features as the intarray data type for storing document ids and partitioned tables as a means for bulk loading daily data whilst maintaining insert speed. However, even with 24-bit hashing (ie. 16,777,216 possible hashes), lookup and insert speeds proved to be poor.

The next candidate was the sexy new key value store on the block! Redis has lots of nice features ideal for storing the index. High speed inserts and the sets command set, great for ensuring no duplicate id is saved for a hash. However, again a performance ceiling reared its skull-damaging surface! Redis operates great when the data set fits totally in memory, but as soon as that limit is surpassed, the paging virtual memory kicks in - but very slowly. The release notes and commit history showed that this was a known issue, but it made Redis impractical for use.

A lateral solution was to consider using Hadoop in the form of Elastic Map Reduce as a way of solving the daily index creation problem. It was interesting to stray into the realms of Java land (with its verbosity quite overwhelming at times!). Issues found included the need to code a simple CSVReader - not included in the vanilla distribution. The difficulty of getting something out of Hadoop in binary rather than text file form. The misleading and, in the end, very expensive pricing of both the EC2 service and the S3 costs for downloading the processed data to an external server. I can see that if you are Yahoo and have 1000’s of servers, Hadoop is a great way to distribute jobs across many machines. However, if you have a continuous data processing task that could be run on one very powerful machine the economics of it don’t quite make sense.

At this point, despair was near! A final gambit seemed to be the service offered by the search giant themselves. App Engine had a lot of ticks in it’s favour. The datastore is capable of massive scaling with no configuration required. API calls to the datastore are very performant (and now can run in parallel). The task queue is fantastic at allowing extra instances to be pulled up as and when needed for processing the index. All in all, Appengine was the preferred choice and I had a great working prototype. The one major headache was the billing cost for Datastore API writes. Because the data was incoming in unsorted form, each hash for each document required an API write, which added up to about ¬£6,000 just to index 3.5 million news articles! This meant any mistakes in indexing would surely kill the project’s budget. A postscript to this might be that the recent addition of the shuffle and reduce phases of the map-reduce project might make the insertion costs considerably less. Also the recently announced full-text search API has a numeric data type that could be misused to simulate the inverted index.

So after a very large amount of time spent evaluating performance, cost and practical implementation it became clear that was a definite advantage to investing in some serious hardware to negate the higher than expected cost of high performance cloud infrastructure. A server configured by Pete at Mythic Beasts with 64GB of RAM and an Intel X25-M faciliated the speedy operation of the final solution that we decided upon.

And that final solution was to use a combination of Kyoto Cabinet and Kyoto Tycoon written by the talented Mikio Hirabayashi (now gainfully employed by Google) who kindly incorporated some feature requests to do with the bit length of Murmur hashing exposed to Lua and gave very useful implementation advice. The pros of Kyoto Cabinet are numerous and include:

  • Very fast insertion and lookup speed, whether accessing from disk or memory.
  • High tunability of indexes in terms of memory usage and algorithms employed in sorting of keys.
  • Embedded Lua scripts can be run in multithreaded HTTP aware server.
  • Designed to be used both as a toolkit and a framework for development allowing tighter integration as bottlenecks are encountered, ie.e the library and header files allow for everything to be used in a custom C++ project if desired.
  • Great documentation.

Filtering Junk

After examining the data for US Congress bills it has become clear that this a far less heterogenous data set than that found with Churnalism.com. For instance phrases like “is amended by adding at the end the following” and “Notwithstanding any other provision of” appear in nearly every bill. These phrases in themselves do not indicate similarity with another bill, but they might if they are part of a longer chunk. So how to ignore the stock phrases when they they are just stock phrases, but include them when they are part of something more unique?

A natural by-product of the index creation, especially as the corpora becomes larger, is that for every window, or in fact hash, there is a sequence of document ids and document types, which are themselves vital for search, but when accumulated together the count per hash gives an indication of the originality or cliché of that window. Say that this sentence is present in both search text and a match candidate:

“The Tobacco Smoker’s compensation bill is amended by adding at the end the following”

The last half of the sentence consists of very frequently occurring windows, whilst the start is very likely less common. We want the whole sentence to be given prominence in the fragment results. We can do this by averaging the counts of each hash, perhaps with a high cut-off, say 100, for the high counts. This would allow us to filter out the very common phrases, which would tend to score around 100. This is yet to be implemented, but should be easy to achieve and is scheduled for version 0.4.

Another possible use of this sort of “heatmap” might be to visualise the originality of texts, with common phrases coloured red perhaps and more original phrases in a cooler tone!

Back to Blog With a Bang!

| Comments

DSC_0247.jpg

Well, not for the first time, there has been somewhat of a gap in the continuity of posts on this blog. I could say sorry, but then I’d risk ending up here as Jack has noted.

There is a rites of passage in Web Programming, whereby you write your own blog software in whatever framework you are learning; you write a few posts and then stop because each new post has some CSS requirement or weird formatting that makes you give up. I did that.

I write code. I often read other people’s code. I’d like to share my code. Stack Overflow is good for this if you have a question, or know the solution to another person’s problem. But what if you have found a nice solution to a problem in your own code. Sharing code on the previous two incarnations of this blog was a pain. Moving posts between different blog platforms is a pain. Logging in to whatever website to make a post is a pain. Lots of pain!!!

Content is generally written in Markdown at a point in time on a subject. Jekyll aims to keep that simplicity and Octopress builds on that to make sharing code easy. So here goes with this (yet another) new framework. Hopefully the big advantage will be that all the content is now much easier to play with using text editors and grep!

Experian’s Rise in Local Council Administration

| Comments

I’ve had a recent experience with Liverpool City Council that has raised my awareness of the increasing role of the credit agency in the decisions made by local government officials. It has also highlighted the detachment and lack of accountability that a public-private-partnership run call centre affords these officials. The credit reference agency in question is Experian and the product that they sell is called Residency Checker. The PPP call centre is called Liverpool Direct and is 80% owned by BT. The third party who brings an added bonus to the formula is The Royal Mail.

First, a chronology (personal details included for clarity!):

  • I live with my New Zealand girlfriend in a flat in Liverpool City Centre.
  • We split up in August 2009 and she moves out from the flat by the end of that month.
  • In September 2009 I post this form to the Revenue Services applying for the Single Occupancy Discount of 25%. I do not know the forwarding address of my girlfriend at this point - such is the nature of ended relationships!
  • In October 2009 I get the same form resent to me asking for the forwarding address. My ex-girlfriend now gives me this information. I post the form again.
  • I hear nothing for the next two months, so ring up Liverpool Direct in January 2010. They inform me that they sent a letter refusing the discount on the grounds that they cannot confirm my ex-girlfriends new address and fiscal records indicate that she still lives at this address. This letter, dated the 24th November 2009, never reached me.
  • I request that Experian explain how their data indicates that my ex-girlfriend still lives at my flat. They explain the concept of a financial association. On the 5th January 2009 I request that this information is removed as it is incorrect. On the 7th January 2009 I receive acknowledgment that this has taken place.
  • I appeal the decision in person at Dale Street’s One Stop Shop on the 12th January 2010 and inquire as to the nature of this fiscal information.
  • No reply. I phone Liverpool Direct and they inform me of another unreceived letter, dated 20th January 2010. Again, another refusal, saying that my ex is still in occupation, according to an unnamed “Credit Reference Agency”. I request copies of these letters. Mysteriously, they all turn up at the start of Februray 2010.
  • On the 4th February 2010, I phone Experian again and discover the information has not been removed, as they had previously said had been the case. I request that the information is removed again. It turns out the specific information relates to a joint personal loan application in April 2009 (that was turned down) to help my ex-girlfriend consolidate her considerable debt. Interestingly, the association is based around the fact that we lived together in London before we lived together in Liverpool. Experian removed the Liverpool association, but not the London association. A key fact in this horribly bungled and failed process.

And so we go on ….

Kafka would have been proud of the communication cycle that occurs with Liverpool Direct’s council tax department. You phone the call centre and supply your reference number. You give the operator information about your case. They make a “service request” that goes to the revenues service in a separate building. A randomly-selected revenues officer writes a vague reply without details of which Credit Reference Agency is used and what alternative evidence can be accepted. This reply is sent in the post. The Royal Mail have a deal with someone, whereby they don’t deliver the letter. You phone Liverpool Direct, they make a “service request” for letters to be resent. The letter does not help explain what next steps need to be taken.

Revenues Officers refuse to take phone calls explaining and justifying their decisions.

The fact that needs to be up for debate is why the council can use a failed personal loan application in April 2009 as proof that two people share a flat between September 2009 and February 2010. Also, why do the council even have access to this information? The answer to that question seems to be a market opportunity spotted by Experian to reuse credit data as something that can be assumed to be the truth of where people reside at certain points in time. My case study is an example of where Experian’s data can be proved to be incorrect. There seems to be no indication on the Liverpool City Council website or in any of their communications on how to correct the wrong data. For the record, the request form is here.

Another fact up for debate is why council officials can have a one-way dialogue with borough residents. The answer to this is the help desk outsourcing of Liverpool Direct. In exceptional circumstances, process must be overridden and direct conversation permitted.

How could this whole problem have been avoided? Very simply and cheaply. A letter could have been sent to the forwarding address I provided, asking for confirmation of residence. This was not done. Shocking.

Two questions for MP’s and councillors. How much is being spent on the Residency Checker licensing? How many single occupancy discount requests have been wrongly declined against based on bad data from Experian?

Archimedes Memories

| Comments

Almost bought tears to my eyes …

Brings back memories of trying to do animation in Render Bender and writing parallax star fields in Assembler!

Linq to SQL Wish List

| Comments

I really like using Linq, it is an elegant query language that can be applied to in-memory objects as well as translated into SQL dialects behind the scenes. What this gives is a common means of expressing filters, orderings, groupings and aggregations throughout your code. I have battled with massive reports using the CROSS JOIN with LEFT OUTER JOIN pattern in SQL, passing stacks of parameters into the stored procedure to get the report results out. Admittedly, it could be tweaked to make it run pretty fast, but it was sheer hell to maintain.

Linq allows queries to be composed at runtime and built to arbitrary levels of complexity, as needed by the consumer of the code. This is a win situation compared to versioning strings of SQL that are not type-checked and prone to error. However, if we go down the Linq route we also need an ORM solution to go with it. Currently there are four choices:

  • Subsonic
  • Linq to SQL
  • The Entity Framework V1
  • Nhibernate with Linq to Nhibernate

I like Linq to SQL because it is lightweight, the documentation is fairly complete and I’ve invested time learning it. It has some missing features, though, and now Microsoft are deprecating it in favour of the Entity Framework V2.

Please don’t do this! Improve what is already a good product with some simple additions:

  • Add a many-to-many association
  • Add the two other patterns of inheritance: Class Table Inheritance and Concrete Class Table Inheritance
  • Improve the CreateDatabase() method and extend the MetaModel to include database default values (newid() for example) and a PostCreateDatabase hook for inserting any custom SQL
  • Add a versioning feature to the mapping source which can deal with migrations from one version to another. Currently Linq to Sql works well for DB-first design, as long as you automate the process as follows:
  • Introspect the db with sqlmetal to create a .dbml file
  • Patch the dbml file with changes using MSBuild Community Tasks for instance
  • Run the second half of the sqlmetal process to generate the code

This means that you don’t end up with nice clean POCO code, however. I would prefer, in a DDD way, to start with some POCO’s and then create mappings using a fluent interface, and then have some SQL DDL’s created. Maybe this is possible now. I just don’t want to plough through the horrid NHibernate documentation and the EF looks like a mess that won’t be sorted out until VS 2010 hits the streets. Subsonic is an option but Rob Conery seems awfully busy at the moment!

Thurstaston Day Trip

| Comments

Had fun taking photographs around the church where the owner of the White Star Line ended up being buried.

Gothic Angel

Other Worldly Transport

| Comments

There’s something unreal about the way this Russian Sea Plane floats just above the water. Like something from a Philip K Dick novel. That would be some way to cross an ocean!

Fun With Libyan Domain Names

| Comments

Heard here that domains ending in .ly where available from the Libyan Registrar so thought I might investigate. Libyan Spider seemed like the most respectable agent, although at $149.50/year, they are not cheap.

It was quite a lot of fun searching all the short words ending in ‘ly’, and believe me, there are quite a few left! I ended up going for:

and

If you are interested in lightening me of one of these domains, then feel free to comment!