SuperFastMatch is a project which began life as a piece of research and development in the production of Churnalism.com in 2010/2011 for the Media Standards Trust. A means was required of comparing a given text with over 3 million news articles in the Journalisted.com archive. The idea was to find the articles which had the longest sections of common text and to then rank them according to their respective percentages of reuse.
Initially, we looked at existing search engines like Xapian and Lucene and although being very effective at finding articles containing the same words, they weren't particulary good at finding the articles with the longest common strings. More info on this can be found here.
Besides spotting churnalism, the software seemed to have other potential uses and the Sunlight Foundation believes it could be applied to spotting duplicated federal law. To facilitate this application they have generously offered to fund the open sourcing of this software, and the results should hopefully be soon tested against the OpenStates project. I'd like to thank Martin Moore, Ben Campbell, Tom Lee and everyone else at Sunlight for the opportunity to work on this great project!
This is my first attempt at a truly open-source project so I've had a lot to learn about making software accessible by more than just the development team I am working with. SuperFastMatch originally started as the algorithm, developed over many sleepless nights, for comparing press releases with newspaper articles used by Churnalism.com. This was a single use-case, but as the site became known to the public, other use-cases became apparent. One is to track legislation for lobbyist influence. Others might include tracking plagiarism in fiction or making sure academic texts have correct citations.
The point is that when more than one use-case comes up, a piece of code needs to develop into a framework which is easy to understand, is well documented and is easily integrated and installable in concert with other software. Seeing as the majority of the existing code is written in Python it made sense to look at how some other Python projects made use of existing infrastructure to meet these ends. I examined Celery, a distributed job engine and tracker, as I was familiar with it from previous work, and saw how it made use of tools to assist its open source development. Below are some ideas and tools that I borrowed from it:
Github has become the de facto platform for sharing and distributing code. It's free for open source projects and is built on top of git, which has many advantages over previous generations of source code control software. It also offers static file hosting, from where you are probably reading this documentation!
Pypi is a centralised index of Python packages that allows for the versioning and sharing of Python software without the need to remember lots of urls and is reliable enough to be referred to in bootstrap scripts.
Sphinx is a documentation generator that works well with Python packages as well as C++ (and hopefully Lua!). This page and the rest of the docs are written in reStructured Text and then passed throught a batch process which converts them into HTML, after which they are uploaded to the static Github hosting.
Paver is a combination of tools which allows for the scripting in Python of many development workflows such as bootstrapping, documentation generation, packaging and test execution. I'm used to using Fabric for deployment, and Paver seems to offer a similar experience for development. It will be interesting to see if I integrate some of the C++ build process into the pavement.py file.
VirtualEnv is an essential tool for isolating Python packages into an application-specific environment. This ensures you always know which versions of software are present and can accurately recreate this when you deploy to either a production or staging server. It comes with Pip installed in the created environment for easy integration with Pypi.
Now that all the open source infrastructure is in place it's time to start learning a bit more about the use-case in hand, US Congress and state legislation.
The US Congress has two chambers, both elected, the House of Representatives and the Senate. Both chambers can introduce bills and the bill can take one of many paths through both chambers. The passage of a bill through Congress appears to be quite undefined with a whole series of possible routes and referrals being possible before an Act is signed by the President.
The end result of acts of Congress is the US Code which is an up to date record of all enrolled bills. States in the US are also capable of passing their own local laws.
The US has a well maintained digital archive of all House of Representative Bills from 2004, and a more recent archive of Senate Bills. The data is accessible in XML form which allows for metadata extraction of Bill information and easy manipulation of the Bill text itself, ideal for the purposes of this project. Text-only bills are available from 1993 onwards.
The aim of the project is to design a reusable tool for bulk text comparison and analysis. To succeed in becoming reusable, at least two use-cases have to be considered and a clean, easy to understand interface for the user has to be designed. Django offers many ways of designing models for data storage, and this flexibility is useful - but there can be a number of caveats that can obstruct progress further down the line.
Model Inheritance allows for the subclassing of a Model which itsef permits further extension. This is an ideal pattern for different types of Document. Say that you have Press Releases and News Articles, or State Law and Congress Law, all with potentially similar content. When you search, you might not know what to expect as results and would like to search all Document types simultaneously. By defining a Document base model, this is possible because it can be assumed that the necessary content in it's indexable form is present.
With Django, there is a choice of either abstract or multi-table inheritance (and also proxy...). Multi-table inheritance results in an extra table, while abstract inheritance just adds fields to the subclasses' tables. The simplest choice is abstract inheritance and I went for this. However, the disadvantage is that you cannot define a ForeignKey to the abstract base class itself but the related clean indexed content needs to be stored somewhere as a cache so that updates can be detected. To solve this issue I turned to content types, which makes it possible to relate any model instance to any other model instance, regardless of type. This means that the cleaned Content can be related to any of Document's subclasses. At this point it's probably worth looking at the model definition.
One of the limitations of the original Churnalism code is the inability to search for news articles, only press releases. However, both document types have very similar features, such as date published, an author, a publisher and of course the content itself. Therefore it makes sense to also be able to search for news articles with either a news article, press release or other text as the input text. This would be useful for journalists to check for plagiarism and is useful in lots of other contexts.
The challenge is that the script that builds the search results needs a limit on the number of results to return to make the data and server load manageable. However, imagine searching for both news articles and press releases with a limit of 20 results. It might be the case that 20 news articles have higher ranking than matching press releases. The press releases might be a more interesting result though, so excluding them could omit valuable data. The solution to this is faceted search where extra data, in this case the document type is stored along with the document id in the index. The number of results returned is per document type, and therefore useful data should not be missed. This has implications for total index disk space usage, but is a truly useful addition to the capabilities of SuperFastMatch.
Storing and processing data is a challenging task with many, many options! Often the performance characteristics and ceilings of a platform only become apparent after a large amount of implementation for that platform has occurred. These evaluations can take a lot of time and definitely help form an opinion on which platform is good at a particular task. SuperFastMatch works on the simple concept of storing every hash of every window of text in a document with the associated document id (and now document type). This is done for every document and yields an inverted index of hashes to documents. Thus, for a search document or text, hashing its windows allows for a fast search of matching documents.
The storage requirement for this is very high - typically 5x the original document size. Access speed is very important, given that a search of a document n characters long requires (n-window_size) lookups. Also index creation speed is vital. If it takes 2 days to index 1 days worth of news, Churnalism.com would be behind the current affairs very quickly!
The obvious starting candidate for a platform is SQL, and I experimented with Postgres, testing out such features as the intarray data type for storing document ids and partitioned tables as a means for bulk loading daily data whilst maintaining insert speed. However, even with 24-bit hashing (ie. 16,777,216 possible hashes), lookup and insert speeds proved to be poor.
The next candidate was the sexy new key value store on the block! Redis has lots of nice features ideal for storing the index. High speed inserts and the sets command set, great for ensuring no duplicate id is saved for a hash. However, again a performance ceiling reared its skull-damaging surface! Redis operates great when the data set fits totally in memory, but as soon as that limit is surpassed, the paging virtual memory kicks in - but very slowly. The release notes and commit history showed that this was a known issue, but it made Redis impractical for use.
A lateral solution was to consider using Hadoop in the form of Elastic Map Reduce as a way of solving the daily index creation problem. It was interesting to stray into the realms of Java land (with its verbosity quite overwhelming at times!). Issues found included the need to code a simple CSVReader - not included in the vanilla distribution. The difficulty of getting something out of Hadoop in binary rather than text file form. The misleading and, in the end, very expensive pricing of both the EC2 service and the S3 costs for downloading the processed data to an external server. I can see that if you are Yahoo and have 1000's of servers, Hadoop is a great way to distribute jobs across many machines. However, if you have a continuous data processing task that could be run on one very powerful machine the economics of it don't quite make sense.
At this point, despair was near! A final gambit seemed to be the service offered by the search giant themselves. App Engine had a lot of ticks in it's favour. The datastore is capable of massive scaling with no configuration required. API calls to the datastore are very performant (and now can run in parallel). The task queue is fantastic at allowing extra instances to be pulled up as and when needed for processing the index. All in all, Appengine was the preferred choice and I had a great working prototype. The one major headache was the billing cost for Datastore API writes. Because the data was incoming in unsorted form, each hash for each document required an API write, which added up to about £6,000 just to index 3.5 million news articles! This meant any mistakes in indexing would surely kill the project's budget. A postscript to this might be that the recent addition of the shuffle and reduce phases of the map-reduce project might make the insertion costs considerably less. Also the recently announced full-text search API has a numeric data type that could be misused to simulate the inverted index.
So after a very large amount of time spent evaluating performance, cost and practical implementation it became clear that was a definite advantage to investing in some serious hardware to negate the higher than expected cost of high performance cloud infrastructure. A server configured by Pete at Mythic Beasts with 64GB of RAM and an Intel X25-M faciliated the speedy operation of the final solution that we decided upon.
And that final solution was to use a combination of Kyoto Cabinet and Kyoto Tycoon written by the talented Mikio Hirabayashi (now gainfully employed by Google) who kindly incorporated some feature requests to do with the bit length of Murmur hashing exposed to Lua and gave very useful implementation advice. The pros of Kyoto Cabinet are numerous and include:
- Very fast insertion and lookup speed, whether accessing from disk or memory.
- High tunability of indexes in terms of memory usage and algorithms employed in sorting of keys.
- Embedded Lua scripts can be run in multithreaded HTTP aware server.
- Designed to be used both as a toolkit and a framework for development allowing tighter integration as bottlenecks are encountered, ie.e the library and header files allow for everything to be used in a custom C++ project if desired.
- Great documentation.
After examining the data for US Congress bills it has become clear that this a far less heterogenous data set than that found with Churnalism.com. For instance phrases like "is amended by adding at the end the following" and "Notwithstanding any other provision of" appear in nearly every bill. These phrases in themselves do not indicate similarity with another bill, but they might if they are part of a longer chunk. So how to ignore the stock phrases when they they are just stock phrases, but include them when they are part of something more unique?
A natural by-product of the index creation, especially as the corpora becomes larger, is that for every window, or in fact hash, there is a sequence of document ids and document types, which are themselves vital for search, but when accumulated together the count per hash gives an indication of the originality or cliché of that window. Say that this sentence is present in both search text and a match candidate:
"The Tobacco Smoker's compensation bill is amended by adding at the end the following"
The last half of the sentence consists of very frequently occurring windows, whilst the start is very likely less common. We want the whole sentence to be given prominence in the fragment results. We can do this by averaging the counts of each hash, perhaps with a high cut-off, say 100, for the high counts. This would allow us to filter out the very common phrases, which would tend to score around 100. This is yet to be implemented, but should be easy to achieve and is scheduled for version 0.4.
Another possible use of this sort of "heatmap" might be to visualise the originality of texts, with common phrases coloured red perhaps and more original phrases in a cooler tone!