Thar's gels in them thar hills

Text mining is a rather unkind metaphor. In the life sciences, it refers to how we can sort through all the published data out there and extract broad conclusions from the aggregate. This does make the rhetorical assumption that most of that text is some kind of stone to be blasted away until the truth emerges, gleaming in the sun.

In practice, things are never that simple, of course. I think I've mentioned here before how some of the major science publishers have only just begun opening their archives to text mining. There's also the issue of images: current software can usually handle text without serious issues but extracting meaningful data from figures is conceptually problematic. Scientists just aren't consistent when it comes to presenting their data. That's a good thing, really, as it they often have to focus on different aspects of their results; nature (or Nature, for that matter) just isn't always as consistent as we'd like it to be.

A paper posted to arXiv recently by Kuhn et al proposes one strategy for extracting data from images in scientific papers. They focused on gel images. These kinds of figures are great because they're generally just photos with sets of horizontal bands in them.  The placement of the bands determines their relative size, so as long as a size standard is present, there's one bit of easy data already. The tricky part is knowing what's actually present on the gel. Even when we know that, the gel images are often too data-rich to tease apart every apparent result. As the authors say, "...the text rarely mentions all these relations in an explicit way, and the image is therefore the only accessible source."

These folks used a straightforward approach to break down gels into usable data - check out the paper for details. They're fond of the optical character recognition in MS Office 2003 for handling text. The gel segment recognition needed a machine learning-based approach and random-forest classifiers. Assigning relations to those gel images is much trickier so the authors had to use the ol' Human Touch in their code.

So how well does it work? Not terribly well yet, at least because it's still incomplete. The issue of inconsistent labeling remains; the authors' approach works passably as long as figures are neatly labeled with gene or protein names. This kind of approach and others like it may eventually mean authors have to consider image mining when designing figures. They could, of course, just have fun with the image mining and write out all their labels by hand.