In practice, things are never that simple, of course. I think I've mentioned here before how some of the major science publishers have only just begun opening their archives to text mining. There's also the issue of images: current software can usually handle text without serious issues but extracting meaningful data from figures is conceptually problematic. Scientists just aren't consistent when it comes to presenting their data. That's a good thing, really, as it they often have to focus on different aspects of their results; nature (or Nature, for that matter) just isn't always as consistent as we'd like it to be.
A paper posted to arXiv recently by Kuhn et al proposes one strategy for extracting data from images in scientific papers. They focused on gel images. These kinds of figures are great because they're generally just photos with sets of horizontal bands in them. The placement of the bands determines their relative size, so as long as a size standard is present, there's one bit of easy data already. The tricky part is knowing what's actually present on the gel. Even when we know that, the gel images are often too data-rich to tease apart every apparent result. As the authors say, "...the text rarely mentions all these relations in an explicit way, and the image is therefore the only accessible source."
So how well does it work? Not terribly well yet, at least because it's still incomplete. The issue of inconsistent labeling remains; the authors' approach works passably as long as figures are neatly labeled with gene or protein names. This kind of approach and others like it may eventually mean authors have to consider image mining when designing figures. They could, of course, just have fun with the image mining and write out all their labels by hand.