Your health data, rapidly disappearing into the distance on a hijacked stagecoach

When you generate health data, where does it go? I mean, if you visit the doctor, and they collect all their requisite information in their electronic health records, where do the records go? Who may access them? Despite all the regulations in place regarding patient privacy, these questions aren’t easy to answer, especially in circumstances where data breaches may have left sensitive data open to access from unintended parties. This is the ground covered by theDataMap, a project by Prof. Latanya Sweeney and Harvard’s Data Privacy Lab.

The map itself.

The map itself.

The map is essentially an index of known data sharing arrangements between parties, irrespective of whether any single person or group may participate in those relationships. Most of its health data is from state-level discharge records, i.e., partially-structured records describing individual details of a patient and hospital visit, including payment details. While these records don’t include names or other personal identifiers, the project’s creators note that discharge records provide enough detail to link patients to news stories and thereby identify patients. (In theory, some could be linked to clinical case reports as well.) These records also don’t match HIPPA standards as they’re governed by state regulations instead.

So, the answer to “where does my health data go” is essentially “to whoever buys it or finds it after a data breach”. Click on any of the nodes on the project site and you’ll get a list of organizations known to handle health data, along with any instances of data going missing. I think this is the most interesting aspect of the project: with a more comprehensive graph representation and/or a simple API, theDataMap could be a way to automatically trace paths between known data leaks and specific patient groups. If a Florida real estate company suffers a data breach and is known to have purchased discharge records, the impacted parties (i.e., patients of Florida hospitals) should know ASAP. Then again, sometimes it can take nearly a decade for health data breaches to become public.

How to make straight lines and arrows in Powerpoint

A very short entry here for an ever-present issue with Powerpoint.

When drawing lines or arrows in the software, and particularly when letting said lines or arrows snap into position, they aren’t quite aligned right. They’re visibly off-center. In short, they look terrible.

A slightly-off center arrow. Not the worst example.

A slightly-off center arrow. Not the worst example.

This problem has existed for years. The solution is to ensure that either the width or height (for vertical or horizontal lines, respectively) is exactly zero. In Powerpoint, right-click the offending line and select “Size and Position”, then adjust the corresponding height or width value to be zero.


I don’t like posting about Powerpoint problems and solutions, but these issues and the software itself are still all too common in academic circles.

Annotation and conversion with brat: a technical note

Quick technical fix if you’re interested in trying out some of the tools developed for use with the brat annotation platform. I wanted to be able to convert brat annotations into BioC format. There’s a tool developed by Antonio Jimeno Yepes et al. for that purpose - it’s called Brat2BioC. This tool has the dependency of brateval, developed by the same group. I tried installing brateval first via maven as instructed, and it built just fine, but Brat2BioC refused to do so.

Image not explicitly related.

Image not explicitly related.

The solution? Turns out Brat2BioC is just looking for the wrong version of brateval. Edit pom.xml such that the line




matches the actual version name of the brateval jar file. Then the build should work.

But what about running the thing? I have a set of annotated documents in brat standoff format (i.e., I have a set of .txt docs and corresponding .ann files) now in their own folder named “input”. After at least an hour of troubleshooting I still couldn’t get it to work. Part of the issue is Maven: it doesn’t seem to like loading local jar packages anymore (see this Stackoverflow post). Even avoiding Maven doesn’t seem to help, though. Java can just never seem to find the main class, which could happen for a variety of reasons, but in this case it just needed some very explicit CLASSPATH definitions. Having built BRATEval already as requested by the Brat2BioC README, I copied its jar into the Brat2BioC lib folder, then ran the following:

java -cp ./target/classes:BRAT2BioCConverter-0.0.1-SNAPSHOT.jar:./lib/BRATEval-0.1.0-SNAPSHOT.jar:./lib/bioc.jar:xstream-1.4.4.jar:xmpull- input output

This works just fine.

Lessons learned: even relatively simple format-conversion tools can be a headache to get working if you have to troubleshoot things like file locations.

ME, ME, ME: mutual exclusivity in understanding biomedical text

I’ve been reading and thinking about this paper by Gandhi and Lake on mutual exclusivity bias, or ME bias, lately, especially in terms of what it means for understanding biomedical text and other communications. ME bias is the tendency for an individual or a model, given a set of objects with known names along with an unknown name and novel object, to assign the new name to the new object. This bias works under the assumption that every object has one name. If that seems childlike, you’re right: this is one of the biases used by children when they’re learning language. They don’t often grasp the complexity of hierarchical relationships while they’re still learning, but if you show them a novel object, they’ll readily attach a newly provided name to it.

What kind of bird is that? I’ve seen birds before, and could even tell you the species of some types of birds, but I couldn’t tell you what the species of this one is. If you told me it was a  Green Violetear  I would have no evidence to dispute the identification. Maybe it’s enough to just call it “bird”. Image credit: me.

What kind of bird is that? I’ve seen birds before, and could even tell you the species of some types of birds, but I couldn’t tell you what the species of this one is. If you told me it was a Green Violetear I would have no evidence to dispute the identification. Maybe it’s enough to just call it “bird”. Image credit: me.

Gandhi and Lake were curious about whether neural networks (NNs) operate using the same bias. It would be convenient if they did, not only because it would allow them to learn relationships in a way mirroring that of humans, but because the data they may need to learn from if often replete with infrequently-occurring concepts. This is, in fact, a known limitation of NNs. They often encounter difficulties in assigning meaning to objects or sequences when few or zero training examples are available. The authors refer to recent work by Cohn-Gordon and Goodman demonstrating how machine translation models often produce ambiguity through many-to-one semantic relationships (i.e., two sentences in a given language may be translated to the same output sentence, even if those two sentences have different meanings) but implementing a model incorporating a bias resembling ME can preserve more of those direct, meaningful relationships.

Through experiments with synthetic data, the authors show that:

  • None of 400 different varieties of NN classification model demonstrate ME bias. In fact, they default to the opposite bias: “…trained models strongly predict that a novel input symbol will correspond to a known rather than unknown output symbol”.

  • This anti-ME bias holds regardless of the size of the training data.

  • The same appears to be true for sequence-to-sequence models: “The networks achieve a perfect score on the training set, but cannot extrapolate the one-to-one mappings to unseen symbols”.

This tendency may be true for machine learning models of other architectures and not NNs alone, as the authors concede. They extensively discuss how including ME bias may improve applications of machine translation and image classification, with the caveat that continuing the metaphor of human-style learning may be untenable in machine learning. As humans, we need mechanisms to learn about novel phenomena for our entire lives, so we remain open to the idea that a newly-encountered word or object may have a new meaning or name. Training machine learning models requires some degree of artificial limitation, however. It does provide a level of control over learning that few actively learning children will ever experience (and, on the subject of active learning, children receive constant feedback from parents, teachers, and their environment; it’s challenging to give any machine model that amount of careful human guidance).

So what’s the relevance to understanding biomedical text? One of the challenges in understanding any experimental or clinical document is its vocabulary. We can expect that some words in the document will be novel due to some combination of not encountering them before, learning them in a different context (and perhaps even one with a slightly different meaning, like how a myocardial infarction and a cerebral infarction are physiologically similar but certainly not identical, not least of which because they’re in different organs), or authorial creativity. Here’s a recent paper with a novel title: “Barbie-cueing weight perception”. As a reader, I can parse that pun on “barbecue”, and that’s not even technical terminology. What would, say, a biomedical named entity recognition model do with it? I don’t think ME bias can solve pun recognition, but could it assist with recognizing when a term is genuinely new and meaningful?

Results by Gandhi and Lake suggest that, at least for machine translation models, a novel output should be expected given a novel input. In entity recognition, it’s trivial to have this expectation, but perhaps not useful to assume that all novel words or phrases are unique entities. Typing is the real challenge, especially if there are numerous possible types. Should all newly encountered words get added to new types, then processed further in some manner? Perhaps this would make the most sense in a continuous learning scenario where types are aligned to a fixed ontology but there is some room for ambiguity. I’m not sure if it’s quite the same as ME bias to have a bias toward ambiguity, but it seems like half of the idea. There’s likely some of the idea of learning to learn involved. A model would need to have some ability to recognize contexts appropriate for assigning new or ambiguous relationships, much like how children learn about being prompted to connect a new object with a name.

Limitless Powerpoint: alternatives to the usual slide presentations

I’ve had a fair amount of musing/complaining about posters here lately, but what about the ubiquitous PowerPoint-style slide presentation? It’s just as much a linga franca of scientific communication as anything else, despite being conceptually identical to overhead transparencies of old. I’m not going to get into the inherent limitations of slide presentations, especially now that we’ve progressed beyond the age of laser sound effects and transitions. These discussions have been conveyed elsewhere, years ago. I believe much of the argument comes down to “the focus of the presentation should be the presenter, not the slides”. It’s certainly a compelling position but not one I’m going to dissect at the moment.

Instead, let’s look at a few alternatives to PowerPoint and its ilk. These are tools for in-person talks or webinars, as opposed to pre-recorded presentations. They seem to primarily address one of PowerPoint’s primary limitations: it never satisfactorily engineered a way to integrate all the types of audiovisual media a presenter may want to show. The standard protocol for demonstrating a live web site, for instance, is to open it in a browser. Google Slides gets better all the time but has similar limitations. Keynote and LibreOffice Impress still follow the same PowerPoint philosophy. How about something entirely new? A new paradigm, perhaps? Or maybe you just want to include small animations, like in this acknowledgements slide? Here are a few.



A slide tool built around “smart blocks” and integration with a whole bunch of webservices. Want to add GIFs on a whim? Have a collection of design mockups on Figma you’d like to show off? Ludus will do both of those. Plus, their About page does, in fact, refer to the tool as a “new paradigm”. It does most of the things PowerPoint et al. do, though it costs about $15 to $20/month for a single user, depending on payment frequency. I have not tried it and likely will not in the future for this reason (the above image is captured from the demo video on their site). There is a 30 day free trial if you are intrigued. Its targeted audience seems to be designers rather than researchers. The integrated services don’t seem to align with the usual science needs: support for things like Dropbox may help if that’s part of an existing workflow, but there isn’t integration with NCBI resources or arXiv, for example. Looks neat otherwise.



So much of a counterpoint to Ludus that it bills itself as “for people who aren’t designers”, Prezi is built around zooming in and out on a map of visuals. The platform has been around for about a decade now and has had plenty of time to smooth out its rough edges, though the interface still requires some acclimation. A big plus: it has a fairly basic but free-of-any-monetary-cost option. The cheapest paid option is $7/month. Prezi’s new interface makes all the style details easy to play around with. But does Prezi meet the needs of research presentations? Does it make it easier to convey multifaceted questions, methods, and results? Here’s one example of something Prezi does very well: it allows the presentation to zoom in on detailed figures without having them continuously occupy a lot of screen space. This is another fun example. I think the zoom effects need to be handled with care, as they can become distracting as the view zooms past other slides. It can feel a bit like trying to navigate with Google Maps in an area you don’t know well. If you don’t think your audience will mind, Prezi may be worthwhile, but expect to feel constrained by the free account limitations.



This option is a bit closer to traditional slide presentations, with a few notable features. Swipe supports building slides out of pure Markdown notation, a format many programmers and Github devotees are familiar with. It’s a great way to control formatting without getting too distracted by the exact placing of every text box or visual element. Swipe also allows presentations to include multiple choice polls. This seems entirely appropriate for academic lectures, and given the right setup, could even be a simple way to increase audience engagement. The audience would have to be expecting it from the beginning but it appears quite easy to direct them to a short link providing the live presentation, complete with polls and real-time results. This is my favorite of the three, and the option I’m most likely to use in the future, particularly as it has a decent free option.

This is just a selection of the non-PowerPoint presentation platforms in existence. In the end, there’s no replacement for delivering a message confidently and authentically, except perhaps for getting on a stage and screaming for 20 to 30 minutes. That can go pretty far, too.