Draw your own lines

Maps have always been interesting to me. It’s less about the aesthetics of maps or their level of detail (though those are interesting, and really require a staggering amount of coordinated effort) and more about how areas end up with consistent definitions. Where are the explicit and implicit borders between neighborhoods? How do I tell someone I live in a particular area? How does the language we use in describing geographic regions change depending on who we’re talking with?

I live in LA, a city rich in examples of these types of questions. It’s famously a patchwork of neighborhoods, independent cities, and other sociopolitical niches, some as small as a block or two. There’s often debate about where one neighborhood begins and another ends. Luckily, some map creators are willing to wade into that active debate.

This LAist article features a very detailed map by Eric Brightwell. You can find the map on its own here.

LA_map_1.png

There was a previous, searchable version produced by the LA Times. I believe the first version is from 2009 and it’s undergone changes since then. It’s not bad, though then again, I didn’t grow up in LA.

For context - here’s a map of gentrification over the last few decades in the same area.

Your health data, rapidly disappearing into the distance on a hijacked stagecoach

When you generate health data, where does it go? I mean, if you visit the doctor, and they collect all their requisite information in their electronic health records, where do the records go? Who may access them? Despite all the regulations in place regarding patient privacy, these questions aren’t easy to answer, especially in circumstances where data breaches may have left sensitive data open to access from unintended parties. This is the ground covered by theDataMap, a project by Prof. Latanya Sweeney and Harvard’s Data Privacy Lab.

The map itself.

The map itself.

The map is essentially an index of known data sharing arrangements between parties, irrespective of whether any single person or group may participate in those relationships. Most of its health data is from state-level discharge records, i.e., partially-structured records describing individual details of a patient and hospital visit, including payment details. While these records don’t include names or other personal identifiers, the project’s creators note that discharge records provide enough detail to link patients to news stories and thereby identify patients. (In theory, some could be linked to clinical case reports as well.) These records also don’t match HIPPA standards as they’re governed by state regulations instead.

So, the answer to “where does my health data go” is essentially “to whoever buys it or finds it after a data breach”. Click on any of the nodes on the project site and you’ll get a list of organizations known to handle health data, along with any instances of data going missing. I think this is the most interesting aspect of the project: with a more comprehensive graph representation and/or a simple API, theDataMap could be a way to automatically trace paths between known data leaks and specific patient groups. If a Florida real estate company suffers a data breach and is known to have purchased discharge records, the impacted parties (i.e., patients of Florida hospitals) should know ASAP. Then again, sometimes it can take nearly a decade for health data breaches to become public.

How to make straight lines and arrows in Powerpoint

A very short entry here for an ever-present issue with Powerpoint.

When drawing lines or arrows in the software, and particularly when letting said lines or arrows snap into position, they aren’t quite aligned right. They’re visibly off-center. In short, they look terrible.

A slightly-off center arrow. Not the worst example.

A slightly-off center arrow. Not the worst example.

This problem has existed for years. The solution is to ensure that either the width or height (for vertical or horizontal lines, respectively) is exactly zero. In Powerpoint, right-click the offending line and select “Size and Position”, then adjust the corresponding height or width value to be zero.

format_shape.png

I don’t like posting about Powerpoint problems and solutions, but these issues and the software itself are still all too common in academic circles.

Annotation and conversion with brat: a technical note

Quick technical fix if you’re interested in trying out some of the tools developed for use with the brat annotation platform. I wanted to be able to convert brat annotations into BioC format. There’s a tool developed by Antonio Jimeno Yepes et al. for that purpose - it’s called Brat2BioC. This tool has the dependency of brateval, developed by the same group. I tried installing brateval first via maven as instructed, and it built just fine, but Brat2BioC refused to do so.

Image not explicitly related.

Image not explicitly related.

The solution? Turns out Brat2BioC is just looking for the wrong version of brateval. Edit pom.xml such that the line

<version>0.0.1-SNAPSHOT</version>

under

<artifactId>BRATEval</artifactId>

matches the actual version name of the brateval jar file. Then the build should work.

But what about running the thing? I have a set of annotated documents in brat standoff format (i.e., I have a set of .txt docs and corresponding .ann files) now in their own folder named “input”. After at least an hour of troubleshooting I still couldn’t get it to work. Part of the issue is Maven: it doesn’t seem to like loading local jar packages anymore (see this Stackoverflow post). Even avoiding Maven doesn’t seem to help, though. Java can just never seem to find the main class, which could happen for a variety of reasons, but in this case it just needed some very explicit CLASSPATH definitions. Having built BRATEval already as requested by the Brat2BioC README, I copied its jar into the Brat2BioC lib folder, then ran the following:

java -cp ./target/classes:BRAT2BioCConverter-0.0.1-SNAPSHOT.jar:./lib/BRATEval-0.1.0-SNAPSHOT.jar:./lib/bioc.jar:xstream-1.4.4.jar:xmpull-1.1.3.1.jar:xpp3_min-1.1.4c.jar au.com.nicta.csp.bbc.BRAT2BioC input output

This works just fine.

Lessons learned: even relatively simple format-conversion tools can be a headache to get working if you have to troubleshoot things like file locations.

ME, ME, ME: mutual exclusivity in understanding biomedical text

I’ve been reading and thinking about this paper by Gandhi and Lake on mutual exclusivity bias, or ME bias, lately, especially in terms of what it means for understanding biomedical text and other communications. ME bias is the tendency for an individual or a model, given a set of objects with known names along with an unknown name and novel object, to assign the new name to the new object. This bias works under the assumption that every object has one name. If that seems childlike, you’re right: this is one of the biases used by children when they’re learning language. They don’t often grasp the complexity of hierarchical relationships while they’re still learning, but if you show them a novel object, they’ll readily attach a newly provided name to it.

What kind of bird is that? I’ve seen birds before, and could even tell you the species of some types of birds, but I couldn’t tell you what the species of this one is. If you told me it was a  Green Violetear  I would have no evidence to dispute the identification. Maybe it’s enough to just call it “bird”. Image credit: me.

What kind of bird is that? I’ve seen birds before, and could even tell you the species of some types of birds, but I couldn’t tell you what the species of this one is. If you told me it was a Green Violetear I would have no evidence to dispute the identification. Maybe it’s enough to just call it “bird”. Image credit: me.

Gandhi and Lake were curious about whether neural networks (NNs) operate using the same bias. It would be convenient if they did, not only because it would allow them to learn relationships in a way mirroring that of humans, but because the data they may need to learn from if often replete with infrequently-occurring concepts. This is, in fact, a known limitation of NNs. They often encounter difficulties in assigning meaning to objects or sequences when few or zero training examples are available. The authors refer to recent work by Cohn-Gordon and Goodman demonstrating how machine translation models often produce ambiguity through many-to-one semantic relationships (i.e., two sentences in a given language may be translated to the same output sentence, even if those two sentences have different meanings) but implementing a model incorporating a bias resembling ME can preserve more of those direct, meaningful relationships.

Through experiments with synthetic data, the authors show that:

  • None of 400 different varieties of NN classification model demonstrate ME bias. In fact, they default to the opposite bias: “…trained models strongly predict that a novel input symbol will correspond to a known rather than unknown output symbol”.

  • This anti-ME bias holds regardless of the size of the training data.

  • The same appears to be true for sequence-to-sequence models: “The networks achieve a perfect score on the training set, but cannot extrapolate the one-to-one mappings to unseen symbols”.

This tendency may be true for machine learning models of other architectures and not NNs alone, as the authors concede. They extensively discuss how including ME bias may improve applications of machine translation and image classification, with the caveat that continuing the metaphor of human-style learning may be untenable in machine learning. As humans, we need mechanisms to learn about novel phenomena for our entire lives, so we remain open to the idea that a newly-encountered word or object may have a new meaning or name. Training machine learning models requires some degree of artificial limitation, however. It does provide a level of control over learning that few actively learning children will ever experience (and, on the subject of active learning, children receive constant feedback from parents, teachers, and their environment; it’s challenging to give any machine model that amount of careful human guidance).

So what’s the relevance to understanding biomedical text? One of the challenges in understanding any experimental or clinical document is its vocabulary. We can expect that some words in the document will be novel due to some combination of not encountering them before, learning them in a different context (and perhaps even one with a slightly different meaning, like how a myocardial infarction and a cerebral infarction are physiologically similar but certainly not identical, not least of which because they’re in different organs), or authorial creativity. Here’s a recent paper with a novel title: “Barbie-cueing weight perception”. As a reader, I can parse that pun on “barbecue”, and that’s not even technical terminology. What would, say, a biomedical named entity recognition model do with it? I don’t think ME bias can solve pun recognition, but could it assist with recognizing when a term is genuinely new and meaningful?

Results by Gandhi and Lake suggest that, at least for machine translation models, a novel output should be expected given a novel input. In entity recognition, it’s trivial to have this expectation, but perhaps not useful to assume that all novel words or phrases are unique entities. Typing is the real challenge, especially if there are numerous possible types. Should all newly encountered words get added to new types, then processed further in some manner? Perhaps this would make the most sense in a continuous learning scenario where types are aligned to a fixed ontology but there is some room for ambiguity. I’m not sure if it’s quite the same as ME bias to have a bias toward ambiguity, but it seems like half of the idea. There’s likely some of the idea of learning to learn involved. A model would need to have some ability to recognize contexts appropriate for assigning new or ambiguous relationships, much like how children learn about being prompted to connect a new object with a name.