Send it to please.


123 Street Avenue, City Town, 99999

(123) 555-6789


You can set your address, phone number, email and site description in the settings tab.
Link to read me page with more information.


So much to say: novel generation with birchcanoe

Harry Caufield

I'm trying something this month, partially for NaNoGenMo (essentially a computational, generative, and generally absurd version of National Novel Writing Month), and partially to address an idea that's been bouncing around in my head for a while now. I'm writing a program to assemble stories (perhaps more like loosely-connected prose, and slightly like novels) from existing collections of short, somehow homogeneous sentences. This will start with the Harvard Sentences - a set of carefully-engineered sentences originally intended for testing speech quality in telephone systems. They all read like this:

 1. Slide the box into that empty space.
 2. The plant grew large and green in the window.
 3. The beam dropped down on the workman's head.
 4. Pink clouds floated with the breeze.
 5. She danced like a swan, tall and graceful.
 6. The tube was blown and the tire flat and useless.
 7. It is late morning on the old wall clock.
 8. Let's all join as we sing the last chorus.
 9. The last switch cannot be turned off.
10. The fight will end in just six minutes.

There's a flat but poetic quality to all these sentences, like their edges have been sanded down, and most of that quality - unsurprisingly - only arises when they're spoken aloud. So, my code will need to generate stories with a similar quality.

The project is called birchcanoe, after the first of the Harvard Sentences in the list. There was a related magnetic poetry-style software toy a few years back, but I'm going for a larger scope here, though the exact extent remains to be seen. Updates below and on Github.

Nov 1 - Started the project. Not much code yet.

Nov 2 - Code is now at the point where it produces randomly-ordered sentences, so it's comparable to very early efforts into book generation. Still too basic to consider innovative.

Nov 8 - Got sick for a few days and that tends to fog the coding process. Still, I've elected to use seed sentences and retrieve new, similarly-structured text to populate each chapter rather than composing each entirely from the initial seeds. I've updated accordingly. Output is filler for now, like this:

She saw a cat in the neighbor's house. Saw neighbor's she neighbor's the saw saw she house. In house. Cat house. A in in cat the neighbor's the a cat the saw house. Cat the in cat a house. She a saw house. A cat house. The in cat cat the house. In the saw saw she saw in house. Cat neighbor's saw in cat the she a a neighbor's in a house. The in in in house. In house. The neighbor's saw house. House. The she house. Saw house. Cat house. She a house.

Dec 5 - Obviously I've passed the deadline on this one. That's how life goes sometimes. The project is not over, however: the plan is to record audio of my readings of each of the input sentences, use the DeepSpeech engine to build a text-to-speech model with the audio as training data, then use that model to produce audio books of the generated text. It will sound strange, and that's kind of the point (otherwise, I could use one of the many extant text-to-speech systems or their APIs).

Undiscovered ice floes

Harry Caufield

I've been reading some Don Swanson papers recently in an attempt to:

  1. Approach information retrieval (IR) from a philosophical perspective, as IR is a major part of what I do now but is far too broad a field to easily comprehend without years of experience
  2. Gain a historical perspective
  3. Remember material from my undergraduate Philosophy of Mind course (something about P-zombies...or I guess that was something else entirely)
Not this guy. No relation. Some relation?

Not this guy. No relation. Some relation?

My progress on those fronts continues, but in the meantime, I noticed an interesting point in the 1986 paper, "Undiscovered Public Knowledge":

To verify that all relevant pieces of recorded information do in fact fit the description specified by a given search function, one would have to examine directly every piece of information that has ever been published. Moreover, such a task would never end, for, during the time it would take to examine even a small fraction of the material, more information would have been created. The above-stated hypothesis about a search function, in short, can never be verified. In that sense, an information search is essentially incomplete, or, if it were complete, we could never know it. Information retrieval therefore is necessarily uncertain and forever open-ended.

OK, so that's less of a point and more of a critical element of searching information: we can never know everything because we can never search everything. Swanson is specifically discussing scientific literature here, but even if he wasn't, do we now have access to technology rendering that issue somewhat less of a concern? We can't search everything, but between heavily optimized database structures, carefully engineered indexing schemes, and deep learning approaches (though I'd rather avoid seeing any type of machine learning as a universal, hammers-and-nails solution) can't we get very close?

At the very least, focusing on scientific literature alone, the modern issue becomes less of how rapidly new information becomes available as much as how rapidly it is lost. I'd suspect that this is more of a problem for supplementary data than for manuscripts; data tables are much more difficult to index and are essentially useless without documentation, so every data set available only in a single supplementary Excel spreadsheet has the potential to be "lost" data. I'm curious about how much of this information disappears every day, like melting glaciers or permafrost, never to be seen again, except perhaps with luck or coincidence (for the scientific data, at least - that probably won't work for the ice).


New orbits

Harry Caufield


I've always found astronauts inspiring. No, that's not a controversial statement. These are people who are so far beyond everyone in terms of skill, determination, and often literal distance that they're almost superhuman, yet obviously vulnerable due to their exposure to the universe's natural hazards. I mean, how could that not be inspiring?

People living in space need to be resourceful. They need to have some degree of improvisational skill, and I'm not just talking about Chris Hadfield-like performance skills and sci-comm virtuosity (a term I believe is appropriate as the guy really captured how to answer What Everyone Wants to Know while balancing it with The Practical Stuff). I'm focused on the fact that both science and life in space depend upon resourcefulness, whether it's hacking together a carbon dioxide filter for Apollo 13* or the necessary responses to ISS maintenance issues. This is universally true, to a different degree: science can be dangerous and researchers must maintain their health.

I'm in a new place now, beginning a new mission. I'm not an astronaut and I'm not in space. It's just LA, and it's strange in its own way, but I like it. Hopefully I have numerous opportunities for improvisation.

*This isn't purely the work of the astronauts themselves, of course, but they were the ones who had to implement an emergency plan while running out of breathable air. 

A refreshing dip in the data pool

Harry Caufield

A tiger, enjoying a swim. At least I'm assuming it's enjoying itself. Photo by Ber'Zophus on Wikimedia Commons.

A tiger, enjoying a swim. At least I'm assuming it's enjoying itself. Photo by Ber'Zophus on Wikimedia Commons.

Feeling overheated by all the Big Data breathing down your neck? Cool off with some toy data sets. Here, I'm using "toy" to mean "anything you don't have to be responsible for and can just have some fun with."

R users are familiar with mtcars, a set of data concerning 32 different automobile models from the early 1970's. It's an old standard. Additional R data sets can be listed using data() and more can be loaded from packages like MASS (which is included with R base so don't worry about installing it). If you'd prefer to use these data sets in Python, there's a package called PyDataset to make it easy.

Not happy with that data? Try - it's urrently the home of nearly 186,000 data sets across numerous disciplines. They vary in format as well: some are nice, clean CSVs while others may just be collections of spreadsheets. Still others may require some navigation to get to the useful material. 

Here are some examples, found through and other sources:

Kaggle has some fun data sets to work with too, as does Amazon Web Services.

Or you can just give up and make a small synthetic data frame in R:

syn <- data.frame(replicate(10,sample(0:100,50,rep=TRUE)))
rownames(syn) <- c(replicate(50,paste(sample(c(0:9, LETTERS), 4, replace=TRUE), collapse="")))
colnames(syn) <- c(replicate(10,paste(sample(c(LETTERS), 4, replace=TRUE), collapse="")))