A refreshing dip in the data pool

A tiger, enjoying a swim. At least I'm assuming it's enjoying itself. Photo by Ber'Zophus on Wikimedia Commons. — A tiger, enjoying a swim. At least I'm assuming it's enjoying itself. Photo by Ber'Zophus on Wikimedia Commons.

Feeling overheated by all the Big Data breathing down your neck? Cool off with some toy data sets. Here, I'm using "toy" to mean "anything you don't have to be responsible for and can just have some fun with."

R users are familiar with mtcars, a set of data concerning 32 different automobile models from the early 1970's. It's an old standard. Additional R data sets can be listed using data() and more can be loaded from packages like MASS (which is included with R base so don't worry about installing it). If you'd prefer to use these data sets in Python, there's a package called PyDataset to make it easy.

Not happy with that data? Try Data.gov - it's urrently the home of nearly 186,000 data sets across numerous disciplines. They vary in format as well: some are nice, clean CSVs while others may just be collections of spreadsheets. Still others may require some navigation to get to the useful material.

Here are some examples, found through Data.gov and other sources:

Kaggle has some fun data sets to work with too, as does Amazon Web Services.

Or you can just give up and make a small synthetic data frame in R:

syn <- data.frame(replicate(10,sample(0:100,50,rep=TRUE)))
rownames(syn) <- c(replicate(50,paste(sample(c(0:9, LETTERS), 4, replace=TRUE), collapse="")))
colnames(syn) <- c(replicate(10,paste(sample(c(LETTERS), 4, replace=TRUE), collapse="")))

J. Harry Caufield

J. Harry Caufield

severalog

J. Harry Caufield

A refreshing dip in the data pool

J. Harry Caufield