Tumblr Corpus

The neat thing about making a tumblr search engine, is that you end up with a huge database of tumblr posts. This database forms the backbone of the remote associates "creativity AI" I whipped up a while back, but now you too can access the corpus for your own devious research!

Currently, the corpus is provided in .csv format. It contains raw text from ~4.9 million tumblr posts indexed up until October 13th, 2015 -- including post_id and reblog_key (for easy removal of duplicates, or perhaps virality studies, who knows). The file size is 1.3GB.

Do note that it includes a number of text-less posts (generally these are from photo or video posts with no user caption) but they are easy enough to filter out if your requirements so demand. 

Also note that 4.9 million posts substantially exceeds the row-limit of OpenOffice and LibreOffice. This shouldn't be an issue if you plan to use excel, but, for the FOSS buffs among you -- you'll want to open the file in GNUmeric.

If there's enough user demand for it, I can upload a lexemetized corpus as well, but you are probably better off lexemetizing the raw-text yourself in some format specific to your needs. 

Let me know if you run into any problems, and Have fun!