Hi all! 2019 has been pretty busy so far, so I thought it might be good to take a breath and compile a list of all the tools, updates, and examples I've created in the past few months. As you'll see most of my effort has been on the development of the GLAM Workbench, but probably the thing that has garnered most attention is the collection of editorial cartoons from The Bulletin that I harvested and compiled.
Items on the list below are basically just copied and pasted from my updates feed – keep an eye on it for the latest news. Of course you can also follow me on Twitter, or like the DHHacks Facebook page.
New notebook added to the #GLAMWorkbench RecordSearch repository — get the basic details of agencies associated with all government functions used in @naagovau’s RecordSearch and save to a single JSON data file. View code and data.
The Australian version of ‘Who’s responsible?’ is up! Just select a government function and explore the different agencies associated with it over time. It’s built with data from @naagovau’s RecordSearch. Try it live!
Exploring some of the adjectives attached to ‘alien’ in @TroveAustralia newspapers… You can create these sorts of comparisons yourself using this app.
TroveHarvester 0.2.1 — updated to work with version 2 of the @TroveAustralia API. Now on pypi! More details shortly…
And version 0.2.2 of TroveHarvester quickly follows 0.2.1 as I squash a bug when downloading PDFs… Also managed to get the README displaying properly on Pypi. pypi.org/project/t…
Want an easy way to download @TroveAustralia newspaper articles in bulk? No installation? Point and click? I’ve created a simple web app version of my TroveHarvester using a Jupyter notebook & running on @mybinderteam. Try it live!
You want big data? I just harvested 213,340 newspaper articles (including full OCRd text) from @TroveAustralia in 82 minutes, at about 40 articles a second. https://mybinder.org/v2/gh/GLAM-Workbench/trove-newspaper-harvester/master?urlpath=%2Fapps%2Fnewspaper_harvester_app.ipynb
Just updated my harvest of metadata and full text from The Bulletin in @TroveAustralia. There’s about 2gb of OCRd text from 4,534 issues (1880-1968). Full text for about 60 issues have been added since my last harvest. 111 have no OCRd text. Download it all from GitHub
If there are APIs or other data sources you’d like me to add to my GLAM Workbench, feel free to create an issue. You could also describe what sorts of tools or examples using that data source would be useful.
Slowly working my way through the documentation for my GLAM Workbench. Still lots to do, but I think the page for @naagovau’s RecordSearch is now up-to-date.
Ok, more documentation for you — page for the @DigitalNZ API in GLAM Workbench updated!
So here’s some fun things to do with @TroveAustralia newspapers… (via GLAM Workbench)
Added a page for @ArchivesNZ’s Archway to the GLAM Workbench docs…
And now a GLAM Workbench page for @Te_Papa…
Added a ‘data’ section to the GLAM Workbench docs, with info on harvests from government data portals, as well as series from @naagovau relating to ASIO and the White Australia Policy.
I’ve finished putting details of all the current GLAM Workbench repositories into the new documentation site. Still a few notebooks to migrate from the original workbench, but getting there! There’s about 50 Jupyter notebooks so far.
One more and I’m done for the night… New GLAM Workbench page for the ‘Trove API introduction’ notebooks.
30,000+ occurences of the word ‘Chinese’ in the OCRd full text of The Bulletin, 1880-1968.
I’m giving a talk in a week or so (eep!) that looks at some of the changing contexts in which the word ‘aliens’ has been used in Australia. I thought, by way of comparison, it would be useful to do the same for ‘immigrants’. While I was playing around with the data last night, I came across something interesting, so here’s a sneak preview…
Pleased and proud to see the chapter @baibi & I wrote on the Real Face of White Australia now published as part of an awesome collection. Buy now or read the CC-BY version online!
I’ve added a ‘save chart’ option to the QueryPic app in my GLAM Workbench. Visualise your searches in @TroveAustralia newspapers, then save the results as HTML for easy download.
My talk for #text2data at the National Library of Sweden looks at occurence of the words ‘aliens’ & ‘immigrants’ in @TroveAustralia newspapers, The Bulletin, & Hansard. The slides, code & data are online.
The full text of ‘Who belongs? Reading identity, ownership, and legitimacy’, my talk for #text2data last week, is now online. Includes slides, code, data & more…
I’ve added a section for Library and Archives Canada to my GLAM workbench. The first notebook extracts records of people from a specific country from their naturalisations database and saves the results as a CSV file.
Here’s an example dataset harvested from Library and Archives Canada’s naturalisation database. It’s all the people with ‘China’ as their country of origin, supplemented with wives and children (who are not included in a country search).
NSW naturalisations 1834 to 1903. The sudden rise in Chinese naturalisations followed the introduction of the poll tax. More restrictions soon followed… Using (deduped) naturalisations data from @nswarchives.
Whoops. Here’s the actual full list of countries of origin from the @nswarchives NSW naturalisations data (and not just the screenshot!).
New section added to my GLAM Workbench for the Queensland State Archives (@qsarchives). Includes a notebook to add series information into their Naturalisations 1851-1904 index.
I’ve updated the notebook for harvesting records from @archivesnz’s Archway database in my GLAM Workbench. I just used it to harvest more than 8,000 records from series 8333 relating to naturalisation.
Lots of exciting new stuff has been added to @TroveAustralia’s digitised journals in the last few months. To explore it all, head here and click on the ‘New titles’ button.
I’ve finally updated the @TroveAustralia API Console to use version 2 of the API & https by default. (Also updated to Python3 & latest Heroku stack.) More examples coming soon…
So right around now I think I'm talking (via video) about my adventures with #HistoricHansard for the ‘Between Cyberutopia and Cyberphobia’ workshop at @witswiser in South Africa. You can follow along: vimeo.com/321657685
Fun day talking to the @dhpanu team at ANU about digital history possibilities. Slides/links are all online.
Quick notebook to harvest GLAM datasets via the new(ish) @datagovau API. Includes 447 CSVs from 19 institutions.
Sneak preview of my GLAM CSV Explorer now live on @MyBinderTeam! Select one of 447 GLAM-related CSVs from @datagovau for analysis, or load your own. Coming soon to @TDHASSN.
So, I’ve finally figured out a way to automatically generate nice-looking thumbnails from @TroveAustralia newspaper articles. Demo notebook here.
Here's my 'Introducing APIs' slides for #VALATechCamp...
TIL that the web pages for digitised works (like books and journal issues) on @TroveAustralia embed a lot of useful metadata that you can't get through the API. Here's how to extract it.
After talking to @PrimahadiWijaya today about work at @MonashLing, I started harvesting metadata & full text from digitised books in @TroveAustralia. OCRd text from about 2,000 books downloaded so far. More soon...
I’m looking for books in @TroveAustralia, but there’s lots of ephemera (pamphlets, posters etc) in the book zone. So I tried grabbing the images of ‘books’ with one page & found some nice stuff including this collection of playbills.
The final tally – after much tweaking I’ve downloaded OCRd text from 9,738 works in the @TroveAustralia books zone. This includes ephemera such as pamphlets and posters as well as more booky books. Here’s the full metadata, all the text files, & harvesting code.
So @TroveAustralia includes more than 370,000 press releases, speeches, and interview transcripts issued by Aust federal politicians & saved by the Parliamentary Library. Learn how to harvest metadata & full text to create your own datasets in this notebook.
All 9,738 OCRd text files harvested from books, pamphlets and leaflets in @TroveAustralia's ‘book' zone have been uploaded to @aarnet’s CloudStor for easy browsing/download. There's also a 400mb zip file if you want the whole lot. The harvesting method and code is available in this notebook. All this and more will be documented soon in my GLAM Workbench.
Ok, so I’ve downloaded the OCRd text from 27,426 issues of 358 digitised journals/series in @TroveAustralia. That's 6.6gb of full text. Tune in tomorrow for full details…
I’ve added a section for the @TroveAustralia ‘book’ zone to the GLAM Workbench.
I've been busy lately harvesting LOTS of full text data from @TroveAustralia's digitised journals – so many opportunities for research! You should be able to get to all the code & data from the new Trove journals section of my GLAM Workbench.
Australian pilots, aviators, airmen, and flyers — 4,950 thumbnails from a search in @TroveAustralia's newspapers combined into one very big, zoomable image.
If you'd like to make your own big, composite images from lots of @TroveAustralia newspaper thumbnails, here's a notebook that shows you how.
The other night @OpenGLAM was sharing collections of high-res images from GLAM orgs that are free to download. That got me thinking about @TroveAustralia's digitised maps because there's lots of them, most are out of copyright, and the images are BIG.
And now my GLAM Workbench has a 'Trove Maps' section to document examples and explorations using data from @TroveAustralia's 'map' zone: glam-workbench.github.io/trove-map... Includes a list of 20,158 maps with high-res downloads.
I've reharvested Commonwealth Hansard from 1901 to 1980 and updated my repository of XML files. This should pick up the work of @ParlLibrary staff over recent years to improve the XML output.
Here's the notebook-ified version of the code I used to harvest all the Australian Commonwealth Hansard XML files from 1901 to 1980: nbviewer.jupyter.org/github/GL...
Over the last week I've been downloading editorial cartoons published in The Bulletin from @TroveAustralia. There's 3,471 cartoons – at least one from every issue published between 4 Sep 1886 and 17 Sep 1952. And you can browse them all... To make it easier to explore the images, I've compiled them into a series of PDFs – one PDF for each decade. The PDFs include lower resolution versions of the images together with their publication details and a link to Trove. They're all available from DropBox:
- 1886 to 1889 (45mb PDF)
- 1890 to 1899 (139mb PDF)
- 1900 to 1909 (147mb PDF)
- 1910 to 1919 (153mb PDF)
- 1920 to 1929 (159mb PDF)
- 1930 to 1939 (151mb PDF)
- 1940 to 1949 (146mb PDF)
- 1950 to 1952 (42mb PDF)
Some overdue updates to the GLAM Workbench. First here's details, data, and code from a harvest of GLAM datasets on @datagovau. Includes details of more than 400 CSV datasets.
And what can you do with 400 CSV files? Well, you could explore their contents using my GLAM CSV Explorer. Select one of the files to peek inside, or upload your own CSV.
If you're researching foreign policy using @naagovau you might find this little tool useful – it tries to find the file containing a specific numbered cable. Good for tracking down rogue references! Try it live!
I've updated the data that sits behind my Trove Places app and added more than 140 newspaper titles. To find @TroveAustralia newspapers published in any region of Australia simply click on the map!
Here's how you can get the text of Australian books in @TroveAustralia from the Internet Archive (via the Open Library).