Historical newspaper content is increasingly being made available online in the form of both scanned images and XML files produced by optical character recognition (OCR). The availability of this content is promising for scholarship in the digital humanities. However, there are many difficulties in using historical newspaper content, especially OCR-generated XML files. OCR quality problems lead to low quality text strings and thus exclude methods which require good sequence data, while the typically tiny fraction of the newspaper items relevant to a specific research question means that methods such as topic modelling and cooccurrence analysis are unlikely to provide any insight when applied to full newspaper datasets.
This paper presents a study of philosophical content in early New Zealand English-language newspapers to illustrate a general method for overcoming these problems. The method produces a corpus relevant to the research question by labelling articles of interest, training a naive Bayes classifier, evaluating the resulting corpus, and, if necessary, feeding the corpus back in to the labelling stage. In this study, two iterations are sufficient to generate a corpus which provides some insight into philosophical discourse in New Zealand newspapers.
The study is motivated by a lack of scholarship on philosophy in New Zealand before the development of more-or-less contemporary academic philosophy in New Zealand in the middle of the 20th century. Sometimes early philosophical activity in New Zealand is dismissed on the basis that it did not produce monographs or publications in major philosophical journals (e.g. Davies and Helgeby 2014, p. 24). However, as Ballantyne has argued in the case of colonial Otago, newspapers were “the fundamental infrastructure for intellectual life” (2012, p. 57). The resources for running intellectual periodicals or publishing monographs were not present in New Zealand and international journals were too far away. So, we might expect that early New Zealand academic philosophy is present in the newspapers of the time. Moreover, a turn to newspapers allows us to extend our view to non-academic philosophy—philosophy by and for the general (English-language) newspaper reading public.
The investigation of early New Zealand English-language newspapers was enabled by the National Library of New Zealand releasing a large dataset as part of their newspaper Open Data Pilot. The dataset contained the OCR data, derived from microfilm scans, in METS/ALTO XML format from 1839-1899. In order to enable the application of the general method used in this study to other projects, a series of Jupyter notebooks is currently being developed corresponding to each stage of the method. These will be made public before the presentation of this paper.