Together with a number of my colleagues at Humlab, Umeå University (and at Uppsala and Linköpings University) I have been working hard with a research application the last coupule of weeks. Today, we finally managed to finish it all and send it to the Swedish Research Council. We have called it, Welfare State Analytics. Text Mining and Modeling Swedish Politics, Media & Culture, 1945-1989 (WeStAc), and the first passage regarding purpose and aims runs as follows:
In what ways can the analyses of massive textual datasets from the heyday of Swedish welfare society in the 1950s, 60s and 70s generate new historical knowledge about national culture and society? Welfare State Analytics. Text Mining and Modeling Swedish Politics, Media & Culture, 1945-1989 (WeStAc) is an ambitious multidisciplinary, digital humanities (DH) research proposal that seeks to digitise literature, curate already digitised collections, and make use of probabilistic methods and text mining models for research. WeStAc will digitise and curate three massive textual datasets—in all, Big Data of almost four billion tokens—gleaned from the domains of Swedish politics, news media and literary culture during the second half of the 20th century. WeStAc asks the insidious simple question what socio-cultural computational analyses of massive textual sources from the Swedish welfare state decades could reveal. The 60s and 70s, for example, are often said to have been a period of radical transformations invoked by notions as emancipation, globalisation and individualisation. But if one was to analyse massive datasets from the domains of politics, media and culture, does such assertions hold true? WeStAc will trace discursive changes on a scale hitherto unexplored by Swedish scholars. Considering the possibility to process large amounts of data through methods as probabilistic topic models or word embeddings, how can societal transformations be empirically measured—for example by distant reading the notion of globalisation, or data modeling ideas of emancipation and individualisation? Importantly, what kind of data curation is needed to make decades of textual data ready for digital historical research?
WeStAc is a collaboration between Umeå University and the National Library of Sweden that will establish a scholarly ecosystem of digitisation, curation and research (Benardou et al. 2018). The objective is twofold: (A.) to develop digital curation work, including a robust library workflow capable of digitising and assembling, curating and preparing datasets for large scale text mining research at the National Library, and (B.) to develop digital history scholarship and perform DH-inspired textual research. In close collaboration with the National Library, WeStAc will digitise and curate, explore, mine and analyse three massive datasets: (1.) “Politics”—3,100 already digitised Swedish Governmental Official Reports (SOU) and an abundance of recently digitised political propositions, proposals and debates from the Swedish Parliament, (2.) “Media”—two already digitised Swedish newspapers, Aftonbladet and Dagens Nyheter, (3.) “Culture”—the digitisation of Sweden’s most emblematic cultural journal during the welfare state decades, Bonniers Litterära Magasin (BLM), and all Swedish novels from 1945 to 1989, in all some 22,000 titles.
The usage of probabilistic methods and text mining models follows the rule that more data is better data. Even if some uncertainties prevail (Crossley, Dascalu & McNamara 2017) size matters—and consequently computational analyses needs to be supported by (A.) very large text corpora, and (B.) digitised text material with good OCR quality. Departing from previous (bad) experiences with computational research on 19th century text (Jarlbrink & Snickars 2017), WeStAc will therefore focus on modern textual materials from the second half of the 20th century. The three datasets that WeStAc will assemble are prodigious enough and display an OCR-quality and auto-segmentation of content that is superior to older material. One point of departure for WeStAc is thus the way in which digital history research today needs appropriate data, as well as curation in order to produce datasets for apt DH-methods (Poole 2013). Another is the systematic intertwining of research questions, digital materials, and tools (Drucker & Svensson 2016). Accordingly, WeStAc will develop—and during the project launch—a generic multi-user online environment for open JupyterLab notebooks with the ability to research and interpret essentially any textual datasets. WeStAc thus embraces open science and the reusability of digitised heritage data as research data. Since both the National Library and Humlab have applied to become co-operating partners within DARIAH, WeStAc will also connect to a pan-european infrastructural network for humanities scholars working with computational methods that is currently taking form. Ultimately, the goals of WeStAc are to foster new historical knowledge about Sweden’s welfare state decades through big data analyses, develop research methods in computer-assisted scholarship, and lay the foundation for an open research infrastructure to facilitate digital scholarship at the National Library of Sweden.