Quantitative heterodoxology

Last week Science published an article introducing the term “culturomics” – the quantitative study of cultural trends. By constructing a database out of the by now 15 million books that Google have digitized over the past years, a Harvard based research team led by Jean-Baptiste Michel have created a powerful searchable tool which makes it possible to create quantitative date for analysing cultural trends. As they state in the abstract:

We constructed a corpus of digitized texts containing about 4% of all books ever printed. Analysis of this corpus enables us to investigate cultural trends quantitatively. We survey the vast terrain of ‘culturomics,’ focusing on linguistic and cultural phenomena that were reflected in the English language between 1800 and 2000. We show how this approach can provide insights about fields as diverse as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology. Culturomics extends the boundaries of rigorous quantitative inquiry to a wide array of new phenomena spanning the social sciences and the humanities.

In short, this is a tool which has the potential to revolutionise research methods in a vast number of fields. The best part: Google Labs have made the tool (the Ngram viewer) publicly available. Before even starting reading the article I found myself  thinking about a number of applications for my own research and field. Below follow some  rough examples, and preliminary results which already seem to challenge established knowledge in the history of esotericism.

As the authors of the Science article noted the tool can be used to perform a number of interesting things. The basic idea is this:

  • Together with Google, libraries and publishers across the world have digitized a total of 15 million books (about 12 % of everything that has been published since the invention of the printing press), using  optical character recognition (OCR) technology;
  • These books have been provided with metadata;
  • From this pool, the researchers created  a data set of 5 million books (ca. 4 % of all ever published), based on the quality of the meta data and character detection. Due to these criteria, the data set is much more complete and representative for books published after 1800, and for books published in English (although German, French, Chinese, Russian and Hebrew books have also been given their own corpora).
  • Finally, they developed a powerful tool for computational analysis, which can create frequency ratios for any specific word or cluster of words, or “n-grams” (one word = 1-gram; two words, 2-gram; etc.). The tool can divide the number of occurrences of a certain n-gram by the total number of words published in any given year, and thus find out how frequent it is.

And that’s when the fun begins. Multiple searches give room for comparisons, choosing the time frame makes it possible to analyse trends.

Obviously this can be used for interesting research, for fun, or both. What  better way to illustrate secularisation than by the graph below, for example (click to enlarge)?

Sex vs. religion

From being a very frequent word in the 1810s and 1830s, “religion” has gradually lost popularity in the world of books, until, in the 1980s, it was finally surpassed by the word “sex”. Or is it quite so final? Apparently, from the year 2000, religion is again on the increase, while sex is getting less frequent (the word, that is). Is  secularisation giving way to desecularisation (as Peter Berger would have it in 1999) and de-sexualisation (whatever that means)? Perhaps a more likely explanation is that more of the sexual discourse has moved away from print media, towards online forums – although that should also be the case for religion, which is still increasing in print (the most significant increase since the 1940s, at least).

And this was just a warm-up try. What about applying the tool somewhat more seriously to my own sub-field, esotericism? One can do lots of interesting things only by checking the frequencies of the word itself.

Search for "esotericism", all books published in the English language between 1830 and 2008.

Quite expectedly, the word “esotericism” has never been more frequent than during the past two decades. This may be explained by two parallel trends: the professionalisation of the academic study of esotericism (from ca. 1990), which has led to the production of many books and articles on the subject; secondly, by the still growing “alternative” and “spiritual” publishing business, which often makes use of the word.

It is a little more surprising to notice that there was a drastic fall just before the year 2000, and that so far, the frequency peak was 1995. At least for the point about the academic literature this would be surprising, since the field has certainly grown and become much more productive, visible and established from that moment until today. In other words we expect that something else must be going on. Perhaps the rise of the academic study of esotericism coincided with a turn away from that term (towards other alternatives, there are plenty of them) on the part of  new agers, occultists, pagans, and alternativites? I don’t know, but searches for the corresponding German and French words show the same pattern, with German dropping only a few years later (figures below).

If we look backwards there are still other relevant questions that arise. For example, the genealogy of the term “esotericism” has been the object of research over the years (mentioned here previously), and the tool is well equipped to help researchers in the future. Indeed, it is especially in this kind of painstakingly precise work that the “culturomic” tool may revolutionalise the way we work. Research that could previously fill a four year PhD project can now be solved in 20 minutes on a laptop.

Frequencies for the word "ésotérisme" in the French corpus (1800-2008).

It was for a long time thought that the first instance of a modern noun for esotericism was the French “ésotérisme”, appearing in Jacques Matter’s Histoire critique du gnosticisme from 1828 (Laurant 1992: 19; cf. Hanegraaaff 2010). Last year it was shown that earlier instances indeed existed in the German “Esoterik” and the corresponding “Esoteriker” (Neugebauer-Wölk 2010). In English, the term is known to have been popularised with Theosophy as late as the  1870s and 1880s, but not much systematic philological scholarship exists.

Three searches and three minutes is all it takes to get a better picture.

In the French, it turns out that Jacques Matter is predated by two other references to ésotérisme. Both are (accidentally, it would seem) from 1811: the second volume of Pierre Leroux’s De l’humanité, de son principe et de son avenir, and in volume 9 of Henri Martin, Histoire de France. Both references use esotericism dismissively about features of religion the authors don’t like: the esotericism of the essenes and pharisees in the case of Leroux, and that of the Papacy in the case of Martin (although the latter  is more ambiguous, distinguishing between the esotericism of the “ancient Orient” and the “negative esotericism” of the “sceptical philosophers”).

The word "Esoterik" in German publications (1800-2008).

When we try to search for the German “Esoterik”, we find that the word is not only much more frequent in the German data set than in the French during this period (i.e. the early 1800s), but also that it appears in a number of different sources already in the 1780s, confirming  Monika Neugebauer-Wölk’s recent findings. The earliest German reference that we find is “Esoteriker” (“esotericist”) rather than Esoterik, appearing in association with Pythagoras (and apparently as synonymous with “Mathematiker”), in the first volume of Christoph Meiners, Geschichte des Ursprungs, Fortgangs und Verfalls der Wissenschaften in Griechenland und Rom (1781). The second reference, in vol. 2 of Archiv für Freimäurer und Rosenkreuzer, by Konrad Friedrich Uden, seems derived from the prior, because here as well we find “Esoteriker oder Mathematiker” connected to Pythagoras. And that is it. Similar references to esotericists and esotericism keep popping up in German histories of philosophy, being joined by theological literature in the early 19th century, until, by 1840, there is even a dictionary entry for “Esoteriker” in Vollständiges Wörterbuch der deutschen Sprache.

Here, however, we encounter a weakness, because Neugebauer-Wölk actually found references that were even earlier, references that do not show up in the database search. This reminds us that the tool is not yet perfect; particularly we should recall that the Harvard researchers had already warned that the corpora are less accurate and complete before 1800, and especially in the non-English corpora. Among other things, this has to do with the difficulty of applying OCR technology to these older prints – particularly, it seems that the letter recognition runs into trouble with the Gothic script used in German publications of this era. Indeed, making a separate search for the titles which Neugebauer-Wölk had found (e.g. Meiners’  Revision der Philosophie, 1772) shows that the books have indeed been digitalised and registered, but that they are not yet searchable. When the technology has improved, and the number of pre-1800 books increased, we can only imagine what kinds of findings may be achieved.

Returning now to English, we find no reference to the noun “esotericism” before 1838, when it shows up in an article in a letter to the editor of The Christian Observer. This author connects esotericism to exotericism and uses it as a derogatory. Yet another sense is found in the surprising second reference to “esotericism” in the English language. It is found in an article in the Quarterly Review in 1842, on the highly esoteric topic of gardening:

“To produce new seedling varieties of one’s own, by hybridizing and other mysteries of the priests of Flora, is indeed the highest pleasure and the deepest esotericism of the art.”

After this there are several other references, mostly in political history (“the esotericism of the High Whig Party”) and yet more theological polemical literature, until the occultist literature starts growing in the 1880s.

This quick analysis should already make it clear that central research questions in the field of esotericism can benefit greatly from the new culturomic tool, brought to you by Google Inc.

I will be back with more examples of relevant uses in a later installment.

Full references:

UPDATE: It has surfaced that the two French references are in fact later than 1811. For a full update and correction, see the new errata post.

 

Creative Commons License
This work by Egil Asprem was first published on Heterodoxology. It is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Advertisements

The URI to TrackBack this entry is: https://heterodoxology.com/2011/01/26/quantitative-heterodoxology/trackback/

RSS feed for comments on this post.

11 CommentsLeave a comment

  1. Nice work! I, too, was pleased when I found out about this innovation. It’s a great corpus of texts with a lot of possibilities for the future.

    As you already show, however, there are things to be aware of if you intend to use it for serious research.
    First of all, Google themselves admit that the OCR doesn’t work as well on older texts, which builds in some margin of error.

    Far more important, however, are the linguistic limitations that, as far as I know, are still there. A few examples:

    Orthography: you have to know the spelling of the words you’re looking for. Certainly in older texts (e.g. <1800) there will be little or no standardised spelling, deflating the proportion of the spelling you are looking for in older periods. You can remedy this by specifying as many orthographic variants as you can think of and add them together, but the program can't do it for you, and more importantly, it doesn't know how to.
    [cf. an alchemical example: http://ngrams.googlelabs.com/graph?content=alchemy%2Calchymy%2Calchymie&year_start=1700&year_end=1900&corpus=0&smoothing=0 ]

    Synonyms/Semantic history: We don't always use the same words for the same concepts, and the meanings may or may not overlap. Here, even more so, there is room for confusion and gaps in the findings that the program can not find for you.

    [cf. another example: http://ngrams.googlelabs.com/graph?content=esotericism%2Coccultism%2C+hermeticism&year_start=1500&year_end=2000&corpus=0&smoothing=0%5D

    More frustratingly, both issues can combine, obviously.

    Nevertheless, with the right tools, these problems might be ameliorated some time in the future. Be aware, though, that in its current form, I'd say the Ngram viewer cannot be used for proper research without a firm linguistic study of the words you are going to research.

    • Oscar,
      Thanks for this very good overview of some serious linguistic difficulties. These obviously impose limitations on what one can and cannot do with it. At the time being I think one should stick to the advise of Michel et al. that it is only really good enough for English texts after 1800 at the moment. And there, too, of course with limitations that always come with interpreting such data. I might add that the authors discuss several of these limitations and problems, as well as the opportunities, in their article.

      • Yes, indeed, though with some proper work here and there, the model will be usable for larger periods of written history as well, I believe.

        I know there are programs and models being written to compensate for misspellings and spelling variation, for example. I am sure such features will be integrated in the future.

        By the way, a good one near the end of the article: “”God” is not dead but needs a new publicist.”

  2. Oh, and it’s completely useless for some kinds of linguistics research.

    If, like me, you would be interested in studying the historical evolution of the past tense of the verb ‘dive’, you compare ‘dived’ and ‘dove’. Problem is, the program can’t tell the difference between verbs and birds. D’oh.

    • That is certainly a problem. I’ve encountered it already too, for example when looking for the popularity of a certain name, which is also identical to a common noun.

  3. Note that not everybody likes the term “culturomics”, or, more importantly, that the Google NGram viewer is only a part of larger trend to use digital tools for the humanities:

    http://www.dancohen.org/2010/12/19/initial-thoughts-on-the-google-books-ngram-viewer-and-datasets/

    • Duly noted. Thanks for the link.

  4. PLEASE NOTE:
    I have become aware of a critical error concerning the publication date of the French material mentioned in this post. An update will follow shortly, but for now: It would seem that Google *cannot* help us to find a French reference to “ésotérisme” prior to Jacques Matter.

  5. […] and a lesson of caution for “culturomics” In the previous post I shared my enthusiasm about possible applications for digital, quantitative tools for studying […]

  6. […] the previous post I shared my enthusiasm about possible applications for digital, quantitative tools for studying […]

  7. Omeopatia: un approccio “culturomico”…

    Da dove arriva e quando è stato introdotto nel lessico il termine “omeopatia”?……


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: