Research

Each literary character in a book - at least if the author is good - should speak somewhat differently from all the other characters. In his pioneering study on Jane Austen, Prof. John Burrows has shown (1987) that this difference relies on the occurence of "non-meaningful" yet very frequent words like "I," "can," or "or" as well as on that of "keywords" like "love," "brother," or "body." Burrows has also shown that similar characters different book by Austen (Elizabeth and Elinor, for instance) may speak in a similar way.

I have tried to apply the same to the Polish classic of all time, Henryk Sienkiewicz's Trilogy. A series of novels that share some of their characters is an especially interesting field for a statistical comparison of this kind. And then I went a step further: to compare the speech of corresponding characters not only in the three parts of the series in Polish but also in those in the Trilogy's two English translations. This was done to investigate if differences in the characters' "idiolects" travel accross languages.

The computer is used first to find the most frequent words of each text and then to create relative frequency matrices. These are then processed in a statistical package using a procedure called "multidimensional scaling" to produce two-dimensional diagrams, or "maps" showing the relative "distances" between the individual languages of major characters.

Still interested?

This website has been optimised for 800x600.

© Jan Rybicki 2003