WRITTEN BY DOSTOEVSKY?
Geir Kjetsaa, Oslo University
It is well known that Dostoevsky was not only a great novelist but also a prolific journalist. His first article was written in the same year as his first novel (1845), and the last one was published only after his death (1881). Most of his articles, comprising three big volumes in his collected works, were published during his editorship of the journals "Time" (1861-63), "Epoch" (1864-65), "The Citizen" (1873-74), and "The Diary of a Writer" (1876-77, 1880-81), all of which had a profound influence on Russian thought.
No wonder then that scholars have been much concerned with Dostoevsky's journalism. A substantial amount of criticism has been devoted to questions such as the political and ideological orientation of his journals, and the influence of Dostoevsky the journalist on Dostoevsky the novelist. However, no final answer has yet been given to the fundamental question of exactly which articles published under Dostoevsky's editorship were actually written by him.
Unlike the articles in "The Diary of a Writer", most of the contributions in "Time", "Epoch" and "The Citizen" were published anonymously. An interesting discussion of the possibility that Dostoevsky wrote some of the unsigned articles in "The Citizen" has been undertaken by V. V. Vinogradov,1 but less attention has been paid to the attribution of the articles in "Time" and "Epoch".
In December 1977, when I visited Leningrad to deliver a lecture on the authorship of "The Quiet Don"2 I was asked by the Soviet Academy of Sciences to undertake a computer-oriented investigation of 12 articles published in "Time" and "Epoch". The immediate reason for the request was to assist the editors in deciding whether or not these articles ought to be included in the forthcoming volumes of the Academy edition of Dostoevsky's works that is currently in progress.
A generous grant from The Norwegian Research Council for Science and the Humanities allowed me to start the project in March 1979. Six months later, thanks to the brilliant effort of my consultant Ivar Fonnes and my assistant Trygve Ulf Helgaker, the material, 120,000 words of undisputed Dostoevsky texts and 58.000, words of disputed texts, had been transferred to a DEC 10 computer, after which the answers sought for were rapidly found.3
Even if most of the articles in "Time" and "Epoch" were published anonymously, we do possess a contemporary source for the attribution of some of them to Dostoevsky. I am referring to a list of 23 articles made by Nikolaj Strakhov and later used by Dostoevsky's widow for the first posthumous edition of his works. In the early 1860s Strakhov was a close friend and collaborator of Dostoevsky. As a regular contributor to "Time" and "Epoch" he had inside information about what Dostoevsky actually contributed, and his attributions have never been questioned by scholars.
However, during the mere two years he spent working on the journal "Time" Dostoevsky, according to his own admission, wrote up to a hundred printed sheets.5 Although this statement must be regarded as an exaggeration, there is reason to believe that he actually wrote more than the articles in Strakhov's list.
Nevertheless, attempts to extend Strakhov's list have had little success. Some sixty years ago two such attempts were made, independently of each other, by Leonid Grossman and Oskar von Schoultz.6 On the basis of a number of ideological and lexical parallels with undisputed Dostoevsky articles, von Schoultz made a list of sixteen more articles which he claimed to be either certainly or most likely written by the novelist. It later appeared, however, that several of these attributions were risky to say the least. For example, one of the articles allegedly written by Dostoevsky was found by a reviewer in a collection of Strakhov's critical articles.7
The conclusion to be drawn from the investigation made by Oskar von Schoultz is that the method of external evidence is a very dangerous way of trying to solve authorship problems. In ideologically homogeneous, "party" journals such as "Time" and "Epoch", where the contributors largely subscribed to the same views, and where the articles often were edited by Dostoevsky himself, thematic and even lexical parallels with the editor's own works would seem inevitable, providing an unreliable and far too subjective basis for any attribution.
More objective is the method of internal evidence involving the comparison of style and language only. Assuming that style may more or less be regarded as the writer's fingerprints, one has to compare the style in the anonymous articles to the style of articles known to have been written by Dostoevsky at about the same time. Then, the claim that there is no significant difference between the disputed and the undisputed articles must be set up as a null hypothesis to be either rejected or not rejected. If the comparison, conducted on the basis of a pool of parameters, shows a substantial difference between the disputed and the undisputed texts, then the assumption of their having the same origin in Dostoevsky must be rejected. On the other hand, if no statistically significant difference can be found and the null hypothesis cannot be rejected, this does not necessarily mean that the disputed article is written by Dostoevsky. There will always remain at least a theoretical possibility of two authors using styles and languages that cannot be sufficiently discriminated by quanti-
tative methods. Exclusion, then, ought to be regarded as the keyword in all studies of disputed authorship. One has to approach the problem bearing in mind the maxime of Sherlock Holmes that truth can only be found by the exclusion of the impossible.
In order to detect the stylistic "fingerprints" of Dostoevsky and try to discriminate between disputed and undisputed texts, a pool of the following 15 parameters was used:
For the first 8 parameters I used manual coding, which is not so laborious as it sounds: 6.000 odd sentences were coded in less than three weeks. The coding procedure is a two-numbered one. The first number is used to designate position in the sentence: 1 for first position, 2 for second position, 3 for third position from the end, 4 for second position from the end, and 5 for final position. The second number is used to designate category or part of speech: 0 stands for adjective, 1 for preposition, 2 for adverb, 3 for conjunction, 4 for pronoun, 5 for noun (in all functions except as subject), 6 for noun in the function of subject, 7 for verb, 8 for terminator (. ! ?), and 9 for comma. Sentences consisting of one word were omitted from registration, whereas in sentences of two words without a comma, the
terminator was regarded as a part of speech in the final position of the
The resulting telephone numbers were easily subjected to data processing, providing information about the distribution and combinations of parts of speech. Tables 1 and 2 show the results for "Petersburg Dreams in Verse
and Prose", a text undisputedly written by Dostoevsky. In Table 2 the demonstration of the results has been restricted to the top twenty combinations only.
The discriminating power of one of the parts-of-speech parameters is demonstrated in Figure 1, where the total distribution of parts of speech in one undisputed and one disputed article is compared to the total distribution of parts of speech in the whole Dostoevsky corpus.
As will be seen from Figure 1 the undisputed text "Petersburg Dreams in Verse and Prose" is much closer to Dostoevsky than is the disputed text "The Exhibition in the Academy of Arts: 1860-1861". A typical feature of the Dostoevsky populations, as compared to the disputed text, is the high amount of pronouns and verbs and the low amount of nouns, confirming Dostoevsky as a more "dynamic" and less "object-oriented" writer than the author of "The Exhibition". Significantly, the most popular sentence opener in Dostoevsky is "pronoun + verb" (4 - 7) which is averagely found in 7.15 % of the sentences (as will be seen from Table 2a, in "Petersburg Dreams" this opener has an even higher percentage: 9.32). In "The Exhibition", on the other hand, the opener 4 - 7 is found at the bottom of the list, being used in 0.54 % of the sentences only.
That the author of "The Exhibition", contrary to numerous statements made by scholars, must have been another person than Dostoevsky is clearly, borne out by a x -test of the total distribution of parts of speech (in %). Using the formula
where "n" is the size of the text, and "f" and "p" are the observed and relative frequencies of the 10 different categories, and comparing "The Exhibition" to the total Dostoevsky corpus, we get an empirical x2-value of 222.31, whereas the same comparison for "Petersburg Dreams" yields an empirical x2 -value of only 5.21. With 9 degrees of freedom, the critical x2 value at a 0.01 (1 %) confidence level will be 21.67, which gives us the right to exclude the possibility of "The Exhibition" originating from Dostoevsky.
Lack of space prevents me from going into all the parameters used for this investigation.8 However, I should like to say a few words about the parameter that was found to have the greatest discriminating power, viz. the type-token ratio.
Many readers of Dostoevsky will probably have noticed his tendency to repeat his words over and over again. Thus in "A Gentle Creature" the hero emphatically exclaims: "Glupo, glupo, glupo i glupo!" (Foolish, foolish, foolish, and foolish!) In the articles, where Dostoevsky often seems to play the part of an orator, trying to mesmerize his readers with the magic of words, this tendency is even more obvious. Very common, in
particular, is the use of anaphora, i. e. the beginning of a number of successive sentences with the same words, forming either chains of arguments (Znajut... Znajut... Znajut...; They know... They know... They know...) or rhetorical questions (Neuzheli... Neuzheli... Neuzheli...; Is it really... Is it really... Is it really...).
The suspicion that Dostoevsky has a limited vocabulary, so comforting for less brilliant writers, was thoroughly confirmed by the computer. In order to use the type-token ratio as a basis for statistical tests, the computer was asked to print out the number of different word-forms per 500 tokens. Even if the procedure was an expensive one, since the computing had to be done anew for every 500 tokens throughout the texts, the effort was richly rewarded. It turned out that the Dostoevsky corpus, consisting of 225 samples, had an average of only 307 different word-forms per 500 tokens. By way of comparison, parts of "The Quiet Don" were found to have an average of 380. Clearly, the difference is enormous, showing Sholokhov and Dostoevsky to have quite different stylistic fingerprints in terms of richness of vocabulary.
The approximately normal distribution of different word-forms per 500 tokens in Dostoevsky's texts permitted the use of Student's t-test to exclude the possibility of a number of the disputed texts having been written by Dostoevsky. The deviations of the different texts from the Dostoevsky corpus can best be demonstrated by the use of confidence intervals. We then compare the absolute error of the mean using the formula
where "t" is estimated by the Student's t-distribution as to level of confidence and degrees of freedom (no = n - 1). If we choose a confidence level of 0.99 (99 96) we shall get for "The Exhibition"
The confidence interval is then established by m ±£= 334.214 -14.555, i. e. by the confidence limits 320... 349. This means that in 99 out of 100 experiments the mean of the different word-forms per 500 tokens of running text will lie between 320-349. Obviously, this is much too high for "The Exhibition" to have been written by Dostoevsky, whose texts, taken together, were found to have confidence limits of 304...310 only. The following graphs gives a representation of the confidence intervals in our texts (Figure 2).
In texts consisting of less than three samples (9, 11, 22-25, VI, VIII) the
confidence levels, if established, will always tend to become too wide to allow exclusion. As for the other texts the likelihood of Dostoevsky's authorship can be measured by their distance from the column formed by the whole Dostoevsky-corpus interval (304-310). It may be seen that all the undisputed texts are more or less covered by the column, with the interesting exception of text 13, which is not included in Strakhov's list, but has been attributed to Dostoevsky both by Leonid Grossman and Oskar von Schoultz. However, a much less Dostoevskian profile is demonstrated by a number of the disputed texts, in particular by texts IV, X and XII, which are definitely uncharacteristic of Dostoevsky as far as richness of vocabulary is concerned. As a matter of fact, using Student's t-test, we can exclude the possibility of text XI having been written by Dostoevsky as well.
On the whole, the greater part of the 15 parameters proved to have a high discriminating power. The main problem was caused by lack of satisfactory consistency within the Dostoevsky texts, some of which are rather heterogeneous in style and genre, ranging from polemical pamphlets to philosophical essays.
Now the problem of consistency is well known to any investigator of disputed authorship. The best way of fighting it is probably to be very selective in picking out the sentences for comparison. Thus, in our investigation of the charge of plagiarism against Sholokhov, to ensure that the sentences would be as independent as possible of their context, we excluded paragraphs containing direct speech, a report of some character's thoughts, and questions. This time, the small size of some of the disputed articles was prohibitive to such an approach and called for the maximum exploitation of the material. (However, poems and quotations exceeding 10 words were excluded from excerption.) Instead, the problem of insufficient consistency was faced by lowering the confidence level for some of the parameters. But even at a 0.001 level, some of Dostoevsky's texts, especially those not included in Strakhov's list, showed a statistically significant deviation from the Dostoevsky population as a whole. This circumstance, of course, had to be taken into account when attempts were made to exclude disputed texts. No disputed text could be excluded unless it had a higher statistical value than the "least Dostoevskian" text included in the main corpus. However, a number of the disputed texts, going through the different parameters, rapidly formed a group of their own, showing a night-to-day difference from the Dostoevsky corpus. In Table 3 such instances, allowing us to exclude the null hypothesis of no significant difference between the disputed text and the Dostoevsky corpus, are indicated by a minus sign (-), whereas plus (+) means that no exclusion can be made on the basis of the parameter used.
In authorship studies, the maxim "never take 'no' for an answer" is definitely out of place. Of course, no anonymous text should be included in the collected works of a writer without a careful stylistic and linguistic examination. Thus, texts X and XII, with a score of 15 minuses, should not be attributed to Dostoevsky, and neither should texts IV and XI, where the
requirements of only two parameters are met with. On the other hand, texts I, III, VII, and IX (15 pluses), text II (14 pluses), and text V (15 pluses) might well be included, if only under the rubric d u b i a . Texts VI and VII, too, reveal identical stylistic traits with Dostoevsky's texts. However, because of the small size of these texts (261 and 816 words respectively) any definite conclusion would seem unjustified. As indicated by 0 in Table 3, for some of the parameters no test can be undertaken at all, owing to insufficient text length.9
An interesting observation to be made in this investigation is that the parameters are working together, deciding for exclusion or non-exclusion in their complexity. Even if some of the parameters may be regarded as more as less mutually related to each other, this is hardly the only explanation. Experience indicates that once a tendency has been established it will be confirmed by any reasonable parameter. Take, for instance, the distribution of the synonymous conjunctions "chtoby" and "chtob" ('in order to'). A distinctive feature of Dostoevsky's linguistic fingerprint is the preference for "chtob", which is used 7-8 times more often than "chtoby". The same pattern is found in most of the disputed texts where, judging from the 15 parameters, the possibility of Dostoevsky's authorship can not be excluded, whereas "chtoby" prevails in most of the minus articles. A collection of high-frequency function words found to be very typical of Dostoevsky points to the same result: while frequent in the plus articles they are comparatively rarely used in the minus articles. It would therefore seem that, given sufficient material, the computer-oriented parameters used in this investigation make an adequate and powerful tool for approaching authorship problems.
A survey of the articles used in this investigation
(a) Texts written by Dostoevsky (119,107 words):
(b) Disputed texts (58,039 words):