Why Computer Authorship Studies Continue to be Inconclusive

NOV	MAY	OCT
	5
2005	2006	2007

20 captures

5 Aug 01 - 21 Oct 06

Close Help

Revised 4 January 2005

Why Computer Authorship Studies Continue to be Inconclusive:

Like any new tool, the computer opened new and unexpected possibilities. One of those has been hope, for those interested in questions of authorship, that computers might prove better at attribution than their human programmers.

Perhaps, it was contemplated, that with the computer’s ability to quickly reduce a text to word lists and statistics, it would be possible to use them to settle longstanding debates over the authorship of specific texts or even canons.

Who wrote the books of the Bible? Who wrote its letters? Who wrote the Federalist Papers?

Yet now, three full decades into the personal computer age, authorship questions seem, for the most part, as puzzling as they ever were.

There are a number of reasons for this, but the foremost is that language is a volitional tool. It is always at the command of the writer. In a real sense every word in a work is there because the author wanted it there.

He or she has "automatically" considered many alternative words and turns of expression.

In this last sentence, for example, I considered several possibilities for "he or she", I thought perhaps leave out the "has," for "automatically" I supposed "intentionally," "subconsciously," and "routinely." Likewise for "alternative. For "words," which might have been "phrases" or "expressions" or even "banter," I settled for the ordinary.

This is why the T/tr of a work is "unique" or "characteristic" to a given author. It is why the average word length of a work is a good indication of who wrote it. Or the average sentence length.

The rub is, that we cannot imagine a writer who could not, if he or she desired to, change the frequency of "and" and "the" within any given text. Or the average length of words. A write might set themselves not to use the word "the" in a text and could do it.

Nor can we imagine a writer with a vocabulary as large as Shakespeare’s not being able to work the word "ocular" or "unsought" into any twenty or thirty thousand word text more than dozen times without us noticing it. This is the "flaw" in those authorship test that use "rare" words. Gosh, if an author rarely uses a word and we have two works with it appearing just once in each...does that mean it is the same author? Or if we know in advance it is the same author and the word only appears in one text, does that mean the other text isn't that author's after all? It's just silly. I'm sorry to say. It makes much more sense to take all the words and no samples. That's easy using a computer. We then ask ourselves how do all the words relate to one another. Is there a core vocabulary involved? We can do this, as I will show below, by just combining two texts into one and making the computer read it as one text. Do we then find more or fewer Types among the tokens? If fewer, the two text are related by core vocabularies, if more, they aren't. And that's a fact. A numerical fact. it cannot be disputed. Ule and I established these facts, because we were the first to do this sort of work.

The computer, of course, will notice it, because a computer makes a list of all the words, as well as "addresses" of where they appear, so it can "access" them at the command of the programmer or user.

Apart from this fundamental problem, language, as opposed to a code, is openly shared among numerous speakers or readers. If it weren’t, a writer couldn’t be understood.

This means that language isn’t like our genetic code or our fingerprints. Its something we have in common with thousands, perhaps, billions of people.

So while many writers would like to imagine, as Agathon did in Plato's Symposium , that our intellectual off springs are more uniquely ours than our children, this simply isn’t the case.

Anyone who speaks our language could mimic our style and, even our very words, if he or she tasked themselves to do so.

This doesn’t mean that anyone could write the same novel, play or poem that we would write, but it does make it very difficult, if not impossible, to establish authorship based on words and word usage.

To give a broader example, photographs and handwriting are not used forensically to identify anyone. Likewise writing style is simply not individual enough to be used as a reliable tool for identification.

So we should ask what computers can do, if they can’t identify authors with any more certainty than we can and, perhaps, far less.

Computers are quite good at making word lists, concordances, tracking the average length of words, sentences and paragraphs.

If we agree on what words are prepositions and conjunctives, a properly programmed computer can tell us precisely how many are in any given text. A problem here is that context changes word use, so a word may change not simply its meaning, but whether it is a preposition or conjunctive depending on use.

Properly programmed, a computer can also quickly tell us which words are most frequently used and a list of words by the frequency of their appearance can be quickly generated.

Computers can find a given word or quotation nearly instantaneously, certainly far more quickly than a reader can, even if the reader knows where to look.

So computers have their place.

But consider this thought paradox. Suppose a computer study of two texts showed them to contain precisely the same number of words, that their average word lengths were same, as well as the average length of sentences and paragraphs.

Now a computer could do this for two texts in different languages and it would then foolishly come to us and say, "look Boss these two texts seem to have the same authorship characteristics." However, we would know better at a glance. One is in French, one is in English.

Or even, suppose, in the same language. Imagine we organized a computer study of a group of anonymous Diaries, where a match was produced, but upon inspection, one text might clearly be about modern times and another about a much earlier time and thus make it impossible for them to share authors.

No stylistic program could make those kinds of calls unaided.

Now lets take the works of Shakespeare, Marlowe, Bacon, Oxford and Jonson, about whom we are all so keenly interested. What would we use for a base for Shakespeare? Would we use Yorkshire Tragedy ? Or just the plays in the First Folio? Do we exclude the poems or count them in?

Since Shakespeare’s canon is fraught with questions, it may be better to consider a more certain canon, like Jonson’s.

But suppose for a moment we had all the plays written during the period to serve as a data base. There may be many readers who have read all of the these plays and a computer can now easily handle such a task.

Paradoxically this wouldn’t be helpful for the computer, because the first match would be with the play itself. So that match would have to be ignored.

Suppose we take the group of all plays and we asked the computer to arrange the plays according to average word length. Suppose several so arranged or "clustered" but suppose we know in advance that in the cluster are several authors, not one. Do we now doubt the authorship? Or do we conclude the method is not precise enough to determine authorship? What would our control be? We have no time machine.

Mendenhal first used word length in an attempt to show that Bacon might have written Shakespeare’s works, but soon proved that only Marlowe could have written them based on this single criteria. Is it valid? How can we prove it?

Science works on experiments that can be duplicated and frequently uses "double blind" studies. It would be quite simple to take any number of texts, where the authorship is pre-establish, and have a computer process them and cluster them by word length.

If it obtained, over and over, clusters of similar authors we would know the technique is sound. And it surely seems to be.

But we can always imagine or suppose an author might have used words of different lengths for literary or intellectual reasons.

We can do the same thing with sentence length, T/trs or whatever, with the same results.

Now it may be fun to discover that a certain play has similar stylometric characteristics to another...but fun isn’t proof. And proof was what we wanted.

I’ve pointed out that Don Foster’s Shaxicon program has not successfully handled known texts or completed double blind experiments. I’ve also pointed out its popularity comes from the fact that it supports the consensus opinion, in most cases, about the texts in question.

It’s a kind of program that uses "samples" and thus lacks the power of a program like Ule’s or a method like "pace" which takes no samples.

Suppose Foster had, for example, "proven" that all of Shakespeare’s texts were written by Bacon or Oxford or Marlowe, his popularity would, I dare to say, be quite low.

Mendenhall and Ule, using vastly different methods, have "proven" that the works of Marlowe and Shakespeare are often "closer" than works within their own canons. But these methods are not popular among the consensus holders, who continue to devise new tests for authorship which are more in line with the consensus opinion.

These programs, in my opinion, are fads and lack scientific value.

My own program or rather, method, called "Pace," which simply ranked works according to their T/tr or, to be a bit more precise, took notice of it as a "rate" rather than as ratio, also suggests that Marlowe and Shakespeare were the same authors.

Consider the question of the authorship of Hero and Leander. Much debated. The computers proved H&L has three virtually equal parts. Marlowe is said to have completed the first part and Chapman the final two parts. Pace shows them within a few words of one another in "tokens". This is odd enough to suggest a common plan and thus a common author. But the program did something else. It proved the actual vocabulary of all three parts were within 50 words or so of each other, i.e., in "Types." Now imagine that one. Two authors working independently, one years after the other died, both write essentially 6,000 words texts (one does this twice) and both use essentially 1,800 words of vocabulary or Types to do so...that's odd. Really odd. But pace. I then did something no one else had bothered to do. I gave the entire work to the computer to study. And guess what happened? The T/tr placed it in between three other works said Marlowe's: Dido, Ovid's Elegies and Faustus (1604). All four sit shoulder to shoulder in a list of over a hundred Elizabethan texts. Now that's really odd. The a priori expectation would have been that Chapman would have used another core vocabulary (his own) and so the entire text (all three parts) would have had a higher T/tr than anything written solely by Marlowe...but this didn't happen the computer suggests, strongly, that Hero and Leander has but one author and that author was Marlowe. What's happening is that the "two" vocabularies are blending and not clashing.

Indeed Pace has the added value of being able to sort or suggest where a work came in a writer’s canon.

The suggestion arises from a well established fact.

A writer’s vocabulary, when viewed over his or her life time, resembles a bell curve. It starts out low and climbs and then declines into what Shakespeare called "hateful silence."

When we have an entire canon to review, T/trs thus provide us with an indication as to where, on that bell curve, a particular play belongs.

However because the curve has the same T/trs on both ends, (O) a question will always persist as to which end to place a given work upon. Luckily scholars often have enough data to know whether the work was early or late.

When we have a group of plays, of unknown dates, its nice to be able to sort them "objectively." Pace sorted the two parts of Henry IV , for example and placed them side by side. But contrary to the consensus opinion, Pace would place Merry Wives of Windsor with Romeo and Juliet , or far too early for the consensus opinion.

For those interested in Pace, they may see my essay in Literary and Linguistic Computing , Vol. 3, No. 1, 1988.


"Pace: A Test for Authorship Based on the Rate at Which New Words Enter an Author's Text."	Vol. 3. No.1. 1988

It was reviewed by Darrel Ince, in The Independent , 14 August 1989, p.14. Ince, who I have never met, nor corresponded with, wrote about Pace, that it "has at least three advantages over past work. First, it is computationally very simple and does not require very much programming or advanced statistics. [I’d say none at all.] Second, it can be applied to works of different genres such as essays, plays and novels. Third, it matches our perceptions of how great writers grow in stature as they get older."

Professor Ince even went beyond my paper, and noticed, as I had suggested in an end note, that authorship might be discovered by "bisecting the anonymous work and combining it with a work whose attribution is certain. If the pace of the combined piece changes, then the author whose work has been used has not written the anonymous work. However, if it stays the constant, then there is a high probability that the author has been found."

Ince proved very much correct here. The process is quite simple. One merely "bisects" an anonymous text and a text which is not anonymous. One then recombines them and notes the T/tr. What will happen when two authors are combined is the computer will detect a second vocabulary. For example if one text is French and the other English, the T/tr will double.

If two writers are of the same language the effect will not be so pronounced, but a second vocabulary should surface and the result of this is an elevation in the T/tr.

I’ve tried it over and over and it has never failed to work. (As I did with Hero and Leander, above.)

Just for the record, my study contains a peculiar error that was later noticed. It arose during an exchange with the unnamed peer reviewer for LLC ,who suggested and then required me to convert my table from a simple T/tr to a percentage...apparently for clarity.

Then a year or so after publication, perhaps the same anonymous reviewer, then not anonymous, argued that the table wasn’t kosher, because it was actually simply a T/tr converted to a percentage. This critic correctly noted that T/trs, in large studies, have proven worthless.

This reviewer, Louis T. Milic, seems to have missed the point that Pace does not take samples nor does it blend texts in huge extracanonical aggregates.

Pace was devised to study works in a canon and to organize them according to their T/trs. It does not take samples and does not use statistics. In this it works quite well. Whether the T/tr is expressed as a ratio or a percentage is meaningless, since the relative placement of a text does not change.

When a work is under question and may be part of a canon, Pace also tells us the most likely time it was written, which is a considerable help to scholars.

Moreover, as Professor Ince noted, Pace can be used to suggest authorship when bisected texts are recombined and studied.

However Pace is not foolproof.

One of the entirely unexpected things that Pace discovered was that a great writer has the ability to generate Types among tokens at a fairly steady rate, regardless of the length of the text, (within obvious limits).

Thus, for example, " Two Gentlemen of Verona and Henry IV, Pt. II are within .06% of the same "pace," even though Two Gentlemen is nearly half again as short as Henry IV, Pt. II ."

I see this ability as best expressed as a rate, rather than as a fixed ratio.

In any case my interest in these matters, at the highest levels, has a documented history and publication record. I am quite content to let history judge whether Ince, a professor of mathematics, or Milic, was right.

To View the Table as it was Published : Scan I:

Scan Two:

Scan Three:

Return to John Baker's Home Page.