Raw Thought

by Aaron Swartz

False Outliers

So far my Wikipedia script has churned through about 200 articles, calculating who wrote what in each. This morning I looked through them to see if there were any that didn’t match my theory. It printed out a couple and I decided to investigate.

The first it found was Alkane, a long technical article about acyclic saturated hydrocarbons that it said was largely written by Physchim62. Yesterday a good friend was telling me that he thought long technical articles were likely written by a single person, so I immediately thought that here was the proof that he was right. But, just to check, I decided to look in the edit history to make sure my script hadn’t made an error.

It hadn’t, I found, but once again simply looking at the numbers missed the larger point. Physchim62 had indeed contributed most of the article, but according to the edit comments, it was by translating the German version! I don’t have the German data, but presumably it was written in the same incremental way as most of the articles in my study.

The next serious case was Characters in Atlas Shrugged, which the script said was written by CatherineMunro. Again, it seemed plausible that one person could have written all those character bios. But again, an investigation into the actual edit history found that Munro hadn’t written them, instead she’d copied them from a bunch of subpages, merging them into one bigger page.

The final serious example was Anchorage, Alaska, which appeared to have been written by JeffreyAllen1975. Here the contributions seemed quite genuine; JeffreyAllen1975 made tons of edits each contributing a paragraph at a time. The work seemed to take quite a toll on him; at his user page he noted “I just got burned-out and tired of the online encyclopedia. My time is being taken away from me by being with Wikipedia.” He lasted about four months.

Still, something seemed fishy about JeffreyAllen1975, so I decided to investigate further. Currently, the Anchorage page has a tag noting that “The current version of the article or section reads like an advertisement.” A bit of Googling revealed why: JeffreyAllen1975’s contributions had been copied-and-pasted from other websites, like the Anchorage Chamber of Commerce (“Anchorage’s public school system is ranked among the best in the nation. … The district’s average SAT and ACT College entrance exam scores are consistently above the national average and Advanced Placement courses are offered at each of the district’s larger high schools.”).

I suspect JeffreyAllen1975 didn’t know what he was doing; his writing style suggests he’s just a kid: “In my free time, I am very proud of my-self by how much I’ve learned by making good edits on Wikipedia articles.” I’m pretty sure he just thought he was helping the project: “Wikipedia is like the real encyclopedia books (A thru Z) that you see in the library, but better.” But his plagiarism will still have to be removed.

When I started, just looking at the numbers these seemed to be several cases that strongly contradicted my theory. And had I just stuck to looking at the numbers, I would have believed that to be the case as well. But, once again, investigation shows the picture to be far more interesting: translation, reorganization, and plagiarism. Exciting stuff!

You should follow me on twitter here.

September 5, 2006


“The world’s always more interesting than the numbers suggest.”

“The purpose of computing is insight, not numbers” - Richard W. Hamming

The orginal basic insight is still standing - there’s a “staff”, which does copyediting, and subject-matter writers.

What’s you’ve found, at a more refined level of investigation, is that some of the people who might have been thought to be the original writers are in fact simply relays (translators or plagiarists) from the true original writers.

Which is indeed interesting. Especially since it’s more likely these primary writers were in fact paid.

posted by Seth Finkelstein on September 5, 2006 #

Interesting. I hope you keep a record of the metric used to identify “outliers”, what happened on further inspection of each outlier, and any interesting results from further inspection of an equal number of non-outliers…


posted by Sj on September 5, 2006 #

That Anchorage article was a wacky case (I just deleted all revisions since late 2005); the poor guy must have put as much effort into finding all those different websites to copy from, and mixing and matching the content like that, as he would have into writing an original article. Weird…

Your observations in this whole series, by the way, tally with what’s been my anecdotal experience with the articles I’ve worked on; most raw content is added by IP editors or occasional contributors, most formatting is done by regulars. I would be interested, however, in seeing you run this on articles with good references, since I think that’s what we need to be focusing on now.

But, in any event, you got my vote with this…


posted by Robth on September 15, 2006 #

You might be interested in this paper, which is a more thorough examination of who contributes to wikipedia.

posted by Ksero on May 27, 2007 #

You can also send comments by email.

Email (only used for direct replies)
Comments may be edited for length and content.

Powered by theinfo.org.