Capstone

kathleen wang
2 min readMar 15, 2021

Some fans and experts believe that Agatha Christie had undiagnosed dementia in her later years. A paper published in 2009 looked at 14 of Agatha Christie’s novels and found an increase in repeated phrase types, increase in indefinite word usage, and decrease in unique word types in her works as she aged. The goal of my capstone project is to first replicate the findings of the paper using 60 of her detective novels instead of 14, and to dive deeper into analysis of her works. Then, using the features of her texts that I engineer using NLP techniques, I will make a model that predicts the age at which she wrote a piece of text.

Agatha Christie’s novels were downloaded in the ePUB format from the internet archive. They were converted to TXT files using the program Calibre.

I was able to replicate the paper’s findings; as Agatha Christie aged, the number of unique words she used decreased, number of repeated phrases increased, and her usage of the “indefinite words” something and anything increased. Furthermore, I found that the usage of vague words such as lot and very also increased.

I did not include any of her short stories, and also excluded her thriller novel written in her late 70s. The authors of the original paper removed the book Passenger to Frankfurt, identifying it as an outlier because

“Passenger to Frankfurt has the largest vocabulary of all the works we analyzed … it is a thriller, not a detective mystery, conceived, written, and researched in her early to mid 70s… it draws on books by political thinkers that she requested of her publishers … Much of the vocabulary in Passenger to Frankfurt comes from her reliance on these sources.”

I find this outlier extremely interesting because it demonstrates that, though Agatha Christie’s vocabulary recall may have declined over the years, her ability to recognize and use a large vocabulary in her writing remained intact in her later years.

I tried out various regression models and found that the Ridge regression model, with an R² of 0.7, performed better than the mean-predicting baseline, which had an R² of 0.

--

--