Like millions of other people, the first thing Mark Humphries did with ChatGPT when it was released in late 2022 was ask it to perform parlor tricks, like writing poetry in the style of Bob Dylan — which, while very impressive, did not seem particularly useful to him, a historian studying the 18th-century fur trade. But Humphries, a 43-year-old professor at Wilfrid Laurier University in Waterloo, Canada, had long been interested in applying artificial intelligence to his work. He was already using a specialized text recognition tool designed to transcribe antiquated scripts and typefaces, though it made frequent errors that took time to correct. Curious, he pasted the tool’s garbled interpretation of a handwritten French letter into ChatGPT. AI corrected the text, fixing all the Fs that had been misread as an S and even adding missing accents. Then Humphries asked ChatGPT to translate it to English. It did that, too. Maybe, he thought, this thing would be useful after all.
For Humphries, AI tools held a tantalizing promise. Over the last decade, millions of documents in archives and libraries have been scanned and digitized — Humphries was involved in one such effort himself — but because their wide variety of formats, fonts, and vocabulary rendered them impenetrable to automated search, working with them required stupendous amounts of manual research. For a previous project, Humphries pieced together biographies for several hundred shellshocked World War I soldiers from assorted medical records, war diaries, newspapers, personnel files, and other ephemera. It had taken years and a team of research assistants to read, tag, and cross-reference the material for each individual. If new language models were as powerful as they seemed, he thought, it might be possible to simply upload all this material and ask the model to extract all the documents related to every soldier diagnosed with shell shock.
“That’s a lifetime’s work right there, or at least a decade,” said Humphries. “And you can imagine scaling that up. You could get an AI to figure out if a soldier was wounded on X date, what was happening with that unit on X date, and then access information about the members of that unit, that as historians, you’d never have the time to chase down on an individual basis,” he said. “It might open up new ways of understanding the past.”
Improved database management may be a far cry from the world-conquering superintelligence some predict, but it’s characteristic of the way language models are filtering the real world. From law to programming to journalism, professionals are trying to figure out whether and how to integrate this promising, risky, and very weird technology into their work. For historians, a technology capable of synthesizing entire archives that also has a penchant for fabricating facts is as appealing as it is terrifying, and the field, like so many others, is just beginning to grapple with the implications of such a potentially powerful but slippery tool.
AI seemed to be everywhere at the 137th annual meeting of the American Historical Association last month, according to Cindy Ermus, an associate professor of history at the University of Texas at San Antonio. She chaired one of several panels on the topic. Ermus described her and many of her colleagues’ relationship to AI as that of “curious children,” wondering with both excitement and wariness what aspects of their work it will change and how. “It’s going to transform every part of historical research, from collection, to curation, to writing, and of course, teaching,” she said. She was particularly impressed by Lancaster University lecturer Katherine McDonough’s presentation of a machine learning program capable of searching historic maps, initially trained on ordnance surveys of 19th-century Britain.
“It’s going to transform every part of historical research, from collection, to curation, to writing, and of course, teaching.”
“She searched the word ‘restaurant,’ and it pulled up the word ‘restaurant’ in tons of historical maps through the years,” Ermus said. “To the non-historian, that might not sound like a big deal, but we’ve never been able to do that before, and now it’s at our fingertips.”
Another attendee, Lauren Tilton, professor of liberal arts and digital humanities at the University of Richmond, had been working with machine learning for over a decade and recently worked with the Library of Congress to apply computer vision to the institution’s vast troves of minimally labeled photos and films. All archives are biased — in what material is saved to begin with and in how it is organized. The promise of AI, she said, is that it can open up archives at scale and make them searchable for things the archivists of the past didn’t value enough to label.
“The most described materials in the archive are usually the sort of voices we’ve heard before — the famous politicians, famous authors,” she said. “But we know that there are many stories by people of minoritized communities, communities of color, LGBTQ communities that have been hard to tell, not because people haven’t wanted to, but because of the challenges of how to search the archive.”
AI systems have their own biases, however. They have the well-documented tendency to reflect the gender, racial, and other biases of their training data — the fact that, as Ermus pointed out, when she asked GPT-4 to create an image of a history professor, it drew an elderly white man with elbow patches on his blazer — but they also display a bias that Tilton calls “presentism.” Because the vast preponderance of training data is scraped from the contemporary internet, models reflect a contemporary worldview. Tilton encountered this phenomenon when she found image recognition systems struggled to make sense of older photos, for example, labeling typewriters as computers and their paperweights as their mice. These were image recognition systems, but language models have a similar problem.
Impressed with ChatGPT, Humphries signed up for the OpenAI API and set out to make an AI research assistant. He was trying to track 18th-century fur traders through a morass of letters, journals, marriage certificates, legal documents, parish records, and contracts in which they appear only fleetingly. His goal was to design a system that could automate the process.
One of the first challenges he encountered was that 18th-century fur traders do not sound anything like a language model assumes
One of the first challenges he encountered was that 18th-century fur traders do not sound anything like a language model assumes. Ask GPT-4 to write a sample entry, as I did, and it will produce lengthy reflections on the sublime loneliness of the wilderness, saying things like, “This morn, the skies did open with a persistent drizzle, cloaking the forest in a veil of mist and melancholy,” and “Bruno, who had faced every hardship with the stoicism of a seasoned woodsman, now lay still beneath the shelter of our makeshift tent, a silent testament to the fragility of life in these untamed lands.”
Whereas an actual fur trader would be far more concise. For example, “Fine Weather. This morning the young man that died Yesterday was buried and his Grave was surrounded with Pickets. 9 Men went to gather Gum of which they brought wherewith to Gum 3 Canoes, the others were employed as yesterday,” as one wrote in 1806, referring to gathering tree sap to seal the seams of their bark canoes.
“The problem is that the language model wouldn’t pick up on a record like that, because it doesn’t contain the type of reflective writing that it’s trained to see as being representative of an event like that,” said Humphries. Trained on contemporary blog posts and essays, it would expect the death of a companion to be followed by lengthy emotional remembrances, not an inventory of sap supplies.
By fine-tuning the model on hundreds of examples of fur trader prose, Humphries got it to pull out journal entries in response to questions, but not always relevant ones. The antiquated vocabulary still posed a problem — words like varangue, a French term for the rib of a canoe that would rarely appear in the model’s training data, if ever.
After much trial and error, he ended up with an AI assembly line using multiple models to sort documents, search them for keywords and meaning, and synthesize answers to queries. It took a lot of time and a lot of tinkering, but GPT helped teach him the Python he needed. He named the system HistoryPearl, after his smartest cat.
He tested his system against edge cases, like the Norwegian trader Ferdinand Wentzel, who wrote about himself in the third person and deployed an odd sense of humor, for example, writing about the birth of his son by speculating about his paternity and making self-deprecating jokes about his own height — “F. W.’s Girl was safely delivered of a boy. – I almost believe it is his Son for his features seem to bear some resemblance of him & his short legs seem to determine this opinion beyond doubt.” This sort of writing stymied earlier models, but HistoryPearl could pull it up in response to a vaguely phrased question about Wentzel’s humor, along with other examples of Wentzel’s wit Humphries hadn’t been looking for.
The tool still missed some things, but it performed better than the average graduate student Humphries would normally hire to do this sort of work. And faster. And much, much cheaper. Last November, after OpenAI dropped prices for API calls, he did some rough math. What he would pay a grad student around $16,000 to do over the course of an entire summer, GPT-4 could do for about $70 in around an hour.
“They’re still talking about the technology as if it is a theoretical thing without the full understanding that it poses a very real, existential threat to our whole raison d’être as higher educators.”
“That was the moment where I realized, ‘Okay, this begins to change everything,’” he said. As a researcher, it was exciting. As a teacher, it was frightening. Organizing fur trading records may be a niche application, but a huge number of white collar jobs consist of similar information management tasks. His students were supposed to be learning the sorts of research and thinking skills that would allow them to be successful in just these sorts of jobs. In November, he published a newsletter imploring his peers in academia to take the rapid development of AI seriously. “AI is simply starting to outrun many people’s imaginations,” he wrote. “They’re still talking about the technology as if it is a theoretical thing without the full understanding that it poses a very real, existential threat to our whole raison d’être as higher educators.”
In the meantime, though, he was pleased that his tinkering had resulted in what he calls a “proof of concept”: reliable enough to be potentially useful, though not yet enough to fully trust. Humphries and his research partner, the historian Lianne Leddy, submitted a grant to scale their research up to all 30,000 voyageurs in their database. In a way, he found the labor required to develop this labor-saving system comforting. The largest improvements in the model came from feeding it the right data, something he was able to do only because of his expertise in the material. Lately, he has been thinking that there may actually be more demand for domain experts with the sort of research and critical assessment skills the humanities teach. This year he will teach an applied generative AI program he designed, run out of the Faculty of Arts.
“In some ways this is old wine in new bottles, right?” he said. In the mid 20th century, he pointed out, companies had vast corporate archives staffed by researchers who were experts, not just in storing and organizing documents, but in the material itself. “In order to make a lot of this data useful, people are needed who have both the ability to figure out how to train models, but more importantly, who understand what is good content and what’s not. I think that’s reassuring,” he said. “Whether I’m just deluding myself, that’s another question.”
https://www.theverge.com/24068716/ai-historians-academia-llm-chatgpt