There are ethical and legal disputes running between AI companies and content publishers, but nothing comes close to a flashpoint like Forbes’ recent accusation about Perplexity AI. This is a unique moment in 2024 because the spat is about journalistic labor and form. It began earlier this month with Forbes complaining about Perplexity AI stealing language from the former’s paywalled scoop on Eric Schmidt’s stealth drones startup. Articles have since been flying back and forth as multiple news outlets – Axios, AP, Semafor, Wired, Short Cut and Fast Company – have all chronicled the dispute.
What’s missing in all the back and forth about the adequacy, or not, of citations, traffic referrals, and rights is a discussion on a fundamental question: What is the future of the journalistic form? Paywalled or not, are we headed to a future where journalists are merely laboring to supply tokens (pun intended) to large language models and their applications?
AI engineers eager to produce the next extractive content aggregator are missing a core emotional point. Journalists who care about digging up realities that lurk beneath the surface also view their work a labor of love. Their final product has a form associated with the work, presented on a web page that stands for the work.
A fantastic example of journalistic form and labor at work is the new investigative series released this month, 40 Acres And A Lie, published by Mother Jones, in collaboration with the Center for Public Integrity and Reveal. It is on one of the oldest broken promises of reparations in American history. “A government program gave formerly enslaved people land after the Civil War, only to take nearly all of it back a year and a half later,” goes the intro. The team of named journalists themselves marshaled artificial intelligence “to track down the people, places, and stories that had long been misunderstood and forgotten, then asked their descendants about what’s owed now”, says the series introduction.
Journalistic form has value for democracy. It represents journalists' intervention into democracy’s discourse. It lets ethical journalists defend their work in public to questions about their story selection, sourcing, verification, corroboration, etc. The form is key for people in democracy to try to hold journalists accountable to ethical standards. There will be contestations over facts, representation, unaccounted-for stakeholders, amplification or salience to illegitimate controversies and so forth.
In sum, this form that manifests on journalistic pages is the face of journalism and journalists who undertake truth-telling. Despite the field’s continuous grappling with ethics, it is this form that makes journalists and journalism visible to the public.
Large Language Model-curated news needs scrutiny
Perplexity AI’s Pages feature is designed to look like news articles and not your personal Q&A with your preferred LLM chatbot. It explicitly lets its users “curate” new articles with headlines using generative AI summarizations of other articles. Each page automatically includes numerical (like Wikipedia) and logo-based citations to original sources. If Perplexity AI’s Pages feature succeeds, it will mean other competitors will follow. And we will have “curated generated content”, or CGC, (I’m coining this here), for the generative AI era. This is akin to user generated content or UGC that emerged with social media companies.
Perplexity AI has also taken on a publisher-like role in pushing newsy content to its audience with a newsletter. As more startups and services follow, streams of newsy summary articles answering contemporary questions will enter the online sphere every day. They will be “curated” by “users” like you and me. What happens when recycling begins? How will we differentiate between the human journalists who invest time and money in digging up real stories and the mix of humans and machines who come later?
Most worrisome is Perplexity AI’s news podcast Discover Daily which produces 5-minute generated audio summaries of the news of the day. The audio has no spoken inline credits to the original publishers or journalists it relies on. Credits are only made in text in the show notes. If you only listened to the audio and did not see the credits, you will have no idea who reported the stories and how. Compare this with Apple News Today. As the show’s hosts summarize the latest stories and the questions answered in the news reports, they usually say the names of the media outlets each time. You can hear which publisher reported or exposed what, or which outlet explained something.
Generative AI companies’ have an opportunity
Genuine news publishers cannot be faulted if they are wary about promises of audience expansion from links through generative AI answers.
On the economic front, the good news is that newer revenue sharing or royalty models are emerging to let AI answer engines reward original sources to the degree their material is used for a given subscriber query. Tim O’Reilly made a case for a method to apportion payments to original sources in this recent piece on how to fix AI’s original sin. “By creating an algorithm that attributes the content referenced for an “answer,” we’re able to allocate a royalty to the original creators. Which is an important advancement in bringing AI models to market”, wrote O’Reilly. Semafor reported Perplexity AI’s claims that they are working on revenue sharing. Revenues are indeed a key component of recognition of authentic creative work, but not all.
But the critical challenge and opportunity for AI companies in handling journalism is more than revenue share. The designers of large language models and their applications need to invest more time in catching the differences between original, value-added and recycled journalism that often run in the same news cycle. They need to find ways to distinguish news publishers or journalists who actually broke the original story from those who publish rapid derivative rewrites within minutes. Original reporting brings new knowledge into the world relative to other types of secondary and follow-on writing.
In line with that, generative AI-curated current affairs content needs a standard for richer credits that respect journalistic form. Numerical, reductive, and tiny logo-based citations will not cut it. This is a design problem that needs to be worked on together by news publishers and generative AI companies.
We need to anticipate the risks
It is not my case that journalistic form is already being done to death because of LLMs and generative AI applications. But if the current direction of generative AI-powered news applications solidifies in the name of “extracting knowledge,” we must anticipate the risk of journalistic form losing its force altogether. With the Internet revolution, publishers lost control of news distribution as search, social, and digital aggregators took over. But the journalistic form largely remained. With generative AI-powered answer engines relying extractively and reductively on journalism, we have to ask ourselves, in caution, where this road is headed.
We need a way for human reporters and investigators to be visible as the people who are doing the real work so that they can defend it in the public domain, whether or not machines aggregate their findings as knowledge.