Future Building Technology The Internet

Read On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜

Read On the Dangers of Stochastic Parrots | Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency

In this paper, we take a step back and ask: How big is too big? What are the possible risks associated with this technology and what paths are available for mitigating those risks? We provide recommendations including weighing the environmental and financial costs first, investing resources into curating and carefully documenting datasets rather than ingesting everything on the web, carrying out pre-development exercises evaluating how the planned approach fits into research and development goals and supports stakeholder values, and encouraging research directions beyond ever larger language models.

LLMs reinforce existing structures and values, and overrepresent certain (privileged) viewpoints.

The net result is that a limited set of subpopulations can continue to easily add data, sharing their thoughts and developing platforms that are inclusive of their worldviews; this systemic pattern in turn worsens diversity and inclusion within Internet-based communication, creating a feedback loop that lessens the impact of data from underrepresented populations.

Even if populations who feel unwelcome in mainstream sites set up different fora for communication, these may be less likely to be included in training data for language models.

The Colossal Clean Crawled Corpus, used to train a trillion parameter LM in, is cleaned, inter alia, by discarding any page containing one of a list of about 400 “Dirty, Naughty, Obscene or Otherwise Bad Words”. This list is overwhelmingly words related to sex, with a handful of racial slurs and words related to white supremacy (e.g. swastika, white power) included. While possibly effective at removing documents containing pornography (and the associated problematic stereotypes encoded in the language of such sites) and certain kinds of hate speech, this approach will also undoubtedly attenuate, by suppressing such words as twink, the influence of online spaces built by and for LGBTQ people. If we filter out the discourse of marginalized populations, we fail to provide training data that reclaims slurs and otherwise describes marginalized identities in a positive light. Thus at each step, from initial participation in Internet fora, to continued presence there, to the collection and finally the filtering of training data, current practice privileges the hegemonic viewpoint. In accepting large amounts of web text as ‘representative’ of ‘all’ of humanity we risk perpetuating dominant viewpoints, increasing power imbalances, and further reifying inequality.

Language — and thus values — can become fixed when shifts in language are overcome by the inertia of LLMs trained on old and biased information.

Acentral aspect of social movement formation involves using language strategically to destabilize dominant narratives and call attention to underrepresented social perspectives. Social movements produce new norms, language, and ways of communicating. This adds challenges to the deployment of LMs, as methodologies reliant on LMs run the risk of ‘value-lock’, where the LM-reliant technology reifies older, less-inclusive understandings.


As people in positions of privilege with respect to a society’s racism, misogyny, ableism, etc., tend to be overrepresented in training data for LMs…, this training data thus includes encoded biases, many already recognized as harmful.

Media coverage can fail to cover protest events and social movements and can distort events that challenge state power. This is exemplified by media outlets that tend to ignore peaceful protest activity and instead focus on dramatic or violent events that make for good television but nearly always result in critical coverage. As a result, the data underpinning LMs stands to misrepresent social movements and disproportionately align with existing regimes of power.

See also: Distortion and distraction

We say seemingly coherent because coherence is in fact in the eye of the beholder.

The problem is, if one side of the communication does not have meaning, then the comprehension of the implicit meaning is an illusion arising from our singular human understanding of language (independent of the model).

We interpret meaning that is not intended or present.

The ersatz fluency and coherence of LMs raises several risks, precisely because humans are prepared to interpret strings belonging to languages they speak as meaningful and corresponding to the communicative intent of some individual or group of individuals who have accountability for what is said.

[Another] category of risk involves bad actors taking advantage of the ability of large LMs to produce large quantities of seemingly coherent texts on specific topics on demand in cases where those deploying the LM have no investment in the truth of the generated text.

This is my biggest worry: that generative AI can be used as an amplification tool for misinformation and disinformation. Even without tools like this, the Russians already interfered in our elections, and Cambridge Analytica pushed public sentiment on Facebook.

We recall again Birhane and Prabhu’s words (inspired by Ruha Benjamin): “Feeding AI systems on the world’s beauty, ugliness, and cruelty, but expecting it to reflect only the beauty is a fantasy.”

They call for a more considered, intentional approach to training models: to decide on the project’s goals beforehand, to weigh the value of diminishing gains in the model against environmental costs, and to select data intentionally rather than try to filter the junk after the fact.

All of these approaches take time and are most valuable when applied early in the development process as part of a conceptual investigation of values and harms rather than as a post-hoc discovery of risks. These conceptual investigations should come before researchers become deeply committed to their ideas and therefore less likely to change course when confronted with evidence of possible harms. This brings us again to the idea we began this section with: that research and development of language technology, at once concerned with deeply human data (language) and creating systems which humans interact with in immediate and vivid ways, should be done with forethought and care.

Basically, to slow down; stop “moving fast and breaking things” and take a little time to figure out what you’re trying to do before you do it, and who you might hurt by rushing.

By Tracy Durnell

Writer and designer in the Seattle area. Freelance sustainability consultant. Reach me at She/her.

4 replies on “Read On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜”

Leave a Reply

Your email address will not be published. Required fields are marked *