Open Side Menu Go to the Top
Register
How to do social science research without really trying How to do social science research without really trying

04-14-2019 , 01:14 PM
Or, How to Make a Racist AI Without Really Trying.

I love this blog post because I think it clearly and empirically captures something essential about the very theoretical argument put forth by Berger and Luckmann in The Social Construction of Reality (cf. The Sociological Imagination). That is, I think people generally find it easy to imagine social control when it involves formal organizations with codified rules: churches, the police, schools, employers, and so on. But the concept of social control embedded in informal cultural norms, values, and beliefs is more slippery. How, for example, can an individual's vague racial bias -- which seems isolated and impotent -- contribute to a larger social problem of racial discrimination?

The researchers building ConceptNet were not trying to answer that question; they were trying to build useful tools to allow computers to analyze written communication (Natural Language Processing, or NLP). But they inadvertently demonstrated the idea in a very rigorous way, all the more rigorous because they weren't intending to do so.


The blog post is technical and aimed at an audience that is familiar with programming NLP tools (something I've dabbled with, including doing sentiment analysis, although by a different approach). But here's the gist:

1) Choose a standard NLP word embeddings dataset.
NLP researchers have created large NLP datasets by scraping enormous volumes of text from the internet, and creating word embeddings from them. Word Embeddings provide a way for computers to understand how different words are related to each other (have similar meanings or are often used in conjunction) simply by comparing the statistical frequency with which words are used in the same context. See for example the explanation here:

Quote:
This vector representation provides convenient properties for comparing words or phrases. For example, if "salt" and "seasoning" appear within the same context, the model will indicate that "salt" is conceptually closer to "seasoning," than, say, "chair."
Two commonly used word embedding databases are word2vec (trained on Google News data) and GloVe (trained from webcrawl data). In essence, data scientists have used algorithmic tools to create datasets that are similar in spirit to the kinds of content analysis traditionally done by humans in social science research. That is, in a content analysis a group of people will read (view, listen to...) a large body of content and produce a classification scheme out of it, to understand the dominant themes, find patterns, and so on.
2) Create or find a database that associates emotional connotations with individual words.
This is generally straightforward. The word "sad" has a negative emotional connotation, the word "joy" a positive one, and so on. There are various ways of producing these lists, but they will generally be uncontroversial. People don't disagree about which basic words connote positive or negative emotions
3) Combine (1) and (2) to build a sentiment analyzer
With only the word lists from (2), you could analyze texts to rate the "overall sentiment" contained within them to some extent, as long as some of the connotation-bearing words appear in the text. But many times those words will be pretty sparse. But, if you combine the list from (2) with your word embeddings from (1), you can infer emotional connotations for almost any text, by inferring that words more frequently appearing in context with negative emotional terms, or with similar meanings, also have negative connotations.
4) Observe that ostensibly neutral words associated with race or ethnicity are given different sentiment scores.
Your sentiment analyzer, built in a very reasonable way with ostensibly neutral and unbiased sources has learned to be racist because negative emotions are more closely associated with certain races/ethnicities in the training data (Google News, or just a large set of web pages):


You can stop here and realize this is very much an interesting social science project. Through a novel methodology involving NLP the authors of this blog have just demonstrated the idea that racial bias exists and is measurable in a very large sample of data. That data is, in one sense, just an aggregate of individual biases expressed in writing. But it is very likely the case that those individual biases would seem so small in isolation as to be negligible. Yet because they tend to run in the same direction they add up to something very significant. That the "whole" of cognitive culture is larger than it might seem as an aggregate of individual thinking is a central idea in the social sciences.

Beyond this, though, the creators of ConceptNet are also worried that AI tools -- by absorbing culture -- will reproduce these existing biases in real-world applications where people will naively assume that those AI tools can't possibly be prejudiced in the way that an individual person can be. In fact, there are already examples of this happening.
How to do social science research without really trying Quote
04-14-2019 , 02:31 PM
Yes, one of the things sociology does is bring forth our taken-for-granted, invisible assumptions, and make them visible. This "biased AI" is a perfect example.
How to do social science research without really trying Quote

      
m