The Big Five are word vectors
Lexical studies in psychology and Latent Semantic Analysis in computer science were introduced a half century apart to solve different problems and yet are mathematically equivalent. This isn’t a metaphor that works on a certain level of abstraction; the Big Five are dimensions of word vectors.
But first, some background. The Lexical Hypothesis claims that personality structure will be written into the language as the speakers have to describe the most salient attributes of those around them. The beauty of this idea is that, instead of a single person suggesting a model of personality, language records what millions of people implicitly agree to be useful. The psychometrician’s job is to simply identify this structure. This has typically been accomplished by inviting psychology students to rate themselves on lists of adjectives and performing factor analysis on the correlation matrix. Back in 1933 LL Thurstone administered a survey of 60 adjectives to 1300 people. In his seminal The Vectors of Mind he reports that “five factors are sufficient” to explain the data. In subsequent decades such studies, more or less, resulted in five principal components: Agreeableness, Extroversion, Conscientiousness, Neuroticism, and Openness/Intellect. (For an excellent treatment of the subject see Lexical Foundations of the Big Five.)
Latent Semantic Analysis was introduced as an information retrieval technique in 1988. Words can be represented as vectors and documents or sentences can be represented as the mean of their word vectors. If you want to search a large database (eg, wikipedia) keywords for each page can only get you so far. One way to get around this is to represent both documents (wiki articles) and queries (search terms) as the mean of their word vectors. Finding relevant documents can now be accomplished with a simple dot product. (In this post I treat LSA and word vectors as synonymous. There are other ways to vectorize language and more specifically make word vectors, but those are beyond the scope for now.)
Despite their different uses and history the steps are the same:
Collect a word x document count matrix
The result is a set of word vectors that succinctly describe each word. These can be used for a host of downstream tasks from sentiment analysis to narcissism prediction from student essays. In the case of personality adjectives, the dimensions of the word vectors were analyzed, named, and debated for decades. What follows is a discussion of the differences in each step.
Count matrix. LSA usually involves a large number of varied documents (eg. millions of Amazon product reviews). These are transformed to word x doc matrix by counting how often each word appears in each doc. In psychology, a document is the words a subject agrees describe them. This extends to likert scales as well. If someone says a word describes them 5/7, then simply repeat the word five times in the document.
Nonlinear Transform. Lexical studies often ipsatize the data (z score along the subject axis), and then perform a Pearson correlation. Thurstone used an archaic tetrarch correlation in his study. In LSA the most common transform is TF-IDF followed by a logarithm. This ensures the matrix is not dominated by common terms. Often, the transform results in a word x word affinity matrix (eg. correlation matrix). This step is practically very important but not all that theoretical. The transform to pick is the one that gives you a reasonable result in the end.
Matrix Decomposition. There are many methods of matrix decomposition. Some, such as PCA, require a square matrix. Others are robust to missing data. With personality data, the choice doesn’t matter much; results are very similar. General word vectors require ~300 dimensions to represent a word’s meaning, part of speech, slangness, and much else that gives language texture. As surveys are designed to hold much of that constant only ~5 dimensions are needed. Thurstone justified his choice of five by looking at the reconstruction error which he reports as a histogram. Later psychologists justified 5 by the reconstruction error (measured with eigenvalues), as well as considering interpretability and reproducibility.
Rotation. Have you ever heard of component over-extraction? It’s not a story the psychologists would tell you. It’s when a researcher extracts too many Principal Components then rotates variance from the earlier, valid PCs on to the later marginal PCs. This is what happened with the Big Five, believe it or not! What is now Agreeableness was once a much more robust and theoretically satisfying ‘Socialization’ factor which was spread out over PCs 3-5 to make Conscientiousness, Neuroticism, and Openness. Rotation can be justified to produce interpretable factors. But if you ever find yourself rotating then arguing about the correct number of factors, check yourself!
Why did it take so long?
OG psychometricians designed personality studies to interrogate the Lexical Hypothesis, constrained by 1930s compute. For later personality spelunkers surveys reigned supreme. The Lexical Hypothesis was on the radar but primarily engaged with as justification to make nomothetic claims based on survey results. Once personality psychologists coalesced around the Big Five and agreed how to measure it they moved on to other questions and the Big Five was either good enough for government work, or handed down from God (depending on one’s pedigree). This milieu allowed personality psychologists to use LSA for 30 years without appreciating the implications for the Lexical Hypothesis.
So close, yet so far
There is a lot of personality research that uses word vectors. I’ll focus on one recent example that gets close enough to ask the right questions. The final study in the 2021 dissertation, Validating Word Embedding as a Tool for the Psychological Sciences, starts:
“I hypothesize that a five-cluster solution applied to WE [Word Embedding] vectors representing Big Five trait descriptive adjectives will produce a structure clearly reflective of the Big Five traits. This hypothesis follows from the fact that the Big Five structure has been recovered from semantic memory in previous research (e.g., Edwards & Collins, 2008), and that my preceding studies suggest that WE vectors accurately encode the organization of semantic memory.”
So the plan is to run k-means on the word vectors (with k = 5); the clusters should recover their Big Five assignments. Words that negatively load on each factor (eg. no introversion words) or do not purely load on one factor are removed. The supporting evidence could have been “because the Big Five were originally defined by the word vectors of this same set of words”. Instead, the citation is a statistically significant correlation between memory and word vectors. The choice to use clustering over factor analysis:
“The Big Five model has been traditionally derived by applying dimension reduction techniques such as principal components analysis and factor analysis to self-report personality scales (John et al., 2008). Such dimension reduction methods are not appropriate for WE vectors as WE dimensions are inherently meaningless (Luo et al., 2015). While meaningful dimensions can be imposed in WE space (Grand et al., 2018), or trained into this space (Luo et al., 2015), these procedures necessarily introduce researcher bias, as the content of dimensions must be explicitly specified. Thus, while the Big Five model may potentially be contingent on dimension reduction procedures, cluster analysis appears to be the most appropriate means to assess whether WE vectors encode the Big Five model.”
“Meaningless” should be “not necessarily interpretable” which is true in the same sense that PCA is not interpretable. Language models do have a sentiment neuron, for example. This is evidence of how far apart psychology and language processing have grown. Even those studying the relationship of word vectors to the Big Five don’t realize their common mathematical heritage.
I know of three works that propose personality models built using natural language word vectors.
The first uses Facebook statuses where all of a single person’s status comprise a document. Unfortunately, the resulting factors mostly have to do with non-personality considerations like age, misspellings, and whether you repost chain statuses. The highest correlation with the Big Five is their Little Hellion factor (fucking/kill/dead/music vs family/wonderful/blessed/morning) with Conscientiousness at -0.29. For reference, that’s about the correlation of Germanic vs French word origin and Agreeableness. (The Norman invasion lives on in our language.) But the experiments are great (k-folds cross validation) and they situate themselves correctly in the Lexical framework. “Our approach suggests value in new constructs of personality derived from everyday human language use.”
The second is dissertation that attempts to construct a Brazilian personality model from twitter using tweets that contain “I am” (Sou, in Portuguese). This makes it a type of unprompted self-report, rather than the self-report found in surveys. Unfortunately the matrix is too sparse for factor analysis so topic modeling is used with the same number of topics as dimensions (3, 5, 6, 7, 14, 15) in various models in psychology.
The Big Three from word vectors
I started my PhD predicting Big Five traits from Facebook statuses. After reading how the personality sausage was made I realized the project used word vectors (of Facebook statuses) to predict noisy approximations of where individuals lived in Big Five space, which was originally defined by word vectors. It seemed more interesting to cut to the chase and learn something fundamental about personality from word vectors. (Also, the dataset I was using became toxic after Cambridge Analytica.) The rest of my PhD was working to constrain word vectors in order to reproduce the Big Five. This involved using transformers rather than LSA (more on that in future posts). The resulting correlation between factors from word vectors (DeBERTa) vs surveys are below. As you can see, there is very close agreement for the first three factors. Where the results diverge, it’s not clear which method is in error. Maybe surveys are right and all the correlations will go to 1 when we get GPT-5. Maybe surveys are just biased and noisy and too many PCs were extracted. Maybe they are measuring different things and we need to refine our interpretation of both. At any rate it’s not obvious to me that surveys should be considered the gold standard between the two. The Lexical Hypothesis is about language structure, after all, and psychology is the only field that uses surveys to analyze natural language.
Thurstone pioneered methods in statistics and linear algebra to probe the Lexical Hypothesis in the 30s. It is remarkable that he developed a way to represent words that was later rediscovered for information retrieval, which now powers the information age. Computation forced Thurstone to flatten the rich landscape of language to survey responses. In the past 30 years NLP has experienced multiple revolutions. If Thurstone invented a telescope with which to view language structure, we now have Hubble. Many insights await!