8 Cognitive Dissonance
As described in the Global Brain Overview, the goal of the global brain algorithm is to focus users’ attention on posts that reduce cognitive dissonance – difference of belief that only exist because people have been exposed to different information. When a note on a post changes the probability that users upvote the post, then there is cognitive dissonance in proportion to the number of people who voted on the post without being shown the note. Information theory lets us easily quantify this cognitive dissonance using the concept of entropy.
8.1 Key Concepts from Information Theory
Here’s a quick summary of the relevant concepts of information theory:
- surprisal: how surprised I am when I learn that the value of X is x:
\[Suprisal(x) = -{\lg P(X=x)}\]
- entropy: how surprised I expect to be:
\[ \begin{aligned} H(P) &= 𝔼 Suprisal(X) \\ &= 𝔼 -{\lg P(X)} \\ &= ∑_x P(X=x) × -{\lg P(X=x)} \\ \end{aligned} \]
- cross-entropy: how surprised I expect Bob to be (if Bob’s beliefs are \(Q\) instead of \(P\)):
\[ \begin{aligned} H(P,Q) &= 𝔼 Suprisal_Q(X) \\ &= 𝔼 -{\lg Q(X)} \\ &= ∑_x P(X=x) × -{\lg Q(X=x)} \end{aligned} \]
- relative entropy or KL divergence: how much more surprised I expect Bob to be than me:
\[ \begin{aligned} D_{KL}(P || Q) &= H(P,Q) - H(P) \\ &= ∑_x P(X=x) × {\lg \frac{P(X=x)}{Q(X=x)}} \end{aligned} \]
When dealing with binary variables then these formulas can be written as:
- entropy:
\[ H(p) = - p × {\lg p} - (1-p) × {\lg (1-p)} \]
- cross-entropy:
\[ H(p,q) = - p × -{\lg q} - (1-p) × {\lg (1-q)} \]
- relative entropy or KL-divergence:
\[ D_{KL}(p||q) = - p × {\lg \frac{p}{q}} - (1-p) × {\lg \frac{1-p}{1-q}} \]
8.2 Surprisal as a Measure of Error
Surprisal can be thought of, for our purposes, as a measure of the “error” of a prediction. If we predict that something has a 1% chance of happening, and it happens, surprisal is \(-{\lg .01} = 6.64~bits\), whereas if we predicted it had a 99% chance of happening, surprisal is only \(-{\lg .99} = 0.14~bits\), which is much smaller. If we thought there was a 50/50 chance, surprisal is \(-{\lg .5} = 1~bit\).
Suppose we predict that the probability of a user upvoting a post is \(p\). Then suppose there are actually \(upvotes\) upvotes and \(downvotes\) downvotes. What is the total error of our prediction?
Every time there is an upvote, surprisal is \(-{\lg p}\). The probability of a downvote is just \(1-p\), so whenever there is a downvote surprisal is \(-{\lg (1-p)}\).
So our total error is:
\[ upvotes × -{\lg p} + downvotes × -{\lg (1-p)} \]
Since \(upvotes ≈ votesTotal×p\), and \(downvotes ≈ votesTotal×(1-p)\) our total error is approximately:
\[ \begin{aligned} & votesTotal × p × -{\lg p} + votesTotal × (1-p) × -{\lg (1-p)} \\ & = votesTotal × H(p) \end{aligned} \]
In other words, our error is roughly entropy times the number of votes. That’s why entropy is often described as the average surprisal, or expected value of surprisal – it is the surprisal per vote. However For our purposes, we will be defining cognitive dissonance in terms of total surprisal – specifically, as total relative entropy – which we will define below. But first we need to discuss the closely concept of cross entropy.
8.3 Total Cross Entropy
Let’s say Alice estimates the probability of an upvote to be \(q\), but Bob thinks the probability of an upvote is \(p\). So Bob’s measure of Alice’s error will different from Alice’s measure of her own error! Bob’s measure of Alice’s error is:
\[ \begin{aligned} &upvotes × -{\lg q} + downvotes × -{\lg (1-q)} \\ &≈ votesTotal×p × -{\lg q} + votesTotal×(1-p) × -{\lg (1-q)} \\ &= votesTotal×( p×-{\lg q} + (1-p)×-{\lg (1-q)} ) \\ &= votesTotal×H(p,q) \\ \end{aligned} \]
\(H(p,q)\) is the cross entropy between Bob and Alices’s estimates: Bob’s estimate of Alice’s average error.
In our case, Alice represents the average uninformed user (users who haven’t seen the note on a post), and Bob represents the informed user. There are \(votesTotal\) “Alices”, or uninformed users. Bob expects the total error of all the uninformed users to be \(votesTotal×H(p,q)\).
\(H(p,q)\) will always be greater than \(H(p)\) if \(p≠q\). That is, Bob’s expects Alice’s error to be greater than his own error as long as his beliefs differ from Alice’s. This makes sense: Bob obviously thinks his beliefs are more accurate than Alice’s. If he didn’t he would update his beliefs to be equal to Alice’s. As more “Alice”’s are exposed to the note, and consequently the average Alice’s beliefs \(q\) approach Bob’s beliefs \(p\), \(H(p,q)\) will decrease, until \(H(p,q)\) = \(H(p,p)\) = \(H(p)\).
8.4 Total Relative Entropy = Cognitive Dissonance
The difference between \(H(p,q)\) and \(H(p)\) is relative entropy, also known as the Kullback–Leibler divergence or KL-divergence. It can be thought of as “how much more surprised Bob expects Alice to be than himself”. Since \(H(p,q) >= H(p)\), relative entropy is never negative, regardless of whether \(p\) is greater or less than \(q\).
If we take Bob’s measure of Alices’s total error, minus his measure of his own total error, we get the total relative entropy: how much more error Bob expects for Alice than he would expect if Alice knew more.
The total relative entropy is our measure of cognitive dissonance.
\[ \begin{aligned} cognitiveDissonance &= votesTotal × H(p,q) - votesTotal × H(p) \\ &= votesTotal × ( H(p,q) - H(p) ) \\ &= votesTotal × D_{KL}(p || q) \end{aligned} \]
8.5 Detailed Example
Suppose a post without a note is given receives 900 upvotes and 100 downvotes. But a certain note, when shown along with the post, reduces the upvote probability to 20%.
The relative entropy is
\[ \begin{aligned} cognitiveDissonance &= votesTotal × D_{KL}(p || q) \\ &= votesTotal × ( p × {\lg \frac{p}{q}} + (1-p) × {\lg \frac{1-p}{1-q}} ) \\ &= ( .2 × lg~\frac{.2}{.9} + .8 × lg~\frac{.8}{.1} ) \\ &= 1.966~bits \end{aligned} \]
Total cognitive dissonance is thus:
\[ \begin{aligned} cognitiveDissonance &= votesTotal × D_{KL}(p || q) \\ &= votesTotal × ( p × {\lg \frac{p}{q}} + (1-p) × {\lg \frac{1-p}{1-q}} ) \\ &= 1000 × 1.966 bits &= 1966~bits \end{aligned} \]
We can reduce cognitive dissonance by bring the note to the attention of users who didn’t see it before they vote. Some fraction of these users will change their vote after seeing the note. But some users are unlikely to actually change their vote, even if exposed to the note many times.
Thus as we expose more and more of these users to the vote, the upvoteProbability for these users will approach $$q but never reach it. We may be able to estimate some global fraction of users who change/don’t change their vote, and use this to decide on which notes to direct attention to maximize the expected rate of reduction of cognitive dissonance, which is the overall goal of the global brain algorithm. See the next document on information value.
8.6 Discussion
8.6.1 Parallel to Machine Learning
Cross entropy is commonly used as the cost function in many machine learning algorithms. A neural network for example takes an input with labels (e.g. images of cats and dogs) and outputs an estimated probability (e.g. that the image is a cat). The cost quantifies the error of these probability estimates with respect to the correct labels, and the neural network is trained by minimizing the cost function.
If \(ŷ_i\) is the machine’s predicted probability for training example \(i\), and \(y_i\) is the correct output (1 or 0), then the total cross entropy cost is:
\[ \sum_i H(y_i, ŷ) = \sum_i y_i × -{\lg ŷ_i} + (1-y_i) × -{\lg(1 - ŷ_i)} \]
In our case, if we say that \(y_i\) are users votes, and \(ŷ_i\) is always equal to the uninformed users prediction \(q\), then our cost function is identical to the cost function used when training a neural network:
\[ \begin{aligned} \sum_i H(y_i, q) &= \sum_i y_i × -{\lg ŷ_i} + (1-y_i) × -{\lg(1 - ŷ_i)} \\ &= \sum_i y_i × -{\lg q} + (1-y_i) × -{\lg(1 - q)} \\ &= upvotes × -{\lg q} + downvotes × -{\lg(1 - q)} \\ &≈ votesTotal×H(p,q) \\ \end{aligned} \]
So both neural networks and the global brain “learn” by reducing cross entropy. The difference is that the global brain reduces entropy not by learning to make better predictions, but by in a sense teaching users to make better predictions of how a fully-informed user would vote.
8.6.2 Subtle Point 1
Our measure of cognitive dissonance is relative entropy, not cross entropy. But minimizing relative entropy is the same as minimizing cross entropy, since relative entropy \(H(P,Q) - H(P)\) is just cross entropy minus a constant \(H(P)\) (\(H(P)\) is a constant since there’s nothing we can to do change the true probability \(P\)).
8.6.3 Subtle Point 2
Note that cross-entropy in our case is a measure of the total surprise of uninformed users at the hypothetical future votes of informed users. That is to say, it is:
\[ \begin{aligned} &votesTotal×p × -{\lg q} \\ &+ votesTotal×(1-p) × -{\lg (1-q)} ) \\ &≈ hypotheticalInformedUpvotes × -{\lg q} \\ &+ hypotheticalInformedDownvotes × -{\lg (1-q)} ) \\ &= votesTotal×H(p,q) \end{aligned} \]
And not
\[ \begin{aligned} & votesTotal×q × -{\lg q} \\ &+ votesTotal×(1-q) × -{\lg (1-q)} ) \\ &≈ actualUpvotes × -{\lg q} \\ &+ actualDownvotes × -{\lg (1-q)} ) \\ &= votesTotal×H(q) \end{aligned} \]
This is potentially confusing (it caused me a great deal of confusion initially) because we are measuring the error of Alice’s estimated probability \(q\) with respect to hypothetical events that have have not yet actually occurred. But shouldn’t we be measuring the error of the votes that have occurred?
No, because surprisal is a measure of the error of a probability estimate, no the error of the event. The events are “what actually happens” or, in case of uncertainty, “what we think will actually happen” given our best current estimate \(p\).
Measuring error against actual vote events just gives us \(votesTotal*H(q)\), which doesn’t tell us how much actual votes differ from what they should be if users were more informed.
\(votesTotal×H(p,q)\) makes the most sense as a measure of the tension between the current state of user’s beliefs, and what that state should be, which is a hypothetical future where all users have changed their upvotes because of the information in the note, and thus \(actualUpvotes = hypotheticalInformedUpvotes\) and \(actualDownvotes = hypotheticalInformedDownvotes\).