Sol Messing

Interactive Web Replication & Update of State Media Influence on LLMs

Sol Messing — Sun, 24 May 2026 00:00:00 GMT

Hannah Waight, Eddie Yang, Yin Yuan, Molly Roberts, Brandon Stewart, Josh Tucker and I published a paper in Nature (2026) showing that state-controlled media in LLM training data influences how those models talk about politics.

But we ran those audits in 2023-2024–it took a long time to get the paper published!

We wanted to know what happens with the current generation of models, especially re how models memorize state media talking points, and I’d seen a lot of criticism of LLM papers that relied on legacy generation models recently on Twitter. I also wanted to show what being more pro-regime looks like in the actual text.

And you don’t land a Nature paper every day!

Now in the past, this is the kind of thing that I would get excited about but never actually execute because there’s a lot of slow and boring scaffolding work outside my expertise required to set this up. But I started using Claude Code late last year and of course CLI-AI tools are great for stuff like this.

In fact, Josh and I recently wrote a piece in Brookings about how agentic AI might make it possible to do more public outreach like this.

So I built an interactive companion site that replicates the core studies from the paper on current-generation models. The whole team gave feedback and what came out was pretty cool. It looked great and a few new and important findings emerged from the effort.

By and large, the core findings hold. In 38 countries, where more than 70% of langauge speakers reside, there’s a strong negative correlation between press freedom and pro-government LLM valence (-0.89) relative to English in current-generation models. Every current-generation model still produces more pro-government answers in Chinese than in English about Chinese leaders and institutions. Memorization rates for state-coordinated media phrases continue to be at or above rates for general web text.

Two years of capability improvements and safety work have not changed the underlying issue.

The highlights

Memorization effects are far larger for new models. As expected, newer larger models memorize state media-aligned text a much higher rates than do the models we tested in the paper. We prompted models with the first half of 2,000 distinctive phrases and measured how often each model completes the second half of the phrase nearly perfectly. Half of the phrases are from Chinese state media talking points (red) and half from general Chinese web text (green/blue).

Memorization rates across paper-era and current-generation LLMs. State-coordinated media phrases in red, general CulturaX web text in green. Newer models complete the held-out half of each phrase at substantially higher rates than the paper-era models.
Newer models tend to be even more positive toward China in Chinese.
DeepSeek V4 Pro overwhelmingly pro-China.

DeepSeek V4 Pro is overwhelmingly pro-China in both languages. Spot-checking suggests it’s spouting state media talking points in English: “principles of socialism with Chinese characteristics” and “whole-process people’s democracy.” To examine DeepSeek’s pro-China valence relative to other models, I ran pairwise llm-as-judge comparisons across nine current-generation models holding language constant and fit a Bradley-Terry model. DeepSeek V4 Pro ranks first on China-favorability in both English and Chinese.

Code and data: github.com/state-media-influence-llm/replication

Paper: Waight et al. 2026, Nature (complimentary copy)

Companion site: state-media-influence-llm.github.io

An Early Election 2024 Forecast

Sol Messing — Thu, 11 Jan 2024 00:00:00 GMT

Early projections for 2024 based on previous Presidential and House returns slighly favor Republicans. These projections are completely unrelated to Biden’s recent polling numbers.

Here’s the story behind this approach: In early 2020, I ran battleground state election forecasts for Acronym. The results suggested Georgia would be extremely competitive—and Acronym spent more $ there than many other non-profit actors. After the election, we could see that those projections had much lower forecasting error than polling data https://solomonmg.github.io/post/what-the-polls-got-wrong-in-2020/.

Because this approach does not use polling data, it’s not suspetible to any of the potential problems with polls I talk about in that post: undecided voters breaking late, low education non-response, bad likely voter modeling, partisan non-response, shy Trumpers, etc.

The core idea behind this approach is a fact not emphasized enough in most stats/ML courses: if you’re going to try to predict something, it’s very hard to do better than using the same variable at t - 1 if you can. And we can. This approach goes one step further and looks at the direction that variable has been moving and assume that things are likely to keep moving in that same direction.

What that means for presidential election forecasts: for each state, estimate the “swing” from 2016 to 2020 for president and 2018-2022 for the U.S. house; then simply add that to 2020 presidential returns. Then those estimates of state-level swing are regularized—mathematically “nudged’’ toward national trends, which you’ll like if you believe “uniform swing” is particularly important. The projected state-level swing is weighted 60-40 toward presidential results.

Here’s a cleaner plot showing the actual forecast values in potential 2024 battleground states:

I should now point to a link to the data and code: https://github.com/SolomonMg/election_projection_regularized_swing, and thank the MIT Election Data + Science Lab for curating these data.

Electoral Math:

I’m going to rely on www.270towin.com to translate these projections into an electoral map. A better way to do this might be to come up with conservative estimates of error and simulate a few thousand elections, but I’m not estimating an extremely rigorous Bayesian model nor including enough extant data to really justify a FiveThirtyEight style forecast.

If you call anything lower than a 3 point margin either way a “tossup,” here’s what the electoral map looks like:

That looks OK for Biden, but if you really trust this approach, you might want to say anything lower than 2% is a tossup. Then the electoral math looks very bad for Biden:

Observation: Polarization and Accuracy

These projections essentially assume party identification, demographic trends, and voting behavior will mostly continue in the same general direction as in the past. They should have a lot of appeal if you think polarization means most people have already made up their minds about who to vote for for President, that Presidential campaign effects are relatively small (in equilibrium at least), and/or that “demographics are destiny.” What’s more, the results are regularized toward national trends, which you’ll like if you believe that local politics has been “nationalized,’’ as Dan Hopkins argues and thus that “uniform swing” in the electorate is an increasingly important factor explaining state-level election results—despite that Florida bucked the national trend in 2020.

In fact, over time, as polarization seems to worsen, this approach improves in accuracy:

However, these projections do not account for events since 2022. Older voters pass away and younger voters become eligible to vote changing the makeup of the electorate. Public opinion/sentiment may change related to economic conditions (inflation/income/unemployment/etc), policy developments e.g., related to abortion, international affairs like the Gaza conflict, or candidate-attributes like Biden’s age or Trumps legal troubles.

Observation: Patterns in U.S. Elections

These projections also do not explicitly model well-known voting patterns, instead relying on change from one cycle to another to get reasonable estimates. The most notable trend is that the president’s party almost always tends to lose seats in the house in midterm elections. https://www.jstor.org/stable/2130810 https://fivethirtyeight.com/features/why-the-presidents-party-almost-always-has-a-bad-midterm/

Because the model only looks at the state-level the difference between the last two midterm cycles, these projections are capturing how midterm returns change in each state, which goes a ways toward correcting the consistent lower performance in midterms pattern, and in part may reflect changes in sentiment toward the president.

A less reliable trend that’s held since FDR’s time is that incumbent presidents have tended to get a higher percent of the popular vote in their election for a second term—Obama in 2012 was a notable exception. https://www.presidency.ucsb.edu/statistics/data/presidential-election-mandates What’s more, house midterm results seem to be particularly bad just before an incumbent is voted out of office. It’s not clear if this is a bug or a feature, or how reliably this would be picked up using this method, but it’s worthing pointing out.

Observation: Accuracy over Previous Election Results

If it’s hard to do better than election returns at t - 1, does this approach actually do better? Yes, by a little. Including all states, these projections have lower mean absolute error (MAE). For some reason these projections miss badly in 2004, and excluding that earliest year I can compute these projections using the MIT data, shows they do in fact do quite a bit better than simply relying on previous presidential election results alone.

> bt_dat %>% group_by(proj_type) %>% summarise(mean(mae))
# A tibble: 2 × 2
  proj_type `mean(mae)`
  <chr>           <dbl>
1 prev pres        3.11
2 proj             2.65

> bt_dat %>% filter(years != 2004) %>% group_by(proj_type) %>% summarise(mean(mae))
# A tibble: 2 × 2
  proj_type `mean(mae)`
  <chr>           <dbl>
1 prev pres        3.36
2 proj             2.51

Results are more subtle if we restrict to battleground states:

> bt_dat %>% group_by(proj_type) %>% summarise(mean(mae))
# A tibble: 2 × 2
  proj_type `mean(mae)`
  <chr>           <dbl>
1 prev pres        1.19
2 proj             1.16
> bt_dat %>% filter(years != 2004) %>% group_by(proj_type) %>% summarise(mean(mae))
# A tibble: 2 × 2
  proj_type `mean(mae)`
  <chr>           <dbl>
1 prev pres       1.28 
2 proj            0.995

Observation: Regularization toward 0 or the Mean?

Here’s the map with shrinkage toward 0, which will move the estimates toward the prior year’s election. Biden does worse in WI and slightly worse in AZ and PA, because the presidential swing estimates get pulled down toward zero instead of up toward the nation-wide state-level mean (3.5%). But he does better in NC, where the relatively good house results get pulled toward zero instead of down to the state-level average midterm swing (-11.5%).

However, based on my updated backtesting, the MAE estimates are worse when you shrink toward zero, which is what I did back in 2020. This makes me feel good because the mathematical/statistical theory says that shrinking toward the group mean should produce high quality estimates, while there’s not much theory that suggests shrinking toward zero should improve estimation.

Methological Details

For each state it estimates the “swing” from 2016 to 2020 for president and 2018-2022 for the U.S. house; then simply adds that to 2020 presidential returns. The projected state-level swing is weighted 60-40 toward presidential results.

Now, the tricky bit is I estimate “swing” using James-Stein-adjusted state-level slope. This method “shrinks” the slant of each slope toward 50-50 or toward the average slope. 50-50 is what I used in 2020 and but recent corrections I’ve made to my backtesting scrips reveals that has a slightly higher mean absolute error going back to the 2004 election.

I’ve since updated the approach in a number of important ways, based on backtesting (looking at how well the method performs on past elections). I now regularize (or “shrink”) fewer quantities and do so toward the mean instead of toward zero. I should also note that the original code I used a few minor errors, which I’ve since fixed.

Disaggregating ‘Ideological Segregation’

Sol Messing — Wed, 02 Aug 2023 00:00:00 GMT

TLDR:

[UPDATED SEPT 30] Yesterday, Science published a letter I wrote arguing that there is little evidence of algorithmic bias in Facebook’s feed ranking system that would serve to increase ideological segregation, also known as the “Filter Bubble” hypothesis.
This contradicts claims in González-Bailón et al 2023 that Newsfeed ranking increases ideological segregation. This claim was the main piece of evidence in the Science Special Issue on Meta that might support the controversial cover that suggested that Meta’s algorithms are “Wired to Split.”
The issue is that while domain-level analysis suggests feed-ranking increases ideological segregation, URL-level analysis shows no difference in ideological segregation before and after feed-ranking.
And we should strongly prefer their URL-level analysis. Domain-level analysis effectively mislabels highly partisan content as “moderate/mixed,” especially on websites like YouTube, Reddit, and Twitter (aggregation bias/ecological fallacy).
Interestingly, the authors seem to agree—the discussion section points out problems with domain-level analysis.
Another Science paper from the same issue, Guess et al 2023 shows (in the SM) that Newsfeed ranking actually decreases exposure to political content from like-minded sources compared with reverse-chronological feedranking.
The evidence in the 4 recent papers is not consistent with a meaningful Filter Bubble effect in 2020; nor does it support the notion that Meta’s algorithms are “Wired to Split.”
Furthermore, domain-level aggregation bias is a big issue in a great deal of past research on ideological segregation, because domain-level analysis understates media polarization. Because González-Bailón et al 2023 gives both URL- and domain-level estimates, we can see the magnitude of aggregation bias. It’s huge.
I make a number of other observations about what we know about whether social media is polarizing and discuss implications for the controversial Science cover and Meta’s flawed claims that this research is exculpatory.

Intro/ICYMI

Click to expand

On BlueSky

Sol Messing — Mon, 03 Apr 2023 00:00:00 GMT

TL/DR Summary

BlueSky has a chance to dethrone twitter right now, but that path is narrow.
Its exclusive invite only model means its user base is now small, elite, and homogenous with few bad actors. Almost everyone likes it. But the real test will be when it opens to the public.
It is designed for true account portability and in theory should prevent a single company from owning the entire network as it scales up.
However, it’s unclear if an ecosystem of small companies can do the job of content moderation in the same ways that centralized social networks do. The same is true of running modern feed-ranking and follow-recommendation systems.
There will be growing pressure to make money using ads to cover costs as the network scales up, which will incentivize centralizing key data and resources, undermining the original model.
Future possibilities include: (1) BlueSky remains de-facto centralized, “in beta” until it can get composable moderation right, which turns out to be the foreseeable future; (2) big players (Google, Facebook) join the party and dominate the ecosystem; (3) small, unmoderated, ad-free apps proliferate and the network becomes overrun with spam, NSFW, hate, scams and gifts that come with a lack of moderation.

Pretty much everyone at Twitter—and especially Jack Dorsey—has long known that BlueSky could replace Twitter. When I joined Twitter in 2021, I soon learned our CEO was terribly unpopular internally, sporting a job approval rating under 40 percent, by far the lowest of any executive at the company.

In fact, Jack was obsessed with decentralization, he seemed convinced that it was a mistake to have Twitter organized as a corporation, and he would rant about this on company-wide calls, which he seemed to be taking from caves in South Asia. This is when everyone else at the company was desperately trying to increase revenues to save the company from implosion.

A photo of young Jack Dorsey in a cave.

Enter BlueSky, which would decentralize Twitter. Jack launched the initiative in 2019, and his plan was to migrate Twitter to this new protocol. It puts user data including posts and follow lists on open, public portable data servers (PDSs) that mean true account portability. Any business or organization could index the those servers, or what I will call the “BlueSkyVerse” (technically the AT Protocol), rank posts, and create a front end interface.

A node labelled as BlueSky or App sits atop various Portable Data Server (PDS) nodes, with arrows (edges) pointing to them. A caption to the right of the App node reads “indexing, ranking, moderation, UX” and a caption to the right of the PDS nodes reads “Open AT protocol: user posts, likes, follow graph.”

But wait a minute! Remember during Elon Musk’s acquisition how everyone said that the value of twitter isn’t the tech, but rather the network of creators and the communities that exist there? If you decouple that network from the platform you give up your most valuable asset—Google, Meta, others can index the network, develop a user interface, create some algorithms, show ads, and eat your lunch.

And yet, Jack was about to do just that, filling Twitter’s moat by turning its most valuable asset into a protocol. Of course, this did not go over well with employees who weren’t independently wealthy, nor the board, who eventually pushed him out.

BlueSky nicely captures the essence of Jack’s reign as half-time CEO: how little he cared about Twitter as a business and how much he cared about Twitter as an ecosystem.

But back to the question everyone cares about right now: will this new system lead to a better social network, or set of networks? Is this finally the Twitter alternative we’re looking for?

Make no mistake about it—BlueSky was designed by Twitter to replace Twitter. This makes it very different from the other new social media protocols, apps, etc. that we’ve seen come on the scene of late. As John Gruber put it, “If you hated Twitter, you’ll like Mastodon. If you liked Twitter, you’ll love BlueSky.”

So it’s a contender, despite how hard it is to start a social network from scratch. And don’t get any funny ideas about a post-surveillance-capitalism social network—if BlueSky takes off, it will most likely devolve into a less-moderated, less-profitable version of Twitter, Inc (aka Twitter 1.0). It will indeed encourage competition for front-end interfaces to explore the BlueSkyVerse. But the biggest challenges that social networks have to face—content moderation, discoverability, and monetization—require big technical and infrastructural investments to do well. They may only be viable for well-capitalized companies that generate big profits.

But of course, I would be very nervous if I still worked at Twitter.

Will it work?

Now is a unique opportunity for a Twitter rival. Twitter CEO Elon Musk tends to say all manner of nutty things, he has decimated Twitter’s trust and safety org, and cut staffing by more than 80%. And the company slashed infrastructure budgets needed for automated content moderation—internal sources say the company has cut 3bn since peak spending prior to the recession, while external accounts say Musk ordered a 1bn cut himself.

It shows: in the wake of the Allen massacre on Saturday, graphic videos and misinformation spread across the platform. Advertisers don’t want to risk putting their brands next to that kind of content and many have suspended advertising on the platform.

We’ve all wondered which alternative social media system might replace Twitter. Could it be Mastodon, Spoutible, Post News, maybe Substack Notes? Or perhaps Truth Social or Gab or Gettr?!

A screenshot from Gab.com, with a post showing a flag of the UN with the text: “Need to burn a flag? Make it this one.”

I’m guessing it’s not going to be those other networks. The new centralized social network entrants—Spoutible, Post News, and Substack Notes—feel sterile and inauthentic when you first get started, partially because they are built around conventional media outlets, partially because they didn’t pay enough attention to discoverability in onboarding. Gettr/gab/truth social have libertarian-borderline-right-wing moderation setups, and the vast majority of people on Twitter have little interest in a right-wing echo chamber where there’s no one to troll.

A screenshot from Spoutible showing a blank timeline.

Mastodon is losing steam for many reasons—onboarding is terribly confusing, it’s broken into communal servers that are all very different but that all seem uptight. Moderation there has been characterized as “petulant nannyism.”

Like Twitter, and unlike Mastodon, BlueSky can surface content from this entire web of activity across the BlueSkyVerse and delight you with memes and witticisms, many of which were about ”Sexy” ALF (yes, the 80s TV star) when I signed up.

Many beta users say BlueSky feels like a breath of fresh air, like a throwback to early Twitter. For now, BlueSky is invite-only and so missing are the scammers, crypto bros, right-wing nuts, and tone policing randos looking for followers you find on Twitter. It feels more communal and less exhausting. Unclear how long that will last.

So maybe BlueSky has a legit claim to the Throne of Discourse, post-Twitter.

Content Moderation

First of all, content moderation is not just a “nice-to-have” thing that keeps the press happy. Facebook and others have found that content moderation increases retention. And look at the flip side: most people don’t want to hang out at what Mike Masnick calls “Nazi bars,” which is what platforms with permissive moderation policies will often become known for, whether they are actually Nazis or just radical free-speech advocates. Once that happens, kiss a lot of your core user base and ad revenue goodbye—which is what seems to be happening at Twitter.

Of course, content moderation is the bane of the modern social media network. It’s expensive, it will always be wrong, it can easily create a PR dumpster file, and its benefits are extremely difficult to measure. This new protocol was designed with content moderation in mind so let me break that down before talking about the problems that will surely come up.

On BlueSky, speech happens on your PDS, but reach happens on the centralized app—Bluesky for now. And they are in fact moderating, so if they find a post that violates policy, they may take it off their app. It’s still up on the PDSs, it’s just not indexed in BlueSky. So great, it allows for a slightly truer form of “freedom of speech but not reach.”

How does this actually work? The BlueSky team wants to create a “moderation ecosystem,” in which labels (“spam”, “nsfw”) can be created by anyone, and apps like BlueSky can then choose what labels to act upon. Right now, it’s completely centralized at BlueSky, and they have an automated layer and decisions are made by “server administrators.” Eventually though, there will be other label sources, other apps besides BlueSky and many servers beyond bsky.social. They’re proposing a “choose your own moderation” approach.

OK what are the downsides?

First, there are key parts of moderation that raise questions under this framework. If you doxx someone’s home address for targeted harassment, post a bunch of Child Sexual Abuse Material (CSAM) or non-consensual sexual imagery, it feels insufficient to merely de-index those posts. There are cases where it may not be legally sufficient under the Digital Services Act, NetzDG, or U.S. Copyright Law.

The spam-detection arms race is another example—the more you are open with how it works, the faster the spammers get around your detection systems. Somewhat relatedly, the fact that blocklists are public on BlueSky due to the BlueSkyVerse architecture, is already stirring controversy.

Finally, a big part of a healthy information ecosystem is keeping bad actors off your platform in the first place. In centralized networks, that’s often done by IP screening, cell phone/text message screening, email validation, and/or by using other private data. But a PDS hosts public data, so the centralized app would need to create parallel user accounts to collect and maintain that data.

All that means it’s difficult to see an alternative to a world where BlueSky and other AT apps need to start collecting private user data, even if it’s inconsistent with the clean decentralized, portable data model illustrated above. The line between PDS and user account will get very fuzzy very fast.

And, once apps do this for content moderation, wouldn’t they also wish to do it for advertising as well? Content moderation isn’t free.

Right now, signups are based on invites, which helps keep out bad actors. But eventually BlueSky will need to open up fully once it’s out of beta.

When that happens, the job of content moderation will be far more complex than in a place like Mastodon, because the BlueSky architecture is meant to enable “scale and global discoverability.” With Mastodon/the Fediverse, each server has its own policies, norms, and content moderation, which is far simpler in its small, federated worlds. In the BlueSkyVerse, you have no choice but to scale up moderation.

Recommender systems in the BlueSkyVerse

Will BlueSky be incentivized to build a feed-ranking system into their product and start logging the vast scope of data that inspired the phrase “surveillance capitalism?” They have already started down that path—in fact they’ve built the BlueSkyVerse to facilitate global discovery—large scale indexing and ranking across all PDSs in the network.

Right now, the “What’s Hot” feed does global discovery, but in a way that is pretty basic—it’s showing popular stuff from the last 30 minutes. For now this is fine, it’s the core of most modern recommender systems in social media websites.

Contrast this with Mastodon, where you can technically follow people from another server but the system isn’t designed so servers index each other and form one network. This is an important reason I think BlueSky could have legs, but Mastodon will probably not replace Twitter.

Setting aside any monetary pressures facing BlueSky for a minute, I suspect they will be driven toward increased data collection and deployment, simply because you need to do that to move the metrics that tell you your product is improving. This may be further cemented by the culture of modern engineering organizations—where engineering leaders and PMs ruthlessly focus on moving a “north star” metric, which is almost alway some variant of time spent. “Time spent, daily active users, session counts, these are measures of whether you’re making your product better—the fact that they are all highly correlated with potential ad revenue is coincidental.

Of course, to do anything like what Twitter and Facebook do with their recommender systems—for both follow recommendations and for feed-ranking—will require a lot more resources. For the follow graph, that entails predicting which users are likely to form mutual follow relationships or satisfactory follow-only relationships, which can be done with shortcuts but is ultimately a difficult (graph machine learning) problem. For feed-ranking, that requires predicting what users are likely to interact with what content, which both Twitter and Facebook had entire divisions of engineers and data scientists working on.

Pressures to centralize and monetize the BlueSkyVerse Venture capitalists and startups in Silicon Valley are always talking about “moats.” If you invest a great deal of resources to build a technology or a new marketplace, what’s to stop a competitor from drinking your milkshake?

There’s an influential idea among “Web 3.0” circles, which is that Facebook, Instagram, and Twitter are the landlords of castles you can’t leave. That’s not supposed to happen this time—the BlueSkyVerse was designed around account portability and front-end/algorithmic competition. The hope is this will create an ecosystem of small companies doing bits and pieces of what big social media companies do today.

At the same time, everything I’ve seen so far suggests that large investments are going be required to even start playing in the BlueSkyVerse—there are barriers to entry on data processing to even index it as users grow, to create a legit feed and UX, and to do content moderation at that kind of scale. Jack has given billions to the BlueSky team to get the system to where it is today.

So what happens if the BlueSkyVerse really takes off? We might indeed see real competition for front-end apps that do custom algorithmic ranking and figure out innovative ways to moderate content. We might see further media fragmentation—perhaps front-end providers will try to differentiate themselves by topic or political orientation like television channels do.

But running a modern social media website is expensive. If it grows as big as Twitter, indexing the BlueSkyVerse will become a challenge, same for running modern recommender systems. And if you want ad revenue you need content moderation, which you can’t solve with AI alone—you need humans in the loop, which means you don’t get the kind of economies of scale you’d see with automated systems. What’s more, you often need sensitive user data to do these things well, and you need bespoke solutions to new adversarial tactics you find. So it’s hard to fully rely on an external company for these solutions, as the creators of BlueSky seemed to envision.

The future of the network

I see a few possibilities if BlueSky gets really big: the first is that BlueSky the app simply dominates this system—they moved first, they understand the system, they can do content moderation, they figure out how to scale up, and they may decide to sell ads. At the same time if BlueSky does become “Twitter 3.0,” there have to be consequences to the fact that I can simply take my posts and follow-graph to a competing service and still be on the same network.

Or maybe not. Maybe they will realize that the challenges of content moderation favor keeping the network as is, and the BlueSkyVerse will remain closed for a long time. Perhaps forever.

But if it does really launch and open up, it seems likely that established tech starts to play—Google jumps in, dedicates a small fraction of the resources it used to fund Google+, indexes the BlueSkyVerse in a day, and boom… has a competitor to Facebook. Maybe Facebook jumps in too, but that’s a tricky proposition because once part of Facebook/Instagram has true account portability what happens to the rest of the company?

Of course, another outcome that seems likely is a conservative social media front-end provider. Maybe Truth Social integrates with the BlueSkyVerse. It won’t make much money because many in that demographic seem happy with Twitter for now, and there will be substantial brand risk for potential advertisers.

Finally, we might see pure anarchy. In this “race to the bottom” scenario, a set of small, unmoderated, ad-free apps proliferate. Since people don’t like ads, they use these apps. The network becomes overrun with spam, NSFW, hate, scams and gifts that come with a lack of moderation. Of course, it’s unclear these apps would be tolerated by the app stores, but this is one direction things might generally go.

What can we learn from ‘The Algorithm,’ Twitter’s partial open-sourcing of it’s feed-ranking recommendation system?

Sol Messing — Mon, 03 Apr 2023 00:00:00 GMT

Last Friday (2023-03-31) Twitter released what it calls “the algorithm,” which appears to be a highly redacted, incomplete part of code that governs the “for you” home timeline ranking system. And I saw nothing to suggest the parts of the code they put in the GitHub repository wasn’t authentic.

It’s highly unusual for a tech company to open up a product at the core of its monetization strategy. The thinking is that the more engaging the content you show people right when they log in, the more likely they are to stick around. And the more you keep people logged in, the more they see ads. And the more data you can get to show them better ads!

Transparency, or a distraction from closing the API?

Is this a step forward for transparency as Musk and Twitter would claim? I am skeptical. You can’t learn much from this release in and of itself—you need the underlying model features, parameters, and data to really understand the algorithm. Those combine into a system that’s effectively different for everyone! So even if you had all that, you’d likely need to algorithmically audit the system to really get a handle on it.

And Twitter made it prohibitively expensive for external researchers to get that data through its API with the recent price updates ($500k/yr). So at the same time twitter is releasing this code, it’s made it incredibly difficult for research to audit this code

What’s in the code? Gossip and Rumors

Ukraine There were some initial reports that Twitter was downranking tweets about Ukraine. I looked at the code and can tell you those claims are wrong—twitter has an audio-only Clubhouse clone called Spaces and that code is for that product, not ordinary tweets on hometimeline. What’s more, this is likely a label related only to crisis misinformation, as per Twitter’s Crisis Misinformation Policy.

Musk Metrics One of the most interesting things we learned from the code is that Twitter created an entire suite of metrics about Elon Musk’s personal twitter experience. The code shows they fed those metrics to the experimentation platform (Duck Duck Goose, or DDG), which at least historically has been used to evaluate whether or not to ship products.

This episode is consistent with reporting that engineers are very concerned about how any features they ship affect the CEOs personal experience on Twitter. And other reporting has suggested that there may have been a Musk centric boost feature that shipped, and you would want exactly this kind of instrumentation to understand how that worked in practice.

Republican, Democrat Metrics We also learned that Twitter is logging similar metrics for lists of prominent Democrat and Republican accounts, ostensibly to understand whether any features that they ship affect those sets of accounts equally. Now we know that conservative accounts tend to share more misinformation than liberal accounts on both Twitter and on Facebook. And, Musk has alleged that Democrats and Big Tech are colluding to enforce policy violation unequally across parties.

But if you have these “partisan equality’’ stats as part of your ship criteria, perhaps on equal footing with policy violation frequency, you can see how this could really affect the types of health and safety features that actually make it to the site in production.

This code was then comically removed via pull requests from Twitter. Because once you delete something on GitHub, it just goes away. Right?

Twitter Blue Boost What’s more, we sorta knew that Twitter Blue users get a boost in feed ranking, but the code make it clear that it could double your score among people who don’t follow you, and quadruple it for those who do.

As Jonathan Stray pointed out, if this counts as a paid promotion, the FTC might require Twitter to label your tweets as ads. Now we kind of already knew this from Musks Twitter Blue announcement, but having evidence in the code might cross a different line for the FTC.

So what about the ackshual algorithm? What does this say about feed ranking?

The code itself is there but it’s missing specifics—key parameters, feature sets, and model weights are absent or abstracted. And obviously the data.

The most critical thing we learned about Twitter’s ranking algorithm is probably from a readme file that former Facebook Data Scientist Jeff Allen found. If we take that at face value, a fav (twitter like) is worth half a retweet. A reply is worth 27 retweets, and a reply with a response from a tweets author is worth a whopping 75 retweets!

Now it’s not quite that simple—what about when a tweet is first posted and there’s no data? Twitter’s deep learning system (in the heavy ranker) will do some heavy lifting and predict the likelihood of each of these actions based on the tweet author, their network, any initial engagements, the tweet text, and thousands of signals and embeddings.

Of course, what happens in the first few minutes when a tweet is posted deeply shapes who sees and engages with it downstream in the future.

[And the way this is implemented in practice is that the model handles all cases, but as you get more and more real time data on a tweet, those real time features dominate everything else and push those probabilities close to 1, see discussion here.]

Now I should point out that there are some spammy accounts claiming to have found ranking parameters in the code. They’re wrong, those are used to retrieve tweets from your network for candidate generation only. Lucene is an open source search tool.

I should point out however, that some of the “Earlybird’’ code was at one point used in timeline ranking, and it appears that it may be used in cr-mixer, which is used in candidate generation for out-of-network tweets.

Interestingly, Twitter appears to remove competitor URLs, perhaps only for tweets that are outside of network (you don’t follow the author).

What else goes into the “the Algorithm?’’

What gets ranked in the first place? The other piece here is the “TikTok’’ part of the ranking algorithm, which is also incomplete without the models/data/parameters/etc. What I mean is the code that takes content from across the platform and says “I’m going to put this into your queue for the heavy ranker to sort out.”

Now on Twitter often that historically meant tweets posted by or replied to by accounts you follow. But, Twitter realized it could find a lot more content for that heavy ranker magic.

There’s a complex system that inserts tweets into your queue for ranking. This is called candidate generation in the “recommendation system” subfield of applied computing.

If you follow a lot of people on twitter like me, about half of the candidate tweets in twitter’s ranked “for you” timeline at any given time are from people you follow.

Now, if you don’t follow a ton of people, or if you have a new account, you can run out of these tweets, and then Twitter will try to find additional candidates so that you have ranked content. If so, means that this system is going to govern what in your home timeline feed like TikTok—gathering content it predicts you’ll like from across the platform.

This takes place in cr-mixer, and although some of the high level function calls are there, much of the code and the models appear to be missing, and many files come with this warning at the top: “This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.’’

Twitter seems to have made some of the systems public underlying candidate generation public, including its SimCluster model.

BTW, I’d like to give a shout out to Vicki Boykis, and Igor Brigadir who are doing amazing work to map out the codebase and unearth exactly what’s missing and what’s not.

Trust and Safety

A lot of the code related to Trust and Safety is missing, presumably to prevent bad actors from learning too much and gaming those systems. However, there do seem to be some specifics about the kinds of things twitter considers borderline or violating that I don’t think were previously public.There are a bunch of safety parameters in the code, some of which are in Twitter’s policy documents, but some are not.

There are entries like “HighCryptospamScore” that appear in the code, which may give scammers hints about how to craft tweets to get around detection systems. The same is true for code that contains links to “UntrustedUrl,” “TweetContainsHatefulConductSlur” for low, medium and high severity.

There’s also a reference to a “Do Not Amplify” parameter in the code, which was discussed in the twitter files but seems not to be publicly documented in it’s policies. There are entries like “AgathaSpam,” which refers to a propriety embedding used across the codebase. Twitter also has a bunch of visibility rules hardcoded in Scala that might be useful to bad actors trying to game the system, outlining what rules are in play for all tweets, new users, user mentions, liked tweets, realtime spam detection, etc. Finally, some of the consequences for those violations are spelled out in Scala as well.

Of course, it’s really hard to know with certainty that any of this wasn’t in public somehow before this release.

Past vote data outperformed the polls. How did it go so wrong?

Sol Messing — Sun, 08 Nov 2020 00:00:00 GMT

It’s becoming clear that the 2020 polls underestimated Trump’s support by anywhere from a 4-8 point margin depending on your accounting–a significantly worse miss than in 2016, when state polls were off but the national polls did relatively well.

In fact, this year we were better off using projections based on past vote history in each state to predict how things would go in battleground states, as I’ll show below.

But I also want to start to ask questions about what happened this time around. The polling from 2018 looked encouraging, convincing many pollsters that the post-2016 reckoning had fixed many issues called out in the 2016 AAPOR report on election polling. After 2018, FiveThirtyEight wrote that the “Polls are Alright”.

But the second Miami-Dade reported results from the 2020 election, we knew something was probably wrong with the 2020 polls.

As Stefan notes (we worked together at Pew Research Center’s Data Labs), the error seems slightly lower in key battleground states, though the polls missed big in WI, perhaps in part due to its horrifically bad voter file data.

Unlike 2016, both state and national polls appeared to underestimate Trump’s support, as this early (Nov 7) analysis from Tom Wood shows:

Polling versus past votes

Perhaps what surprised me the most about polling this time around was when I went to evaluate some election projections I put together in April that we used internally at Acronym to help evaluate where we might want to spend. I pulled in the NYTimes polling averages and compared them with the latest state-level presidential results from the AP. I then did the same for the April projections. Turns out the projections were significantly more accurate than the polling averages:

normal

We used these projections, and other extant data (including the fact that there are two Senate races in play), when making what turned out to be a very lucky decision to start spending money in Georgia. We were one of the biggest and earliest spenders in that race.

What are these projections? I simply took the last two state-level Presidential and U.S. House election totals, estimated each state’s “trajectory,” and added that to each state’s Democratic margin from the previous cycle.

(Note that I also weighted 60-40 toward the Presidential results, and slightly regularized both the latest margin and the trajectory toward zero.)

Informing this approach is work from Yair Ghitza describing what went wrong in 2016, which suggested polarization and other state-level trends would continue, in addition to national trends or “uniform swing.”

I should note that this may only have worked because of something peculiar about this election cycle–I haven’t gone an back-tested this approach or anything like that.

Seems I was not the only one who noticed this kind of pattern:

What went wrong: The Usual Suspects

Humble-brag aside, it’s worth asking what might have gone wrong with polling in 2020?

The 2016 AAPOR report on election polling provides some guidance for how we might start to examine issues with the 2020 polls.

Undecided voters: Undecideds broke toward Trump late in the election in 2016–polls found as many as 13 percent of voters were undecided on election day or planned to vote for a third party. According to Poynter, there were half as many of these voters in 2020, so this is unlikely to be as big a factor as in 2016.

Low education non-response & adjustment: In 2016, individuals lower levels of education were much less likely to answer polls but still voted, and broke for Trump. The national polls adjusted for this but state level polls did not, which is partially why forecasting models that rely on state-level polls missed so hard.

While many state-level pollsters did this in 2020, Pew Research Center still found problems with state level polling this time around, for example failing to adjust for race and education simultaneously–non-college whites are far more likely to support Trump than non-college non-whites.

What’s more, pollsters adjusted only for college/non-college, which may not have been enough. They might need to use more fine grained adjustment–accounting for whether respondents have a high school degree and a college degree. Also error/missing data when people complete education in a survey means trouble if you want to fully fix the issue.

Volunteerism & civic engagement: Even if you adjust for low levels of non-response among individuals with lower education, pollsters still may have problems reaching low civic engagement voters, a bias that seems to persist even after modeling/weighting adjustments. In the past this hasn’t mattered as much, but these folks may be showing up to the polls for Trump.

Other Potential Factors

Likely voter models: This is difficult to fully unpack since each polling house does this slightly differently and not all publish their methods—some ask a battery of voter questions, some use models, some recruit off the voter file. But there’s only a weak relationship between who votes and who scores high on the likely voter battery. To make matters worse, 2020 was a very high-turnout election, which could have introduced even more instability into likely voter models.

Another important point from Peter Suzman is that likely voter screens could have inflated estimates of Dem turnout if they asked if respondents had already voted—it was Democrats who voted early.

However, that would only explain error in likely voter models, not polling based on registered voters, which also seemed to miss big this cycle, as I pointed out:

COVID-19: I wrote about this back in June. It’s possible that COVID-19 made lines long and kept people home in urban areas and non-white communities. Yes we had record turnout but all it takes is a few percent of people who encounter a bit of voting friction, who fail to register in person, don’t get in person canvassing/gotv contact, don’t vote by mail early, and/or don’t vote in vote in person.

At the same time, David Shor points out in a piece by Nate Cohn at the New York Times, that “…after lockdown, Democrats just started taking surveys, because they were locked at home and didn’t have anything else to do.”

normal

Without Dems doing the usual in-person registration drives, organizing, canvassing, etc. plus long lines in the hardest hit areas, and with Democrats taking surveys at unusually high rates, we might expect to see Trump overperform in areas hit hardest by COVID-19.

And indeed the data show just that. NPR has a nice visualization of this:

Another possibility is that shutdowns, school closings, and job losses stoked anger & resentment in centrist & right-leaning voters. I remember watching a local FB group quickly organize around the issue of school-openings and eventually morph into a hub for protests.

EDIT: I took a look at his performance by the urbanicity and racial makeup of those counties and here’s what I found:

Trump outperforms 2016 in non-white counties, and UNDER-performs in mostly-white counties. Same for more urban counties. That’s consistent w/ covid hitting non-white counties much harder in terms of registration, long-lines, and lower VBM rates.

That seems to stand in sharp contrast to speculation that Trump would be hit hardest in areas where people are most likely to know someone with COVID.

Shy Trump voters: There’s a hypothesis out there that people are embarrassed to admit that they would vote for Trump. The evidence for this is limited–Kyle Dropp and co at Morning Consult did some experimental work on this and found that people were slightly more likely in the 2016 primaries (but NOT the General and not in 2020) to say that they would vote for Trump when answering via online survey compared with speaking with a live pollster over the phone. But they’ve done many follow-on surveys since and the pattern doesn’t persist.

I am skeptical that this could be as much of a factor as some on social media seem to be claiming, but it’s hard to get good data to answer this question and acknowledge that absence of evidence is not evidence of absence. A number of commentators have claimed that since the polls underestimated support for all Republicans, this is an unlikely explanation.

That sounds pretty air-tight at first glance but it’s possible that some undecideds, perhaps embarrassed about having Trump as a figurehead of the Republican party, refused to say with certainty who they would actually vote for. Nevertheless, based on the pattern of results we’ve seen so far, this really can’t explain very much of the polling error this time around.

The Role of Election Forecasts

If you’re a forecaster, it’s very easy to look at all the polling data and come away with overconfident estimates of a candidate’s support. Many forecasters in 2016 did just that, failing to account for the fact that error between states and pollsters were likely correlated, and producing estimates that put Clinton’s chances above 95%.

The Huffington Post famously roasted FiveThirtyEight for trying to adjust for this state-level polling error the day before the 2016 election.

But even when forecasters get it right, forecasting can create firm expectations that one candidate will win, which in 2016 was complicated by destiny-narrative driven by media coverage of election forecasting.

Sean Westwood, Yph Lelkes and I recently published a research paper in the Journal of Politics showing just how much additional confidence forecasts give us, and wrote about the implications for the 2020 election in a recent USA Today op ed.

I believe it was the sharp violation of expectations that was so disappointing to Clinton supporters and so invigorating for the MAGA crowd—the Washington elite had underestimated “real Americans” yet again.

Trump’s chances are better than they look

Sol Messing — Sat, 20 Jun 2020 00:00:00 GMT

According to the latest polling research, Trump’s chances of hanging on to power beyond 2020 look pretty dismal. Nate Cohn published an impressive battleground poll from New York Times/Sienna showing Biden ahead of Trump by at least six points in pivotal states. The Economist’s forecast, powered by Elliott Morris and Andrew Gelman, is suggesting Biden is likely to get 64% of electoral college votes, and that if the election were held 100 times Biden would win 90 times to Trump’s 10.

At this point I would like to remind you of that feeling you felt on election night 2016. When a month earlier, CNN’s ‘Poll of Polls’ had Clinton up by 9 points and two prominent forecasters put Clinton’s chances at 99%. Remember that?

I could probably stop there, but I’m not going to because although we’ve fixed some of the issues from 2016, we have COVID-19. And COVID will mess with our election in ways very likely to hurt Democrats, and I know of no pollster factoring this into their method or likely voter model.

After 2016, Sean Westwood, Yphtach Lelkes and I began a multi-year research project (recently published in the Journal of Politics) and found that when you have high confidence that one candidate will win, you’re less likely to vote. The fact that everyone thought Clinton would win in 2016 shaped Comey’s decision to release his infamous letter that some believe cost Clinton the election, changed the way campaigns operated, and likely lowered Democratic turnout.

In addition to showing this in an experiment, one pattern that clearly pops out in the data we analyzed (ANES timeseries) is that people who think the leading candidate will win by quite a bit report voting at about a 3% lower rate. That’s in line with other research showing that early exit polls indicating one candidate is likely to win decrease turnout, and are more likely to affect Democrats. Yet this is by no means an upper bound—one study found more decisive exit polling depressed turnout by 11 points.

normal

While it’s if anything a noisy indicator of the influence Clinton’s ostensible lead may have had on Democrats compared with Republicans, the proportion of Democrats who thought Clinton would ‘win by quite a bit’ was much higher in 2016 than for Republicans, and much higher than it’d been in many years.

normal

To be clear, I no longer occupy the role of dispassionate observer–I’m actively working in politics at the moment.

So while I like seeing Biden up, let me explain exactly why the margins we’re seeing could be a polling mirage.

COVID-19

Are pollsters accounting for the likely decline in urban turnout due to COVID-19? Not if they are assuming typical levels of turnout across urban and rural areas.

Make no mistake, COVID-19 is already affecting the political process—look at voter registration. As many colleagues who regularly deal with registration data have warned me, the usual rush of new voter registrations, often from young voters, have “fallen off a cliff.” Registration numbers started stronger than ever as the new year began, but as 538 notes, fell to unprecedented levels in March as pandemic social distancing measures took effect.

normal

So it’s already hurting Democrats in terms of new registrations, but what might all this mean on election day? At first blush, it may be tempting to say to yourself, “COVID is affecting old people more than the young, and they break conservative so the left is probably fine,” before feeling slightly ashamed that you’re thinking about strategic considerations before the loss of life and sadness this statement implies.

Think a little deeper and you’ll likely realize that so far COVID-19 has affected left-leaning people in left-leaning places—non-White voters in urban areas far more than their suburban/rural counterparts. Even the recent surge in cases in sunbelt states is hitting urban and non-White regions hardest.

What’s more, conservatives seem to be far more likely to be willing risk going out and about than liberals. A Pew study shows Republicans are far more likely to support lifting COVID restrictions quickly than Democrats.

wide

With a deadly pandemic raging, will urban and non-urban voters go to the polls at the usual rates?

Post-pandemic primary voting has meant a vast reduction in the number of polling places and a big increase in mail-in-ballots. We’re seeing this in post-pandemic primaries like this Tuesday’s in Kentucky, New York, and Virginia.

In New York’s primary, there were reports of missing mail in ballots. Kentucky also saw reports of long lines that disportionately hit Black neighborhoods, in a primary that will determine the Democrat who runs against Senate Majority Leader Mitch McConnell.

What at first looks like maybe a silver lining is the surge in voting by mail-in ballot. And while Trump sees mail-in ballots as a threat to his re-election, the evidence is far from clear that widespread voting by mail would hurt his chances.

On the contrary, Stanford’s Andy Hall estimates that universal vote by mail should have no impact on either party’s vote share. However, as they note, vote by mail may very well have a disparate impact on minority voters, and their estimates assume that every voter is mailed a ballot, rather than needing to opt-in to voting by mail.

And just today, the Supreme Court denied an emergency request to allow all citizens in Texas to vote by mail. That’s not the last word, but conservatives are actively fighting measures like this one, which would have made it far easier to prepare to handle a deluge of mail-in ballots in the fall.

Furthermore, we’re already seeing evidence in the primaries of poll-workers failing to show up, lengthening the already long lines in urban areas that discourage voters.

If a little bit of rain can depress turnout in urban areas, fear of a deadly pandemic that spreads when you’re standing in line seems likely to as well.

What’s more, it’s going to take longer to count mail in ballots, and there will almost certainly be confusion about results as Poynter recently noted. Based on the President’s rhetoric around voting by mail, there will almost certainly be legal disputes about the legitimacy of certain results if not the election writ large.

Buckle up.

Things Change

Six months ago the big story was the prospect of war with Iran after Trump killed Sulamani. The political world is fundamentally different now and it’s more than possible that something important will happen between now and election day with political consequences.

Does that matter? Andrew Gelman (yes, the same) and Gary King have a paper suggesting it doesn’t—showing that we can predict elections remarkably well despite how much polls fluctuate. “Thus, the general campaign for president seems irrelevant to the outcome … despite all the media coverage of campaign strategy… except in very close elections.” And Alan Abramowitz’s forecasting model which is the kind of model they are referencing and which has done extremely well in the past, has Trump’s chances in 2020 nearly even (though both the economy and Trump’s polling numbers have suffered since).

So even if you’re one of those people who think that in general the ebb and flow of historical events largely does not impact U.S. elections, it may matter more in 2020 than in a typical year, and that’s before you even factor in a global pandemic that has upended life in America.

OK, but how many people are really undecided about Trump? When asked who they’d vote for, 8 percent of people in Nate Cohn’s poll said something other than Biden or Trump.

According to the American Association of Public Opinion Research 2016 post-mortem, a tsunami of undecided voters went to Trump, which was a major reason we thought Clinton was going to win in 2016. One controversial possibility is that some of these undecided voters were actually “shy Trump supporters,” which might explain the swing. Of course, another controversial possibility is that the Comey letter cost Clinton the election.

Regardless, Trump is more well-known now than in 2016 and there are fewer undecideds this time around. But we have no clue how these folks will break in 2020, and in 2016 they broke for Trump.

Correcting for Political Engagement

That 8 percent undecided number above may very well be an underestimate. Arguably the biggest problem in survey research today is that you can’t fully adjust for the bias toward high political knowledge respondents. And low political knowledge voters are more likely than others to be undecided.

Education

Likewise, it’s difficult to survey Americans with low education. The vast majority of polls fail to recruit a representative swath of these potential voters and cannot fully adjust away the bias.

OK so what?

No election has split on education like 2016 going back to the beginning of Pew Research Center’s data on this in 1980. Non-college Whites voted for Trump over their college educated counterparts by a 35 point margin. And the best retrospective analyses show that his biggest gains have come from low-education White moderates in battleground states (and not as many have presumed, from those with conservative views on race and immigration, across the educational spectrum).

Many 2016 polls did not adjust their samples to account for education—something that mattered far less in the past and something not easy to do correctly. They systematically underestimated Trump’s support in part because of this issue.

Although the NYT/Sienna poll and now many others do target and weight by education to increase representativeness, the methodology shows this poll (along with most) lump together everyone without a college degree. While the AAPOR report concludes this may be ok, Trump’s support does appear to increase as education decreases, which means failing to disaggregate “no high school degree,” “high school degree,” and “some college” when adjusting for education may very well result in some bias in favor of Trump.

Higher Error in Subnational Polls

The NYT/Sienna poll is one of the best subnational polls out there, but the error in battleground polls like this is generally higher than national polls. It’s harder to reach the right mix of people in individual states in a short period of time which increases the error. By error, I mean the actual error in predicting presidential vote share, not the reported “margin of error,” which is usually around half the actual error.

The reported margin of error here is about 2%, so doubling that, 4%, plus the 8 percent who didn’t say Biden or Trump means there may be 12% wiggle room, possibly more.

Issues with the Voter File

The way pollsters recruit people for their survey has a huge impact on accuracy. If you don’t get data from the right mix of people you’re not going to get a good sense of which candidate is ahead, and you can only get so much juice out of adjusting your polls using approaches like weighting.

Pollsters often use random-digit dialing to get a representative sample, but many of the best election surveys run today are now conducted by calling people from the voter file. The Times used the voter file in part so they could poll congressional districts, which are drawn in such idiosyncratic shapes that they don’t line up with area codes nor almost any other data set with phone numbers.

The Times uses the voter file to target specific subsets of the population that are hard to reach, such as low-education voters. Unfortunately, running a voter-file based poll may still not get enough low-education voters—as Pew Research Center’s voter file study showed (it used the same voter file vendor as does the Times—L2). So you have to rely on statistical adjustment, increasing error.

What’s more, in that Pew study only 62% of respondents who answered on a cell phone were the actual person on the voter file. And these quality issues vary a lot by state—remember, the file is first gathered by the secretary of state and is subject to local laws and regulations. For example, Wisconsin’s voter file is notoriously bad.

All this increases total survey error and the chance that systematic biases will creep in.

While the polling does indeed suggest better news than if it showed Trump ahead, this is still very likely a highly competitive race.

Facebook Condor URLs Data Release

Sol Messing — Mon, 18 May 2020 00:00:00 GMT

PDF Follow

On January 17, 2020 my team at Facebook launched one of the largest social science data sets ever constructed. It’s meant to facilitate research on misinformation from across the web, shared and spread on Facebook.

Full details on the release here.

We also released the URL santization framework, which I implemented (and which my SWE colleagues refactored).

What makes this data release unprecedented is that it contains exposure data describing external links that billions of users saw and read while using the site.

The data set goes beyond URL-level data, breaking down exposure and interactions by month, country, age, gender, and in the U.S., political page affinity (see Barbera et al 2015).

The data contain two tables: (1) a “URL attributes” table describing the 38 million URLs in the data set, including how many times users tagged those posts as containing misinformation, harassment, etc. and (2) a “breakdown” table, which aggregates counts of actions taken on urls, broken out by user demographics and URL attributes.

The technical documentation reflects more work than most papers I’ve written: . This list of authors reflects the scale of this massive team effort, and that’s before you include increadibly helpful advice we got from a number of computer scientists in the academy listed in the acknowledgements.

Perhaps most importantly, this release provides guarantees about anonymity in an incredibly rigorous way–action-level differential privacy, while preserving more underlying signal in the data.

Projecting Confidence

Sol Messing — Mon, 18 May 2020 00:00:00 GMT

PDF Follow

Inspired by Donald Trump’s surprise victory over Hillary Clinton in the 2016 general election, Sean Westwood, Yphtach Lelkes and I set out to interrogate the question of whether elecion forecasts—particularly probablistic forecasts—might create a sense of inevitability, and ultimately lead people to stay home on election day.

Clinton herself was quoted in New York Magazine after the election:

I had people literally seeking absolution… ‘I’m so sorry I didn’t vote. I didn’t think you needed me.’ I don’t know how we’ll ever calculate how many people thought it was in the bag, because the percentages kept being thrown at people — ‘Oh, she has an 88 percent chance to win!’

Is it plausible that forecasting could have affected the election?

For this phenomena to affect an election, it must: 1. be visible in the media so it reaches potential voters, 2. depress turnout, and 3. affect one side more than the other. In the case of 2016, that means affecting Clinton’s supporters (and/or Clinton campaigners) more than Trump’s.

We found evidence for all of the above. First, witness the rise of forecasts since 2008, when FiveThirtyEight first came on the scene:

What’s more, there is good evidence that one side will be more affected. Our research (see results below) suggests that candidate who is ahead in the polls is more affected by probablistic forecasts. In 2016, that was Hillary.

And irrespective of 2016, it’s outlets with a left-leaning audience that publish and cover election forecasts. The websites that present their poll aggregation results in terms of probabilities have left-leaning (negative) social media audiences—only realclearpolitics.com, which doesn’t emphasize win-probabilities, has a conservative audience:

half

These data come from the average self-reported ideology of people who share links to various sites hosting poll-aggregators on Facebook, data that come from this paper’s replication materials.

When you look at the balance of coverage of probabilistic forecasts on major television broadcasts, there is more coverage on MSNBC, which has a more liberal audience.

half

How much influence do forecasters really have?

It’s increadibly difficult to tease out when one media outlet is influencing another. However, a freak event in 2018 allows us to get some traction on this question, and suggests that FiveThirtyEight’s 2018 coverage was highly influential.

After FiveThirtyEight’s real-time forecast suddenely moved the the GOP’s odds of taking the House from single digits to about 60% at around 8:15PM, PredictIt’s odds on the GOP rose above 50-50, & U.S. government bond yields rose 2-4 basis points. FiveThirtyEight then altered it’s prediction system and the markets calmed down.

This spike seems to have occurred because a number of big, Republican-dominated districts started reporting returns before those that went toward Democrats and because it was making inferences from partial vote counts:

half

This was first reported by Colby Smith & Brian Greeley of FT.com. They report that because markets expected to see more inflation under a Republican House (high spending, low taxes) the U.S. Bond yield rose.

Was this just a correlation? Possibly, but there was pretty much nothing else happening in the U.S., and it was like 1 am in Europe, as pointed out in the FT.com piece above.

Josh Tucker suggested that 538 might be driving prediction markets back in 2012 in a Monkey Cage blogpost.

Our research on forecasting and perception

Our research shows that probablistic election forecasts make a race look less competitive. Participants in a national probability survey-experiment were substantially more certain that one candidate would win a hypothetical race after seeing a probablistic forecast than after seeing the equivalent vote share estimate and margin of error. This is a big effect—those are confidence intervals not standard errors, with p-values below .

normal

Why do people do this?

More research is needed here but we do have some leads. First, small differences in the election metric most familiar to the public—vote share estimates—generally correspond to very large differences in the probability of a candidate’s chance of victory.

Andy Gelman referenced this in passing in a 2012 blogpost questioning the decimal precision (0.1 percent) that 538 used to communicate its forecast on its website:

That’s right: a change in 0.1 of win probability corresponds to a 0.004 percentage point share of the two-party vote. I can’t see that it can possibly make sense to imagine an election forecast with that level of precision…

Second, people sometimes confuse probabilistic forecasts with vote share projections, and incorrectly conclude that a candidate is projected to say win 85% percent of the vote, rather than to having an 85% chance of winning the election. About 1 in 10 peope did this in our experiment.

As Joshua Benton pointed out in a tweet, TalkingPointsMemo.com made this very mistake:

Finally, people tend to think in qualitative terms about the probability of events {%cite sunstein2002probability%}, {%cite keren1991calibration%}. An 85% likelihood that something will happen means it’s going to happen. These studies may help explain why after the 2016 election, so many criticized forecasters for “getting it wrong” (see this and this).

What about voting?

Perhaps most critically, we show that probabilistic forecasts showing more of a blowout can lower voting. In Study 1, we find limited evidence of this based on self reports. In Study 2, we show that when participants are faced with incentives designed to simulate real world voting, they are less likely to vote when probabilistic forecasts show higher odds of one candidate winning. Yet they are not responsive to changes in vote share.

normal

Could this actually affect real world voting?

Consider 2016—an unusually high number of Democrats thought the leading candidate would win by quite a bit:

normal

And people who say the leading candidate will win by quite a bit in pre-election polling are about three percentage points less likely to say they voted after the election than people who say it’s a close race. That’s after controlling for election year, prior turnout, and party identification.

normal

The data here are from the American National Election Study (ANES) and go back to 1952.

Past social science research also provides evidence that the perception of a close race boosts turnout. Some of the best evidence comes from work that analyzes the effects of releasing exit polling results before voting ends, which clearly removes uncertainty. Work examining the effects of East Coast television networks’ “early calls” for one candidate or another on West Coast turnout generally find small but substantively meaningful effects, despite the fact that these calls occur late on election day, see also this. Similar work exploiting voting reform as a natural experiment shows a full 11 percentage point decrease in turnout in the French overseas territories that voted after exit polls were released. These designs are not confounded with the tendency for campaigns to invest more in campaigns in competitive races.

Researchers consistently find robust correlations between tighter elections and higher turnout see this; and this for reviews]. Furthermore, there is evidence from statistical models that prior election returns also explain turnout above and beyond campaign spending, particularly when good polling data is unavailable.

Field experiments provide additional evidence that perceptions of higher electoral competition increases turnout. This work finds substantive effects on turnout when polling results showing a closer race are delivered via telephone [among those who were reached] but null results when relying on postcards to deliver closeness messages. Finally, one study conducted in the weeks leading up to the 2012 presidential election found higher rates of self-reported, post-election turnout when delivering ostensible polling results showing Obama neck-and-neck with Romney which was not consistent with the extant polling data showing a comfortable Obama lead.

Could this affect politicians as well?

Candidates’ perceptions of the closeness of an election can affect campaigning and representation {%cite enos2015campaign%}, {%cite Mutz:1997wy%}.

These perceptions can also shape policy decisions—-for example, prior to the 2016 election, the Obama administration’s confidence in a Clinton victory was reportedly a factor in the muted response to Russian intervention in the election.

And former FBI Director James Comey, because of his confidence in a Clinton victory, said he felt that it was his duty to write a letter to Congress on October 28 saying he was reopening the investigation into her emails. Comey explained his actions based on his certain belief in a Clinton win: ’‘[S]he’s gonna be elected president, and if I hide this from the American people, she’ll be illegitimate the moment she’s elected, the moment this comes out’’ {%cite keneally_2018%}. Nate Silver at one point said ’‘the Comey letter probably cost Clinton the Election.’’

Media coverage Washington Post, FiveThirthyEight’s Politics Podcast, New York Magazine, Political Wire.

Impression of Influence

Sol Messing — Sun, 17 May 2020 00:00:00 GMT

PDF Follow

The Impression of Influence: Legislator Communication, Representation, and Democratic Accountability Princeton University Press, 2015. With Justin Grimmer and Sean Westwood - Media: Mischiefs of Faction.

Why Election Forecasting Matters

Sol Messing — Sun, 26 Apr 2020 00:00:00 GMT

Do you remember the night of Nov 8, 2016? I was glued to election coverage and obsessively checking probabilistic forecasts, wondering whether Clinton might do so well that she’d win in places like my home state of Arizona. Although FiveThirtyEight had Clinton’s chances at beating Trump at around 70%, most other forecasters had her at around 90%.

When she lost, many on both sides of the aisle were shocked. My co-authors and I wondered if America’s seeming confidence in a Clinton victory wasn’t driven in part by increasing coverage of probabilistic forecasts. And, if a Clinton victory looked inevitable, what did that do to turnout?

We weren’t alone. Clinton herself was quoted in New York Magazine after the election:

I had people literally seeking absolution… ‘I’m so sorry I didn’t vote. I didn’t think you needed me.’ I don’t know how we’ll ever calculate how many people thought it was in the bag, because the percentages kept being thrown at people — ‘Oh, she has an 88 percent chance to win!’

Enter our recent blog post and paper released on SSRN, “Projecting confidence: How the probabilistic horse race confuses and de-mobilizes the public,” by Sean Westwood, Solomon Messing, and Yphtach Lelkes. While our work cannot definitively say whether probabilistic forecasts played a decisive role in the 2016 election, it does indeed show that compared to more conventional vote share projections, probabilistic forecasts can confuse people, can give people more confidence that the candidate depicted as being ahead will win, may decrease turnout, and that liberals in the U.S. are more likely to encounter them. We appreciate the media attention to this work, including coverage by the Washington Post, New York Magazine, and the Political Wire. What’s more, FiveThirtyEight devoted much of their Feb. 12 Politics Podcast to a spirited, and at points critical discussion of our work. We are open to criticism and will respond to some of the questions raised in this post. Below, we’ll show that the evidence in our study and in other research is not inconsistent with our headline, as the hosts suggest—we’ll detail the evidence that probabilistic forecasts confuse people, irrespective of their technical accuracy. We’ll also discuss where we agree with the podcast hosts. Furthermore, we’ll discuss a few topics which, judging from the hosts discussion, may not have come through clearly enough in our paper. We’ll reiterate what this work contributes to social science—how the paper adds to our understanding of how people think about probabilistic forecasts and how they may decrease voting, particularly for the leading candidate’s supporters and among liberals in the U.S. We’ll then walk readers through the way we mapped vote share projections to probabilities in the study. Finally we’ll discuss why this work matters, and conclude by pointing out future research we’d like to see in this area.

What’s new here?

The research contains a number findings that are new to social science:

Presenting forecasted win-probabilities gives potential voters the impression that one candidate will win more decisively, compared with vote share projections (Study 1).
Higher win probabilities, but not vote share estimates, decrease voting in the face of the trade-offs embedded in our election simulation (Study 2). This helps confirm the findings in Study 1 and adds to the evidence from past research that people vote at lower rates when they perceive an election to be uncompetitive.
In 2016, probabilistic forecasts were covered more extensively than in the past and tended to be covered by outlets with more liberal audiences.

Where we agree

If what you care about is conveying an accurate sense of whether one candidate will win, probabilistic forecasts do this slightly better than vote share. And, they seem to give people an edge on accuracy when interpreting the vote share if your candidate is behind. Of course, people can be confused and still end up being accurate, as we’ll discuss below.

We also agree that people often do not accurately judge the likelihood of victory after seeing a vote share projection. That makes sense because, as the study shows, people appear to largely ignore the margin of error, which they’d need to map between vote share estimates and win probabilities.

We also agree that a lot of past work shows that people stay home when they think an election isn’t close. What we’re adding to that body of work is evidence that compared with vote share projections, probabilistic forecasts give people the impression that one candidate will win more decisively, and may thus more powerfully affect turnout.

Does the evidence in our study contradict our headline?

Our headline isn’t about accuracy, it’s about confusion. And the evidence from this research and past work taken as a whole suggests that probabilistic forecasts confuse people — something that came up at the end of segment — even if the result sometimes is technically higher accuracy.

1. People in the study who saw only probabilistic forecasts were more likely to confuse probability and vote share. After seeing probabilistic forecasts, 8.6% of respondents mixed up vote share and probability, while only 0.6% of respondents did so after seeing vote share projections. We’re defining “mixed-up” as reporting the win-probability we provided as the vote share and vice-versa.

2. Figure 2B (Study 1) shows that people get their candidate’s likelihood of winning very wrong, even when we explicitly told them the probability a candidate will win. It’s true that they got slightly closer with a probability forecast, but they are still far off.

normal

Why might this be? A lot of past research and evidence suggests that people have trouble understanding probabilities, as noted at the end of the podcast. People have a tendency to think about probabilities in subjective terms, so they have trouble understanding medical risks and even weather forecasts.

Nate Silver has himself made the argument that the backlash we saw to data and analytics in the wake of the 2016 election is due in part to the media misunderstanding probabilistic forecasts.

As the podcast hosts pointed out, people underestimated the true likelihood of winning after seeing both probabilistic forecasts and vote share projections. It’s possible that people are skeptical of any probabilistic forecast in light of the 2016 election. It’s possible they interpreted the likelihood not as hard-nosed odds, but in rather subjective terms — what might happen, consistent with past research. Regardless, they do not appear to reason about probability in a way that is consistent with how election forecasters define the probability of winning.

3. Looking at how people reason about vote share — the way people have traditionally encountered polling data — it’s clear from our results that when a person’s candidate is ahead and they see a probabilistic forecast, they rather dramatically overestimate the vote share. On the other hand, when they are behind, they get closer to the right answer.

But we know from past research that people have a “wishful thinking” bias, meaning they say their candidate is doing better than polling data suggests. That’s why there’s a positive bias when people are evaluating how their candidate will do, according to Figure 2A (Study 1).

normal

The pattern in the data suggest that people are more accurate after seeing a probabilistic forecast for a losing candidate because of this effect, and not necessarily because they better understand that candidate’s actual chances of victory.

4. Perhaps even more importantly, none of the results here changed when we excluded the margin of error from the projections we presented to people. That suggests that the public may not understand error in the same way that statisticians do, and therefore may not be well-equipped to understand what goes into changes in probabilistic forecast numbers. And of course, very small changes in vote share projection numbers and estimates of error correspond to much larger swings in probabilistic forecasts.

5. Finally, as we point out in the paper, if probabilistic forecasters do not account for total error, they can really overestimate a candidate’s probability of winning. Of course, that’s because an estimate of the probability of victory bakes in estimates of error, which recent work has found is often about twice as large as the estimates of sampling error provided in many polls.

As Nate Silver has alluded to, if the forecaster does not account unobserved error, including error that may be correlated across surveys — he/she will artificially inflate the estimated probability of victory or defeat. Of course, FiveThirtyEight does attempt to account for this error, and released far more conservative forecasts than others in this space in 2016.

Speaking in part to this issue, Andrew Gelman and Julia Azari recently concluded that “polling uncertainty could best be expressed not by speculative win probabilities but rather by using the traditional estimate and margin of error.” They seemed to be speaking about other forecasters, and did not directly reference FiveThirtyEight.

At the end of the day, it’s easy to see that a vote share projection of 55% means that “55% of the votes will go to Candidate A, according to our polling data and assumptions.” However, it’s less clear that an 87% win probability means that “if the election were held 1000 times, Candidate A would win 870 times, and lose 130 times, based on our polling data and assumptions.”

And most critically, we show that probabilistic forecasts showing more of a blowout could potentially lower voting. In Study 1, we provide limited evidence of this based on self reports. In Study 2, we show that when participants are faced with incentives designed to simulate real world voting, they are less likely to vote when probabilistic forecasts show higher odds of one candidate winning. Yet they are not responsive to changes in vote share.

normal

What’s with our mapping between vote share and probability?

The podcast questions how a 55% vote share with a 2% margin of error is equivalent to an 87% win probability. This illustrates a common problem people have when trying to understand win probabilities — -it’s difficult to reason about the relationship between win-probabilities and vote share without actually running the numbers.

You can express a projection as either (1) the average vote share (can be an electoral college vote share or the popular vote share)

and margin of error

Here the average for each survey is , and there are surveys.

Or (2) the probability of winning — the probability that the vote share is greater than half, based on the observed vote share and standard error:

Going back to the example above, here’s the R code to generate those quantities:

svy_mean = .55 
svy_SD = 0.04483415 # see appendix 
N_svy = 20 
margin_of_error = qt(.975, df = N_svy) * svy_SD/sqrt(N_svy) 

svy_mean 
[1] 0.55 

margin_of_error 
[1] 0.02091224 

prob_win = 1-pnorm(q = .50, mean = svy_mean, sd = svy_SD) 
prob_win 
[1] 0.8676222

More details about this approach are in our appendix. This is similar to how the Princeton Election Consortium generated win probabilities in 2016.

Of course, one can also use an approach based on simulation, as FiveThirtyEight does. In the case of the data we generated for our hypothetical election in Study 1, this approach is not necessary. However we recognize that in the case of real-world presidential elections, a simulation approach has clear advantages by virtue of allowing more flexible statistical assumptions and a better accounting of error.

Why does this matter?

To be clear, we are not analyzing real-world election returns. However, a lot of past research shows that when people think an election is in the bag, they tend to vote in real-world elections at lower rates. Our study provides evidence that probabilistic forecasts give people more confidence that one candidate will win and suggestive evidence that we should expect them to vote at lower rates after seeing probabilistic forecasts.

This matters a lot more if one candidate’s potential voters are differentially affected, and there’s evidence that may be the case.

1. Figure 2C in Study 1 suggests that the candidate who is ahead in the polls will be more affected by the increased certainty that probabilistic forecasts convey.

normal

2. When you look at the balance of coverage of probabilistic forecasts on major television broadcasts, there is more coverage on MSNBC, which has a more liberal audience.

half

3. Consider who shares this material in social media–specifically the average self-reported ideology of people who share links to various sites hosting poll-aggregators on Facebook, data that come from this paper’s replication materials. The websites that present their results in terms of probabilities have left-leaning (negative) social media audiences. Only realclearpolitics.com, which doesn’t emphasize win-probabilities, has a conservative audience:

half

4. In 2016, the proportion of American National Election Study (ANES) respondents who thought the leading candidate would “win by quite a bit” was unusually high for Democrats…

normal

5. And we know that people who say the leading presidential candidate will “win by quite a bit” in pre-election polling are about three percentage points less likely to report voting shortly after the election than people who say it’s a close race — and that’s after conditioning on election year, prior turnout, and party identification. The data here are from the ANES and go back to 1952.

normal

These data do not conclusively show that probabilistic forecasts affected turnout in the 2016 election, but they do raise questions about the real world consequences of probabilistic forecasts.

What about media narratives?

We acknowledge that these effects may change depending on the context in which people encounter them — though people can certainly encounter a lone probability number in media coverage of probabilistic forecasts. We also acknowledge that our work cannot address how these effects compare to and/or interact with media narratives.

However, other work that is relevant to this question has found that aggregating all polls reduces the likelihood that news outlets focus on unusual polls that are more sensational or support a particular narrative.

In some ways, the widespread success and reliance on these forecasts represents a triumph of scientific communication. In addition to greater precision compared with one-off horserace polls, probabilistic forecasts can quantify how likely a given U.S. presidential candidate is to win using polling data and complex simulation, rather than leaving the task of making sense of state and national polls to speculative commentary about “paths to victory,” as we point out in the paper. And as one of the hosts noted, we aren’t calling for an end to election projections.

Future work

We agree with the hosts that there are open questions about whether the public gives more weight to these probabilistic forecasts than other polling results and speculative commentary. We have also heard questions raised about how much probabilistic forecasts might drive media narratives. These questions may prove difficult to answer and we encourage research that explores them.

We hope this research continues to create a dialogue about how to best communicate polling data to the public. We would love to see more research into how the public consumes and is affected by election projections, including finding the most effective ways to convey uncertainty.

Know your data - Pricing diamonds using scatterplots and predictive models

Sol Messing — Sun, 02 Feb 2020 00:00:00 GMT

ggpairs

My last post railed against the bad visualizations that people often use to plot quantitive data by groups, and pitted pie charts, bar charts and dot plots against each other for two visualization tasks. Dot plots came out on top. I argued that this is because humans are good at the cognitive task of comparing position along a common scale, compared to making judgements about length, area, shading, direction, angle, volume, curvature, etc.—a finding credited to Cleveland and McGill. I enjoyed writing it and people seemed to like it, so I’m continuing my visualization series with the scatterplot.

Scatterplots

A scatterplot is a two-dimensional plane on which we record the intersection of two measurements for a set of case items–usually two quantitative variables. Just as humans are good at comparing position along a common scale in one dimension, our visual capabilities allow us to make fast, accurate judgements and recognize patterns when presented with a series of dots in two dimensions. This makes the scatterplot a valuable tool for data analysts both when exploring data and when communicating results to others.

In this post—part 1—I’ll demonstrate various uses for scatterplots and outline some strategies to help make sure key patterns are not obscured by the scale or qualitative group-level differences in the data (e.g., the relationship between test scores and income differs for men and women). The motivation in this post is to come up with a model of diamond prices that you can use to help make sure you don’t get ripped off, specified based on insight from exploratory scatterplots combined with (somewhat) informed speculation. In part 2, I’ll discuss the use of panels aka facets aka small multiples to shed additional light on key patterns in the data, and local regression (loess) to examine central tendencies in the data. There are far fewer bad examples of this kind of visualization in the wild than the 3D barplots and pie charts mocked in my last post, though I was still able to find this lovely scatterplot + trend-line.

Scatterplots and the Cartesian coordinate system

The scatterplot has a richer history than the visualizations I wrote about in my last post. The scatterplot’s face forms a two-dimensional Cartesian coordinate system, and DeCartes’ invention/discovery of this eponymous plane in around 1657 represents one of the most fundamental developments in science. The Cartesian plane unites measurement, algebra, and geometry, depicting the relationship between variables (or functions) visually. Prior to the Cartesian plane, mathematics was divided into algebra and geometry, and the unification of the two made many new developments possible. Of course, this includes modern map-making—cartography, but the Cartesian plane was also an important step in the development of calculus, without which very little of our modern would would be possible.

The scatterplot is a powerful tool to help understand the relationship between variables, and especially if that relationship is non-linear. Say you want to get a sense of whether you’re paying the right price when shopping for a diamond. You can use data on the price and characteristics of many diamonds to help figure out whether the price advertised for any given diamond is reasonable, and you can use scatterplots to help figure out how to model that data in a sensible way. Consider the important relationship between the price of a diamond and its carat weight (which corresponds to its size):

caratprice

A few things pop out right away. We can see a non-linear relationship, and we can also see that the dispersion (variance) of the relationship also increases as carat size increases. With just a quick look at a scatterplot of the data, we’ve learned two important things about the functional relationship between price and carat size. And, we also therefore learned that running a linear model on this data as-is would be a bad idea.

Diamonds

If you’ve ever used R, you’ve probably seen references to the diamonds data set that ships with Hadley Wickham’s ggplot2. It records the carat size and the price of more than 50 thousand diamonds, from http://www.diamondse.info/ collected in in 2008, and if you’re in the market for a diamond, exploring this data set can help you understand what’s in store and at what price point. This is particularly useful because each diamond is unique in a way that isn’t true of most manufactured products we are used to buying—you can’t just plug a model number and look up the price on Amazon. And even an expert cannot cannot incorporate as much information about price as a picture of the entire market informed by data (though there’s no substitute for qualitative expertise to make sure your diamond is what the retailer claims).

But even if you’re not looking to buy a diamond, the socioeconomic and political history of the diamond industry is fascinating. Diamonds birthed the mining industry in South Africa, which is now by far the largest and most advanced economy in Africa. I worked a summer in Johannesburg, and can assure you that South Africa’s cities look far more like L.A. and San Francisco than Lagos, Cairo, Mogadishu, Nairobi, or Rabat. Diamonds have stoked conflicts ranging from the Boer Wars to modern day wars in Sierra Leone, Liberia, Côte d’Ivoire, Zimbabwe and the DRC, where the 200 carat Millennium Star diamond was sold to DeBeers at the height of the civil war in the 1990s. Diamonds were one of the few assets that Jews could conceal from the Nazis during the “Aryanization of Jewish property” in the 1930s, and the Congressional Research Service reports that Al Qaeda has used conflict diamonds to skirt international sanctions and finance operations from the 1998 East Africa Bombings to the September 11th attacks.

Though the diamonds data set is full of prices and fairly esoteric certification ratings, hidden in the data are reflections of how a legendary marketing campaign permeated and was subsumed by our culture, hints about how different social strata responded, and insight into how the diamond market functions as a result.

The story starts in 1870 according to The Atlantic, when many tons of diamonds were discovered in South Africa near the Orange River. Until then, diamonds were rare—only a few pounds were mined from India and Brazil each year. At the time diamonds had no use outside of jewelry as they do today in many industrial applications, so price depended only on scarce supply. Hence, the project’s investors formed the De Beers Cartel in 1888 to control the global price—by most accounts the most successful cartel in history, controlling 90% of the world’s diamond supply until about 2000. But World War I and the Great Depression saw diamond sales plummet.

In 1938, according to the New York Times’ account, the De Beers cartel wrote Philadelphia ad agency N. W. Ayer & Son, to investigate whether “the use of propaganda in various forms” might jump-start diamond sales in the U.S., which looked like the only potentially viable market at the time. Surveys showed diamonds were low on the list of priorities among most couples contemplating marriage—a luxury for the rich, “money down the drain.” Frances Gerety, who the Times compares to Madmen’s Peggy Olson, took on the DeBeers’ account at N.W. Ayer & Son, and worked toward the company’s goal “to create a situation where almost every person pledging marriage feels compelled to acquire a diamond engagement ring.” A few years later, she coined the slogan, “Diamonds are forever.”

The Atlantic’s Jay Epstein argues that this campaign gave birth to modern demand-advertising—the objective was not direct sales, nor brand strengthening, but simply to impress the glamour, sentiment and emotional charge contained in the product itself. The company gave diamonds to movie stars, sent out press packages emphasizing the size of diamonds celebrities gave each other, loaned diamonds to socialites attending prominent events like the Academy Awards and Kentucky Derby, and persuaded the British royal family to wear diamonds over other gems. The diamond was also marketed as a status symbol, to reflect “a man’s … success in life,” in ads with “the aroma of tweed, old leather and polished wood which is characteristic of a good club.” A 1980s ad introduced the two-month benchmark: “Isn’t two months’ salary a small price to pay for something that lasts forever?”

By any reasonable measure, Frances Gerety succeeded—getting engaged means getting a diamond ring in America. Can you think of a movie where two people get engaged without a diamond ring? When you announce your engagement on Facebook, what icon does the site display? Still think this marketing campaign might not be the most successful mass-persuasion effort in history? I present to you a James Bond film, whose title bears the diamond cartel’s trademark:

Awe-inspiring and terrifying. Let’s open the data set.

The first thing you should consider doing is plotting key variables against each other using the ggpairs() function. This function plots every variable against every other, pairwise. For a data set with as many rows as the diamonds data, you may want to sample first otherwise things will take a long time to render. Also, if your data set has more than about ten columns, there will be too many plotting windows, so subset on columns first.

# Uncomment these lines and install if necessary:
#install.packages('GGally')
#install.packages('ggplot2')
#install.packages('scales')
#install.packages('memisc')
library(ggplot2)
library(GGally)
library(scales)
data(diamonds)
diasamp = diamonds[sample(1:length(diamonds$price), 10000),]
ggpairs(diasamp, params = c(shape = I('.'), outlier.shape = I('.')))

Anyway, here’s the plot:

ggpairs

What’s happening is that ggpairs is plotting each variable against the other in a pretty smart way. In the lower-triangle of plot matrix, it uses grouped histograms for qualitative-qualitative pairs and scatterplots for quantitative-quantitative pairs. In the upper-triangle, it plots grouped histograms for qualitative-qualitative pairs (using the x-instead of y-variable as the grouping factor), boxplots for qualitative-quantitative pairs, and provides the correlation for quantitative-quantitative pairs. What we really care about here is price, so let’s focus on that. We can see what might be relationships between price and clarity, and color, which we’ll keep in mind for later when we start modeling our data, but the critical factor driving price is the size/weight of a diamond. Yet as we saw above, the relationship between price and diamond size is non-linear. What might explain this pattern? On the supply side, larger contiguous chunks of diamonds without significant flaws are probably much harder to find than smaller ones. This may help explain the exponential-looking curve—and I thought I noticed this when I was shopping for a diamond for my soon-to-be wife. Of course, this is related to the fact that the weight of a diamond is a function of volume, and volume is a function of x * y * z, suggesting that we might be especially interested in the cubed-root of carat weight.

On the demand side, customers in the market for a less expensive, smaller diamond are probably more sensitive to price than more well-to-do buyers. Many less-than-one-carat customers would surely never buy a diamond were it not for the social norm of presenting one when proposing. And, there are fewer consumers who can afford a diamond larger than one carat. Hence, we shouldn’t expect the market for bigger diamonds to be as competitive as that for smaller ones, so it makes sense that the variance as well as the price would increase with carat size.

Often the distribution of any monetary variable will be highly skewed and vary over orders of magnitude. This can result from path-dependence (e.g., the rich get richer) and/or the multiplicitive processes (e.g., year on year inflation) that produce the ultimate price/dollar amount. Hence, it’s a good idea to look into compressing any such variable by putting it on a log scale (for more take a look at this guest post on Tal Galili’s blog).

p = qplot(price, data=diamonds, binwidth=100) +
theme_bw() +
ggtitle('Price')
p
p = qplot(price, data=diamonds, binwidth = 0.01) +
scale_x_log10() +
theme_bw() +
ggtitle('Price (log10)')
p

Indeed, we can see that the prices for diamonds are heavily skewed, but when put on a log10 scale seem much better behaved (i.e., closer to the bell curve of a normal distribution). In fact, we can see that the data show some evidence of bimodality on the log10 scale, consistent with our two-class, “rich-buyer, poor-buyer” speculation about the nature of customers for diamonds. Let’s re-plot our data, but now let’s put price on a log10 scale:

p = qplot(carat, price, data=diamonds) +
scale_y_continuous(trans=log10_trans() ) +
theme_bw() +
ggtitle('Price (log10) by Carat')
p

caratpricelog10

Better, though still a little funky—let’s try using use the cube-root of carat as we speculated about above:

cubroot_trans = function() trans_new('cubroot', transform= function(x) x^(1/3), inverse = function(x) x^3 )
p = qplot(carat, price, data=diamonds) +
scale_x_continuous(trans=cubroot_trans(), limits = c(0.2,3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans=log10_trans(), limits = c(350,15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
theme_bw() +
ggtitle('Price (log10) by Cubed-Root of Carat')
p

Nice, looks like an almost-linear relationship after applying the transformations above to get our variables on a nice scale.

## Overplotting

Note that until now I haven’t done anything about overplotting—where multiple points take on the same value, often due to rounding. Indeed, price is rounded to dollars and carats are rounded to two digits. Not bad, though when we’ve got this much data we’re going to have some serious overplotting.

head(sort(table(diamonds$carat), decreasing=TRUE ))
head(sort(table(diamonds$price), decreasing=TRUE ))

 0.3 0.31 1.01  0.7 0.32    1 
2604 2249 2242 1981 1840 1558 

605 802 625 828 776 698 
132 127 126 125 124 121

Often you can deal with this by making your points smaller, using “jittering” to randomly shift points to make multiple points visible, and using transparency, which can be done in ggplot using the “alpha” parameter.

p = ggplot( data=diamonds, aes(carat, price)) +
geom_point(alpha = 0.5, size = .75, position='jitter') +
scale_x_continuous(trans=cubroot_trans(), limits = c(0.2,3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans=log10_trans(), limits = c(350,15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
theme_bw() +
ggtitle('Price (log10) by Cubed-Root of Carat')
p

This gives us a better sense of how dense and sparse our data is at key places.

## Using Color to Understand Qualitative Factors

When I was looking around at diamonds, I also noticed that clarity seemed to factor in to price. Of course, many consumers are looking for a diamond of a certain size, so we shouldn’t expect clarity to be as strong a factor as carat weight. And I must admit that even though my grandparents were jewelers, I initially had a hard time discerning a diamond rated VVS1 from one rated SI2. Surely most people need a loop to tell the difference. And, according to BlueNile, the cut of a diamond has a much more consequential impact on that “fiery” quality that jewelers describe as the quintessential characteristic of a diamond. On clarity, the website states, “Many of these imperfections are microscopic, and do not affect a diamond’s beauty in any discernible way.” Yet, clarity seems to explain an awful lot of the remaining variance in price when we visualize it as a color on our plot:

p = ggplot( data=diamonds, aes(carat, price, colour=clarity)) +
geom_point(alpha = 0.5, size = .75, position='jitter') +
scale_colour_brewer(type = 'div',
guide = guide_legend(title = NULL, reverse=T,
override.aes = list(alpha = 1))) +
scale_x_continuous(trans=cubroot_trans(), limits = c(0.2,3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans=log10_trans(), limits = c(350,15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
theme_bw() + theme(legend.key = element_blank()) +
ggtitle('Price (log10) by Cubed-Root of Carat and Color')
p

Despite what BlueNile says, we don’t see as much variation on cut (though most diamonds in this data set are ideal cut anyway):

p = ggplot( data=diamonds, aes(carat, price, colour=cut)) +
geom_point(alpha = 0.5, size = .75, position='jitter') +
scale_colour_brewer(type = 'div',
guide = guide_legend(title = NULL, reverse=T,
override.aes = list(alpha = 1))) +
scale_x_continuous(trans=cubroot_trans(), limits = c(0.2,3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans=log10_trans(), limits = c(350,15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
theme_bw() + theme(legend.key = element_blank()) +
ggtitle('Price (log10) by Cube-Root of Carat and Cut')
p

Color seems to explain some of the variance in price as well, though BlueNile states that all color grades from D-J are basically not noticeable.

p = ggplot( data=diamonds, aes(carat, price, colour=color)) +
geom_point(alpha = 0.5, size = .75, position='jitter') +
scale_colour_brewer(type = 'div',
guide = guide_legend(title = NULL, reverse=T,
override.aes = list(alpha = 1))) +
scale_x_continuous(trans=cubroot_trans(), limits = c(0.2,3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans=log10_trans(), limits = c(350,15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
theme_bw() + theme(legend.key = element_blank()) +
ggtitle('Price (log10) by Cube-Root of Carat and Color')
p

caratpricecolorlog10

At this point, we’ve got a pretty good idea of how we might model price. But there are a few problems with our 2008 data—not only do we need to account for inflation but the diamond market is quite different now than it was in 2008. In fact, when I fit models to this data then attempted to predict the price of diamonds I found on the market, I kept getting predictions that were far too low. After some additional digging, I found the Global Diamond Report. It turns out that prices plummeted in 2008 due to the global financial crisis, and since then prices (at least for wholesale polished diamond) have grown at a roughly a 6 percent compound annual rate. The rapidly-growing number of couples in China buying diamond engagement rings might also help explain this increase. After looking at data on PriceScope, I realized that diamond prices grew unevenly across different carat sizes, meaning that the model I initially estimated couldn’t simply be adjusted by inflation. While I could have done ok with that model, I really wanted to estimate a new model based on fresh data.

Thankfully I was able to put together a python script to scrape diamondse.info without too much trouble. This dataset is about 10 times the size of the 2008 diamonds data set and features diamonds from all over the world certified by an array of authorities besides just the Gemological Institute of America (GIA). You can read in this data as follows (be forewarned—it’s over 500K rows):

#install.packages('RCurl')
library('RCurl')
diamondsurl = getBinaryURL('https://raw.github.com/solomonm/diamonds-data/master/BigDiamonds.Rda')
load(rawConnection(diamondsurl))

My github repository has the code necessary to replicate each of the figures above—most look quite similar, though this data set contains much more expensive diamonds than the original. Regardless of whether you’re using the original diamonds data set or the current larger diamonds data set, you can estimate a model based on what we learned from our scatterplots. We’ll regress carat, the cubed-root of carat, clarity, cut and color on log-price. I’m using only GIA-certified diamonds in this model and looking only at diamonds under $10K because these are the type of diamonds sold at most retailers I’ve seen and hence the kind I care most about. By trimming the most expensive diamonds from the dataset, our model will also be less likely to be thrown off by outliers at the high end of price and carat. The new data set has mostly the same columns as the old one, so we can just run the following (if you want to run it on the old data set, just set data=diamonds).

diamondsbig$logprice = log(diamondsbig$price)
m1 = lm(logprice~ I(carat^(1/3)),
data=diamondsbig[diamondsbig$price < 10000 & diamondsbig$cert == 'GIA',])
m2 = update(m1, ~ . + carat)
m3 = update(m2, ~ . + cut )
m4 = update(m3, ~ . + color + clarity)
#install.packages('memisc')
library(memisc)
mtable(m1, m2, m3, m4)

Here are the results for my recently scraped data set:

===============================================================
                    m1          m2          m3          m4     
---------------------------------------------------------------
(Intercept)       2.671***    1.333***    0.949***   -0.464*** 
                 (0.003)     (0.012)     (0.012)     (0.009)   
I(carat^(1/3))    5.839***    8.243***    8.633***    8.320*** 
                 (0.004)     (0.022)     (0.021)     (0.012)   
carat                        -1.061***   -1.223***   -0.763*** 
                             (0.009)     (0.009)     (0.005)   
cut: V.Good                               0.120***    0.071*** 
                                         (0.002)     (0.001)   
cut: Ideal                                0.211***    0.131*** 
                                         (0.002)     (0.001)   
color: K/L                                            0.117*** 
                                                     (0.003)   
color: J/L                                            0.318*** 
                                                     (0.002)   
color: I/L                                            0.469*** 
                                                     (0.002)   
color: H/L                                            0.602*** 
                                                     (0.002)   
color: G/L                                            0.665*** 
                                                     (0.002)   
color: F/L                                            0.723*** 
                                                     (0.002)   
color: E/L                                            0.756*** 
                                                     (0.002)   
color: D/L                                            0.827*** 
                                                     (0.002)   
clarity: I1                                           0.301*** 
                                                     (0.006)   
clarity: SI2                                          0.607*** 
                                                     (0.006)   
clarity: SI1                                          0.727*** 
                                                     (0.006)   
clarity: VS2                                          0.836*** 
                                                     (0.006)   
clarity: VS1                                          0.891*** 
                                                     (0.006)   
clarity: VVS2                                         0.935*** 
                                                     (0.006)   
clarity: VVS1                                         0.995*** 
                                                     (0.006)   
clarity: IF                                           1.052*** 
                                                     (0.006)   
---------------------------------------------------------------
R-squared             0.888       0.892      0.899        0.969
N                338946      338946     338946       338946    
===============================================================

Now those are some very nice R-squared values—we are accounting for almost all of the variance in price with the 4Cs. If we want to know what whether the price for a diamond is reasonable, we can now use this model and exponentiate the result (since we took the log of price). We need to multiply the result by exp(sigma^2/2), because the our error is no longer zero in expectation:

To dig further into that last step, have a look at the Wikipedia page on log-normal distributed variables. Thanks to Miguel for catching this. Let’s take a look at an example from Blue Nile. I’ll use the full model, m4.

# Example from BlueNile
# Round 1.00 Very Good I VS1 $5,601
thisDiamond = data.frame(carat = 1.00, cut = 'V.Good', color = 'I', clarity='VS1')
modEst = predict(m4, newdata = thisDiamond, interval='prediction', level = .95)
exp(modEst) * exp(summary(m4)$sigma^2/2)

The results yield an expected value for price given the characteristics of our diamond and the upper and lower bounds of a 95% CI—note that because this is a linear model, predict() is just multiplying each model coefficient by each value in our data. Turns out that this diamond is a touch pricier than expected value under the full model, though it is by no means outside our 95% CI. BlueNile has by most accounts a better reputation than diamondse.info however, and reputation is worth a lot in a business that relies on easy-to-forge certificates and one in which the non-expert can be easily fooled.

This illustrates an important point about generalizing a model from one data set to another. First, there may be important differences between data sets—as I’ve speculated about above—making the estimates systematically biased. Second, overfitting—our model may be fitting noise present in data set. Even a model cross-validated against out-of-sample predictions can be over-fit to noise that results in differences between data sets. Of course, while this model may give you a sense of whether your diamond is a rip-off against diamondse.info diamonds, it’s not clear that diamondse.info should be regarded as a source of universal truth about whether the price of a diamond is reasonable. Nonetheless, to have the expected price at diamondse.info with a 95% interval is a lot more information than we had about the price we should be willing to pay for a diamond before we started this exercise.

An important point—even though we can predict diamondse.info prices almost perfectly based on a function of the 4c’s, one thing that you should NOT conclude from this exercise is that where you buy your diamond is irrelevant, which apparently used to be conventional wisdom in some circles. You will almost surely pay more if you buy the same diamond at Tiffany’s versus Costco. But Costco sells some pricy diamonds as well. Regardless, you can use this kind of model to give you an indication of whether you’re overpaying.

Of course, the value of a natural diamond is largely socially constructed. Like money, diamonds are only valuable because society says they are—-there’s no obvious economic efficiencies to be gained or return on investment in a diamond, except perhaps in a very subjective sense concerning your relationship with your significant other. To get a sense for just how much value is socially constructed, you can compare the price of a natural diamond to a synthetic diamond, which thanks to recent technological developments are of comparable quality to a “natural” diamond. Of course, natural diamonds fetch a dramatically higher price.

One last thing—there are few guarantees in life, and I offer none here. Though what we have here seems pretty good, data and models are never infallible, and obviously you can still get taken (or be persuaded to pass on a great deal) based on this model. Always shop with a reputable dealer, and make sure her incentives are aligned against selling you an overpriced diamond or worse one that doesn’t match its certificate. There’s no substitute for establishing a personal connection and lasting business relationship with an established jeweler you can trust.

## One Final Consideration

Plotting your data can help you understand it and can yield key insights. But even scatterplot visualizations can be deceptive if you’re not careful. Consider another data set the comes with the alr3 package—soil temperature data from Mitchell, Nebraska, collected by Kenneth G. Hubbard from 1976-1992, which I came across in Weisberg, S. (2005). Applied Linear Regression, 3rd edition. New York: Wiley (from which I’ve shamelessly stolen this example). Let’s plot the data, naively:

#install.packages('alr3')
library(alr3)
data(Mitchell)
qplot(Month, Temp, data = Mitchell) + theme_bw()

Looks kinda like noise. What’s the story here? When all else fails, think about it. What’s on the X axis? Month. What’s on the Y-axis? Temperature. Hmm, well there are seasons in Nebraska, so temperature should fluctuate every 12 months.

But we’ve put more than 200 months in a pretty tight space.

Let’s stretch it out and see how it looks:

Don’t make that mistake.

That concludes part I of this series on scatterplots. Part II will illustrate the advantages of using facets/panels/small multiples, and show how tools to fit trendlines including linear regression and local regression (loess) can help yield additional insight about your data.

You can also learn more about exploratory data analysis via this Udacity course taught by my colleagues Dean Eckles and Moira Burke, and Chris Saden, which will be coming out in the next few weeks.

How to break regression

Sol Messing — Wed, 13 Jun 2018 00:00:00 GMT

Regression models are a cornerstone of modern social science. They’re at the heart of efforts to estimate causal relationships between variables in a multivariate environment and are the basic building blocks of many machine learning models. Yet social scientists can run into a lot of situations where regression models break.

Famed social psychologist Richard Nisbett recently argued that regression analysis is so misused and misunderstood that analyses based on multiple regression “are often somewhere between meaningless and quite damaging.” (He was mainly talking about cases in which researchers publish correlational results that are covered in the media as causal statements about the world.)

Below, I’ll walk through some of the potential pitfalls you might encounter when you fire up your favorite statistical software package and run regressions. Specifically, I’ll be using simulation in R as an educational tool to help you better understand the ways in which regressions can break.

Using simulations to unpack regression

The idea of using R simulations to help understand regression models was inspired by Ben Ogorek’s post on regression confounders and collider bias.

The great thing about using simulation in this way is that you control the world that generates your data. The code I’ll introduce below represents the true data-generating process,since I’m using R’s random number generators to simulate the data. In real life, of course, we only have the data we observe, and we don’t really know how the data-generating process works unless we have a solid theory (like Newtonian physics or evolution) where the system of relevant variables and causal relationships is well understood and to which there is really no analogous phenomenon in social science.

What I’ll do here is create a dataset based on two random standard normal variables by simulating them using the rnorm() function, which draws random values from a normal distribution with mean 0 and standard deviation 1, unless you specify otherwise. I’ll create a functional relationship between y and x such that a 1 unit increase in x will be associated with a .4 unit increase in y.

# make the code reproducible by setting a random number seed
set.seed(100)

# When everything works:
N <- 1000
x <- rnorm(N)
y <- .4 * x + rnorm(N)
hist(x)
hist(y)

# Now estimate our model:
summary(lm(y ~ x))

Call:
lm(formula = y ~ x)
Residuals:
    Min      1Q  Median      3Q     Max 
-3.0348 -0.7013  0.0085  0.6212  3.1688 
Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)     0.003921   0.031039   0.126    0.899    
x               0.413415   0.030129  13.722   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9814 on 998 degrees of freedom
Multiple R-squared:  0.1587,    Adjusted R-squared:  0.1579 
F-statistic: 188.3 on 1 and 998 DF,  p-value: < 2.2e-16

# Plot it
library(ggplot2)
qplot(x, y) +
  geom_smooth(method='lm') +
  theme_bw() +
  ggtitle("The Perfect Regression")

Notice that the model estimates the functional relationship between x and y that I simulated quite well. The plot looks like this:

What about omitted variables? Our machinery actually still works if there is another factor causing y, as long as it is uncorrelated with x.

The dreaded omitted variable bias

Omitted variable bias (OVB) is much feared, and judging by the top internet search results, not well understood. Some top sources say it occurs when “an important” variable is missing or when a variable that “is correlated” with both x and y is missing. I even found a university econometrics course that defined OVB this way.

But neither of those definitions are quite right. OVB occurs when a variable that causes y is missing from the model (and is correlated with x). Let’s call that variable w. Because w is in play when we consider the causal relationship between x and y, it’s often referred to as “endogenous” or a “confounding variable.”

The example below first demonstrates that w, our confounding variable, will bias our results if we fail to include it in our model. The next two examples are essentially a re-telling of the post I mentioned above on collider bias, but emphasizing slightly different points.

w <- rnorm(N)
x <- .5 * w + rnorm(N)
y <- .4 * x + .3 * w + rnorm(N)

m1 <- lm(y ~ x)
summary (m1) # Omitted variable bias

Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.2190 -0.7025  0.0314  0.7120  3.1158 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.01126    0.03310    0.34    0.734    
x            0.50179    0.03049   16.46   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.046 on 998 degrees of freedom
Multiple R-squared:  0.2135,    Adjusted R-squared:  0.2127 
F-statistic: 270.9 on 1 and 998 DF,  p-value: < 2.2e-16

There it is: classic omitted variable bias. We only observed x, and the influence of the omitted variable w was attributed to x in our model. If you re-rerun the regression with w in the model, you no longer get biased estimates.

m2 <- lm(y ~ x + w)
summary (m2) # No omitted variable bias after conditioning on w

Call:
lm(formula = y ~ x + w)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.2748 -0.6632 -0.0001  0.6933  2.9664 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.02841    0.03141   0.905    0.366    
x            0.40627    0.03132  12.973   <2e-16 ***
w            0.32344    0.03439   9.405   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9927 on 997 degrees of freedom
Multiple R-squared:  0.3024,    Adjusted R-squared:  0.301 
F-statistic: 216.1 on 2 and 997 DF,  p-value: < 2.2e-16

Note that the regression errors, also known as residuals, are correlated with w:

Now, recall above that I wrote that it’s wrong to say that OVB occurs when our omitted variable is correlated with both x and y. And yet w, x and w and y are all correlated in this first example:

cor(w,m1$residuals)
[1] 0.2597859

So why can’t we just say that OVB occurs when our omitted variable is correlated with both x and y? As the next example will show, correlation isn’t enough — w needs to cause both x and y. We can easily imagine a case in which we don’t have causality but we still see this kind of correlation — when x and y both cause w.

Let’s make this a little more concrete. Suppose we care about the effect of news media consumption (x) on voter turnout (y). One factor that some researchers think may cause both news media consumption and turnout is political interest (w). If we only measure media consumption and voter turnout, political interest is likely to confound our estimates.

But another school of thought from social psychology — along the lines of self-perception theory and cognitive dissonance — suggests that the causality could be reversed: Voting behavior might be mostly determined by other factors, and casting a ballot might prompt us to be more interested in political developments in the future. Similarly, watching the news might prompt us to become more interested in politics. Let’s suppose that second school of thought is right. If so, our simulated data will look like this:

media_consumption_x <- rnorm(N)
voter_turnout_y <- .1 * media_consumption_x + rnorm(N)

# Political interest increases after consuming media and participating, and, 
# in this hypothetical world, does *not* increase media consuption or participation
political_interest_w <- 1.2 * media_consumption_x + .6 * voter_turnout_y + rnorm(N)

cormat <- cor(as.matrix(data.frame(media_consumption_x, voter_turnout_y, political_interest_w)))
round(cormat, 2)

                     media_consumption_x voter_turnout_y political_interest_w
media_consumption_x                 1.00            0.11                 0.70
voter_turnout_y                     0.11            1.00                 0.46
political_interest_w                0.70            0.46                 1.00

As you can see, all factors are again correlated with each other. But this time, if we only include x (media consumption) and y (turnout) in the equation, we get the correct estimate:

summary(lm(voter_turnout_y ~ media_consumption_x))

Call:
lm(formula = voter_turnout_y ~ media_consumption_x)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.8460 -0.6972 -0.0076  0.6702  3.3925 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)         -0.01202    0.03217  -0.374 0.708839    
media_consumption_x  0.11719    0.03321   3.529 0.000436 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.014 on 998 degrees of freedom
Multiple R-squared:  0.01233,   Adjusted R-squared:  0.01134 
F-statistic: 12.46 on 1 and 998 DF,  p-value: 0.0004359

What makes defining omitted variable bias based on correlation so dangerous is that if we now include w (political interest), we will get a different kind of bias — what’s called collider bias or endogenous selection bias.

summary(lm(voter_turnout_y ~ media_consumption_x + political_interest_w))

Call:
lm(formula = voter_turnout_y ~ media_consumption_x + political_interest_w)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.1569 -0.5981 -0.0129  0.5701  2.8356 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)           0.003155   0.027098   0.116    0.907    
media_consumption_x  -0.437084   0.039102 -11.178   <2e-16 ***
political_interest_w  0.444571   0.021928  20.274   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.854 on 997 degrees of freedom
Multiple R-squared:  0.3007,    Adjusted R-squared:  0.2993 
F-statistic: 214.3 on 2 and 997 DF,  p-value: < 2.2e-16

Simpson’s paradox

Simpson’s paradox often occurs in social science (and medicine, too) when you pool data instead of conditioning it on group membership (i.e., adding it as a factor in your regression model).

Suppose that, all other things being equal, consuming media causes a slight shift in policy preferences toward the left. But, on average, Republicans consume more news than non-Republicans. And we know that generally Republicans have much more right-leaning preferences.

If we just measure media consumption and policy preferences without including Republicans in the model, we’ll actually estimate that the effect goes in the direction opposite of the true causal effect.

N <- 1000

# Let's say that 40% of people in this population are Republicans
republican <- rbinom(N, 1, .4)

# And they consume more media
media_consumption <- .75 * republican + rnorm(N)

# Consuming more media causes a slight leftward shift in policy
# preferences, and Republicans have more right-leaning preferences
policy_prefs <- -.2 * media_consumption + 2 * republican + rnorm(N)

# for easier plotting later
df <- data.frame(media_consumption, policy_prefs, republican)
df$republican = factor(c("non-republican", "republican")[df$republican + 1])

# If we don't condition on being Republican, we'll actually estimate
# that the effect goes in the *opposite* direction
summary(lm(policy_prefs ~ media_consumption))


Call:
lm(formula = policy_prefs ~ media_consumption)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.6108 -0.9559 -0.0198  0.9257  3.9537 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        0.68923    0.04323   15.94  < 2e-16 ***
media_consumption  0.15269    0.03966    3.85 0.000126 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.317 on 998 degrees of freedom
Multiple R-squared:  0.01463,   Adjusted R-squared:  0.01365 
F-statistic: 14.82 on 1 and 998 DF,  p-value: 0.0001257

# Naive plot
qplot(media_consumption, policy_prefs) +
  geom_smooth(method='lm') +
  theme_bw() +
  ggtitle("Naive estimate (Simpson's Paradox)")

The estimate goes in the opposite direction of the true effect! Here’s what the plot looks like:

To resolve this paradox, we need to add a factor in the model that indicates whether or not a respondent is a Republican. Adding that factor lets us estimate separate slopes for Republicans and non-Republicans. Note that this is not like estimating an interaction term, where two explanatory variables are multiplied together. It’s not that the slopes are different, we just need to estimate separate ones for Republicans and non-Republicans.

# Condition on being a Republican to get the right estimates
summary(lm(policy_prefs ~ media_consumption + republican))

Call:
lm(formula = policy_prefs ~ media_consumption + republican)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5518 -0.6678 -0.0186  0.6562  3.3009 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        0.05335    0.03904   1.366    0.172    
media_consumption -0.13615    0.03111  -4.376 1.34e-05 ***
republican         1.93049    0.06758  28.565  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9774 on 997 degrees of freedom
Multiple R-squared:  0.4581,    Adjusted R-squared:  0.457 
F-statistic: 421.4 on 2 and 997 DF,  p-value: < 2.2e-16

# Conditioning on being Republican
qplot(media_consumption, policy_prefs, data=df, colour = republican) +
  scale_color_manual(values = c("blue","red")) +
  geom_smooth(method='lm') +
  theme_bw() +
  ggtitle("Conditioning on being a Republican (Simpson's Paradox)")

Here’s what the plot looks like:

Correlated errors

Another cardinal sin — and one that we should worry a lot about because it often arises from social desirability bias in survey responses — is the phenomenon of correlated errors. This example is inspired by Vavreck (2007).

Here, self-reported turnout and media consumption are caused by a combination of social desirability bias and true turnout and true consumption, respectively:

N <- 1000

# The "Truth"
true_media_consumption <- rnorm(N)
true_vote <- .1 * media_consumption + rnorm(N)

# social desirability bias
social_desirability <- rnorm(N)
#what we actually observe from self reports:
self_report_media_consumption <- true_media_consumption + social_desirability
self_report_vote <- true_vote + social_desirability

Let’s compare the estimated effect sizes of the self-reported data and the “true” data:

# Self reports
summary(lm(self_report_vote ~ self_report_media_consumption))

Call:
lm(formula = self_report_vote ~ self_report_media_consumption)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.9604 -0.7766  0.0142  0.8465  4.1811 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    0.02020    0.03951   0.511    0.609    
self_report_media_consumption  0.54605    0.02716  20.102   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.248 on 998 degrees of freedom
Multiple R-squared:  0.2882,    Adjusted R-squared:  0.2875 
F-statistic: 404.1 on 1 and 998 DF,  p-value: < 2.2e-16

# "Truth"
summary(lm(true_vote ~ true_media_consumption))

Call:
lm(formula = true_vote ~ true_media_consumption)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5814 -0.6677 -0.0077  0.6829  3.4799 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)
(Intercept)             0.01372    0.03217   0.426    0.670
true_media_consumption  0.01313    0.03245   0.404    0.686

Residual standard error: 1.017 on 998 degrees of freedom
Multiple R-squared:  0.0001639, Adjusted R-squared:  -0.000838 
F-statistic: 0.1636 on 1 and 998 DF,  p-value: 0.686

The self-reported data is biased toward over-estimating the effect size, a very dangerous problem. How could we fix this? Well, one way is to actually measure social desirability and include it in the model:

summary(lm(self_report_vote ~ self_report_media_consumption + social_desirability))

Call:
lm(formula = self_report_vote ~ self_report_media_consumption + 
    social_desirability)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.6042 -0.6774 -0.0127  0.6899  3.4470 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    0.01208    0.03220   0.375    0.708    
self_report_media_consumption  0.01220    0.03246   0.376    0.707    
social_desirability            1.02245    0.04547  22.487   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.017 on 997 degrees of freedom
Multiple R-squared:  0.5277,    Adjusted R-squared:  0.5268 
F-statistic:   557 on 2 and 997 DF,  p-value: < 2.2e-16

Note that this while most people think about social desirability as being a problem related to measurement error, it is essentially the same problem as omitted variable bias, as described above.

It’s important to remember that omitted variable bias and correlated errors are just two potential problems with regression analysis. Regression models are also not immune to issues associated with low levels of statistical power, the failure to account for the influence of extreme values, and heteroskedasticity, among others. But by simulating the data-generating process, researchers can get a good sense of some of the more common ways in which statistical models might depart from reality.

Replication of ‘Bias in the Flesh’

Sol Messing — Mon, 16 Oct 2017 00:00:00 GMT

This post presents a replication of Messing et al. (2016, study 2), which showed that exposure to darker images of Barack Obama increased stereotype activation, as indicated by the tendency to finish incomplete word prompts---such as “W E L _ _ _ _”---in stereotype-consistent ways (“WELFARE”).

Overall, the replication shows that darker images of even counter-stereotypical exemplars like Barack Obama can increase stereotype activation, but that the strength of the effect is weaker than conveyed in the original study. A reanalysis of the original study conducted in the course of this replication effort unearthed a number of problems that, when corrected, yield estimates of the effect that are consistent with those documented in the replication. This reanalysis also follows.

I'm posting this to

disseminate a corrected version of the original study;
show how I found those problems with the original study in the course of conducting this replication;
circulate these generally confirmatory findings, along with a pooled analysis revealing a stronger effect among conservatives; and
provide a demonstration of how replication almost always enhances our knowledge about the original research, which I hope may encourage others to invest the time and money in such efforts.

First some context.

The original study that formed the basis of the manuscript shows that more negative campaign ads in 2008 were also more likely to contain darker images of President Obama. In 2009 when I started this work, I was most proud of the method to collect data on skin complexion outlined in study 1. I included another study, what's now study 3, which shows that 2012 ANES survey-takers were more likely to respond negatively to Chinese characters after being presented with darker images of Obama (this is called the Affect Misattribution Procedure (AMP)). But the AMP was not a true experiment and a reviewer was concerned that Study 3 did not provide sufficiently rigorous, causal evidence that darker images alone can cause negative affect. So I conducted an experiment that would establish a causal link between darker images of Obama and something I thought was even more important---stereotype activation. There were strong reasons to expect this effect based on past lab studies showing links between darker skin and negative stereotypes about Blacks, and past observational studies showing far more negative socioeconomic outcomes across the board among darker versus lighter skinned Black Americans. We found an effect and published the three studies.

This replication effort was prompted by a post-publication reanalysis and critique, which raised questions about potential weaknesses in the original analysis. My aim in replicating the study was to bring new data to the discussion and make sure we hadn’t polluted the literature with a false discovery.

The main objection was the way we formed our stereotype consistency index. The items assessing stereotype consistency comprised 11 words with missing blank spaces (e.g., L A _ _). Each fragment had as one possible solution a stereotype-related completion. The complete list follows: L A _ _ (LAZY): C R _ _ _ (CRIME); _ _ O R (POOR); R _ _ (RAP); WEL _ _ _ _ (WELFARE); _ _ C E (RACE); D _ _ _ Y (DIRTY); B R _ _ _ _ _ (BROTHER); _ _ A C K (BLACK); M I _ _ _ _ _ _ (MINORITY); D R _ _ (DRUG).

The author pointed out that there were many potential ways to analyze the original data---he claimed over 16 thousand. Yet very few of these are consistent with generally accepted research practices. We've known, arguably since the 16th century, that combining several measures reduces measurement error and hence variance in estimation. This is particularly important in social science, and especially for this particular study---it would be unwise to attempt to use a single word completion or an arbitrary subset thereof to measure a complex, noisy construct like stereotype activation as measured via a word completion game. Rather, taking the average or constructing an index based on clustering several measures should be expected to result in far less measurement error, which is what we did.

Still, I am sympathetic to concerns about the garden of forking paths, which is part of the motivation for this replication.

In the original study, I formed this index based on what I judged to be the most unambiguously negative word-completions (lazy, dirty, poor), consistent with past work suggesting that darker complexion activates the most negative stereotypes about Blacks. I calculated that these were the three variables that also maximized interclass correlation (ICC). As a robustness check, I also computed a measure that maximized alpha reliability (AR). This measure contained more items, and also seemed to include stereotype-consistent word completions that were on balance negative---lazy, dirty, poor, crime, black, and welfare. I should have but did not report results based on a simple average of these items, which was not conclusive.

The critical reanalysis cited above shows a handful of statistically significant patterns that are inconsistent with the expectations in the original study, which is suggestive evidence that it's quite possible to find signal in noise if you're analyzing arbitrary sets of variables with the originally collected data. However, as shown below in the much larger replication sample below, none of these patterns replicate.

The critique also noted that we did not include an analysis of several trailing questions we included on the original survey. The concern is the file drawer problem - the incentives against and frequent failure to report null results - which obscures knowledge and is bad for the scientific enterprise.

I included those measures based on past work using the same images as stimuli, which found that darker images prompted more negative evaluations of Obama among people with more negative associations with Blacks, as measured using the Implicit Associations Test (IAT). But testing a specification that conditioned on our main outcome of interest---stereotype-consistent word completions---would mean conditioning on a post-treatment variable, particularly worrisome since we saw an effect on stereotype activation in the study.

Below, I pool the data and report another specification that does not require us to condition on post-treatment variables. It takes advantage of the fact that conservatives had significantly higher levels of stereotype activation (which was documented in the original study), and shows that the effect is in fact stronger among this subgroup, providing preliminary evidence in favor of this hypothesis.

The remainder of this post will present my own reanalysis of the original data, the replication, and finally some additional analysis of the data now possible with the larger, pooled data set.

Re-analysis of original data

In the process of collecting data for the replication studies, I used the same interface, simply appending the new data as additional respondents completed the survey experiment. When I geo-coded the IP address data in the full data set, I found a discrepancy between the cases I originally geo-coded as U.S. cases, and the cases that now resolved to U.S. locations in the complete data set. Many of these respondents appeared in sequence, suggesting they may have been skipped, perhaps due to issues related to connectivity to the geo-location server I used.

This prompted me to conduct a full re-analysis of the data, which yields smaller estimates of stereotype activation. First, re-estimating the index yielded different items---'black' in place of 'dirty' for the ICC measure and 'race' in place of 'welfare', 'crime', and 'dirty' in the AR measure. This is due in part to the way I computed the original indices and in part due to correcting the geo-coding issue. In the original study, I computed the index of variables that maximized alpha and ICC by hand because the epiCalc::alphaBest function (now epiDisplay::alphaBest) does not return results (nor an error message) for these data. For reanalysis, I wrote a function that computed variables to include in the index via successive removal of items. The overall alpha is actually slightly lower in new AR measure, while the new ICC measure has a slightly higher correlation coefficient.

For the sake of transparency, I first report results based on the original items included in the index as reported in Messing et al. 2015 using the updated data, then report the new ICC and AR measures.

Using the original indices with the errantly remove cases included, instead of a 36% increase in stereotype-consistent word completions using the ICC measure, this meant a revised estimate of a 20% increase in stereotype activation (M_Light = 0.33, M_Dark = 0.41, T(859.0) = 2.08, P = 0.038, two-sided). For the AR measure, instead of a 13% increase (M_Light = 0.97, M_Dark = 1.11, T(626.72) = 1.77, P = 0.078, two-sided), this meant an 8% increase (M_Light = 0.98, M_Dark = 1.06, T(850.9) = 1.12, P = 0.265, two-sided).

Re-estimating the indices when including all U.S. cases translates to less conclusive findings---a revised estimate of an 8% increase in stereotype activation in the original study (M_Light = 0.79, M_Dark = 0.86, T(850.7) = 1.27, P = 0.203, two-sided) using the ICC measure, and an 8% increase (M_Light = 0.87, M_Dark = 0.91, T(839.1) = 0.77, P = 0.439, two-sided) using the AR measure.

A slightly smaller effect was also observed when examining differences between conservatives and other participants. Correcting the geo-coding error and updating the indices reduced the estimate of stereotype activation for conservatives. Instead of a 53% increase, the original ICC measure yields a 29% increase (M_Other = 0.35, M_Conservative = 0.49, T(205.9) = 2.49, P = 0.013, two-sided). The new ICC measure yields an 18% increase (M_Other = 0.80, M_Conservative = 0.98, T(207.9) = 2.41, P = 0.017, two-sided). For the AR measure, instead of a 29% increase, this meant an 18% increase using either measure (original: M_Other = 0.99, M_Conservative = 1.18, T(210.4) = 2.11, P = 0.036, two-sided) (new: M_Other = 0.86, M_Conservative = 1.05, T(214.3) = 2.47, P = 0.014, two-sided).

The replication

I conducted one exact replication and one very close replication with slightly different images, which I pooled for a total of 3,151 respondents, substantially more than the 630 included in the original writeup. This gives me more statistical power and more precise estimates of the effect in question. (I provide results for each design separately - one of which appears underpowered - at the end of this post).

To be clear, I did not pre-register this replication. However, I've tried to err on the side of exhaustive reporting when the original study did not provide exacting specificity in analyzing the new data. Due to the nature of this replication---the presentation of the same analysis conducted in the original study---the p-values provide highly informative, if not conclusive evidence regarding the nature of the effect.

The average reported age was 36; 52% of participants identified as female; 84% identified as White, 8% as Black; 5% as Hispanic; and 3% as Other. 52% identified as liberal, 27% as moderate, 22% as conservative.

Recomputing the ICC index yielded the following items: black, poor, drug. Recomputing the AR index yielded: lazy, black, poor, welfare, crime, drug, which is close to the original study.

In the replication data, the ICC yields a 5% increase in stereotype activation (M_Light = 0.90, M_Dark = 0.95, T(3142.8) = 1.56, P = 0.119, two-sided). Similarly, the alpha measure yields a 5% increase (M_Light = 1.04, M_Dark = 1.09, T(3145.8) = 1.70, P = 0.089, two-sided).

The original study isn't completely clear on the question of whether a replication should report on the recomputed ICC and AP measures, or the exact same items as in the original study, so it's worth reporting those as well. The original ICC measure yields a 3% increase in stereotype activation (M_Light = 0.36, M_Dark = 0.37, T(3148.9) = 0.51, P = 0.611, two-sided). Using the original AR measure yields a 6% increase (M_Light = 1.01, M_Dark = 1.07, T(3147.9) = 1.91, P = 0.057, two-sided).

Finally, it's worth reporting on an index that simply uses all stereotype-consistent items in the replication reveals a 5% increase in stereotype activation (M_Light = 1.30, M_Dark = 1.37, T(3147.7) = 2.05, P = 0.040, two-sided).

A pooled analysis, after normalizing the ICC and AR measures, yields similar results:

=================================================
                      ICC     Alpha      ALL
-------------------------------------------------
  (Intercept)       -0.041   -0.028    1.292***
                    (0.034)  (0.034)  (0.036)
  cond: Dark/Light   0.067*   0.058    0.067*
                    (0.032)  (0.032)  (0.034)
  study              0.006   -0.001    0.008
                    (0.023)  (0.023)  (0.025)
-------------------------------------------------
  R-squared             0.0      0.0       0.0
  N                  4012     4012      4012
=================================================

I also replicated this study with a different, lesser-known Black politician (Jesse White). However, a manipulation check revealed that only 36% of respondents said the candidate was Black in the “light” condition, compared to 83% in the darker condition, suggesting that any analysis would be severely confounded by perceived race of the target politician. (I did not ask this question in the Barack Obama studies).

Additional analysis

The superior power afforded by pooling all three studies may allow the exploration of treatment heterogeneity. Past work suggests the possibility that darker images might cause people inclined toward more stereotype-consistent responses to evaluate politicians more negatively. However, this analysis would condition on post-treatment variables, which in this case is particularly concerning since the treatment affects stereotype activation according to the original study and replication above. As an alternative, I consider a specification that uses conservative identification instead, which is a strong predictor of stereotype activation (as shown in the original study), but shouldn’t be affected by the treatment. It reveals evidence for the predicted interactions, suggesting that when conservatives are exposed to darker rather than lighter images of Obama, they have slightly “colder” feelings toward the former president (P = 0.039), perceive him to be less competent (P = 0.061), and less trustworthy (P = 0.083).

=================================================================
                             obama_therm  competence    trust
-----------------------------------------------------------------
  (Intercept)                 69.446***    4.213***    3.867***
                              (0.705)     (0.029)     (0.031)
  cond: Dark/Light             0.451       0.024       0.045
                              (0.989)     (0.041)     (0.044)
  iscons                     -40.497***   -1.574***   -1.625***
                              (1.565)     (0.064)     (0.069)
  cond: Dark/Light x iscons   -4.462*     -0.167      -0.165
                              (2.160)     (0.089)     (0.095)
-----------------------------------------------------------------
  R-squared                        0.3         0.3         0.2
  N                             3932        3928        3926
=================================================================

A plot of the model predictions for the thermometer ratings suggests that the effect is concentrated among conservatives.

Conclusion

The more items one uses to form an index, the less noise we should expect, and the more likely any replication attempt should be expected to succeed. It should also mean greater statistical precision. This could explain the remaining discrepancy between this study and the original after adjusting for the geo-coding error pointed out above. It's also possible that something about the timing or the subjects recruited in the replication studies that explain the observed differences.

Nonetheless, this replication provides evidence that darker images of Black political figures, or at least of President Barack Obama, do in fact activate stereotypes. This much larger sample suggests that the true effect is smaller than what I found in the original study, which as noted above, contained some errors.

Replication materials available on dataverse.

Appendix

Below I present alternate specifications estimated without pooling. These specifications suggest first that replication 2 (as well as the original study) was not well-powered. It also suggests that the outcome measures with more items yield more reliable estimates.

Outcome measure summing all items:

Replication 1: M_Light = 1.29, M_Dark = 1.36, T(2115.4) = -1.59, P = 0.113, two-sided

Replication 2: M_Light = 1.30, M_Dark = 1.39, T(982.7) = -1.35, P = 0.177, two-sided

Original Alpha outcome measure:

Replication 1: M_Light = 0.99, M_Dark = 1.06, T(2114.9) = -1.79, P = 0.073, two-sided

Replication 2: M_Light = 1.04, M_Dark = 1.09, T(979.8) = -0.85, P = 0.393, two-sided

Newly estimated Alpha outcome measure:

Replication 1: M_Light = 1.02, M_Dark = 1.09, T(2109.4) = -1.77, P = 0.077, two-sided

Replication 2: M_Light = 1.07, M_Dark = 1.10, T(984.6) = -0.53, P = 0.597, two-sided

Original ICC outcome measure:

Replication 1: M_Light = 0.89, M_Dark = 0.94, T(2100.6) = -1.49, P = 0.137, two-sided.

Replication 2: M_Light = 0.93, M_Dark = 0.96, T(985.9) = -0.67, P = 0.506, two-sided

Newly estimated ICC outcome measure

Replication 1: M_Light = 0.35, M_Dark = 0.36, T(2122.5) = -0.22, P = 0.826, two-sided

Replication 2: M_Light = 0.38, M_Dark = 0.40, T(978.7) = -0.72, P = 0.472, two-sided A replication of prior critique and reanalysis. The patterns that run contrary to our original findings are not significant in the replication data.

Variable	effect size	p-value
feeling therm	-0.028	0.439
race minority welfare crime rap	0.022	0.538
race minority welfare rap	0.017	0.629
race minority rap	0.011	0.752
race	0.006	0.861
minority	-0.013	0.713
rap	0.024	0.502
welfare	0.014	0.695
comp	-0.013	0.72
crime	0.016	0.659
trust	0.002	0.946
brother	0.042	0.236
drug	0.008	0.822
lazy	0.026	0.474
black	0.084	0.019
dirty	0.028	0.438
poor	-0.002	0.957
allwcs	0.073	0.04
original	0.018	0.611
alpha	0.068	0.057

Ideologically diverse news, an agenda for future research

Eytan Bakshy — Fri, 24 Apr 2015 00:00:00 GMT

Earlier this month, we published an early access version of our paper in ScienceExpress (Bakshy et al. 2015), “Exposure to ideologically diverse news and opinion on Facebook.” The paper constitutes the first attempt to quantify the extent to which ideologically cross-cutting hard news and opinion is shared by friends, appears in algorithmically ranked News Feeds, and is actually consumed (i.e., click through to read).

We are grateful for the widespread interest this paper, which grew out of two threads of related research that we began nearly five years ago: Eytan and Lada's work on the role of social networks in information diffusion (Bakshy et al. 2012) and Sean and Solomon's work on selective exposure in social media (Messing and Westwood 2012).

While Science papers are explicitly prohibited from suggesting future directions for research, we would like to shed additional light on our study and raise a few questions that we would be excited to see addressed in future work.

Tradeoffs when Selecting a Population

There were tradeoffs when deciding on who to include in this study. While we could have examined all U.S. adults on Facebook, we focused on people who identify as liberals or conservatives and encounter hard news, opinion, and other political content in social media regularly. We did so because many important questions around “echo chambers” and “filter bubbles”on Facebook relate to this subpopulation, and we used self-reported ideological preferences to define it.

Using self-reported ideological preferences in online profiles is not the only a way to measure ideology or define the population of interest. Yet, people who publicly identify as liberals or conservatives in their Facebook profiles are an interesting and important subpopulation worthy of study for many reasons. As Hopkins and King 2010 have pointed out, studying the expression and behavior of those who are politically engaged online is of interest to political scientists studying activists (Verba, Schlozman, and Brady 1995), the media (Drezner and Farrell 2004), public opinion (Gamson 1992), social networks (Adamic and Glance 2005; Huckfeldt and Sprague 1995), and elite influence (Grindle 2005; Hindman, Tsioutsiouliklis, and Johnson 2003; Zaller 1992).

This subpopulation has limitations and is not the only population of interest. The data are not appropriate for those who seek estimates of the entire U.S. public, people without strong opinions, or people not on Facebook (at least not without additional extrapolation, re-weighting, additional evidence, etc.). While our data could plausibly also provide good estimates of the population of people who are ideologically active and have clear preferences, we are not claiming that's necessarily the case---that remains to be determined in future work.

We'd like to help other researchers looking to study other populations understand more about the population we've defined. An important question in this regard is what proportion of active U.S. adults actually report an identifiable left/right/center ideology in their profile. That number is 25%, or 10.1 million people.

It's also informative to examine the proportion of those users who provide identifiable profile affiliations conditional on demographics and Facebook usage:

Age	Percent reporting ideological affiliation
18-24	21.60%
25-44	28.50%
45-64	24.30%
65+	21.40%

Gender	Percent reporting ideological affiliation
Female	21.90%
Male	30.60%

Login Days	Percent reporting ideological affiliation
105-140	18.90%
140-185	26.70%

Clearly those who report an ideology in their profile tend to be more active on Facebook. They are also more likely to be men, which is consistent with the well-documented gender gap in American politics (Box-Steffensmeier 2004).

It's possible that these individuals differ from other Facebook users in other ways. It seems plausible to expect these people to have higher levels of political interest, a stronger sense of political ideology and political identity, and to be more likely to be active in politics than most others on Facebook. It's also possible that these individuals are more extroverted than the average user, especially in the somewhat taboo domain of politics. These possibilities also strike us as interesting questions for study in future work.

How to Measure Ideology

We hope others will replicate this work using other populations and ways of measuring ideology, which will provide a broader view of exposure to political media. Data on ideology could be collected by, for example, surveying users, imputing ideology based on user behavior, or joining data to the voter file. Each of these methods have advantages and potential challenges.

Using surveys in future work would allow researchers to collect data on ideology in a way that can facilitate comparisons with much of the extant literature in political science, and allow researchers to sample from a less politically engaged population. Of course, this could be tricky because survey response rates might be affected by the phenomenon under study. In other words, the salience of political discussion from the right or left, and/or prior choices to consume content could make people more/less likely to respond to a survey asking about ideology, or affect the way they report the strength of their ideological preferences. This could confound measurement in a way that would be difficult to detect and correct. Yet it would be fascinating to see how survey results compare to the results in this study.

We would also encourage the application of large-scale methods that impute individuals’ ideological leanings using social networks or revealed preferences. This would have the advantage of allowing researchers to estimate ideological preferences for a broader population, and could be applied to empirical contexts for which self-reported ideological affiliations are not present.

However, these approaches present challenges. Imputing ideology based on social networks would make it difficult to estimate what proportion of people’s networks contain individuals from the other side. Bond and Messing, 2015 and Barberá 2014 discuss some of the challenges related to estimating ideology based on revealed preferences. Another challenge specific to the quantities estimated in our paper is that because behavior may be caused by the composition of individuals’ social networks, what their friends share, and how they engage with Facebook, using revealed preferences to select the population could introduce endogenous selection bias (Elwert and Winship 2014). A study that negotiates these issues would be a tremendously valuable contribution. Similar methods could also be used to obtain measures of ideological alignment of content.

Lastly, researchers could use party registration from the voter file. This approach would yield millions of records, but have different selection problems—match rates may differ by region, state, age, gender, etc. Again, the advantage of approaches like this are that these studies compliment each other and provide a fuller picture of how exposure to viewpoints from the other side occur in social media.

Future work should also examine how exposure varies in different subpopulations. For example, one hypothesis to test is whether those with weaker or less consistent ideological preferences have more cross cutting content shared by friends, rendered in social media streams, and selected for reading. Some preliminary analysis suggests that indeed, among the individuals in our study, those with a weaker stated ideological affiliation have on average more cross-cutting content at each stage in the exposure process.

Other Data Sources

There are many other important questions related to this paper that necessitate new data sources: Does encountering cross-cutting content increase or decrease attitude polarization? What about attitudes toward members of the other side? Does it change specific policy preferences? Are liberals and conservatives more or less likely to see content in News Feed because it was cross-cutting? Do they actively avoid cross-cutting political content because of expressions in the title or because of the fact that the media source is suggestive of a cross-cutting article? How do changes to ranking algorithms and user interfaces affect selective exposure? And how can we better understand actual discourse about politics in social media, rather than merely shared media content?

Answering these questions necessitates collecting innovative data sets via online experimentation (Berinsky et al. 2012), social media (Ryan and Broockman 2012), crowdsourcing (Budak et al. 2014), large scale field experimentation (King et al. 2014), observational social media data, clever ways to collect data about individual differences in ranking (Hannak et al. 2013), smart ways to combine behavioral and survey data (Chen et al. 2014), and panel data (Athey and Mobius 2012, Flaxman et al. 2014).

Many of these are causal questions necessitating experimental and/or quasi experimental designs. For example, the extent to which people select content because it is cross-cutting could be investigated using experiments like this one (e.g., Messing and Westwood 2012) or through identifying sources of natural exogenous variation. And while Diana Mutz and others have done ground-breaking research on the effects of encountering cross-cutting arguments on political attitudes (Mutz 2002b) and behavior (Mutz 2002a), more research into how these effects play out in the long term (using approaches like Druckman et al 2012) would be of tremendous benefit to the literature. It is difficult to expose people to any sort of argument for a long period of time (say over the course of a U.S. national political campaign cycle), in a way that is not confounded with people's existing preferences and the social environment, though creative quasi-experimental work (Martin and Yurukoglu 2014) is emerging in this area.

Many of these questions necessitate that researchers identify the effects of cross-cutting arguments both on and off Facebook. To get a full picture of how cross-cutting arguments affect politics requires understanding the myriad of ways individuals get information, both on the Internet (Flaxman et al. 2014) and offline (Mutz 2002a), what kinds of information people discuss in offline contexts (Mutz 2002b), and the relative influence of all of these factors on opinions.

Finally, if individuals' online networks and choices do substantially impact the diversity of news in individuals' overall “information diets,” future research could examine the effects of connecting those with more disparate views (Klar 2014), encouraging consumption of cross-cutting content (Agapie and Munson 2015), or simply encouraging individuals to read more diverse news by making individuals more aware of the balance of news they consume (Munson et al. 2013).

These questions are especially important in light of the fact that there are substantial opportunities for people to read more news on Facebook. The plots below illustrate the average proportion of stories shared by friends, those that are seen in News Feed, and those clicked on for liberals and conservatives in the study. Clearly there is an opportunity to read more news from either side.

Dataverse

Finally, we believe that reproducing, replicating, and conducting additional analyses on extant data sets is extremely important and helps generate ideas for future work (King 1995, Leeper 2015). In that spirit, we have created a Dataverse archive. The repository includes replication data, scripts, as well as some additional supplementary data and code for extending our work.

References

E. Bakshy, S. Messing, L.A. Adamic. 2015. Exposure to ideologically diverse news and opinion on Facebook. Science.

E. Bakshy, I. Rosenn, C.A. Marlow, L.A. Adamic. 2012. The Role of Social Networks in Information Diffusion. ACM WWW 2012.

S. Messing and S.J. Westwood. 2012. Selective Exposure in the Age of Social Media: Endorsements Trump Partisan Source Affiliation When Selecting News Online. Communication Research.

P. Barberá (2015). Birds of the same feather tweet together: Bayesian ideal point estimation using Twitter data. Political Analysis, 23(1), 76-91.

R. Bond, S. Messing, Quantifying Social Media’s Political Space: Estimating Ideology from Publicly Revealed Preferences on Facebook. American Political Science Review

F. Elwert and C. Winship. 2014. Endogenous Selection Bias: The Problem of Conditioning on a Collider Variable. Annual Review of Sociology.

G. King, J. Pan, and M. E. Roberts. 2014. Reverse-Engineering Censorship in China: Randomized Experimentation and Participant Observation. Science.

C. Budak, S. Goel, & J. M. Rao. (2014). Fair and Balanced? Quantifying Media Bias Through Crowdsourced Content Analysis. Quantifying Media Bias Through Crowdsourced Content Analysis (November 17, 2014).

S. Athey, M. Mobius. The Impact of News Aggregators on Internet News Consumption: The Case of Localization. Working paper. http://faculty-gsb.stanford.edu/athey/documents/localnews.pdf

A. Hannak, P. Sapiezynski, A. Molavi Kakhki, B. Krishnamurthy, D. Lazer, A. Mislove, C. Wilson. 2013. Measuring personalization of web search. ACM WWW 2013.

A. Chen and A. Owen and M. Shi. Data Enriched Linear Regression. Working paper. http://arxiv.org/pdf/1304.1837v3.pdf

G.J. Martin, A. Yurukoglu. Working paper. Bias in Cable News: Real Effects and Polarization. Working paper. http://web.stanford.edu/~ayurukog/cable_news.pdf

S.R. Flaxman, S. Goel, J.M. Rao. Filter Bubbles, Echo Chambers, and Online News Consumption. Working paper. https://5harad.com/papers/bubbles.pdf

D.C. Mutz. 2002. The Consequences of Cross-Cutting Networks for Political Participation. American Journal of Political Science.

D.C. Mutz. 2002. Cross-cutting Social Networks: Testing Democratic Theory in Practice. American Political Science Review.

J. N. Druckman, J. Fein, & T. Leeper. 2012. A source of bias in public opinion stability. American Political Science Review.

E. Agapie, S.A. Munson. 2015. “Social Cues and Interest in Reading Political News Stories.” AAAI ICWSM 2015.

S. Klar. 2014. Partisanship in a Social Setting. American Journal of Political Science.

S.A. Munson, S.Y. Lee, P. Resnick. 2013. Encouraging Reading of Diverse Political Viewpoints with a Browser Widget. AAAI ICWSM 2013.

G. King. 1995. “Replication, Replication.” Political Science and Politics. http://j.mp/1wP9Vqn

T. Leeper. 2015. What's in a Name? The Concepts and Language of Replication and Reproducibility. Blog post. http://thomasleeper.com/2015/05/open-science-language/

When to Use Stacked Barcharts?

Sol Messing — Sat, 11 Oct 2014 00:00:00 GMT

Yesterday a few of us on Facebook’s Data Science Team released a blogpost showing how candidates are campaigning on Facebook in the 2014 U.S. midterm elections. It was picked up in the Washington Post, in which Reid Wilson calls us "data wizards." Outstanding.

I used Hadly Wickham's ggplot2 for every visualization in the post except a map that Arjun Wilkins produced using D3, and for the first time I used stacked bar charts. Now as I've stated previously, one should generally avoid bar charts, and especially stacked bar charts, except in a few specific circumstances.

But let's talk about when not to use stacked bar charts first---I had the pleasure of chatting with Kaiser Fung of JunkCharts fame the other day, and I think what makes his site so compelling is the mix of schadenfreude and Fremdscham that makes taking apart someone else's mistake such an effective teaching strategy and such a memorable read. I also appreciate the subtle nod to junk art.

Here's a typical, terrible stacked bar chart, which I found on http://www.storytellingwithdata.com/ and originally published on a Wall Street Journal blogpost. It shows the share of the personal computing device market by operating system, over time. The problem with using a stacked bar chart is that there are only two common baselines for comparison (the top and bottom of the plotting area), but we are interested in the relative share for more than two OS brands. The post is really concerned with Microsoft, so one solution would be to plot Microsoft versus the rest, or perhaps Microsoft on top versus Apple on the bottom with "Other" in the middle. Then we'd be able to compare the over time market share for Apple and Microsoft. As the author points out, an over time trend can also be visualized with line plots.

By far the worst offender I found in my 5 minute Google search was from junkcharts and originally published on Vox. These cumulative sum plots are so bad I was surprised to see them still up. The first problem is that the plots represent an attempt to convey way too much information---either plot total sales or pick a few key brands that are most interesting and plot them on a multi-line chart or set of faceted time series plots. The only brand for which you can quickly get a sense of sales over time is the Chevy Volt because it's on the baseline. I'm sure the authors wanted to also convey the proportion of sales each year, but if you want to do that just plot the relative sales. Of course, the order in which the bars appear on the plot has no organizing principle, and you need to constantly move your eyes back and forth from the legend to the plot when trying to make sense of this monstrosity.

As Kaiser notes in his post, less is often more. Here's his redux, which uses lines and aggregates by both quarter and brand, resulting in a far superior visualization:

So when *should* you use a stacked bar chart? Here are a two scenarios with examples, inspired by work with Eytan Bakshy and conversations with Ta Chiraphadhanakul and John Myles White.

1. You care about comparing the proportion of two things, in this case the share of posts by Democrats and Republicans, along a variety of dimensions. In this case those dimensions consist of keyword (dictionary-based) categories (above) and LDA topics (below). When these are sorted by relative proportion, the reader gains insight into which campaign strategies and issues are used more by Republican or Democratic candidates.

You care about comparing proportions along an ordinal, additive variable such as 5-point party identification, along a set of dimensions. I provide an example from a forthcoming paper below (I'll re-insert the axis labels once it's published). Notice that it draws the reader toward two sets of comparisons across dimensions -- one for strong democrats and republicans, the other for the set of *all* Democrats and *all* Republicans.

Of course, R code to produce these plots follows:

# Uncomment these lines and install if necessary:
#install.packages('ggplot2')
#install.packages('dplyr')
#install.packages('scales')
library(ggplot2)
library(dplyr)
library(scales)
# We start with the raw number of posts for each party for
# each candidate. Then we compute the total by party and
# category.
catsByParty %>% group_by(party, all_cats) %>%
summarise(tot = summ(posts))
# Next, compute the proportion by party for each category
# using dplyr::mutate
catsByParty <- catsByParty %>%
group_by(all_cats) %>%
mutate(prop = tot/sum(tot))
# Now compute the difference by category and order the
# categories by that difference:
catsByParty <- catsByParty %>% group_by(all_cats) %>%
mutate(pdiff = diff(prop))
catsByParty$all_cats <- reorder(catsByParty$all_cats, -catsByParty$pdiff)
# And plot:
ggplot(catsByParty, aes(x=all_cats, y=prop, fill=party)) +
scale_y_continuous(labels = percent_format()) +
geom_bar(stat='identity') +
geom_hline(yintercept=.5, linetype = 'dashed') +
coord_flip() +
theme_bw() +
ylab('Democrat/Republican share of page posts') +
xlab('') +
scale_fill_manual(values=c('blue', 'red')) +
theme(legend.position='none') +
ggtitle('Political Issues Discussed by Party\n')

Insight From Cleveland And Tufte On Plotting Numeric Data By Groups

Sun, 04 Mar 2012 00:00:00 GMT

After my post on making dotplots with concise code using plyr and ggplot, I got an email from my dad who practices immigration law and runs a website with a variety of immigration resources and tools. He pointed out that the post was written for folks who already know that they want to make dot plots, and who already know about bootstrapped standard errors. That’s not many people.

In an attempt to appeal to a broader audience, I’m starting a series in which I’ll outline the key principles I use when developing a visualization. In this post, I’ll articulate these principles, which combine some of Tuft’s aesthetic guidelines with Cleveland’s scientific approach to visualization, which is based on the psychological processes involved in making sense of visualizations, and has been rigorously tested via randomized controlled experiments. Based on these principles, I’ll argue that dotplots and scatterplots are better than other types of plots (especially pie charts) in most situations. In later posts, I’ll demonstrate another innovation whose widespread use I’ll credit to Cleveland and Tufte: the use of multiple panels (aka small multiples, trellis graphics, facets, generalized draftsman’s displays, multivar charts) to clearly convey the same information embedded in more complex and difficult to read visualizations, including multiple line plots and mosaic plots. In future posts I’ll also emphasize why it is important to provide some indication of the noise present in the underlying data using error bars or bands. Along the way, I’ll put you to the test–I’ll present some visualizations of the same data using different visualization techniques and ask you to try to get as much information as you can in 2 seconds from each type of visualization.

A good visualization conveys key information to those who may have trouble interpreting numbers and/or statistics, which can make your findings accessible to a wider audience (more on this below). Visualizations also give your audience a break from lexical processing, which is especially useful when you are presenting your findings–people can listen to you and process the findings from a well-designed visual at the same time, but most people have trouble listening while reading your PowerPoint bullet points. Visualizations also convey key information embedded in massive amounts of data, which can aid your own exploratory analysis of data, no matter how massive.

Yet most visualizations are flawed, drawn using elements that make it unnecessarily difficult for the human visual system to make sense of things. I see a lot of these visualizations attending research presentations, screening incoming draft manuscripts as the assistant editor for Political Communication, and as a consumer of media info-graphics (CNN is especially bad, have a look at this monstrosity). Kevin Fox has an especially compelling visual speaking to this here. A big part of the problem is that Microsoft makes it easy to draw flashy but ultimately confusing visualizations in Excel. If you are too busy to read this post in full, follow this short list of guidelines and you’ll be on your way to producing elegant visualizations that impose a minimal cognitive burden on your audience:

Never represent something in 2 or worse yet 3 dimensions if it can be represented in one—NEVER use pie charts, 3-D pie charts, stacked bar charts, or 3-D bar charts.
Remove as much chart junk as possible–unnecessary gridlines, shading, borders, etc.
Give your audience a sense of the noise present in your data–draw error bars or confidence bands if you are plotting estimates.
If you want to plot multiple types of groups on a single outcome (the visual analog of cross-tabulations/marginals), use multi-paneled plots. These can also help if overploting looks too cluttered.
Avoid mosaic plots. Instead use paneled histograms.
Ditch the legend if you can (you almost always can).

The rest of the content in this series emphasizes why it makes sense to follow these guidelines. In this post I’ll look at the first point in detail and touch on the sixth. These two guidelines are most relevant when you want to look at a quantitative variable (e.g., earnings, vote-share, temperature, etc.) across different qualitative groupings (e.g., industry segment, candidate, party, racial group, season, etc.). This is one of the most common visualization tasks in business, media, and social science, and for this task people often use pie charts and/or bar charts, and occasionally dot plots.

The science of graphical perception

When most people think about visualization, they think first of Edward Tufte. Tufte emphasizes integrity to the data, showing relationships between phenomena, and above all else aesthetic minimalism. I appreciate his ruthless crusade against chart junk and pie charts (nice quote from Data without Borders). We share an affinity for multipanel plotting approaches, which he calls “small multiples,” (thanks to Rebecca Weiss for pointing this out) though I think people give Tufte too much credit for their invention—both juiceanalytics and infovis-wiki write that Cleveland introduced the concept/principle. However, both Cleveland and Tufte published books in 1983 discussing the use of multipanel displays; David Smith over at Revolutions writes that “the”small-multiples” principle of data visualization [was] pioneered by Cleveland and popularized in Tufte’s first book”; and the earliest reference to a work containing multipanel displays I could find was published *long* before Tufte’s 1983 work–Seder, Leonard (1950), “Diagnosis with Diagrams—Part I”, Industrial Quality Control (New York, New York: American Society for Quality Control) 7 (1): 11–19.

I’m less sure about Tufte’s advice to always show axes starting at zero, which can make comparison between two groups difficult, and to “show causality,” which can end up misleading your readers. Of course, the visualizations on display in the glossy pages of Tufte’s books are beautiful. But while his books are full of general advice that we should all keep in mind when creating plots, he does not put forth a theory of what works and what doesn’t when trying to visualize data.

Cleveland (with Robert McGill) develops such a theory and subjects it to rigorous scientific testing. In my last post I linked to one of Cleveland’s studies showing that dots (or bars) aligned on the same scale are indeed the best visualization to convey a series of numerical estimates. In this work, Cleveland examined how accurately our visual system can process visual elements or “perceptual units” representing underlying data. These elements include markers aligned on the same scale (e.g., dot plots, scatterplots, ordinary bar charts), the length of lines that are not aligned on the same scale (e.g., stacked bar plots), area (pie charts and mosaic plots), angles (also pie charts), shading/color, volume, curvature, and direction.

He runs two experiments: the first compares judgements about relative position (grouped bar charts) to judgements based only on length (stacked bar charts); the second compares judgements about relative position (ordinary bar charts) to judgements about angles/area (pie charts). Here are the materials he uses, courtesy of the Stanford Computer Graphics Lab:

The results are resoundingly clear—judgements about position relative to a baseline are dramatically more accurate than judgements about angles, area, or length (with no baseline). Hence, he suggests that we replace pie charts with bar charts or dot plots and that we substitute stacked bar charts for grouped bar charts.

A striking and often overlooked finding in this work is the fact that the group of participants without technical training, “mostly ordinary housewives” as Cleveland describes them, performed just as well as the group of mostly men with substantial technical training and experience. This finding provides evidence for something that I’ve long suspected: that visualizations make it easier for people lacking quantitative experience to understand your results, serving to level the playing field. If you want your findings to be broadly accessible, it’s probably better to present a visualization rather than a bunch of numbers. It also suggests that if someone is having trouble interpreting your visualizations, it’s probably your fault.

Dotplots versus pie charts and stacked barplots

Now let’s put this to the test. Take a look at each visualization below for two seconds, looking for the percent of the vote that Mitt Romney, Ron Paul, and Jon Huntsman got.

Which is easiest to read? Which conveys information most accurately? Let’s first take a look at the most critical information–the order in which the candidates placed. In all plots, the candidates are arrayed in order from highest to least vote share, and it’s easy to see that Mitt won. But once we start looking at who came in second, third, and so on, differences emerge. It’s slightly harder to process order in the pie chart because your eye has to go around the plot rather than up and down in a straight line. In the stacked bar chart, we need to look up which color corresponds to which candidate’s in the legend (as Tufte told us not to use), adding a layer of cognitive processing.

Second, which conveys estimates most accurately? The dot plot is the clear winner here. We can quickly see that Romney got about 37%, Paul got about 24%, and Huntsman got about 16%, just by looking at dots relative to the axis. When we look at the pie chart, it’s really tough to estimate the exact percent each candidate got. Same with the stacked bar chart. We could add numbers to the pie and bar charts, which would even things out to some extent, but then why not just display a table with exact percents?

One argument I used to hear all the time when I worked in industry is that pie charts “convey a sense of proportion.” Well, sure, I guess I can kind of guestimate that Ron Paul’s vote share is about 1/4. What about Jon Huntsman? Hmm, it looks like about 15 percent, which is 3/20. But wait, why do I want to convert things into fractions anyway? I don’t think in terms of fractions, I think in terms of percents. And if I really care about proportion, I suppose I could extend the axis from 0 to 100.

Suppose I want to plot results for the top 15 candidates, not just the top 6? Here’s what happens:

No contest, the pie chart fails completely. We’d need to add a legend with colors for each candidate, which adds another layer of cognitive processing–we’d need to look up each color in the lengend as we go. And even after adding the legend, you wouldn’t be able to distinguish the lower performing candidates from say write-in votes because the pie slices would be too small. The stacked bar chart will fail for the same reasons, so I’ve excluded it in the interest of brevity. Note that we don’t need to add colors to the dotplot to convey the same information, which saves an extra plotting element that we can use to represent something else (say candidate’s campaign funds or total assets). And, on top of it all, the dot plot takes up less screen/page real estate!

Why do I use dot plots instead of ordinary bar charts? A nice visualization guide from perceptualedge.com points out that often we want to only visualize differences between groups in a narrow range (they use an example wherein monthly expenses vary from $4,250-$5,500). But the length of a bar is supposed to facilitate accurate comparisons between values, so when you use a bar plot starting from $4,250, the length between bars dramatically exaggerates the actual differences. Dot plots do not have this problem because dot encode values using only location, so one must reference the axis to interpret the value.

A related points is that bars are often used to convey counts–we use them in histograms to represent frequency and track say counts of dollars earned/raised in bar charts. In fact, a team of doctors I work with at the med school recently sent in a manuscript to Radiology containing a bar chart plotting mean values between groups; they got back the following comment from the statistical reviewer: “the y-axis is quantitative but the data are represented using bars as if the data were counts.” People often use bar plots to convey estimates of means (and I’ve certainly done this), which can serve to exaggerate differences in means and hence effect sizes if you do not plot the bars from zero.

In addition, dot plots have aesthetic advantages. They convey the numerical estimate in question with a single one-dimensional point, rather than a two dimensional bar. There’s simply less that the eye needs to process. Accordingly, if a pattern across qualitative groupings exists, it’s often easier to see with a dot plot. For example, below I plot the average user ratings for each article to which Sean Westwood and I exposed subjects in a news reading experiment. The pattern that emerges is an “S” curve in which one or two stories dominate the ratings, most are sort of average, and a few are uniformly terrible. Note that you’d probably want to use something like this more for yourself than to communicate your results to others as it might overload your audience with too much information–you’d do better to select a subset of these articles or remove some of the ones in the middle (thanks to Yph Lelkes for making this point).

One question that remains is if pie charts are so bad, why are they so common? Perhaps we like them because we find them comforting just as we find pies and pizza? Well if so we’d expect pie charts to be less common in places like Japan and China where people grow up eating different food. Consider info-graphics in newspapers: I haven’t yet done a systematic content analysis, but I was unable to find a single pie chart in Japan’s Yomimuri Shimbun nor the Asahi Shimbun; nor in China’s Beijing Daily nor Sing Tao Daily. I did see plenty of maps, however, which I suppose one could argue are reminiscent of noodles.

Implementation

The most efficient way to produce solid visualizations with the ability to implement multiple panels, proper standard error estimates, and dot plots is probably in R using the ggplot2 package. If you do not have time to learn R and remain tied to MS-Excel stick to ordinary barplots to visualize quantitative variables among multiple groups (not recommended).

Otherwise, if you don’t already use it, download R and a decent editor like Rstudio. Then get started with ggplot2 and dot plots by running the following code chunk which will replicate the election figure above:

pres <- read.csv("https://SolomonMg.github.io/img/primaryres.csv", as.is=T)

# sort data in order of percent of vote:
pres <- pres[order(pres$Percentage, decreasing=T), ]

# only show top 15 candidates:
pres <- pres[1:15,]

# create a precentage variable
pres$Percentage <- pres$Percentage*100

# reorder the Candidate factor by percentage for plotting purposes:
pres$Candidate <- reorder(pres$Candidate, pres$Percentage)

# To install ggplot2, run the following line after deleting the #
#install.packages("ggplot2")
library(ggplot2)
ggplot(pres, aes(x = Percentage, y = factor(Candidate) )) +
geom_point() +
theme_bw() + xlab("Percent of Vote") + ylab("Candidate") +
ggtitle("New Hampshire Primary 2012")

After loading our data and running a few preliminary data processing operations, we pass ggplot our data set, “pres,” then we tell it what aesthetic elements we want to use, in this case that x is going to be our “Percentage” variable and y is going to be our “Candidate” variable. We tell ggplot that we want to display points for every xy pair. We also tell it to use the black and white theme, and pass some obscure axis options that ensures the axis plot correctly. Then we tell it what to label the x and y axis, and give it a title.

We can also reproduce the article ratings by story plot above using ggplot2 (even though I originally produced the plot using the lattice package).

# To install ggplot2, run the following line after deleting the #
#install.packages("ggplot2")
library(ggplot2)
load(file("https://SolomonMg.github.io/img/db.Rda"))

# if you haven't installed dplyr, delete the # and run this line:
# install.packages("dplyr")
library(dplyr)
table(db$story)

# first we use plyr to calculate the mean rating and SE for each story
ratingdat <- db %>% group_by(story) %>%
summarise(M = mean(rating, na.rm=T),
SE = sd(rating, na.rm=T)/sqrt(length(na.omit(rating))),
N = length(na.omit(rating)))

# make story into an ordered factor, ordering by mean rating:
ratingdat$story <- factor(ratingdat$story)
ratingdat$story <- reorder(ratingdat$story, ratingdat$M)

# take a look at our handiwork:
ggplot(ratingdat, aes(x = M, xmin = M-SE, xmax = M+SE, y = story )) +
geom_point() + geom_segment( aes(x = M-SE, xend = M+SE,
y = story, yend=story)) +
theme_bw() + xlab("Mean rating") + ylab("Story") +
ggtitle("Rating article by Story, with SE")

# Now save
ggsave(file="plots/dotplot-story-rating.pdf", height=14, width=8.5)

Working with Bipartite/Affiliation Network Data in R

Sol Messing — Sun, 04 Mar 2012 00:00:00 GMT

Data can often be usefully conceptualized in terms affiliations between people (or other key data entities). It might be useful analyze common group membership, common purchasing decisions, or common patterns of behavior. This post introduces bipartite/affiliation network data and provides R code to help you process and visualize this kind of data. I recently updated this for use with larger data sets, though I put it together a while back.

Preliminaries

Much of the material here is covered in the more comprehensive “Social Network Analysis Labs in R and SoNIA,” on which I collaborated with Dan McFarland, Sean Westwood and Mike Nowak. For a great online introduction to social network analysis see the online book Introduction to Social Network Methods by Robert Hanneman and Mark Riddle.

Bipartite/Affiliation Network Data

A network can consist of different ‘classes’ of nodes. For example, a two-mode network might consist of people (the first mode) and groups in which they are members (the second mode). Another very common example of two-mode network data consists of users on a particular website who communicate in the same forum thread. Here’s a short example of this kind of data. Run this in R for yourself - just copy an paste into the command line or into a script and it will generate a dataframe that we can use for illustrative purposes:

df <- data.frame( 
    person = c('Sam','Sam','Sam','Greg','Tom','Tom','Tom','Mary','Mary'), 
    group = c('a','b','c','a','b','c','d','b','d'), 
    stringsAsFactors = F)

df

person group
1    Sam     a
2    Sam     b
3    Sam     c
4   Greg     a
5    Tom     b
6    Tom     c
7    Tom     d
8   Mary     b
9   Mary     d

Fast, efficient two-mode to one-mode conversion in R

Suppose we wish to analyze or visualize how the people are connected directly - that is, what if we want the network of people where a tie between two people is present if they are both members of the same group? We need to perform a two-mode to one-mode conversion.

To convert a two-mode incidence matrix to a one-mode adjacency matrix, one can simply multiply an incidence matrix by its transpose, which sum the common 1’s between rows. Recall that matrix multiplication entails multiplying the k-th entry of a row in the first matrix by the k-th entry of a column in the second matrix, then summing, such that the ij-th row-column entry in resulting matrix represents the dot-product of the i-th row of the first matrix and the j-th column of the second. In mathematical notation:

Notice further that multiplying a matrix by its transpose yields the following:

Because our incidence matrix consists of 0’s and 1’s, the off-diagonal entries represent the total number of common columns, which is exactly what we wanted. We’ll use the %*% operator to tell R to do exactly this. Let’s take a look at a small example using toy data of people and groups to which they belong. We’ll coerce the data to an incidence matrix, then multiply the incidence matrix by its transpose to get the number of common groups between people.

This is easy to do using the matrix algebra functions included in R. But first, you need to restructure your (edgelist) network data as an incidence matrix. An incidence will record a 1 for row-column combinations where a tie is present and 0 otherwise. One easy way to do this in R is to use the table function and then coerce the table object to a matrix object:

m <- table( df )
M <- as.matrix( m )

If you are using the network or sna packages, a network object be coerced via as.matrix(your-network); with the igraph package use get.adjacency(your-network).

This is great, but what about if we are working with a really large data set? Network data is almost always sparse—there are far more pairwise combinations of potential connections than actual observed connections. Hence, we’d actually prefer to keep the underlying data structured in edgelist format, but we’d also like access to R’s matrix algebra functionality.

We can get the best of both worlds using the Matrix library to construct a sparse triplet representation of a matrix. But we’d also like to avoid building the entire incidence matrix and just feed Matrix our edgelist directly, a point that came up in a recent conversation I had with Sean Taylor. We feed Matrix our ‘person’ column to index ‘i’ (rows in the new incidence matrix), our ‘group’ column to index j (columns in the new incidence matrix), and we repeat ‘1’ for the length of the edgelist to denote an incidence.

library('Matrix')
A <- spMatrix(nrow=length(unique(df$person)),
ncol=length(unique(df$group)),
i = as.numeric(factor(df$person)),
j = as.numeric(factor(df$group)),
x = rep(1, length(as.numeric(df$person))) )
row.names(A) <- levels(factor(df$person))
colnames(A) <- levels(factor(df$group))
A

We will either convert to the ‘mode’ represented by the columns or by the rows. To get the one-mode representation of ties between rows (people in our example), multiply the matrix by its transpose. Note that you must use the matrix-multiplication operator %*% rather than a simple astrisk. The R code is:

Arow <- A %*% t(A)

But we can still do better! The function tcrossprod is faster and more efficient for this:

Arow <- tcrossprod(A)

Arow will now represent the one-mode matrix formed by the row entities—people will have ties to each other if they are in the same group, in our example. Here’s what it looks like:

Arow
4 x 4 sparse Matrix of class "dgCMatrix"
     Greg Mary Sam Tom
Greg    1    .   1   .
Mary    .    2   1   2
Sam     1    1   3   2
Tom     .    2   2   3

To get the one-mode matrix formed by the column entities (i.e. the number of people) enter the following command:

Acol <- t(A) %*% A

Again, we can use tcrossprod to make this even more efficient:

Acol <- tcrossprod(t(A))

And the resulting co-membership matrix is as follows:

Mcol
group
group a b c d
a 2 1 1 0
b 1 3 2 2
c 1 2 2 1
d 0 2 1 2

Although we’ve used a very small network for our example, this code is highly extensible to the analysis of larger networks with R.

Analysis of Two Mode Data and Mobility

Let’s work with some actual affiliation data, collected by Dan McFarland on student extracurricular affiliations. It’s a longitudinal data set, with 3 waves - 1996, 1997, 1998. It consists of students (anonymized) and the student organizations in which they are members (e.g. National Honor Society, wrestling team, cheerleading squad, etc.). What we’ll do is to read in the data, explore it, make a few two-to-one mode conversions, and visualize it.

# Load the 'igraph' library
library('igraph')
# (1) Read in the data files, NA data objects coded as 'na'
magact96 = read.delim('https://solomonmg.github.io/assets/img/mag_act96.txt',
na.strings = 'na')
magact97 = read.delim('https://solomonmg.github.io/assets/img/mag_act97.txt',
na.strings = 'na')
magact98 = read.delim('https://solomonmg.github.io/assets/img/mag_act98.txt',
na.strings = 'na')

Missing data is coded as “na” in this data, which is why we gave R the command na.strings = “na”.

These files consist of four columns of individual-level attributes (ID, gender, grade, race), then a bunch of group membership dummy variables (coded “1” for membership, “0” for no membership). We need to set aside the first four columns (which do not change from year to year).

magattrib = magact96[,1:4]
g96 <- as.matrix(magact96[,-(1:4)]); row.names(g96) = magact96$ID.
g97 <- as.matrix(magact97[,-(1:4)]); row.names(g97) = magact97$ID.
g98 <- as.matrix(magact98[,-(1:4)]); row.names(g98) = magact98$ID.

By using the [,-(1:4)] index, we drop those columns so that we have a square incidence matrix for each year, and then tell R to set the row names of the matrix to the student’s ID. Note that we need to keep the “.” after ID in this dataset (because it’s in the name of the variable). Now we load these two-mode matrices into igraph:

i96 <- graph.incidence(g96, mode=c('all') )
i97 <- graph.incidence(g97, mode=c('all') )
i98 <- graph.incidence(g98, mode=c('all') )

Plotting two-mode networks

Now, let’s plot these graphs. The igraph package has excellent plotting functionality that allows you to assign visual attributes to igraph objects before you plot. The alternative is to pass 20 or so arguments to the plot.igraph() function, which gets really messy.

Let’s assign some attributes to our graph. First we set vertex attributes, making sure to make them slightly transparent by altering the gamma, using the rgb(r,g,b,gamma) function to set the color. This makes it much easier to look at a really crowded graph, which might look like a giant hairball otherwise. You can read up on the RGB color model here.

Each node (or “vertex”) object is accessible by calling V(g), and you can call (or create) a node attribute by using the $ operator so that you call V(g)$attribute. Here’s how to set the color attribute for a set of nodes in a graph object:

V(i96)$color[1:1295] <- rgb(1,0,0,.5)
V(i96)$color[1296:1386] <- rgb(0,1,0,.5)

Notice that we index the V(g)$color object by a seemingly arbitrary value, 1295. This marks the end of the student nodes, and 1296 is the first group node. You can view which nodes are which by typing V(i96). R prints out a list of all the nodes in the graph, and those with a number are obviously different from those that consist of a group name.

Now we’ll set some other graph attributes:

V(i96)$label <- V(i96)$name
V(i96)$label.color <- rgb(0,0,.2,.5)
V(i96)$label.cex <- .4
V(i96)$size <- 6
V(i96)$frame.color <- NA

You can also set edge attributes. Here we’ll make the edges nearly transparent and slightly yellow because there will be so many edges in this graph:

E(i96)$color <- rgb(.5,.5,0,.2)

Now, we’ll open a pdf “device” on which to plot. This is just a connection to a pdf file. Note that the code below will take a minute or two to execute (or longer if you have a pre- Intel dual-core processor).

pdf('i96.pdf')
plot(i96, layout=layout.fruchterman.reingold)
dev.off()

Note that we’ve used the Fruchterman-Reingold force-directed layout algorithm here. Generally speaking, the when you have a ton of edges, the Kamada-Kawai layout algorithm works well but, it can get really slow for networks with a lot of nodes. Also, for larger networks, layout.fruchterman.reingold.grid is faster, but can fail to produce a plot with any meaninful pattern if you have too many isolates, as is the case here. Experiment for yourself. Here’s what we get:

It’s oddly reminiscent of a cresent and star, but impossible to read. Now, if you open the pdf output, you’ll notice that you can zoom in on any part of the graph ad infinitum without losing any resolution. How is that possible in such a small file? It’s possible because the pdf device output consists of data based on vectors: lines, polygons, circles, elipses, etc., each specified by a mathematical formula that your pdf program renders when you view it. Regular bitmap or jpeg picture output, on the other hand, consists of a pixel-coordinate mapping of the image in question, which is why you lose resolution when you zoom in on a digital photograph or a plot produced with most other programs.

Let’s remove all of the isolates (the cresent), change a few aesthetic features, and replot. First, we’ll remove isloates, by deleting all nodes with a degree of 0, meaning that they have zero edges. Then, we’ll suppress labels for students and make their nodes smaller and more transparent. Then we’ll make the edges more narrow more transparent. Then, we’ll replot using various layout algorithms:

i96 <- delete.vertices(i96, V(i96)[ degree(i96)==0 ])
V(i96)$label[1:857] <- NA
V(i96)$color[1:857] <-  rgb(1,0,0,.1)
V(i96)$size[1:857] <- 2
E(i96)$width <- .3
E(i96)$color <- rgb(.5,.5,0,.1)
pdf('i96.2.pdf')
plot(i96, layout=layout.kamada.kawai)
dev.off()
pdf('i96.3.pdf')
plot(i96, layout=layout.fruchterman.reingold.grid)
dev.off()
pdf('i96.4.pdf')
plot(i96, layout=layout.fruchterman.reingold)
dev.off()

I personally prefer the Fruchterman-Reingold layout in this case. The nice thing about this layout is that it really emphasizes centrality–the nodes that are most central are nearly always placed in the middle of the plot. Here’s what it looks like:

Very pretty, but you can’t see which groups are which at this resolution. Zoom assets/in on the pdf output, and you can see things pretty clearly.

Two mode to one mode data transformation

We’ve emphasized groups in this visualization so much, that we might want to just create a network consisting of group co-membership. First we need to create a new network object. We’ll do that the same way for this network as for our example at the top of this page:

g96e <- t(g96) %*% g96
g97e <- t(g97) %*% g97
g98e <- t(g98) %*% g98
i96e <- graph.adjacency(g96e, mode = 'undirected')

Now we need to tansform the graph so that multiple edges become an attribute ( E(g)$weight ) of each unique edge:

E(i96e)$weight <- count.multiple(i96e)
i96e <- simplify(i96e)

Now we’ll set the other plotting parameters as we did above:

# Set vertex attributes
V(i96e)$label <- V(i96e)$name
V(i96e)$label.color <- rgb(0,0,.2,.8)
V(i96e)$label.cex <- .6
V(i96e)$size <- 6
V(i96e)$frame.color <- NA
V(i96e)$color <- rgb(0,0,1,.5)
# Set edge gamma according to edge weight
egam <- (log(E(i96e)$weight)+.3)/max(log(E(i96e)$weight)+.3)
E(i96e)$color <- rgb(.5,.5,0,egam)

We set edge gamma as a function of how many edges exist between two nodes, or in this case, how many students each group has in common. For illustrative purposes, let’s compare how the Kamada-Kawai and Fruchterman-Reingold algorithms render this graph:

pdf('i96e.pdf')
plot(i96e, main = 'layout.kamada.kawai', layout=layout.kamada.kawai)
plot(i96e, main = 'layout.fruchterman.reingold', layout=layout.fruchterman.reingold)
dev.off()

I like the Kamada-Kawai layout for this graph, because the center of the graph is too busy otherwise. And here’s what the resulting plot looks like:

You can check out the difference between each layout yourself. Here’s what the assets/pdf output looks like. Page 1 shows the Kamada-Kawai layout and page 2 shows the Fruchterman Reingold layout.

Group overlap networks and plots

Now we might also be interested in the percent overlap between groups. Note that this will be a directed graph, because the percent overlap will not be symmetric across groups–for example, it may be that 3/4 of Spanish NHS members are in NHS, but only 1/8 of NHS members are in the Spanish NHS. We’ll create this graph for all years in our data (though we could do it for one year only). First we’ll need to create a percent overlap graph. We start by dividing each row by the diagonal (this is really easy in R):

ol96 <- g96e/diag(g96e)
ol97 <- g97e/diag(g97e)
ol98 <- g98e/diag(g98e)

Next, sum the matricies and set any NA cells (caused by dividing by zero in the step above) to zero:

magall <- ol96 + ol97 + ol98
magall[is.na(magall)] <- 0

Note that magall now consists of a percent overlap matrix, but because we’ve summed over 3 years, the maximun is now 3 instead of 1. Let’s compute average club size, by taking the mean across each value in each diagonal:

magdiag <- apply(cbind(diag(g96e), diag(g97e), diag(g98e)), 1, mean )

Finally, we’ll generate centrality measures for magall. When we create the igraph object from our matrix, we need to set weighted=T because otherwise igraph dichotomizes edges at 1. This can distort our centrality measures because now edges represent more than binary connections–they represent the percent of membership overlap.

magallg <- graph.adjacency(magall, weighted=T)
# Degree
V(magallg)$degree <- degree(magallg)
# Betweenness centrality
V(magallg)$btwcnt <- betweenness(magallg)

Before we plot this, we should probably filter some of the edges, otherwise our graph will probably be too busy to make sense of visually. Take a look at the distribution of connection strength by plotting the density of the magall matrix:

plot(density(magall))

Nearly all of the edge weights are below 1–or in other words, the percent overlap for most clubs is less than 1/3. Let’s filter at 1, so that an edge will consists of group overlap of more than 1/3 of the group’s members in question.

magallgt1 <- magall
magallgt1[magallgt1<1] <- 0
magallggt1 <- graph.adjacency(magallgt1, weighted=T)
# Removes loops:
magallggt1 <- simplify(magallggt1, remove.multiple=FALSE, remove.loops=TRUE)

Before we do anything else, we’ll create a custom layout based on Fruchterman.-Ringold wherein we adjust the coordates by hand using the tkplot gui tool to make sure all of the labels are visible. This is very useful if you want to create a really sharp-looking network visualization for publication.

magallggt1$layout <- layout.fruchterman.reingold(magallggt1)
V(magallggt1)$label <- V(magallggt1)$name
tkplot(magallggt1)

Let the plot load, then maximize the window, and select to View -> Fit to Screen so that you get maximum resolution for this large graph. Now hand-place the nodes, making sure no labels overlap:

Pay special attention to whether the labels overlap (or might overlap if the font was bigger) along the vertical. Save the layout coordinates to the graph object:

magallggt1$layout <- tkplot.getcoords(1)

We use “1” here because only if this was the first tkplot object you called. If you called tkplot a few times, use the last plot object. You can tell which object is visible because at the top of the tkplot interface, you’ll see something like “Graph plot 1” or in the case of my screenshot above “Graph plot 7” (it was the seventh time I called tkplot).

# Set vertex attributes
V(magallggt1)$label <- V(magallggt1)$name
V(magallggt1)$label.color <- rgb(0,0,.2,.6)
V(magallggt1)$size <- 6
V(magallggt1)$frame.color <- NA
V(magallggt1)$color <- rgb(0,0,1,.5)
# Set edge attributes
E(magallggt1)$arrow.size <- .3
# Set edge gamma according to edge weight
egam <- (E(magallggt1)$weight+.1)/max(E(magallggt1)$weight+.1)
E(magallggt1)$color <- rgb(.5,.5,0,egam)

One thing that we can do with this graph is to set label size as a function of degree, which adds a “tag-cloud”-like element to the visualization:

V(magallggt1)$label.cex <- V(magallggt1)$degree/(max(V(magallggt1)$degree)/2)+ .3
#note, unfortunately one must play with the formula above to get the
#ratio just right

Let’s plot the results:

pdf('magallggt1customlayout.pdf')
plot(magallggt1)
dev.off()

Note that we used the custom layout, which because we made part of the igraph object magallggt1, we did not need to specify in plot command. assets/Here’s the pdf output, and here’s what it looks like:

This visualization reveals much more information about our network than our cresent-star visualization.

Mobility, Markov, and Transition Probabilities

In order to shed light on how people flow through these groups, we’ll compute transition probabilities. These transition probabilities are more generally referred to as Markov chains.

First we’ll create a new matrix that multiplies 1996 magnet with 1997 magnet so you see the number of students moving from 1996 membership to 1997 memberships.

Before we actually do this, we need to do some data munging to make sure that the rows and columns for g96 and g97 are the same. We’ll use the match() function for this.

    
# First, let's get an idea of how many column-names (activities) and row
# names (student ids) are in common between the two years:
  
(cnames = intersect( colnames(g96), colnames(g97) ) )
(rnames = intersect( row.names(g96), row.names(g97) ) )
  
# Great, there are a lot of names in common. Now we
# need to make sure we are only using the rows
# and columns of each matrix that contain entries used in
# both years. We also need to make sure that the columns and
# rows are in the same order.
  
# In order to accomplish this we are going to exploit R's
# indexing capabilities. We are going to have R "rebuild"
# each matrix according to the order of rnames and cnames.
# We'll use the match() function to accomplish this.
g96matched = g96[ match(rnames, row.names(g96)), match(cnames, colnames(g96)) ]
g97matched = g97[ match(rnames, row.names(g97)), match(cnames, colnames(g97)) ]
  
# We need to do the same thing for the diagonal of the matrix g96e, which is
# our co-membership/affiliation matrix computed above:
mag96diagmatched = diag( g96e[ match(cnames, colnames(g96e)), 
  match(cnames, colnames(g96e)) ] )
  
# Now let's check to make sure things worked correctly:
which(row.names(g96matched) != row.names(g97matched))
which(colnames(g96matched) != colnames(g97matched))

Now that these are effectively matricies, we can multiply to get the transition probability matrix:

mag96_97 = t(g96matched) %*% g97matched

Let’s munge the 97 and 98 data and repeat:


cnames = intersect( colnames(g97), colnames(g98) ) 
rnames = intersect( row.names(g97), row.names(g98) )
g97matched = g97[ match(rnames, row.names(g97)), match(cnames, colnames(g97)) ]
g98matched = g98[ match(rnames, row.names(g98)), match(cnames, colnames(g98)) ]

And again for the 97-98 transition:

mag97_98 = t(g97matched) %*% g98matched

Now we need to get the group-level membership matrix diagonal, ordered by the current set of columns.

mag96diagmatched = diag( g96e[ match(cnames, colnames(g96e)), 
  match(cnames, colnames(g96e)) ] )

mag97diagmatched = diag( g97e[ match(cnames, colnames(g97e)), 
  match(cnames, colnames(g97e)) ] )
  
mag98diagmatched = diag( g98e[ match(cnames, colnames(g98e)), 
  match(cnames, colnames(g98e)) ] )

And finally we can create the transition probability matrix! Divide magmob96_97 by mag96diagmatched in to get the transition probability matrix (Markov chain):

magmob96_97 = mag96_97/mag96diagmatched 
magmob97_98 = mag97_98/mag97diagmatched

Now add the matrices and divide by 2:

mobility_all <- (magmob96_97 + magmob97_98)/2

Now plot as with the event-overlap graphs!