<?xml version="1.0" encoding="UTF-8"?>
<rss  xmlns:atom="http://www.w3.org/2005/Atom" 
      xmlns:media="http://search.yahoo.com/mrss/" 
      xmlns:content="http://purl.org/rss/1.0/modules/content/" 
      xmlns:dc="http://purl.org/dc/elements/1.1/" 
      version="2.0">
<channel>
<title>Sol Messing</title>
<link>https://solomonmg.github.io/blog.html</link>
<atom:link href="https://solomonmg.github.io/blog.xml" rel="self" type="application/rss+xml"/>
<description>Projects and notes on data visualization, elections, social media, AI, and statistics.</description>
<generator>quarto-1.9.38</generator>
<lastBuildDate>Sun, 24 May 2026 00:00:00 GMT</lastBuildDate>
<item>
  <title>Interactive Web Replication &amp; Update of State Media Influence on LLMs</title>
  <dc:creator>Sol Messing</dc:creator>
  <link>https://solomonmg.github.io/blog/state-media-llm/</link>
  <description><![CDATA[ 





<p><a href="https://doi.org/10.1038/s41586-026-10506-7" class="btn-paper">PDF</a> <a href="https://github.com/state-media-influence-llm/replication" class="btn-paper">Code</a> <a href="https://twitter.com/SolomonMg" class="btn-paper">Follow</a></p>
<p><a href="https://hannahwaight.com">Hannah Waight</a>, <a href="https://eddieyang.net">Eddie Yang</a>, Yin Yuan, <a href="https://polisci.ucsd.edu/people/faculty/faculty-directory/currently-active-faculty/roberts-profile.html">Molly Roberts</a>, <a href="https://scholar.princeton.edu/bstewart">Brandon Stewart</a>, <a href="https://wp.nyu.edu/joshuatucker/">Josh Tucker</a> and I <a href="https://doi.org/10.1038/s41586-026-10506-7">published a paper in Nature (2026)</a> showing that state-controlled media in LLM training data influences how those models talk about politics.</p>
<!-- The paper shows (1) state-coordinated Chinese media are in open training corpora; (2) pretraining on that content moves a model's outputs in a pro-government direction, especialy in the regime's langauge; and (3) commercial models answer political prompts more favorably toward the regime in that language. -->
<p>But we ran those audits in 2023-2024–it took a long time to get the paper published!</p>
<p>We wanted to know what happens with the current generation of models, especially re how models memorize state media talking points, and I’d seen a lot of criticism of LLM papers that relied on legacy generation models recently on Twitter. I also wanted to show what being more pro-regime looks like in the actual text.</p>
<p>And you don’t land a <em>Nature</em> paper every day!</p>
<p>Now in the past, this is the kind of thing that I would get excited about but never actually execute because there’s a lot of slow and boring scaffolding work outside my expertise required to set this up. But I started using Claude Code late last year and of course CLI-AI tools are great for stuff like this.</p>
<p>In fact, Josh and I recently wrote a <a href="https://www.brookings.edu/articles/the-train-has-left-the-station-agentic-ai-and-the-future-of-social-science-research/">piece in Brookings</a> about how agentic AI might make it possible to do more public outreach like this.</p>
<p>So I built an <a href="https://state-media-influence-llm.github.io">interactive companion site</a> that replicates the core studies from the paper on current-generation models. The whole team gave feedback and what came out was pretty cool. It looked great and a few new and important findings emerged from the effort.</p>
<p>By and large, <a href="https://state-media-influence-llm.github.io/global.html">the core findings hold</a>. In 38 countries, where more than 70% of langauge speakers reside, there’s a strong negative correlation between press freedom and pro-government LLM valence (-0.89) relative to English in current-generation models. Every current-generation model still produces more pro-government answers in Chinese than in English about Chinese leaders and institutions. Memorization rates for state-coordinated media phrases continue to be at or above rates for general web text.</p>
<p>Two years of capability improvements and safety work have not changed the underlying issue.</p>
<p><strong>The highlights</strong></p>
<ul>
<li><p><strong><a href="https://state-media-influence-llm.github.io/memorization.html">Memorization effects are far larger for new models</a>.</strong> As expected, newer larger models memorize state media-aligned text a <em>much</em> higher rates than do the models we tested in the paper. We prompted models with the first half of 2,000 distinctive phrases and measured how often each model completes the second half of the phrase nearly perfectly. Half of the phrases are from Chinese state media talking points (red) and half from general Chinese web text (green/blue).</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://solomonmg.github.io/blog/state-media-llm/memorization_rates.png" class="img-fluid figure-img"></p>
<figcaption>Memorization rates across paper-era and current-generation LLMs. State-coordinated media phrases in red, general CulturaX web text in green. Newer models complete the held-out half of each phrase at substantially higher rates than the paper-era models.</figcaption>
</figure>
</div></li>
<li><p><strong><a href="https://state-media-influence-llm.github.io/audit.html">Newer models tend to be even more positive toward China in Chinese</a>.</strong></p></li>
<li><p><strong><a href="https://state-media-influence-llm.github.io/cross_model_audit.html">DeepSeek V4 Pro overwhelmingly pro-China</a>.</strong></p></li>
</ul>
<p>DeepSeek V4 Pro is overwhelmingly pro-China in both languages. Spot-checking suggests it’s spouting state media talking points in English: “principles of socialism with Chinese characteristics” and “whole-process people’s democracy.” To examine DeepSeek’s pro-China valence relative to other models, I ran pairwise llm-as-judge comparisons across nine current-generation models holding language constant and fit a Bradley-Terry model. DeepSeek V4 Pro ranks first on China-favorability in both English <em>and</em> Chinese.</p>
<!-- 828 politically sensitive prompts about leaders, countries, and institutions across six countries — once in English, once in Chinese. A six-judge LLM panel (GPT-OSS-120B, GPT-5.2, Claude Opus 4.6, Grok 4, Gemini 3.1 Pro, DeepSeek V3.2) scores which language produces the more pro-government response. I depart from the paper on one methodological choice: prompts where the model refuses in either language are excluded, since refusals predictably lose to substantive responses on the other side and confound the underlying valence question (DeepSeek V4 Pro and Gemini 3.1 Pro each refuse about a quarter of the prompts in at least one language). -->
<p><strong>Code and data:</strong> <a href="https://github.com/state-media-influence-llm/replication">github.com/state-media-influence-llm/replication</a></p>
<p><strong>Paper:</strong> <a href="https://doi.org/10.1038/s41586-026-10506-7">Waight et al.&nbsp;2026, <em>Nature</em></a> (<a href="https://rdcu.be/fiyrF">complimentary copy</a>)</p>
<p><strong>Companion site:</strong> <a href="https://state-media-influence-llm.github.io">state-media-influence-llm.github.io</a></p>



 ]]></description>
  <guid>https://solomonmg.github.io/blog/state-media-llm/</guid>
  <pubDate>Sun, 24 May 2026 00:00:00 GMT</pubDate>
  <media:content url="https://solomonmg.github.io/blog/state-media-llm/featured.png" medium="image" type="image/png" height="123" width="144"/>
</item>
<item>
  <title>An Early Election 2024 Forecast</title>
  <dc:creator>Sol Messing</dc:creator>
  <link>https://solomonmg.github.io/blog/election-projection-regularized-swing/</link>
  <description><![CDATA[ 





<p>Early projections for 2024 based on previous Presidential and House returns slighly favor Republicans. These projections are completely unrelated to Biden’s recent polling numbers.</p>
<!-- ![](/img/Map_for_2024_JS_Swing.png "A simple forecasted map for 2024. Created https://www.270towin.com/maps/WWE2B.") -->
<p>Here’s the story behind this approach: In early 2020, I ran battleground state election forecasts for Acronym. The results suggested Georgia would be extremely competitive—and Acronym spent more $ there than many other non-profit actors. After the election, we could see that those projections had much lower forecasting error than polling data https://solomonmg.github.io/post/what-the-polls-got-wrong-in-2020/.</p>
<p>Because this approach does not use polling data, it’s not suspetible to any of the potential problems with polls I talk about in that post: undecided voters breaking late, low education non-response, bad likely voter modeling, partisan non-response, shy Trumpers, etc.</p>
<p>The core idea behind this approach is a fact not emphasized enough in most stats/ML courses: if you’re going to try to predict something, it’s very hard to do better than using the same variable at t - 1 if you can. And we can. This approach goes one step further and looks at the direction that variable has been moving and assume that things are likely to keep moving in that same direction.</p>
<p>What that means for presidential election forecasts: for each state, estimate the “swing” from 2016 to 2020 for president and 2018-2022 for the U.S. house; then simply add that to 2020 presidential returns. Then those estimates of state-level swing are regularized—mathematically “nudged’’ toward national trends, which you’ll like if you believe “uniform swing” is particularly important. The projected state-level swing is weighted 60-40 toward presidential results.</p>
<p>Here’s a cleaner plot showing the actual forecast values in potential 2024 battleground states:</p>
<p><img src="https://solomonmg.github.io/img/EstStateDemVote.png" title="A simple forecast for 2024 battleground states. Hat tip to [Tom Cunningham](https://tecunningham.github.io) who suggested this plot design." class="img-fluid"></p>
<p>I should now point to a link to the data and code: https://github.com/SolomonMg/election_projection_regularized_swing, and thank the <a href="https://electionlab.mit.edu">MIT Election Data + Science Lab</a> for curating <a href="https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/42MVDX">these</a> <a href="https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/IG0UN2">data</a>.</p>
<section id="electoral-math" class="level3">
<h3 class="anchored" data-anchor-id="electoral-math">Electoral Math:</h3>
<p>I’m going to rely on www.270towin.com to translate these projections into an electoral map. A better way to do this might be to come up with conservative estimates of error and simulate a few thousand elections, but I’m not estimating an extremely rigorous Bayesian model nor including enough extant data to really justify a FiveThirtyEight style forecast.</p>
<p>If you call anything lower than a 3 point margin either way a “tossup,” here’s what the electoral map looks like:</p>
<p><img src="https://solomonmg.github.io/img/Map_for_2024_JS_Swing.png" title="Elecotral map for 2024, lower than a 3% margin is a tossup. Created https://www.270towin.com/maps/WWE2B." class="img-fluid"></p>
<p>That looks OK for Biden, but if you really trust this approach, you might want to say anything lower than 2% is a tossup. Then the electoral math looks very bad for Biden:</p>
<p><img src="https://solomonmg.github.io/img/Map_for_2024_JS_margin2Swing.png" title="Elecotral map for 2024, lower than a 2% margin is a tossup. Created https://www.270towin.com/maps/WWExg." class="img-fluid"></p>
</section>
<section id="observation-polarization-and-accuracy" class="level3">
<h3 class="anchored" data-anchor-id="observation-polarization-and-accuracy">Observation: Polarization and Accuracy</h3>
<p>These projections essentially assume party identification, demographic trends, and voting behavior will mostly continue in the same general direction as in the past. They should have a lot of appeal if you think polarization means most people have already made up their minds about who to vote for for President, that Presidential campaign effects are relatively small (in equilibrium at least), and/or that “demographics are destiny.” What’s more, the results are regularized toward national trends, which you’ll like if you believe that <a href="https://press.uchicago.edu/ucp/books/book/chicago/I/bo27596045.html">local politics has been “nationalized,’’ as Dan Hopkins argues</a> and thus that <a href="https://projects.fivethirtyeight.com/2020-swing-states/">“uniform swing” in the electorate is an increasingly important factor explaining state-level election results</a>—despite that Florida bucked the national trend in 2020.</p>
<p>In fact, over time, as polarization seems to worsen, this approach improves in accuracy:</p>
<p><img src="https://solomonmg.github.io/img/Battleground_MAE_projections.png" title="Backtested forecasts improve in accuracy over time, and 2020 was far easier to predict than past elections." class="img-fluid"></p>
<p>However, these projections do not account for events since 2022. Older voters pass away and younger voters become eligible to vote changing the makeup of the electorate. Public opinion/sentiment may change related to economic conditions (inflation/income/unemployment/etc), policy developments e.g., related to abortion, international affairs like the Gaza conflict, or candidate-attributes like Biden’s age or Trumps legal troubles.</p>
</section>
<section id="observation-patterns-in-u.s.-elections" class="level3">
<h3 class="anchored" data-anchor-id="observation-patterns-in-u.s.-elections">Observation: Patterns in U.S. Elections</h3>
<p>These projections also do not explicitly model well-known voting patterns, instead relying on change from one cycle to another to get reasonable estimates. The most notable trend is that the president’s party almost always tends to lose seats in the house in midterm elections. https://www.jstor.org/stable/2130810 https://fivethirtyeight.com/features/why-the-presidents-party-almost-always-has-a-bad-midterm/</p>
<p>Because the model only looks at the state-level the difference between the last two <em>midterm</em> cycles, these projections are capturing how midterm returns <em>change</em> in each state, which goes a ways toward correcting the consistent lower performance in midterms pattern, and in part may reflect changes in sentiment toward the president.</p>
<p>A less reliable trend that’s held since FDR’s time is that incumbent presidents have tended to get a higher percent of the popular vote in their election for a second term—Obama in 2012 was a notable exception. https://www.presidency.ucsb.edu/statistics/data/presidential-election-mandates What’s more, house midterm results seem to be particularly bad just before an incumbent is voted out of office. It’s not clear if this is a bug or a feature, or how reliably this would be picked up using this method, but it’s worthing pointing out.</p>
</section>
<section id="observation-accuracy-over-previous-election-results" class="level3">
<h3 class="anchored" data-anchor-id="observation-accuracy-over-previous-election-results">Observation: Accuracy over Previous Election Results</h3>
<p>If it’s hard to do better than election returns at t - 1, does this approach actually do better? Yes, by a little. Including all states, these projections have lower mean absolute error (MAE). For some reason these projections miss badly in 2004, and excluding that earliest year I can compute these projections using the MIT data, shows they do in fact do quite a bit better than simply relying on previous presidential election results alone.</p>
<p><img src="https://solomonmg.github.io/img/AllStates_MAE_projections.png" title="Accuracy over time for projections compared with simply using the previous election." class="img-fluid"></p>
<div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb1-1"><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> bt_dat <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">group_by</span>(proj_type) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summarise</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(mae))</span>
<span id="cb1-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># A tibble: 2 × 2</span></span>
<span id="cb1-3">  proj_type <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">`</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">mean(mae)</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">`</span></span>
<span id="cb1-4">  <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span>chr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span>           <span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">&lt;</span>dbl<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span></span>
<span id="cb1-5"><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> prev pres        <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">3.11</span></span>
<span id="cb1-6"><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span> proj             <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">2.65</span></span>
<span id="cb1-7"></span>
<span id="cb1-8"><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> bt_dat <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(years <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2004</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">group_by</span>(proj_type) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summarise</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(mae))</span>
<span id="cb1-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># A tibble: 2 × 2</span></span>
<span id="cb1-10">  proj_type <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">`</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">mean(mae)</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">`</span></span>
<span id="cb1-11">  <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span>chr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span>           <span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">&lt;</span>dbl<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span></span>
<span id="cb1-12"><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> prev pres        <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">3.36</span></span>
<span id="cb1-13"><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span> proj             <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">2.51</span></span></code></pre></div>
<p>Results are more subtle if we restrict to battleground states:</p>
<div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb2-1"><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> bt_dat <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">group_by</span>(proj_type) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summarise</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(mae))</span>
<span id="cb2-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># A tibble: 2 × 2</span></span>
<span id="cb2-3">  proj_type <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">`</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">mean(mae)</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">`</span></span>
<span id="cb2-4">  <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span>chr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span>           <span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">&lt;</span>dbl<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span></span>
<span id="cb2-5"><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> prev pres        <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.19</span></span>
<span id="cb2-6"><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span> proj             <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.16</span></span>
<span id="cb2-7"><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> bt_dat <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(years <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2004</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">group_by</span>(proj_type) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summarise</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(mae))</span>
<span id="cb2-8"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># A tibble: 2 × 2</span></span>
<span id="cb2-9">  proj_type <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">`</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">mean(mae)</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">`</span></span>
<span id="cb2-10">  <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span>chr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span>           <span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">&lt;</span>dbl<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span></span>
<span id="cb2-11"><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> prev pres       <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.28</span> </span>
<span id="cb2-12"><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span> proj            <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.995</span></span></code></pre></div>
</section>
<section id="observation-regularization-toward-0-or-the-mean" class="level3">
<h3 class="anchored" data-anchor-id="observation-regularization-toward-0-or-the-mean">Observation: Regularization toward 0 or the Mean?</h3>
<p>Here’s the map with shrinkage toward 0, which will move the estimates toward the prior year’s election. Biden does worse in WI and slightly worse in AZ and PA, because the presidential swing estimates get pulled down toward zero instead of up toward the nation-wide state-level mean (3.5%). But he does better in NC, where the relatively good house results get pulled toward zero instead of down to the state-level average midterm swing (-11.5%).</p>
<p><img src="https://solomonmg.github.io/img/Map_for_2024_JS_0_Swing.png" title="A simple forecasted map for 2024, shrinkage toward zero. Created https://www.270towin.com/maps/66rZw." class="img-fluid"></p>
<p>However, based on my updated backtesting, the MAE estimates are worse when you shrink toward zero, which is what I did back in 2020. This makes me feel good because the mathematical/statistical theory says that shrinking toward the group mean should produce high quality estimates, while there’s not much theory that suggests shrinking toward zero should improve estimation.</p>
</section>
<section id="methological-details" class="level3">
<h3 class="anchored" data-anchor-id="methological-details">Methological Details</h3>
<p>For each state it estimates the “swing” from 2016 to 2020 for president and 2018-2022 for the U.S. house; then simply adds that to 2020 presidential returns. The projected state-level swing is weighted 60-40 toward presidential results.</p>
<p>Now, the tricky bit is I estimate “swing” using James-Stein-adjusted state-level slope. This method “shrinks” the slant of each slope toward 50-50 or toward the average slope. 50-50 is what I used in 2020 and but recent corrections I’ve made to my backtesting scrips reveals that has a slightly higher mean absolute error going back to the 2004 election.</p>
<p>I’ve since updated the approach in a number of important ways, based on backtesting (looking at how well the method performs on past elections). I now regularize (or “shrink”) fewer quantities and do so toward the mean instead of toward zero. I should also note that the original code I used a few minor errors, which I’ve since fixed.</p>


</section>

 ]]></description>
  <guid>https://solomonmg.github.io/blog/election-projection-regularized-swing/</guid>
  <pubDate>Thu, 11 Jan 2024 00:00:00 GMT</pubDate>
  <media:content url="https://solomonmg.github.io/blog/election-projection-regularized-swing/featured.png" medium="image" type="image/png" height="83" width="144"/>
</item>
<item>
  <title>Disaggregating ‘Ideological Segregation’</title>
  <dc:creator>Sol Messing</dc:creator>
  <link>https://solomonmg.github.io/blog/thoughts-on-election-2020/</link>
  <description><![CDATA[ 





<section id="tldr" class="level3">
<h3 class="anchored" data-anchor-id="tldr">TLDR:</h3>
<ol type="1">
<li>[UPDATED SEPT 30] Yesterday, Science <a href="../../pdf/science.adk1211.pdf">published a letter I wrote</a> arguing that there is little evidence of algorithmic bias in Facebook’s feed ranking system that would serve to increase ideological segregation, also known as <a href="https://books.google.com/books/about/The_Filter_Bubble.html?id=Qn2ZnjzCE3gC">the “Filter Bubble” hypothesis</a>.</li>
<li>This contradicts claims in <a href="https://www.science.org/doi/full/10.1126/science.ade7138">González-Bailón et al 2023</a> that Newsfeed ranking increases ideological segregation. This claim was the main piece of evidence in the Science <a href="https://www.science.org/toc/science/381/6656">Special Issue on Meta</a> that might support the controversial cover that suggested that Meta’s algorithms are “Wired to Split.”</li>
<li>The issue is that while domain-level analysis suggests feed-ranking increases ideological segregation, URL-level analysis shows <em>no difference</em> in ideological segregation before and after feed-ranking.</li>
<li>And we should strongly prefer their URL-level analysis. Domain-level analysis <em>effectively mislabels highly partisan content</em> as “moderate/mixed,” especially on websites like YouTube, Reddit, and Twitter (<a href="https://ori.hhs.gov/education/products/niu_authorship/mistakes/09mistake-a.htm">aggregation bias/ecological fallacy</a>).</li>
<li>Interestingly, the authors seem to agree—the discussion section points out problems with domain-level analysis.</li>
<li>Another <em>Science</em> paper from the same issue, <a href="https://www.science.org/doi/10.1126/science.abp9364">Guess et al 2023</a> shows (in the SM) that Newsfeed ranking actually <em>decreases</em> exposure to <em>political content</em> from like-minded sources compared with reverse-chronological feedranking.</li>
<li>The evidence in the 4 recent papers is not consistent with a meaningful Filter Bubble effect in 2020; nor does it support the notion that Meta’s algorithms are “Wired to Split.”</li>
<li>Furthermore, domain-level aggregation bias is a big issue in a great deal of past research on ideological segregation, because domain-level analysis <em>understates</em> media polarization. Because <a href="https://www.science.org/doi/full/10.1126/science.ade7138">González-Bailón et al 2023</a> gives both URL- and domain-level estimates, we can see the magnitude of aggregation bias. It’s huge.</li>
<li>I make a number of other observations about what we know about whether social media is polarizing and discuss implications for the controversial Science cover and Meta’s flawed claims that this research is exculpatory. <!-- 6. None of this is the last word on social media algorithms---as [González-Bailón et al 2023](https://www.science.org/doi/full/10.1126/science.ade7138) point out, we need additional research on friend/page/group/etc recommender systems, which may polarize the graph itself.  --></li>
</ol>
</section>
<section id="introicymi" class="level3">
<h3 class="anchored" data-anchor-id="introicymi">Intro/ICYMI</h3>
<details>
<summary>
Click to expand
</summary>
<p>Last week saw the release of a series of <a href="https://www.science.org/toc/science/381/6656">excellent papers in <em>Science</em></a>. I was particularly interested in <a href="https://www.science.org/doi/full/10.1126/science.ade7138">González-Bailón et al 2023</a>, which measures “ideological segregation.” This concept is based on <a href="https://web.stanford.edu/~gentzkow/research/echo_chambers.pdf">Matt Gentzkow and Jesse Shapiro’s 2011 work</a>. As they note, “The index ranges from 0 (all conservative and liberal visits are to the same outlet) to 1 (conservatives only visit 100% conservative outlets and liberals only visit 100% liberal outlets).”</p>
<p>This paper also replicates and extends my own work with Eytan Bakshy and Lada Adamic, also published in <a href="https://solomonmg.github.io/pdf/Science-2015-Bakshy-1130-2.pdf"><em>Science</em> in 2015</a>.</p>
<p>To be clear <a href="https://www.science.org/doi/full/10.1126/science.ade7138">González-Bailón et al 2023</a> goes a lot further, examining how these factors vary over-time, investigating clusters of isolated partisan media organizations, and patterns in the consumption of misinformation. They find (1) ideological segregation is high; (2) ideological segregation “increases after algorithmic curation” consistent with the “Filter Bubble” hypothesis; (3) there is a substantial right wing “echo chamber” in which conservatives are essentially siloed from the rest of the site (4) where misinformation thrives.</p>
<p>I’ve had many years to think about issues related to these questions, after working with similar data from 2012 in my dissertation, and co-authoring a <a href="https://solomonmg.github.io/pdf/Science-2015-Bakshy-1130-2.pdf">Science paper</a> while working at Facebook using 2014 data. I also saw the design (but not results) presented at the <a href="https://www.ssrc.org/programs/digital-platforms-initiative/2023-ssrc-workshop-on-the-economics-of-social-media/">2023 SSRC Workshop on the Economics of Social Media</a>, though I did not notice these issues until I saw the final paper.</p>
<p>I put together these thoughts after discussion and feedback from Dean Eckles and Tom Cunningham, former colleagues at Stanford, Facebook, and Twitter.</p>
<!-- [Click here to read the backstory on Echo Chambers, Filter Bubbles, and Selective Exposure, including where my own past work fits in](#a-brief-history-of-echo-chambers-filter-bubbles-and-selective-exposure) -->
</details>
</section>
<section id="a-brief-history-of-echo-chambers-filter-bubbles-and-selective-exposure" class="level3">
<h3 class="anchored" data-anchor-id="a-brief-history-of-echo-chambers-filter-bubbles-and-selective-exposure">A Brief History of Echo Chambers, Filter Bubbles, and Selective Exposure</h3>
<details>
<summary>
Click to expand
</summary>
<!-- 20 years ago I worked as a foreign media analyst and I noticed that a great deal of misinformation circulating in news websites in Indonesia and the Middle East came from Alex Jones' InfoWars (yes he's been around for a long time). I became fascinated with the question of how people get their media and how technology changes that.  -->
<p>The conventional academic wisdom when I started my PhD was that we shouldn’t expect to see much in the way of media effects (<a href="https://books.google.com/books/about/The_Effects_of_Mass_Communication.html?id=CzcGAQAAIAAJ">Klapper 1960</a>) because people tended to “select into” content that reinforced their views (<a href="https://www.jstor.org/stable/2747198">Sears and Freedman 1967</a>).</p>
<p>In 2007, Cass Sunstein wrote <a href="https://www.jstor.org/stable/j.ctt7tbsw">“Republic.com 2.0”</a>, which warned that the internet could allow us to even more easily isolate ourselves into “information cocoons” and “echo chambers.” Technology allows us to “filter” exactly what we want to see, and design our own programming. Cass also suggested this could lead to polarization.</p>
<!-- What's more, new media should be expected to further strengthen this "minimal effects" hypothesis ([Bennett and Iyengar 2008](https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=a2fcebd37ba1e919f662a287841b0356603cd0a4)).  -->
<p>In 2009-2010, Sean Westwood and I ran a series of studies suggesting that <a href="https://journals.sagepub.com/doi/10.1177/0093650212466406">popularity and social cues</a> in news aggregators and social media websites might be a way out of selective news consumption—we might be <em>more</em> exposed to cross-cutting views on platforms that feature a social component, and furthermore, this social component seemed to be more important that media “source” label. That was great in the abstract, but what happens on actual websites that people use?</p>
<p>The extent to which widely used social media platforms might allow us to exit “echo chambers” depended on the extent of cross-partisan friendships and interactions on platforms like Facebook. I did a PhD internship at Facebook to look into that, and the question of whether encountering <a href="https://www.dropbox.com/s/nu39148ukbab34r/CH7brief.pdf?raw=true">political news on social media was ideologically polarizing</a> (note that the results are as not well powered as I would like). That work evolved into dissertation chapters and eventually our <a href="https://solomonmg.github.io/pdf/Science-2015-Bakshy-1130-2.pdf">Science paper</a></p>
<p>Now <a href="https://en.wikipedia.org/wiki/Eli_Pariser">Eli Parsner</a> had just published a book on “<a href="https://books.google.com/books/about/The_Filter_Bubble.html?id=Qn2ZnjzCE3gC">Filter Bubbles</a>” suggesting that media technologies like Google Search and Facebook NewsFeed not only allowed us to ignore the “other side,” but actively filtering out search results and friends posts with perspectives from the other side.</p>
<p>What’s more, a lot of people in the Human Computer Interaction (HCI) world were very interested in how one might examine this empirically, and I started collecting data with Eytan Bakshy that would do just that. The paper would allow us to quantify echo chambers created by our network of contacts, filter bubbles, and partisan selective exposure in social media by looking at exposure to <a href="https://www.jstor.org/stable/3117813">ideologically “cross-cutting” content</a>.</p>
<p>We defined a few key components:</p>
<p><strong>Random</strong> - The set of content (external URLs) shared on Facebook writ large.</p>
<p><strong>Potential</strong> - The set of content shared by one’s friends</p>
<p><strong>Exposed</strong> - The set of content appearing in one’s Newsfeed.</p>
<p><strong>Selected</strong> - The set of content one clicks on.</p>
<p><strong>Endorsed</strong> - The set of content one ‘likes’.</p>
<p><img src="https://solomonmg.github.io/img/Ch6Fig6.5.jpg" title="Figure 6 from Messing 2013: Our original analysis of the distribution of ideologically-aligned content on Facebook using data from 2012." class="img-fluid"></p>
<p><img src="https://solomonmg.github.io/img/ScienceBakshyFig3B.jpg" title="Figure 3B from Bakshy et al 2015: Data from 2014 quantifying cross-cutting content on Facebook." class="img-fluid"></p>
<p>Unlike when I started to study social media in 2009 and no one was interested, in 2014, people understood that social media was an important force that was reshaping at least media if not society more broadly. And Science was particularly interested in the role of algorithms played in this environment.</p>
<p>We published our results in <a href="https://solomonmg.github.io/pdf/Science-2015-Bakshy-1130-2.pdf">Science</a>, and the response from many was “well this is smaller than expected,” including a piece in <em>Wired</em> from <a href="https://www.wired.com/2015/05/did-facebooks-big-study-kill-my-filter-bubble-thesis/">Eli Parisner himself</a>. David Lazer, one of the lead authors of <a href="https://www.science.org/doi/full/10.1126/science.ade7138">González-Bailón et al 2023</a>, also <a href="https://education.biu.ac.il/sites/education/files/shared/science-2015-lazer-1090-1.pdf">wrote a perspective in <em>Science</em></a>.</p>
<p>However, the piece immediately drew a great deal of criticism. This is in part because I wrote that exposure was driven more by individual choices than algorithms, which ignored the potential influence of friend recommendation systems and, more broadly, swept aside the extent to which interfaces structure interactions on websites, which my own dissertation work had shown was quite substantial.</p>
<p>The study also had important limitations, many of which I addressed in a post suggesting how <a href="https://solomonmg.github.io/post/exposure-to-ideologically-diverse-response/">future work could provide a more robust picture</a>.</p>
<p>I was (much later) tech lead for Social Science One (2018-2020), which gave external researchers access to data (the <a href="../../pdf/Facebook_DP_URLs_Dataset.pdf">‘Condor’ URLs data set</a>) via differential privacy. My goal was to enable the kind of work research done <a href="https://www.science.org/doi/full/10.1126/science.ade7138">González-Bailón et al 2023</a>. However, it soon became clear that Social Science One’s data sharing model and differential privacy in particular was not suitable for ground-breaking research.</p>
<p>I personally advocated (with Facebook Researcher and longtime colleague <a href="https://twitter.com/anniefranco">Annie Franco</a>) for the collaboration model used in the Election 2020 project, wherein external researchers would collaborate with Facebook researchers. That model would have to shield the research from any interference from Facebook’s Communications and Policy arm, which might attempt to interfere with the inturpretation or publication of any resulting papers, which would create ethical conflicts.</p>
<p>I advocated for pre-registration to accomplish this, not merely to ensure scientific rigor but to protect against conflicts of interest and selective reporting of results. However, I left Facebook in January of 2020 and have not been deeply involved in the project since.</p>
<p><a href="https://www.science.org/doi/full/10.1126/science.ade7138">González-Bailón et al 2023</a> does in fact accomplish most if not all of what I recommended future research do and goes much further than our original study, and the authors should be applauded for it. The paper shows that on Facebook (1) ideological segregation is high (in fact it’s arguably higher than implied in the paper); (2) there is a substantial right wing “echo chamber” in which conservatives are siloed from the rest of the site (3) where misinformation thrives. When they start to talk about the filter bubble though, things get more complicated.</p>
</details>
</section>
<section id="is-there-a-filter-bubble-on-facebook" class="level3">
<h3 class="anchored" data-anchor-id="is-there-a-filter-bubble-on-facebook">Is there a Filter Bubble on Facebook</h3>
<p>Are Facebook’s algorithms “Wired to split” the public? This question is at the core of the recent <em>Science</em> issue, it’s hotly debated in the field of algorithmic bias. A suspicion that the answer is “yes” has motivated a number of policy and regulatory actions. Armed with unprecendented data, <a href="https://www.science.org/doi/full/10.1126/science.ade7138">González-Bailón et al 2023</a> seeks to answer this and other related questions.</p>
<p>There are three relevant claims that answer this question in the text: (1) “ideological segregation is high and increases as we shift from potential exposure to actual exposure to engagement” in the abstract, (2) “The algorithmic promotion of compatible content from this inventory is positively associated with an increase in the observed segregation as we move from potential to exposed audiences” in the discussion section, and (3) “Segregation scores drawn from exposed audiences are higher than those based on potential audiences … (the difference between potential and engaged audiences is only visible at the domain level),” in the caption of Figure 2.</p>
<p>These statements are generally confirmatory of algorithmic segregation, aka the <a href="https://books.google.com/books/about/The_Filter_Bubble.html?id=Qn2ZnjzCE3gC">Filter Bubble hypothesis</a>.</p>
<p>But look at Figure 2, on which these claims seem to be based. Figure 2B shows an increase in observed segregation as you move from potential to exposed audiences. BUT Figure 2C—describing the same phenomena—does <em>not</em> (as noted in the caption).</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://solomonmg.github.io/img/BailonFig2BC.jpeg" title="When viewed at the level of the URL (2C), the study is consistent with a negligible Filter Bubble effect" class="img-fluid figure-img"></p>
<figcaption>Figure 2 (B &amp; C) from González-Bailón et al 2023: When viewed at the level of the URL (2C), the study is consistent with a <em>negligible</em> Filter Bubble effect</figcaption>
</figure>
</div>
</section>
<section id="so-which-is-it" class="level3">
<h3 class="anchored" data-anchor-id="so-which-is-it">So which is it?</h3>
<p>First, what’s the difference between these two figures? 2B aggregates things at the domain level (e.g., www.yahoo.com) while 2C aggregates things at the URL level (e.g., https://www.yahoo.com/news/pence-trumps-indictment-anyone-puts-002049678.html). That means 2B treats all shares from yahoo.com the same, while 2B looks at each story separately.</p>
<p>If you’re like me, when you think of political news, you have in mind domains like FoxNews.com or MSNBC.com, where it’s likely that the website itself has a distinct partisan flavor.</p>
<p>But YouTube.com and Twitter.com both appear in the “Top 100 Domains by Views” in the study’s SM, which obviously host a ton of both far left and far right or “mixed” content. And indeed, Figure S10 below shows that there are an array of domains that host some far right content and some far left content.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://solomonmg.github.io/img/BailonFigS10.jpeg" title="Some domains host some far right content (URLs) and some far left content." class="img-fluid figure-img"></p>
<figcaption>Figure S10 from González-Bailón et al 2023: Some domains host some far right content and some far left content.</figcaption>
</figure>
</div>
<!-- There's also just something about content that makes an argument that we seem to want to share, as suggested by analysis in my [dissertation](https://www.dropbox.com/s/zfw1d9j60hqjil7/sudiss.pdf?raw=true), which shows that NYT editorials are more likely than content from other sections to be shared (via email). Consider two OpEds at the NYT---one by conservative columnist Bret Stevens, the other by well-known liberal Nicholas Kristof. The former is going to be shared more by conservatives, the latter more by liberals.  -->
<p>But even if we’re talking about NYTimes.com it’s not hard to see that for example, conservatives might be more likely to share conservative Op Eds from Bret Stevens, while liberals may be more likely to share Op Eds from Nicholas Kristof.</p>
<p>If you aggregate your analysis to the domain level, you’ll miss this aspect of media polarization.</p>
</section>
<section id="domains-or-urls" class="level3">
<h3 class="anchored" data-anchor-id="domains-or-urls">Domains or URLs</h3>
<p>So which is right? <a href="https://www.science.org/doi/full/10.1126/science.ade7138">González-Bailón et al 2023</a> suggests that we should prefer the URL-level analysis. Here’s the passage in the discussion section, which describes why analyzing media polarization at the level of the domain is problematic:</p>
<p>“As a result of social curation, exposure to URLs is systematically more segregated than exposure to domains… <em>A focus on domains rather than URLs will likely understate, perhaps substantially, the degree of segregation in news consumption online.</em>” (Emphasis added)</p>
<p>What’s more, past work (co-authored by one of the lead authors) shows that <a href="https://osf.io/vbwer">domain-level analysis can indeed mask “curation bubbles”</a> in which “specific stories attract different partisan audiences than is typical for the outlets that produced them.”</p>
<p>You can see this in the data clear as day—let’s go back to Figure 2A, which shows a massive increase in estimated segregation when using URLs rather than domains:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://solomonmg.github.io/img/BailonFig2.jpg" title="Figure 2A shows much higher levels of audience segregation when you look at the URL level rather than the domain level" class="img-fluid figure-img"></p>
<figcaption>Figure 2 from González-Bailón et al 2023: There are much higher levels of audience segregation when you look at the URL level rather than the domain level</figcaption>
</figure>
</div>
<p>Returning to Figures 2B and 2C, it seems that the potential audience at domain level is <em>artificially less segregated</em>, due to the aggregation at the level of the domain.</p>
</section>
<section id="what-explains-the-filter-bubble-discrepency" class="level3">
<h3 class="anchored" data-anchor-id="what-explains-the-filter-bubble-discrepency">What explains the ‘Filter Bubble discrepency’?</h3>
<p>One possibility is that posts linking to content from “mixed” domains like YouTube, Reddit, Twitter, Yahoo, etc. do not score as well in feed-ranking. It’s possible that partisan content on these domains is more likely to be downranked as misinformation or spam, or maybe Facebook-native videos (which render faster/better) have an edge over YouTube, or perhaps there are domain-level features in feed ranking, or maybe there is some other reason that content from ‘non-mixed’ domains just performs better in the rather complex recommendation system that powers Newsfeed ranking.</p>
<p>Regardless, that would explain the results in Figure 2—making the potential audience look artificially broader than the actual audience, when you analyze content at the domain level.</p>
</section>
<section id="what-about-the-reverse-chron-experiment" class="level3">
<h3 class="anchored" data-anchor-id="what-about-the-reverse-chron-experiment">What about the Reverse-Chron experiment?!</h3>
<p>Surely we can paint a fuller picture of the impact of algorithmic ranking on media polarization with that other excellent recent <em>Science</em> paper which looked at the <em>causal</em> effect of turning off Newsfeed ranking. Maybe we can cross-reference that paper and get a clearer picture of what’s happening.</p>
<p><a href="https://www.science.org/doi/10.1126/science.abp9364">Guess et al 2023</a> shows that Newsfeed induces proportionally <em>more</em> exposure cross-cutting sources but also more exposure to like-minded sources. It reduces exposure to moderate or mixed sources. Importantly, this is not just news that news sources link to, it’s all content that everyone posts, including life updates, pictures, videos, etc.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://solomonmg.github.io/img/GuessFig.jpg" class="img-fluid figure-img"></p>
<figcaption>Figure 2 from Guess et al 2023: Compared to a reverse chronologically-ranked feed, FB’s ranking system induces a proportionally more exposure to “like-minded” sources but also more to cross-cutting sources, defined at the level of the entity posting.</figcaption>
</figure>
</div>
<p>Ok, but what about political content? In the supplimentary materials, we see that when it comes to <em>political content</em>, Newsfeed ranking actually <em>decreases</em> exposure to political content from like-minded sources.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://solomonmg.github.io/img/GuessChronScienceTabS20.jpg" class="img-fluid figure-img"></p>
<figcaption>Figure S20 from Guess et al 2023: A reverse chronologically-ranked Newsfeed induces a proportionally less exposure to “like-minded” sources but also less to cross-cutting sources, defined at the level of the entity posting.</figcaption>
</figure>
</div>
<p>What about exposure to political content posted by cross-cutting sources? The SM doesn’t provide that, but it does provide a paragraph noting that Newsfeed <em>decreased</em> exposure to political news from partisan sources relative to reverse-chron!</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://solomonmg.github.io/img/GuessChronScienceS3.3.jpg" class="img-fluid figure-img"></p>
<figcaption>S3.3 in Guess et al 2023: A reverse chronologically-ranked Newsfeed induces a proportionally less exposure to “like-minded” sources but also less to cross-cutting sources, defined at the level of the entity posting.</figcaption>
</figure>
</div>
<p>Now a big caveat here is that it’s clear from the main results that political content is not doing well in Newsfeed ranking. Note that there are <a href="https://www.wsj.com/articles/facebook-politics-controls-zuckerberg-meta-11672929976">reports that the company decided to downrank political and news content in 2021</a>.</p>
<p>Regardless, the results in the SM are not at all suggestive of a filter bubble, at least during the 2020 election—the experimental results suggest that if anything feedranking is showing us <em>less</em> polarizing content than we would see with reverse-chron.</p>
</section>
<section id="how-fb-groups-impact-estimates-of-algorithmic-segregation" class="level3">
<h3 class="anchored" data-anchor-id="how-fb-groups-impact-estimates-of-algorithmic-segregation">How FB Groups impact estimates of algorithmic segregation</h3>
<p>I sent a much earlier draft of this to <a href="https://www.asc.upenn.edu/people/faculty/sandra-gonzalez-bailon-phd">Sandra González-Bailón</a> and <a href="https://cssh.northeastern.edu/faculty/david-lazer/">David Lazer</a>, lead authors for <a href="https://www.science.org/doi/full/10.1126/science.ade7138">González-Bailón et al 2023</a>. Sandra pointed me to Figure S14, which does show a slight increase in segregation post-ranking for URLs shared by users and pages, but shows the <em>opposite</em> for content shared in those often-contentious Facebook groups.</p>
<!-- People may be especially likely to come across "persons dissimilar to themselves, and with modes of thought and action unlike those with which they are familiar" [(Mill cited in Mutz and Mondak, 2006)](https://www.polisci.upenn.edu/sites/default/files/mutz_mondak_2006.pdf). And it seems likely based on the plot below that group posts are treated differently from page posts and friend posts in Facebook feedranking.  -->
<p>So it’s really <em>not</em> that there’s no difference at all pre- and post- ranking, just that the difference is on average zero once you include groups. Of course, the population of people who see content from groups and/or pages in Newsfeed may be unusual, and future work should dig into this variation.</p>
<p>I should also note that this small but real difference seems more or less consistent with what we <a href="https://www.science.org/doi/10.1126/science.aaa1160">found in past work</a>, which only examined news shared by users (excluding pages and groups).</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://solomonmg.github.io/img/BailonFigS14.jpg" class="img-fluid figure-img"></p>
<figcaption>Figure S14 from González-Bailón et al 2023: There is a modest filter bubble effect among users and pages, and a “reverse filter bubble” for content shared in groups.</figcaption>
</figure>
</div>
</section>
<section id="the-algorithm-and-the-most-politically-engaged" class="level3">
<h3 class="anchored" data-anchor-id="the-algorithm-and-the-most-politically-engaged">The algorithm and the most politically engaged</h3>
<p>There is also a hint of an increase in segregation post-ranking among the most politically engaged 10% of Facebook users. However, the paper notes that this trend “is only clear for domain-level data,” which we’ve already established should not be used here. (Note that they define high political interest users as those in the “top 10% of engagement… (comments, likes, reactions, reshares) with content classified as political on Facebook…)”).</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://solomonmg.github.io/img/BailonFigS19.jpg" class="img-fluid figure-img"></p>
<figcaption>Figure S19 shows a modest filter bubble effect among the most engaged users at the peak of the 2020 election.</figcaption>
</figure>
</div>
<p>A similar pattern holds for the top 1% of users (Figure S23 in the SM).</p>
</section>
<section id="algorithmic-segregation-and-ideology" class="level3">
<h3 class="anchored" data-anchor-id="algorithmic-segregation-and-ideology">Algorithmic segregation and ideology</h3>
<p>Feed ranking seems to expose both conservatives and liberals to more liberal content. This is consistent with my priors that conservative content is more likely to violate policy and be taken down or subject to “soft actioning” (e.g., downranking) for borderline violations.</p>
<p>So now things get messy—should we really say liberals are in a filter bubble (and conservatives aren’t) if misinformation is included in that calculation?</p>
<!-- exposure to cross cutting content was thought to [reduce political participation](https://www.jstor.org/stable/3088437). -->
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://solomonmg.github.io/img/BailonTabS8.jpeg" title="Feedranking exposes you to slightly more liberal content." class="img-fluid figure-img"></p>
<figcaption>Feedranking exposes you to slightly more liberal content, presumably due to misinformation actioning.</figcaption>
</figure>
</div>
<p>We can see a similar pattern when we look at exposure to cross-cutting content, which is the measure our 2015 Science paper used. Conservatives see more liberal content in feed than their friends share.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://solomonmg.github.io/img/BailonFigS11.jpg" title="Replication of Bakshy et al 2015" class="img-fluid figure-img"></p>
<figcaption>Replication of Bakshy et al 2015, showing little effect of feed-ranking among content shared by friends. However, when including page and group content, feed-ranking plays a more important role—exposing liberals to proportionally less cross-cutting content and conservatives to comparatively more.</figcaption>
</figure>
</div>
<p>We can also see that things have changed a lot since we wrote our original Science piece in 2015. Liberals seem to see <em>far</em> more cross-cutting content, conservatives less.</p>
<!-- What about how this compares to others ways we get media on the internet? The evidence is not great here, but [Flaxman et al 2016](https://academic.oup.com/poq/article/80/S1/298/2223402) show that ideological segregation is higher for search than for social media websites, but lower for news aggregators and direct navigation to news websites.  -->
</section>
<section id="brevity-is-a-double-edged-sword" class="level3">
<h3 class="anchored" data-anchor-id="brevity-is-a-double-edged-sword">Brevity is a double-edged sword</h3>
<p>Science gives authors only limited space and it would have been difficult to dig into everything I’ve written about in 2 pages. I also know better than almost anyone just how much work went into these papers (I would bet thousands of hours for each of several authors), and how difficult it can be explain everything perfectly when you’re pulling off such a big lift. I should also point out that these papers are very nuanced, well-caveated, and careful not to overstate their results regarding the filter bubble or algorithmic polarization.</p>
<p>Still, I do wish this work had squarely focused on URL-level analyses.</p>
</section>
<section id="what-this-means-for-other-studies" class="level3">
<h3 class="anchored" data-anchor-id="what-this-means-for-other-studies">What this means for other studies</h3>
<p>This also means that past estimates of ideological segregation based domain-level analysis probably <em>understate</em> media polarization in a big way. This includes those based on <a href="https://journalqd.org/article/view/2586/2683">Facebook data</a>, <a href="https://onlinelibrary.wiley.com/doi/epdf/10.1111/ajps.12589">browser data</a>, on data describing various <a href="https://academic.oup.com/poq/article/80/S1/298/2223402">platforms</a>) or simply websites across the <a href="https://web.stanford.edu/~gentzkow/research/echo_chambers.pdf">internet</a>. I have said this for a long time but I did not think the magnitude was as strong as shown in <a href="https://www.science.org/doi/full/10.1126/science.ade7138">González-Bailón et al 2023</a>.</p>
</section>
<section id="so-is-facebook-polarizing" class="level3">
<h3 class="anchored" data-anchor-id="so-is-facebook-polarizing">So, is Facebook polarizing?</h3>
<p>It’s very hard to give a good answer to this question. In “The Paradox of Minimal Effects,” Stephen Ansolabehere points out that re-election depends overwhelmingly on whether the country is prosperous and at peace, not what happens with media politics. This is thought to be because people selectively consume media, which serves mainly to reinforce their beliefs; while at the same time, the sum total of people’s private lived experiences correponds reasonably well to aggregated economic data.</p>
<p>There’s an argument that social media exists somewhere in between the conventional media and one’s lived experiences. And what about evidence? As Sean Westwood and I have shown, partisan selectivity is far less severe when you <a href="https://journals.sagepub.com/doi/10.1177/0093650212466406">add a social element to news consumption</a>. What’s more, field-experimental work I did shows that <a href="https://www.dropbox.com/s/nu39148ukbab34r/CH7brief.pdf?raw=true">increasing the prominence of political news in Facebook’s Newsfeed</a> shifted issue positions toward the majority of news encountered (left-leaning), particularly among political moderates.</p>
<p>Tom Cunningham recently wrote a nice <a href="https://tecunningham.github.io/posts/2023-07-27-meta-2020-elections-experiments.html#other-evidence-on-media-and-polarization">summary of some of the evidence related to the question of whether any kind of media might increase affective polarization</a>, which we discussed at length.</p>
<p>The evidence suggests any effect is likely small. First, we see that while social media is a global phenomenon, affective polarization is not—the <a href="https://direct.mit.edu/rest/article-abstract/doi/10.1162/rest_a_01160/109262/Cross-Country-Trends-in-Affective-Polarization?redirectedFrom=fulltext">UK, Japan, and Germany have seen affective depolarization</a>.</p>
<p>Second, perhaps the highest quality experimental study on this question I’ve seen is <a href="https://osf.io/jrw26/">Broockman and Kalla (2022)</a>, which finds that paying heavy Fox News viewers to watch CNN has generally depolarizing effects, though as Tom points out, finds null effects on traditional measures of affective polarization.</p>
<p>Third, Tom and I have also discussed an excellent experimental study attempting to shed light on this: “<a href="https://www.aeaweb.org/articles?id=10.1257/aer.20190658">Welfare Effects of Social Media</a>,” which concludes that Facebook is likely polarizing. They find that “deactivating Facebook for the four weeks before the 2018 US midterm election… makes people less informed, it also makes them less polarized by at least some measures, consistent with the concern that social media have played some role in the recent rise of polarization in the United States.”</p>
<p>The study defines political polarization in an unusual way—including congenial media exposure—how much news you see from your own side—in its polarization index. Most political scientists would consider congenial media exposure as <em>the thing that might cause polarization</em>, but not an aspect of polarization in and of itself.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://solomonmg.github.io/img/AlcottetalFigure3.jpg" title="Allcott et al 2020 Figure 3 demonstrates the biggest effect is on a measure of media exposure" class="img-fluid figure-img"></p>
<figcaption>Allcott et al 2020 Figure 3</figcaption>
</figure>
</div>
<p>They do explain that their effects on affective polarization are not significant and they don’t try to hide what’s going into the measure. But you have to read beyond the abstract and the media headlines to really understand this point.</p>
<!-- They make the following claim in a footnote: "Online Appendix Table A16 shows that the effect on the political polarization index is *robust to excluding each of the seven individual component variables in turn*, although the point estimate moves toward zero and the unadjusted p-value rises to 0.09 when omitting congenial news exposure."  -->
<p>Notably, in a robustness test in the appendix, the effect on the polarization loses statistical significance when you exclude this variable.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://solomonmg.github.io/img/AlcottetalFigureA16.jpg" title="Allcott et al 2020 Figure A16 shows that the effect on polarization is not significant at P<0.05 if you exclude congenial media exposure." class="img-fluid figure-img"></p>
<figcaption>Allcott et al 2020 Figure A16</figcaption>
</figure>
</div>
<p>That means the folks who didn’t deactivate Facebook had higher levels of political knowledge and higher levels of issue polarization. This makes sense because if a person doesn’t know where the parties stand on an issue, she is less likely to be sure about where she ought to stand.</p>
</section>
<section id="implications-for-how-this-was-publicized" class="level3">
<h3 class="anchored" data-anchor-id="implications-for-how-this-was-publicized">Implications for how this was publicized</h3>
<p>All of this is relevant in light of the controversial Science Cover, which suggests Facebook’s <em>algorithms</em> are “Wired to Split” us. It may be true, but the evidence across all 4 <em>Science</em> and <em>Nature</em> papers is not decisive on this question.</p>
<p>None of the experiments published so far show an impact on affective or ideological polarization. What’s more, the proper URL-level analyses in <a href="https://www.science.org/doi/full/10.1126/science.ade7138">González-Bailón et al 2023</a> show only a modest ‘Filter Bubble’ for certain subsets of the data, and when it comes to political news, and the reverse-chronological feed ranking experiment shows Newsfeed ranking feeds us <em>less</em> polarized political content than would see in a reverse-chron Newsfeed.</p>
<p>Of course, the spin from the Meta Comms team that these <a href="https://www.wsj.com/articles/does-facebook-polarize-users-meta-disagrees-with-partners-over-research-conclusions-24fde67a">results are exculpretory</a> is also highly problematic. This claim is not only wrong but amatuerish and self-defeating from a strategic perspective, and I was surprised to read about it.</p>
<p>For all the amazing work done to produce the experimental results, the data are too noisy to detect small but potentially compounding effects on polarization as suggested in <a href="https://statmodeling.stat.columbia.edu/author/dean/">a post-publication review from Dean Eckles</a> and <a href="https://tecunningham.github.io/posts/2023-07-27-meta-2020-elections-experiments.html">Tom Cunningham</a>.</p>
<p>What’s more, even the excellent work done here in <a href="https://www.science.org/doi/full/10.1126/science.ade7138">González-Bailón et al 2023</a> does not speak to the question of effects on polarization that other key recommender systems at Facebook may have: the People You May Know (PYMK) algorithm, which facilitates network connections on the website, along with the Pages You Might Like (PYML) and Groups You Might Like (GYML). The authors make a similar point in the Supplementary Materials, S3.2—pointing out that inventory, or the “potential audience,” “results from another curation process determining the structure and composition of the Facebook graph, which itself results from social and algorithmic dynamics.”</p>
<p>This means that we should <em>not</em> necessarily conclude that exposure is all about individual choices and not algorithms based on the sum total of evidence we have (a point I should have better emphasized <a href="https://solomonmg.github.io/pdf/Science-2015-Bakshy-1130-2.pdf">in past work</a>)—algorithms may play an important role and as usual, more research is needed.</p>
<p>It would seem to me that both Science and Meta Comms are both going beyond the data here.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://solomonmg.github.io/img/ScienceCoverWiredtoSplit.jpeg" title="Wired to Split" class="img-fluid figure-img"></p>
<figcaption>The Science Cover suggests Facebook’s algorithms are “Wired to Split” us.</figcaption>
</figure>
</div>
</section>
<section id="disclosures" class="level3">
<h3 class="anchored" data-anchor-id="disclosures">Disclosures</h3>
<p>As noted above, from 2018-2020, I was tech lead for Social Science One, which gave external researchers <em>direct</em> access to data (the <a href="../../pdf/Facebook_DP_URLs_Dataset.pdf">‘Condor’ URLs data set</a>) via differential privacy. However, that project has not yielded much research output for a number of organizational and operational reasons, including the fact that differential privacy is not yet suitable for such a complex project.</p>
<p>While at Facebook, I personally advocated (with Annie Franco) for the collaboration model used in the Election 2020 project, wherein external researchers would collaborate with Facebook researchers. That model would have to shield the research from any interference from Facebook’s Communications and Policy arm, which would violate scientific ethics. It would involve pre-registration, not merely to ensure scientific rigor but to protect against conflicts of interest and selective reporting of results. However, I left Facebook in January of 2020 and have not been deeply involved in the project since.</p>
<p>I recently left Twitter (requesting to be in the first rounds of layoffs after Elon Musk took over) and started a job at NYU’s CSMaP lab when my employment with Twitter ended. There are authors who are affiliated with my lab on the paper, including one of the PIs, Josh Tucker. My graduate school Advisor, Shanto Iyengar is also on the paper, and I consider the majority of the authors to be my colleagues and friends.</p>
<p>See also my <a href="../../disclosures/">disclosures page</a>.</p>
<!-- I do wish the samples and timeframes for the ranking experiments were bigger so we could understand potentially smaller effects, which may be very important.  -->


</section>

 ]]></description>
  <guid>https://solomonmg.github.io/blog/thoughts-on-election-2020/</guid>
  <pubDate>Wed, 02 Aug 2023 00:00:00 GMT</pubDate>
  <media:content url="https://solomonmg.github.io/blog/thoughts-on-election-2020/featured.jpeg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>On BlueSky</title>
  <dc:creator>Sol Messing</dc:creator>
  <link>https://solomonmg.github.io/blog/bluesky-quasi-decentralized-social-network/</link>
  <description><![CDATA[ 





<p>TL/DR Summary</p>
<ol type="1">
<li>BlueSky has a chance to dethrone twitter right now, but that path is narrow.&nbsp;</li>
<li>Its exclusive invite only model means its user base is now small, elite, and homogenous with few bad actors. Almost everyone likes it. But the real test will be when it opens to the public.&nbsp;</li>
<li>It is designed for true account portability and in theory should prevent a single company from owning the entire network as it scales up.&nbsp;</li>
<li>However, it’s unclear if an ecosystem of small companies can do the job of content moderation in the same ways that centralized social networks do. The same is true of running modern feed-ranking and follow-recommendation systems.&nbsp;</li>
<li>There will be growing pressure to make money using ads to cover costs as the network scales up, which will incentivize centralizing key data and resources, undermining the original model.</li>
<li>Future possibilities include: (1) BlueSky remains de-facto centralized, “in beta” until it can get composable moderation right, which turns out to be the foreseeable future; (2) big players (Google, Facebook) join the party and dominate the ecosystem; (3) small, unmoderated, ad-free apps proliferate and the network becomes overrun with spam, NSFW, hate, scams and gifts that come with a lack of moderation.</li>
</ol>
<hr>
<p>Pretty much everyone at Twitter—and especially Jack Dorsey—has long known that BlueSky could replace Twitter. When I joined Twitter in 2021, I soon learned our CEO was terribly unpopular internally, sporting a job approval rating under 40 percent, by far the lowest of any executive at the company.</p>
<p>In fact, Jack was obsessed with decentralization, he seemed convinced that it was a mistake to have Twitter organized as a corporation, and he would rant about this on company-wide calls, which he seemed to be taking from caves in South Asia. This is when everyone else at the company was desperately trying to increase revenues to save the company from implosion.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://solomonmg.github.io/img/JackDorseyInCaveCreditTwitter.jpg" title="This photo of Jack Dorsey captures his general aspect on many all-hands calls." class="img-fluid figure-img"></p>
<figcaption>A photo of young Jack Dorsey in a cave.</figcaption>
</figure>
</div>
<p>Enter BlueSky, which would decentralize Twitter. Jack launched the initiative in 2019, and his plan was to migrate Twitter to this new protocol. It puts user data including posts and follow lists on open, public portable data servers (PDSs) that mean true account portability. Any business or organization could index the those servers, or what I will call the “BlueSkyVerse” (technically the <a href="https://blueskyweb.xyz/blog/10-18-2022-the-at-protocol">AT Protocol</a>), rank posts, and create a front end interface.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://solomonmg.github.io/img/BlueSkyVerse-illustration.jpg" title="The BlueSky App reads posts and the follow graph from Portable Data Servers, centralizing them in an index, ranking, and dislaying them for users." class="img-fluid figure-img"></p>
<figcaption>A node labelled as BlueSky or App sits atop various Portable Data Server (PDS) nodes, with arrows (edges) pointing to them. A caption to the right of the App node reads “indexing, ranking, moderation, UX” and a caption to the right of the PDS nodes reads “Open AT protocol: user posts, likes, follow graph.”</figcaption>
</figure>
</div>
<p>But wait a minute! Remember during Elon Musk’s acquisition how everyone said that the value of twitter isn’t the tech, but rather the network of creators and the communities that exist there? If you decouple that network from the platform you give up your most valuable asset—Google, Meta, others can index the network, develop a user interface, create some algorithms, show ads, and eat your lunch.</p>
<p>And yet, Jack was about to do just that, filling Twitter’s moat by turning its most valuable asset into a protocol. Of course, this did not go over well with employees who weren’t independently wealthy, nor the board, who eventually pushed him out.</p>
<p>BlueSky nicely captures the essence of Jack’s reign as half-time CEO: how little he cared about Twitter as a business and how much he cared about Twitter as an ecosystem.</p>
<p>But back to the question everyone cares about right now: will this new system lead to a better social network, or set of networks? Is this finally the Twitter alternative we’re looking for?</p>
<p>Make no mistake about it—BlueSky was designed by Twitter to replace Twitter. This makes it very different from the other new social media protocols, apps, etc. that we’ve seen come on the scene of late. As <a href="https://mastodon.social/@gruber/110314523447694321">John Gruber put it</a>, “If you hated Twitter, you’ll like Mastodon. If you liked Twitter, you’ll love BlueSky.”</p>
<p>So it’s a contender, despite how hard it is to start a social network from scratch. And don’t get any funny ideas about a post-surveillance-capitalism social network—if BlueSky takes off, it will most likely devolve into a less-moderated, less-profitable version of Twitter, Inc (aka Twitter 1.0). It will indeed encourage competition for front-end interfaces to explore the BlueSkyVerse. But the biggest challenges that social networks have to face—content moderation, discoverability, and monetization—require big technical and infrastructural investments to do well. They may only be viable for well-capitalized companies that generate big profits.</p>
<p>But of course, I would be very nervous if I still worked at Twitter.</p>
<p><strong>Will it work?</strong></p>
<p>Now is a unique opportunity for a Twitter rival. Twitter CEO Elon Musk tends to say <a href="https://www.cnn.com/2022/10/30/business/musk-tweet-pelosi-conspiracy/index.html">all manner of nutty things</a>, he has <a href="https://techcrunch.com/2022/11/21/elon-musk-twitter-netzdg-test/">decimated Twitter’s trust and safety org</a>, and cut staffing by more than 80%. And the company slashed infrastructure budgets needed for automated content moderation—internal sources say the company has cut 3bn since peak spending prior to the recession, while external accounts say <a href="https://www.reuters.com/technology/musk-orders-twitter-cut-infrastructure-costs-by-1-bln-sources-2022-11-03/">Musk ordered a 1bn cut himself</a>.</p>
<p>It shows: in the wake of the Allen massacre on Saturday, <a href="https://www.dallasnews.com/news/2023/05/11/gore-conspiracies-spread-on-elon-musks-loosely-moderated-twitter-after-allen-shooting/">graphic videos and misinformation spread across the platform</a>. Advertisers don’t want to risk putting their brands next to that kind of content and <a href="https://www.cnbc.com/2022/11/01/ad-giant-ipg-advises-brands-to-pause-twitter-spending.html">many have suspended advertising</a> on the platform.</p>
<p>We’ve all wondered which alternative social media system might replace Twitter. Could it be Mastodon, Spoutible, Post News, maybe Substack Notes? Or perhaps Truth Social or Gab or Gettr?!</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://solomonmg.github.io/img/NoAdsonGab.jpg" title="The only ads I saw on Gab.com were ads for advertising on Gab.com." class="img-fluid figure-img"></p>
<figcaption>A screenshot from Gab.com, with a post showing a flag of the UN with the text: “Need to burn a flag? Make it this one.”</figcaption>
</figure>
</div>
<p>I’m guessing it’s not going to be those other networks. The new centralized social network entrants—Spoutible, Post News, and Substack Notes—feel sterile and inauthentic when you first get started, partially because they are built around conventional media outlets, partially because they didn’t pay enough attention to discoverability in onboarding. Gettr/gab/truth social have libertarian-borderline-right-wing moderation setups, and the vast majority of people on Twitter have little interest in a right-wing echo chamber where there’s no one to troll.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://solomonmg.github.io/img/SpoutableBlankTimeline.jpg" title="You can get the best designers and engineers on the planet but if you show people a blank timeline and recommendations to follow a bunch of people they've never heard of, no one is going to use your platform." class="img-fluid figure-img"></p>
<figcaption>A screenshot from Spoutible showing a blank timeline.</figcaption>
</figure>
</div>
<p>Mastodon is losing steam for many reasons—onboarding is terribly confusing, it’s broken into communal servers that are all very different but that all seem uptight. Moderation there has been characterized as “<a href="https://mastodon.social/@gruber/110328355532624579">petulant nannyism</a>.”</p>
<p>Like Twitter, and <a href="https://werd.io/2023/the-fediverse-and-the-at-protocol">unlike Mastodon</a>, BlueSky can surface content from this entire web of activity across the BlueSkyVerse and delight you with memes and witticisms, many of which were about <a href="https://faineg.substack.com/p/how-i-accidentally-ruined-bluesky">”Sexy” ALF (yes, the 80s TV star)</a> when I signed up.</p>
<p>Many beta users say BlueSky feels like a breath of fresh air, like a throwback to early Twitter. For now, BlueSky is invite-only and so missing are the scammers, crypto bros, right-wing nuts, and tone policing randos looking for followers you find on Twitter. It feels more communal and less exhausting. Unclear how long that will last.</p>
<p>So maybe BlueSky has a legit claim to the Throne of Discourse, post-Twitter.</p>
<p><strong>Content Moderation</strong></p>
<p>First of all, content moderation is not just a “nice-to-have” thing that keeps the press happy. Facebook and others have found that content moderation <a href="http://tecunningham.github.io/2023-04-28-ranking-by-engagement.html">increases retention</a>. And look at the flip side: most people don’t want to hang out at what Mike Masnick calls “<a href="https://www.techdirt.com/2023/05/04/on-social-media-nazi-bars-tradeoffs-and-the-impossibility-of-content-moderation-at-scale/">Nazi bars</a>,” which is what platforms with permissive moderation policies will often become known for, whether they are actually Nazis or just radical free-speech advocates. Once that happens, kiss a lot of your core user base and ad revenue goodbye—which is what seems to be happening at Twitter.</p>
<p>Of course, content moderation is the bane of the modern social media network. It’s expensive, <a href="https://www.techdirt.com/2019/11/20/masnicks-impossibility-theorem-content-moderation-scale-is-impossible-to-do-well/">it will always be wrong</a>, it can easily create a PR dumpster file, and its benefits are extremely difficult to measure. This new protocol was designed with content moderation in mind so let me break that down before talking about the problems that will surely come up.</p>
<p>On BlueSky, speech happens on your PDS, but reach happens on the centralized app—Bluesky for now. And they <a href="https://blueskyweb.xyz/blog/4-13-2023-moderation">are in fact moderating</a>, so if they find a post that violates policy, they may take it off their app. It’s still up on the PDSs, it’s just not indexed in BlueSky. So great, it allows for a slightly truer form of “freedom of speech but not reach.”</p>
<p>How does this actually work? The BlueSky team wants to create a “<a href="https://blueskyweb.xyz/blog/4-13-2023-moderation">moderation ecosystem</a>,” in which labels (“spam”, “nsfw”) can be created by anyone, and apps like BlueSky can then choose what labels to act upon. Right now, it’s completely centralized at BlueSky, and they have an automated layer and decisions are made by “server administrators.” Eventually though, there will be other label sources, other apps besides BlueSky and many servers beyond bsky.social. They’re proposing a “choose your own moderation” approach.</p>
<p>OK what are the downsides?</p>
<p>First, there are key parts of moderation that raise questions under this framework. If you doxx someone’s home address for targeted harassment, post a bunch of Child Sexual Abuse Material (CSAM) or non-consensual sexual imagery, it feels insufficient to merely de-index those posts. There are cases where it <a href="https://www.nytimes.com/2023/05/03/technology/dorsey-musk-twitter-bluesky-nostr.html">may not be legally sufficient</a> under the Digital Services Act, NetzDG, or U.S. Copyright Law.</p>
<p>The spam-detection arms race is another example—the more you are open with how it works, the faster the spammers get around your detection systems. Somewhat relatedly, the fact that blocklists are public on BlueSky due to the BlueSkyVerse architecture, is <a href="https://twitter.com/MattBinder/status/1652142389165797377?s=20">already stirring controversy</a>.</p>
<p>Finally, a big part of a healthy information ecosystem is keeping bad actors off your platform in the first place. In centralized networks, that’s often done by IP screening, cell phone/text message screening, email validation, and/or by using other private data. But a PDS hosts public data, so the centralized app would need to create parallel user accounts to collect and maintain that data.</p>
<p>All that means it’s difficult to see an alternative to a world where BlueSky and other AT apps need to start collecting private user data, even if it’s inconsistent with the clean decentralized, portable data model illustrated above. The line between PDS and user account will get very fuzzy very fast.</p>
<p>And, once apps do this for content moderation, wouldn’t they also wish to do it for advertising as well? Content moderation isn’t free.</p>
<p>Right now, signups are based on invites, which helps keep out bad actors. But eventually BlueSky will need to open up fully once it’s out of beta.</p>
<p>When that happens, the job of content moderation will be far more complex than in a place like Mastodon, because the BlueSky architecture is meant to enable “scale and global discoverability.” With Mastodon/the Fediverse, each server has its own policies, norms, and content moderation, which is far simpler in its small, federated worlds. In the BlueSkyVerse, you have no choice but to scale up moderation.</p>
<p><strong>Recommender systems in the BlueSkyVerse</strong></p>
<p>Will BlueSky be incentivized to build a feed-ranking system into their product and start logging the vast scope of data that inspired the phrase “surveillance capitalism?” They have already started down that path—in fact they’ve built the BlueSkyVerse to facilitate global discovery—large scale indexing and ranking across all PDSs in the network.</p>
<p>Right now, the “What’s Hot” feed does global discovery, but in a way that is pretty basic—it’s showing popular stuff from the last 30 minutes. For now this is fine, <a href="http://tecunningham.github.io/2023-04-28-ranking-by-engagement.html">it’s the core of most modern recommender systems</a> in social media websites.</p>
<p>Contrast this with Mastodon, where you can technically follow people from another server but the system isn’t designed so servers index each other and form one network. This is an important reason I think BlueSky could have legs, but Mastodon will probably not replace Twitter.</p>
<p>Setting aside any monetary pressures facing BlueSky for a minute, I suspect they will be driven toward increased data collection and deployment, simply because you need to do that to move the metrics that tell you your product is improving. This may be further cemented by the culture of modern engineering organizations—where engineering leaders and PMs ruthlessly focus on moving a “north star” metric, which is almost alway some variant of time spent. “Time spent, daily active users, session counts, these are measures of whether you’re making your product better—the fact that they are all highly correlated with potential ad revenue is coincidental.</p>
<p>Of course, to do anything like what Twitter and Facebook do with their recommender systems—for both follow recommendations and for feed-ranking—will require a lot more resources. For the follow graph, that entails predicting which users are likely to form mutual follow relationships or satisfactory follow-only relationships, which can be done with shortcuts but is ultimately a difficult (graph machine learning) problem. For feed-ranking, that requires predicting what users are likely to interact with what content, which both Twitter and Facebook had entire divisions of engineers and data scientists working on.</p>
<p><strong>Pressures to centralize and monetize the BlueSkyVerse</strong> Venture capitalists and startups in Silicon Valley are always talking about “moats.” If you invest a great deal of resources to build a technology or a new marketplace, what’s to stop a competitor from drinking your milkshake?</p>
<p>There’s an influential idea among “Web 3.0” circles, which is that Facebook, Instagram, and Twitter are the landlords of castles you can’t leave. That’s not supposed to happen this time—the BlueSkyVerse was designed around account portability and front-end/algorithmic competition. The hope is this will create an ecosystem of small companies doing bits and pieces of what big social media companies do today.</p>
<p>At the same time, everything I’ve seen so far suggests that large investments are going be required to even start playing in the BlueSkyVerse—there are barriers to entry on data processing to even index it as users grow, to create a legit feed and UX, and to do content moderation at that kind of scale. Jack has given billions to the BlueSky team to get the system to where it is today.</p>
<p>So what happens if the BlueSkyVerse really takes off? We might indeed see real competition for front-end apps that do custom algorithmic ranking and figure out innovative ways to moderate content. We might see further media fragmentation—perhaps front-end providers will try to differentiate themselves by topic or political orientation like television channels do.</p>
<p>But running a modern social media website is expensive. If it grows as big as Twitter, indexing the BlueSkyVerse will become a challenge, same for running modern recommender systems. And if you want ad revenue you need content moderation, which you can’t solve with AI alone—you need humans in the loop, which means you don’t get the kind of economies of scale you’d see with automated systems. What’s more, you often need sensitive user data to do these things well, and you need bespoke solutions to new adversarial tactics you find. So it’s hard to fully rely on an external company for these solutions, as the creators of BlueSky seemed to envision.</p>
<p><strong>The future of the network</strong></p>
<p>I see a few possibilities if BlueSky gets really big: the first is that BlueSky the app simply dominates this system—they moved first, they understand the system, they can do content moderation, they figure out how to scale up, and they may decide to sell ads. At the same time if BlueSky does become “Twitter 3.0,” there have to be consequences to the fact that I can simply take my posts and follow-graph to a competing service and still be on the same network.</p>
<p>Or maybe not. Maybe they will realize that the challenges of content moderation favor keeping the network as is, and the BlueSkyVerse will remain closed for a long time. Perhaps forever.</p>
<p>But if it does really launch and open up, it seems likely that established tech starts to play—Google jumps in, dedicates a small fraction of the resources it used to fund Google+, indexes the BlueSkyVerse in a day, and boom… has a competitor to Facebook. Maybe Facebook jumps in too, but that’s a tricky proposition because once part of Facebook/Instagram has true account portability what happens to the rest of the company?</p>
<p>Of course, another outcome that seems likely is a conservative social media front-end provider. Maybe Truth Social integrates with the BlueSkyVerse. It won’t make much money because many in that demographic seem happy with Twitter for now, and there will be substantial brand risk for potential advertisers.</p>
<p>Finally, we might see pure anarchy. In this “race to the bottom” scenario, a set of small, unmoderated, ad-free apps proliferate. Since people don’t like ads, they use these apps. The network becomes overrun with spam, NSFW, hate, scams and gifts that come with a lack of moderation. Of course, it’s unclear these apps would be tolerated by the app stores, but this is one direction things might generally go.</p>



 ]]></description>
  <guid>https://solomonmg.github.io/blog/bluesky-quasi-decentralized-social-network/</guid>
  <pubDate>Mon, 03 Apr 2023 00:00:00 GMT</pubDate>
  <media:content url="https://solomonmg.github.io/blog/bluesky-quasi-decentralized-social-network/featured.jpg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>What can we learn from ‘The Algorithm,’ Twitter’s partial open-sourcing of it’s feed-ranking recommendation system?</title>
  <dc:creator>Sol Messing</dc:creator>
  <link>https://solomonmg.github.io/blog/twitter-the-algorithm/</link>
  <description><![CDATA[ 





<p>Last Friday (2023-03-31) Twitter released what it calls “the algorithm,” which appears to be a highly redacted, incomplete part of code that governs the “for you” home timeline ranking system. And I saw nothing to suggest the parts of the code they put in the GitHub repository wasn’t authentic.</p>
<p>It’s highly unusual for a tech company to open up a product at the core of its monetization strategy. The thinking is that the more engaging the content you show people right when they log in, the more likely they are to stick around. And the more you keep people logged in, the more they see ads. And the more data you can get to show them better ads!</p>
<p><strong>Transparency, or a distraction from closing the API?</strong></p>
<p>Is this a step forward for transparency as Musk and Twitter would claim? I am skeptical. You can’t learn much from this release in and of itself—you need the underlying model features, parameters, and data to really understand the algorithm. Those combine into a system that’s effectively different for everyone! So even if you had all that, you’d likely need to algorithmically audit the system to really get a handle on it.</p>
<p>And Twitter made it <a href="https://www.wired.com/story/twitter-data-api-prices-out-nearly-everyone/">prohibitively expensive</a> for external researchers to get that data through its API with the recent price updates ($500k/yr). So at the same time twitter is releasing this code, it’s made it incredibly difficult for research to <em>audit</em> this code</p>
<p><strong>What’s in the code? Gossip and Rumors</strong></p>
<p><strong>Ukraine</strong> There were some <a href="https://twitter.com/SolomonMg/status/1642845123531751425?s=20">initial reports</a> that Twitter was downranking tweets about Ukraine. I looked at the code and can tell you those claims are wrong—twitter has an audio-only <a href="https://www.clubhouse.com">Clubhouse</a> clone called Spaces and that code is for that product, not ordinary tweets on hometimeline. What’s more, this is likely a label related only to <strong>crisis misinformation</strong>, as per Twitter’s <a href="https://help.twitter.com/en/rules-and-policies/crisis-misinformation">Crisis Misinformation Policy</a>.</p>
<blockquote class="twitter-tweet blockquote">
<a href="https://twitter.com/SolomonMg/status/1642560420392103936"></a>
</blockquote>
<p><strong>Musk Metrics</strong> One of the most interesting things we learned from the code is that Twitter created an entire suite of metrics about Elon Musk’s personal twitter experience. The code shows they fed those metrics to the experimentation platform (Duck Duck Goose, or DDG), which at least historically has been used to evaluate whether or not to ship products.</p>
<blockquote class="twitter-tweet blockquote">
<a href="https://twitter.com/wongmjane/status/1641884551189512192"></a>
</blockquote>
<p>This episode is consistent with reporting that engineers are very concerned about how any features they ship <a href="https://www.theverge.com/2023/2/9/23593099/elon-musk-twitter-fires-engineer-declining-reach-ftc-concerns">affect the CEOs personal experience on Twitter</a>. And other <a href="https://arstechnica.com/tech-policy/2023/02/report-musk-had-twitter-engineers-boost-his-tweets-after-biden-got-more-views/">reporting has suggested that there may have been a Musk centric boost feature</a> that shipped, and you would want exactly this kind of instrumentation to understand how that worked in practice.</p>
<p><strong>Republican, Democrat Metrics</strong> We also learned that Twitter is logging similar metrics for lists of prominent Democrat and Republican accounts, <a href="https://www.yahoo.com/entertainment/twitters-recommendation-algorithm-is-now-on-github-200511112.html">ostensibly to understand</a> whether any features that they ship affect those sets of accounts equally. Now we know that <a href="https://www.nature.com/articles/s41467-022-34769-6">conservative accounts tend to share more misinformation than liberal accounts on both Twitter</a> and <a href="https://www.science.org/doi/full/10.1126/sciadv.aau4586">on Facebook</a>. And, <a href="https://www.washingtonpost.com/technology/2023/02/08/house-republicans-twitter-files-collusion/">Musk has alleged that Democrats and Big Tech are colluding</a> to enforce policy violation unequally across parties.</p>
<p>But if you have these “partisan equality’’ stats as part of your ship criteria, perhaps on equal footing with policy violation frequency, you can see how <strong>this could really affect the types of health and safety features that actually make it to the site in production</strong>.</p>
<p>This code was then comically removed via pull requests from Twitter. Because once you delete something on GitHub, it just goes away. Right?</p>
<blockquote class="twitter-tweet blockquote">
<a href="https://twitter.com/colin_fraser/status/1641960748233662464"></a>
</blockquote>
<p><strong>Twitter Blue Boost</strong> What’s more, we sorta knew that Twitter Blue users get a boost in feed ranking, but the code make it clear that it could double your score among people who don’t follow you, and quadruple it for those who do.</p>
<blockquote class="twitter-tweet blockquote">
<a href="https://twitter.com/beeonaposy/status/1641878347557883910"></a>
</blockquote>
<p>As <a href="https://twitter.com/jonathanstray/status/1642200687101501441">Jonathan Stray pointed out</a>, if this counts as a paid promotion, the FTC might require Twitter to label your tweets as ads. Now we kind of already knew this from Musks Twitter Blue announcement, but having evidence in the code might cross a different line for the FTC.</p>
<p><strong>So what about the ackshual algorithm? What does this say about feed ranking?</strong></p>
<p>The code itself is there but it’s missing specifics—key parameters, feature sets, and model weights are absent or abstracted. And obviously the data.</p>
<p>The most critical thing we learned about Twitter’s ranking algorithm is probably from a readme file that former Facebook Data Scientist <a href="https://twitter.com/jeff4llen">Jeff Allen</a> found. If we take that at face value, a fav (twitter like) is worth half a retweet. A reply is worth 27 retweets, and a reply with a response from a tweets author is worth a whopping 75 retweets!</p>
<blockquote class="twitter-tweet blockquote">
<a href="https://twitter.com/jeff4llen/status/1641901988047626241"></a>
</blockquote>
<p>Now it’s not quite that simple—what about when a tweet is first posted and there’s no data? Twitter’s deep learning system (in the heavy ranker) will do some heavy lifting and predict the likelihood of each of these actions based on the tweet author, their network, any initial engagements, the tweet text, and thousands of signals and embeddings.</p>
<p>Of course, what happens in the first few minutes when a tweet is posted deeply shapes who sees and engages with it downstream in the future.</p>
<p>[And the way this is implemented in practice is that the model handles all cases, but as you get more and more real time data on a tweet, those real time features dominate everything else and push those probabilities close to 1, see <a href="https://twitter.com/SolomonMg/status/1642154005588504577?s=20">discussion here</a>.]</p>
<p>Now I should point out that there are some spammy accounts claiming to have found ranking parameters in the code. They’re wrong, those are used to <a href="https://github.com/twitter/the-algorithm/blob/7f90d0ca342b928b479b512ec51ac2c3821f5922/src/java/com/twitter/search/README.md">retrieve tweets from your network for candidate generation only</a>. <a href="https://lucene.apache.org">Lucene</a> is an open source search tool.</p>
<blockquote class="twitter-tweet blockquote">
<a href="https://twitter.com/SolomonMg/status/1642563414970060800"></a>
</blockquote>
<p>I should point out however, that some of the “Earlybird’’ code was at one point used in timeline ranking, and it appears that <a href="https://github.com/twitter/the-algorithm/blob/7f90d0ca342b928b479b512ec51ac2c3821f5922/cr-mixer/server/src/main/scala/com/twitter/cr_mixer/similarity_engine/EarlybirdTensorflowBasedSimilarityEngine.scala">it may be used in cr-mixer</a>, which is used in candidate generation for <a href="https://github.com/twitter/the-algorithm/blob/7f90d0ca342b928b479b512ec51ac2c3821f5922/cr-mixer/README.md">out-of-network tweets</a>.</p>
<p>Interestingly, <a href="https://github.com/twitter/the-algorithm/blob/main/home-mixer/server/src/main/scala/com/twitter/home_mixer/functional_component/filter/OutOfNetworkCompetitorURLFilter.scala">Twitter appears to remove competitor URLs</a>, perhaps only for tweets that are outside of network (you don’t follow the author).</p>
<p><strong>What else goes into the “the Algorithm?’’</strong></p>
<p>What gets ranked in the first place? The other piece here is the “TikTok’’ part of the ranking algorithm, which is also incomplete without the models/data/parameters/etc. What I mean is the code that takes content from across the platform and says “I’m going to put this into your queue for the heavy ranker to sort out.”</p>
<p>Now on Twitter often that historically meant tweets posted by or replied to by accounts you follow. But, Twitter realized it could find a lot more content for that heavy ranker magic.</p>
<p>There’s a complex system that inserts tweets into your queue for ranking. This is called <strong>candidate generation</strong> in the “recommendation system” subfield of applied computing.</p>
<p>If you follow a lot of people on twitter like me, about <strong>half</strong> of the candidate tweets in twitter’s ranked “for you” timeline at any given time are from people you follow.</p>
<p>Now, if you don’t follow a ton of people, or if you have a new account, you can run out of these tweets, and then Twitter will try to find additional candidates so that you have ranked content. If so, means that this system is going to govern what in your home timeline feed like TikTok—gathering content it predicts you’ll like from across the platform.</p>
<p>This takes place in <a href="https://github.com/twitter/the-algorithm/blob/7f90d0ca342b928b479b512ec51ac2c3821f5922/cr-mixer/README.md">cr-mixer</a>, and although some of the <a href="https://github.com/twitter/the-algorithm/blob/7f90d0ca342b928b479b512ec51ac2c3821f5922/cr-mixer/server/src/main/scala/com/twitter/cr_mixer/candidate_generation/CandidateSourcesRouter.scala">high level function calls are there</a>, much of the code and the models appear to be missing, and many files come with this warning at the top: “This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.’’</p>
<p>Twitter seems to have made some of the systems public underlying candidate generation public, including its <a href="https://github.com/twitter/the-algorithm/tree/7f90d0ca342b928b479b512ec51ac2c3821f5922/src/scala/com/twitter/simclusters_v2">SimCluster model</a>.</p>
<p>BTW, I’d like to give a shout out to <a href="https://twitter.com/vboykis">Vicki Boykis</a>, and <a href="https://twitter.com/igorbrigadir">Igor Brigadir</a> who are <a href="https://github.com/igorbrigadir/awesome-twitter-algo">doing amazing work to map out the codebase</a> and unearth exactly what’s missing and what’s not.</p>
<p><strong>Trust and Safety</strong></p>
<p>A lot of the code related to Trust and Safety is missing, presumably to prevent bad actors from learning too much and gaming those systems. However, there do seem to be some specifics about the kinds of things twitter considers borderline or violating that I don’t think were previously public.There are a bunch of safety parameters in the code, some of which are in Twitter’s policy documents, but some are not.</p>
<p>There are entries like “HighCryptospamScore” that <a href="https://github.com/twitter/the-algorithm/blob/7f90d0ca342b928b479b512ec51ac2c3821f5922/visibilitylib/src/main/scala/com/twitter/visibility/rules/DownrankingRules.scala">appear in the code</a>, which may give scammers hints about how to craft tweets to get around detection systems. The same is true for <a href="https://github.com/twitter/the-algorithm/blob/7f90d0ca342b928b479b512ec51ac2c3821f5922/visibilitylib/src/main/scala/com/twitter/visibility/models/TweetSafetyLabel.scala#L115">code that contains links</a> to “UntrustedUrl,” “TweetContainsHatefulConductSlur” for low, medium and high severity.</p>
<p>There’s also a reference to a “Do Not Amplify” <a href="https://github.com/twitter/the-algorithm/blob/7f90d0ca342b928b479b512ec51ac2c3821f5922/visibilitylib/src/main/scala/com/twitter/visibility/models/SpaceSafetyLabelType.scala#L26">parameter in the code</a>, which was discussed in the twitter files but seems not to be publicly documented in it’s policies. There are entries like “AgathaSpam,” which refers to a propriety embedding used across the codebase. Twitter also has a bunch of visibility rules hardcoded in Scala that might be useful to bad actors trying to game the system, outlining what rules are in play for all tweets, new users, user mentions, liked tweets, realtime spam detection, etc. Finally, some of the consequences for those violations are <a href="https://github.com/twitter/the-algorithm/blob/7f90d0ca342b928b479b512ec51ac2c3821f5922/visibilitylib/src/main/scala/com/twitter/visibility/rules/Action.scala">spelled out in Scala</a> as well.</p>
<p>Of course, it’s really hard to know with&nbsp;certainty&nbsp;that any of this wasn’t in public somehow before this release.</p>
<script async="" src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>



 ]]></description>
  <guid>https://solomonmg.github.io/blog/twitter-the-algorithm/</guid>
  <pubDate>Mon, 03 Apr 2023 00:00:00 GMT</pubDate>
  <media:content url="https://solomonmg.github.io/blog/twitter-the-algorithm/featured.png" medium="image" type="image/png" height="90" width="144"/>
</item>
<item>
  <title>Past vote data outperformed the polls. How did it go so wrong?</title>
  <dc:creator>Sol Messing</dc:creator>
  <link>https://solomonmg.github.io/blog/what-the-polls-got-wrong-in-2020/</link>
  <description><![CDATA[ 





<p>It’s becoming clear that the 2020 polls underestimated Trump’s support by anywhere from a 4-8 point margin depending on your accounting–a significantly worse miss than in 2016, when <a href="https://fivethirtyeight.com/features/the-polls-are-all-right/">state polls were off but the national polls did relatively well</a>.</p>
<p>In fact, this year we were better off using projections based on past vote history in each state to predict how things would go in battleground states, as I’ll show below.</p>
<p>But I also want to start to ask questions about what happened this time around. The polling from 2018 looked encouraging, convincing many pollsters that the post-2016 reckoning had fixed many issues called out in the <a href="https://www.aapor.org/Education-Resources/Reports/An-Evaluation-of-2016-Election-Polls-in-the-U-S.aspx">2016 AAPOR report on election polling</a>. After 2018, FiveThirtyEight wrote that the <a href="https://fivethirtyeight.com/features/the-polls-are-all-right/">“Polls are Alright”</a>.</p>
<p>But the second Miami-Dade reported results from the 2020 election, we knew something was probably wrong with the 2020 polls.</p>
<!-- <blockquote class="twitter-tweet"><a href="https://twitter.com/stefanjwojcik/status/1325786708022079488"></a></blockquote> -->
<p>As Stefan notes (we worked together at Pew Research Center’s Data Labs), the error seems slightly lower in key battleground states, though the polls missed big in WI, perhaps in part due to its horrifically bad voter file data.</p>
<p>Unlike 2016, both state and national polls appeared to underestimate Trump’s support, as this early (Nov 7) analysis from <a href="https://twitter.com/thomasjwood">Tom Wood</a> shows:</p>
<blockquote class="twitter-tweet blockquote">
<a href="https://twitter.com/thomasjwood/status/1325199348553162752"></a>
</blockquote>
<!-- [![normal](/img/TWpollingerror.jpeg)](https://twitter.com/thomasjwood/status/1325199348553162752) -->
<section id="polling-versus-past-votes" class="level2">
<h2 class="anchored" data-anchor-id="polling-versus-past-votes">Polling versus past votes</h2>
<p>Perhaps what surprised me the most about polling this time around was when I went to evaluate some election projections I put together in April that we used internally at Acronym to help evaluate where we might want to spend. I pulled in the <a href="https://www.nytimes.com/live/2020/presidential-polls-trump-biden">NYTimes polling averages</a> and compared them with the latest state-level presidential results from the AP. I then did the same for the April projections. Turns out the projections were significantly more accurate than the polling averages:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://solomonmg.github.io/img/PollingVSPastVoteProj.png" class="img-fluid figure-img"></p>
<figcaption>normal</figcaption>
</figure>
</div>
<p>We used these projections, and other extant data (including the fact that there are two Senate races in play), when making what turned out to be a very lucky decision to start spending money in Georgia. We were one of the biggest and earliest spenders in that race.</p>
<p>What are these projections? I simply took the last two state-level Presidential and U.S. House election totals, estimated each state’s “trajectory,” and added that to each state’s Democratic margin from the previous cycle.</p>
<p>(Note that I also weighted 60-40 toward the Presidential results, and slightly regularized both the latest margin and the trajectory toward zero.)</p>
<p>Informing this approach is work from <a href="https://catalist.us/yair-ghitza-phd/">Yair Ghitza</a> describing what went wrong in 2016, which suggested polarization and other state-level trends would continue, in addition to national trends or&nbsp;“uniform swing.”&nbsp;</p>
<blockquote class="twitter-tweet blockquote">
<a href="https://twitter.com/SolomonMg/status/1325564912798752773"></a>
</blockquote>
<p>I should note that this may only have worked because of something peculiar about this election cycle–I haven’t gone an back-tested this approach or anything like that.</p>
<p>Seems I was not the only one who noticed this kind of pattern:</p>
<blockquote class="twitter-tweet blockquote">
<a href="https://twitter.com/SolomonMg/status/1325522770890027008"></a>
</blockquote>
</section>
<section id="what-went-wrong-the-usual-suspects" class="level2">
<h2 class="anchored" data-anchor-id="what-went-wrong-the-usual-suspects">What went wrong: The Usual Suspects</h2>
<p>Humble-brag aside, it’s worth asking what might have gone wrong with polling in 2020?</p>
<p>The <a href="https://www.aapor.org/Education-Resources/Reports/An-Evaluation-of-2016-Election-Polls-in-the-U-S.aspx">2016 AAPOR report on election polling</a> provides some guidance for how we might start to examine issues with the 2020 polls.</p>
<p><strong>Undecided voters</strong>: Undecideds broke toward Trump late in the election in 2016–polls found as many as 13 percent of voters were <a href="https://fivethirtyeight.com/features/the-invisible-undecided-voter/">undecided on election day or planned to vote for a third party</a>. According to Poynter, there were <a href="https://www.poynter.org/fact-checking/2020/2020-is-not-like-2016-heres-whats-different/">half as many of these voters in 2020</a>, so this is unlikely to be as big a factor as in 2016.</p>
<p><strong>Low education non-response &amp; adjustment</strong>: In 2016, individuals lower levels of education were much less likely to answer polls but still voted, and broke for Trump. The national polls adjusted for this but state level polls did not, which is partially why forecasting models that rely on state-level polls missed so hard.</p>
<p>While many state-level pollsters did this in 2020, Pew Research Center still <a href="https://www.pewresearch.org/methods/2020/08/18/a-resource-for-state-preelection-polling/">found problems with state level polling this time around</a>, for example failing to adjust for race and education simultaneously–non-college whites are far more likely to support Trump than non-college non-whites.</p>
<p>What’s more, pollsters adjusted only for college/non-college, which may not have been enough. They might need to use more fine grained adjustment–accounting for whether respondents have a high school degree and a college degree. Also error/missing data when people complete education in a survey means trouble if you want to fully fix the issue.</p>
<p><strong>Volunteerism &amp; civic engagement</strong>: Even if you adjust for low levels of non-response among individuals with lower education, pollsters still may have problems reaching <a href="https://www.pewresearch.org/fact-tank/2015/07/21/the-challenges-of-polling-when-fewer-people-are-available-to-be-polled/">low civic engagement voters, a bias that seems to persist even after modeling/weighting adjustments</a>. In the past this hasn’t mattered as much, but these folks may be showing up to the polls for Trump.</p>
</section>
<section id="other-potential-factors" class="level2">
<h2 class="anchored" data-anchor-id="other-potential-factors">Other Potential Factors</h2>
<p><strong>Likely voter models</strong>: This is difficult to fully unpack since each polling house does this slightly differently and not all publish their methods—some ask a battery of voter questions, some use models, some recruit off the voter file. But there’s only a weak relationship between who votes and who scores high on the likely voter battery. To make matters worse, 2020 was a very high-turnout election, which could have introduced even more instability into likely voter models.</p>
<p>Another important point from Peter Suzman is that likely voter screens could have inflated estimates of Dem turnout if they asked if respondents had already voted—it was Democrats who voted early.</p>
<blockquote class="twitter-tweet blockquote">
<a href="https://twitter.com/Biomaven/status/1325545770230161408"></a>
</blockquote>
<p>However, that would only explain error in likely voter models, not polling based on registered voters, which also seemed to miss big this cycle, as I pointed out:</p>
<blockquote class="twitter-tweet blockquote">
<a href="https://twitter.com/SolomonMg/status/1325605403636486146"></a>
</blockquote>
<p><strong>COVID-19</strong>: I wrote about <a href="https://solomonmg.github.io/post/trumps-chances-are-better-than-they-look/">this back in June</a>. It’s possible that COVID-19 made lines long and kept people home in urban areas and non-white communities. Yes we had record turnout but all it takes is a few percent of people who encounter a bit of voting friction, who fail to register in person, don’t get in person canvassing/gotv contact, don’t vote by mail early, and/or don’t vote in vote in person.</p>
<p>At the same time, <a href="https://www.nytimes.com/2020/11/10/upshot/polls-what-went-wrong.html">David Shor points out</a> in a piece by Nate Cohn at the New York Times, that “…after lockdown, Democrats just started taking surveys, because they were locked at home and didn’t have anything else to do.”</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://solomonmg.github.io/img/DavidShorOLNCNYT.png" class="img-fluid figure-img"></p>
<figcaption>normal</figcaption>
</figure>
</div>
<p>Without Dems doing the usual in-person registration drives, organizing, canvassing, etc. plus long lines in the hardest hit areas, and with Democrats taking surveys at unusually high rates, we might expect to see Trump overperform in areas hit hardest by COVID-19.</p>
<p>And indeed the data show just that. <a href="https://www.npr.org/sections/health-shots/2020/11/06/930897912/many-places-hard-hit-by-covid-19-leaned-more-toward-trump-in-2020-than-2016">NPR has a nice visualization of this</a>:</p>
<blockquote class="twitter-tweet blockquote">
<a href="https://twitter.com/SolomonMg/status/1326348370869415937"></a>
</blockquote>
<p>Another possibility is that shutdowns, school closings, and job losses stoked anger &amp; resentment in centrist &amp; right-leaning voters. I remember watching a local FB group quickly organize around the issue of school-openings and eventually morph into a hub for protests.</p>
<p>EDIT: I took a look at his performance by the urbanicity and racial makeup of those counties and here’s what I found:</p>
<p><img src="https://solomonmg.github.io/img/share_over_2016_pct_nonwhite_covid_up_weighted.png" class="img-fluid" alt="normal"> <img src="https://solomonmg.github.io/img/share_over_2016_pct_urban_covid_up_weighted.png" class="img-fluid" alt="normal"></p>
<p>Trump outperforms 2016 in non-white counties, and UNDER-performs in mostly-white counties. Same for more urban counties. That’s consistent w/ covid hitting non-white counties much harder in terms of registration, long-lines, and lower VBM rates.</p>
<p>That seems to stand in sharp contrast to speculation that Trump would be hit hardest in areas where people are most likely to know someone with COVID.</p>
<p><strong>Shy Trump voters</strong>: There’s a hypothesis out there that people are embarrassed to admit that they would vote for Trump. The evidence for this is limited–Kyle Dropp and co at Morning Consult did some experimental work on this and found that people were slightly more likely <a href="https://morningconsult.com/form/shy-trump-2020/">in the 2016 primaries (but NOT the General and not in 2020)</a> to say that they would vote for Trump when answering via online survey compared with speaking with a live pollster over the phone. But they’ve done many follow-on surveys since and the pattern doesn’t persist.</p>
<blockquote class="twitter-tweet blockquote">
<a href="https://twitter.com/NateSilver538/status/1324948324718436352"></a>
</blockquote>
<p>I am skeptical that this could be as much of a factor as some on social media seem to be claiming, but it’s hard to get good data to answer this question and acknowledge that absence of evidence is not evidence of absence. A number of commentators have claimed that since the polls underestimated support for all Republicans, this is an unlikely explanation.</p>
<p>That sounds pretty air-tight at first glance but it’s possible that some undecideds, perhaps embarrassed about having Trump as a figurehead of the Republican party, refused to say with certainty who they would actually vote for. Nevertheless, based on the pattern of results we’ve seen so far, this really can’t explain very much of the polling error this time around.</p>
</section>
<section id="the-role-of-election-forecasts" class="level2">
<h2 class="anchored" data-anchor-id="the-role-of-election-forecasts">The Role of Election Forecasts</h2>
<p>If you’re a forecaster, it’s very easy to look at all the polling data and come away with overconfident estimates of a candidate’s support. Many forecasters in 2016 did just that, failing to account for the fact that error between states and pollsters were likely correlated, and producing estimates that put Clinton’s chances above 95%.</p>
<p>The Huffington Post famously <a href="https://www.huffpost.com/entry/nate-silver-election-forecast_n_581e1c33e4b0d9ce6fbc6f7f">roasted FiveThirtyEight</a> for trying to adjust for this state-level polling error the day before the 2016 election.</p>
<p>But even when forecasters get it right, forecasting can create firm expectations that one candidate will win, which in 2016 was complicated by destiny-narrative driven by media coverage of election forecasting.</p>
<p>Sean Westwood, Yph Lelkes and I recently published a <a href="https://solomonmg.github.io/pdf/aggregator.pdf">research paper</a> in the Journal of Politics showing just how much additional confidence forecasts give us, and wrote about the implications for the 2020 election in a recent <a href="https://www.usatoday.com/story/opinion/2020/10/01/election-forecasts-can-wrong-you-still-need-vote-column/5857993002/">USA Today op ed</a>.</p>
<p>I believe it was the sharp violation of expectations that was so disappointing to Clinton supporters and so invigorating for the MAGA crowd—the Washington elite had underestimated “real Americans” yet again.</p>
<script async="" src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>


</section>

 ]]></description>
  <guid>https://solomonmg.github.io/blog/what-the-polls-got-wrong-in-2020/</guid>
  <pubDate>Sun, 08 Nov 2020 00:00:00 GMT</pubDate>
  <media:content url="https://solomonmg.github.io/blog/what-the-polls-got-wrong-in-2020/featured.png" medium="image" type="image/png" height="105" width="144"/>
</item>
<item>
  <title>Trump’s chances are better than they look</title>
  <dc:creator>Sol Messing</dc:creator>
  <link>https://solomonmg.github.io/blog/trumps-chances-are-better-than-they-look/</link>
  <description><![CDATA[ 





<p>According to the latest polling research, Trump’s chances of hanging on to power beyond 2020 look pretty dismal. Nate Cohn published an impressive battleground <a href="https://www.nytimes.com/2020/06/25/upshot/poll-2020-biden-battlegrounds.html">poll from New York Times/Sienna</a> showing Biden ahead of Trump by at least six points in pivotal states. The Economist’s forecast, powered by Elliott Morris and Andrew Gelman, is suggesting Biden is likely to get 64% of electoral college votes, and that if the election were held 100 times Biden would <a href="https://statmodeling.stat.columbia.edu/2020/06/12/election-2020-is-coming-our-poll-aggregation-model-with-elliott-morris-of-the-economist/">win 90 times to Trump’s 10</a>.</p>
<p>At this point I would like to remind you of that feeling you felt on election night 2016. When a month earlier, <a href="https://www.cnn.com/2016/10/23/politics/hillary-clinton-donald-trump-presidential-polls/index.html">CNN’s ‘Poll of Polls’ had Clinton up by 9 points</a> and two prominent forecasters put Clinton’s chances at 99%. Remember that?</p>
<p>I could probably stop there, but I’m not going to because although we’ve fixed some of the issues from 2016, we have COVID-19. And COVID will mess with our election in ways very likely to hurt Democrats, and I know of no pollster factoring this into their method or likely voter model.</p>
<p>After 2016, Sean Westwood, Yphtach Lelkes and I began a multi-year research project (recently published in the <a href="https://www.journals.uchicago.edu/doi/abs/10.1086/708682?mobileUi=0">Journal of Politics</a>) and found that when you have high confidence that one candidate will win, <a href="https://solomonmg.github.io/project/projecting_confidence/">you’re less likely to vote</a>. The fact that everyone thought Clinton would win in 2016 shaped Comey’s decision to release his infamous letter that <a href="https://fivethirtyeight.com/features/the-comey-letter-probably-cost-clinton-the-election/">some believe cost Clinton the election</a>, changed the way campaigns operated, and likely <a href="https://solomonmg.github.io/pdf/aggregator.pdf">lowered Democratic turnout</a>.</p>
<p>In addition to showing this in an experiment, one pattern that clearly pops out in the data we analyzed (ANES timeseries) is that people who think the leading candidate will win by quite a bit report voting at about a 3% lower rate. That’s in line with other research showing that early exit polls indicating one candidate is likely to win <a href="https://www.jstor.org/preview-page/10.2307/2748722?seq=1">decrease turnout</a>, and are more likely to <a href="https://repository.upenn.edu/cgi/viewcontent.cgi?referer=&amp;httpsredir=1&amp;article=1018&amp;context=asc_papers">affect Democrats</a>. Yet this is by no means an upper bound—one study found more decisive exit polling <a href="https://www.sciencedirect.com/science/article/pii/S0014292115000483">depressed turnout by 11 points</a>.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://solomonmg.github.io/img/closerace_vote_anes.png" class="img-fluid figure-img"></p>
<figcaption>normal</figcaption>
</figure>
</div>
<p>While it’s if anything a noisy indicator of the influence Clinton’s ostensible lead may have had on Democrats compared with Republicans, the proportion of Democrats who thought Clinton would ‘win by quite a bit’ was much higher in 2016 than for Republicans, and much higher than it’d been in many years.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://solomonmg.github.io/img/anes_turnout_closerace_mc_tall.png" class="img-fluid figure-img"></p>
<figcaption>normal</figcaption>
</figure>
</div>
<p>To be clear, I no longer occupy the role of dispassionate observer–I’m actively working in politics at the moment.</p>
<p>So while I like seeing Biden up, let me explain exactly why the margins we’re seeing could be a polling mirage.</p>
<section id="covid-19" class="level2">
<h2 class="anchored" data-anchor-id="covid-19">COVID-19</h2>
<p>Are pollsters accounting for the likely decline in urban turnout due to COVID-19? Not if they are assuming typical levels of turnout across urban and rural areas.</p>
<p>Make no mistake, COVID-19 is already affecting the political process—look at voter registration. As many colleagues who regularly deal with registration data have warned me, the usual rush of new voter registrations, often from young voters, have “fallen off a cliff.” Registration numbers started stronger than ever as the new year began, but as <a href="https://fivethirtyeight.com/features/voter-registrations-are-way-way-down-during-the-pandemic/">538 notes, fell to unprecedented levels in March</a> as pandemic social distancing measures took effect.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><a href="https://fivethirtyeight.com/features/voter-registrations-are-way-way-down-during-the-pandemic/"><img src="https://solomonmg.github.io/img/538-voter-registrations-are-way-way-down-during-the-pandemic.png" class="img-fluid figure-img"></a></p>
<figcaption>normal</figcaption>
</figure>
</div>
<p>So it’s already hurting Democrats in terms of new registrations, but what might all this mean on election day? At first blush, it may be tempting to say to yourself, “COVID is affecting old people more than the young, and they break conservative so the left is probably fine,” before feeling slightly ashamed that you’re thinking about strategic considerations before the loss of life and sadness this statement implies.</p>
<p>Think a little deeper and you’ll likely realize that so far COVID-19 has affected left-leaning people in left-leaning places—<a href="https://www.npr.org/2020/04/12/832455226/what-coronavirus-exposes-about-americas-political-divide">non-White voters in urban areas</a> far more than their suburban/rural counterparts. Even the recent <a href="https://www.theatlantic.com/politics/archive/2020/06/coronavirus-surge-sun-belt-could-doom-trump/613495/">surge in cases in sunbelt states</a> is hitting urban and non-White regions hardest.</p>
<p>What’s more, conservatives seem to be far more likely to be willing risk going out and about than liberals. A Pew study shows <a href="https://www.pewresearch.org/fact-tank/2020/05/07/americans-remain-concerned-that-states-will-lift-restrictions-too-quickly-but-partisan-differences-widen/">Republicans are far more likely</a> to support lifting COVID restrictions quickly than Democrats.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><a href="https://www.pewresearch.org/fact-tank/2020/05/07/americans-remain-concerned-that-states-will-lift-restrictions-too-quickly-but-partisan-differences-widen/"><img src="https://solomonmg.github.io/img/covid-partisan-differences-widen-Pew.png" class="img-fluid figure-img"></a></p>
<figcaption>wide</figcaption>
</figure>
</div>
<p>With a deadly pandemic raging, will urban and non-urban voters go to the polls at the usual rates?</p>
<p>Post-pandemic primary voting has meant a vast reduction in the number of polling places and a big increase in mail-in-ballots. We’re seeing this in post-pandemic primaries like this Tuesday’s in <a href="https://www.nytimes.com/2020/06/23/us/politics/kentucky-new-york-election-recap.html?action=click&amp;module=Top%20Stories&amp;pgtype=Homepage">Kentucky, New York, and Virginia</a>.</p>
<p>In New York’s primary, there were reports of <a href="https://www.thecity.nyc/2020/6/23/21300471/nyc-primary-missing-ballots-busted-machines-pandemic">missing mail in ballots</a>. Kentucky also saw reports of <a href="https://www.motherjones.com/politics/2020/06/kentucky-slashes-polling-places-voting-rights-mcgrath-booker-lebron-james/">long lines that disportionately hit Black neighborhoods</a>, in a primary that will determine the Democrat who runs against Senate Majority Leader Mitch McConnell.</p>
<p>What at first looks like maybe a silver lining is the surge in voting by mail-in ballot. And while Trump sees mail-in ballots as a <a href="https://www.politico.com/news/2020/06/19/trump-interview-mail-voting-329307">threat to his re-election</a>, the evidence is far from clear that widespread voting by mail would hurt his chances.</p>
<p>On the contrary, Stanford’s Andy Hall estimates that universal vote by mail <a href="https://www.pnas.org/content/117/25/14052">should have no impact on either party’s vote share</a>. However, as they note, vote by mail may very well have a disparate impact on minority voters, and their estimates assume that every voter is mailed a ballot, rather than needing to opt-in to voting by mail.</p>
<p>And just today, the Supreme Court <a href="https://www.nytimes.com/2020/06/26/us/supreme-court-texas-vote-by-mail.html?action=click&amp;module=Top%20Stories&amp;pgtype=Homepage">denied an emergency request</a> to allow all citizens in Texas to vote by mail. That’s not the last word, but conservatives are actively fighting measures like this one, which would have made it far easier to prepare to handle a deluge of mail-in ballots in the fall.</p>
<p>Furthermore, we’re already seeing evidence in the primaries of <a href="https://www.nytimes.com/2020/03/09/us/virus-election-voting.html">poll-workers failing to show up</a>, lengthening the already long lines in urban areas that discourage voters.</p>
<p>If <a href="https://faculty.ucmerced.edu/thansford/Articles/The%20Republicans%20Should%20Pray%20for%20Rain%20-%20Weather,%20Turnour,%20and%20Voting%20in%20U.S.%20Presidential%20Elections.pdf">a little bit of rain can depress turnout in urban areas</a>, fear of a deadly pandemic that spreads when you’re standing in line seems likely to as well.</p>
<p>What’s more, it’s going to take longer to count mail in ballots, and there will almost certainly be confusion about results <a href="https://www.poynter.org/fact-checking/2020/be-patient-on-election-night-2020-counting-the-returns-will-take-time/">as Poynter recently noted</a>. Based on the President’s rhetoric around voting by mail, there will almost certainly be legal disputes about the legitimacy of certain results if not the election writ large.</p>
<p>Buckle up.</p>
</section>
<section id="things-change" class="level2">
<h2 class="anchored" data-anchor-id="things-change">Things Change</h2>
<p>Six months ago the big story was the prospect of war with Iran after Trump killed Sulamani. The political world is fundamentally different now and it’s more than possible that something important will happen between now and election day with political consequences.</p>
<p>Does that matter? Andrew Gelman (yes, the same) and Gary King have a paper suggesting it doesn’t—showing that we can <a href="https://www.jstor.org/preview-page/10.2307/194212?seq=1">predict elections remarkably well despite how much polls fluctuate</a>. “Thus, the general campaign for president seems irrelevant to the outcome … despite all the media coverage of campaign strategy… <em>except in very close elections</em>.” And <a href="https://pollyvote.com/en/components/models/retrospective/fundamentals-plus-models/time-for-change-model/">Alan Abramowitz’s forecasting model</a> which is the kind of model they are referencing and which has done extremely well in the past, has <a href="https://www.rasmussenreports.com/public_content/political_commentary/commentary_by_alan_i_abramowitz/assessing_trump_s_chances_forecasting_the_2020_presidential_election">Trump’s chances in 2020 nearly even</a> (though both the economy and Trump’s polling numbers have suffered since).</p>
<p>So even if you’re one of those people who think that in general the ebb and flow of historical events largely does not impact U.S. elections, it may matter more in 2020 than in a typical year, and that’s before you even factor in a global pandemic that has upended life in America.</p>
<p>OK, but how many people are really undecided about Trump? When asked who they’d vote for, 8 percent of people in Nate Cohn’s poll said something other than Biden or Trump.</p>
<p>According to the American Association of Public Opinion Research 2016 post-mortem, <a href="https://www.aapor.org/Education-Resources/Reports/An-Evaluation-of-2016-Election-Polls-in-the-U-S.aspx">a tsunami of undecided voters went to Trump</a>, which was a major reason we thought Clinton was going to win in 2016. One controversial possibility is that some of these undecided voters were actually “<a href="https://morningconsult.com/2016/11/03/shy-trump-social-desirability-undercover-voter-study/">shy Trump supporters</a>,” which might explain the swing. Of course, another controversial possibility is that <a href="https://fivethirtyeight.com/features/the-comey-letter-probably-cost-clinton-the-election/">the Comey letter cost Clinton the election</a>.</p>
<p>Regardless, Trump is more well-known now than in 2016 and there <em>are</em> fewer undecideds this time around. But we have no clue how these folks will break in 2020, and in 2016 they broke for Trump.</p>
</section>
<section id="correcting-for-political-engagement" class="level2">
<h2 class="anchored" data-anchor-id="correcting-for-political-engagement">Correcting for Political Engagement</h2>
<p>That 8 percent undecided number above may very well be an underestimate. Arguably the biggest problem in survey research today is that you can’t fully adjust for the <a href="https://www.pewresearch.org/fact-tank/2015/07/21/the-challenges-of-polling-when-fewer-people-are-available-to-be-polled/">bias toward high political knowledge respondents</a>. And low political knowledge voters are more likely than others to be undecided.</p>
</section>
<section id="education" class="level2">
<h2 class="anchored" data-anchor-id="education">Education</h2>
<p>Likewise, it’s difficult to survey Americans with low education. The <strong>vast majority</strong> of polls fail to recruit a representative swath of these potential voters and cannot fully adjust away the bias.</p>
<p>OK so what?</p>
<p>No election has <a href="https://www.pewresearch.org/fact-tank/2016/11/09/behind-trumps-victory-divisions-by-race-gender-education/">split on education like 2016</a> going back to the beginning of Pew Research Center’s data on this in 1980. Non-college Whites voted for Trump over their college educated counterparts by a <strong>35 point margin</strong>. And the best retrospective analyses show that his biggest gains have <a href="https://williammarble.co/docs/vb.pdf">come from low-education White moderates in battleground states</a> (and <em>not</em> as many have presumed, from those with conservative views on race and immigration, across the educational spectrum).</p>
<p>Many 2016 polls did not adjust their samples to account for education—something that mattered far less in the past and something not easy to do correctly. They systematically underestimated Trump’s support in part because of this issue.</p>
<p>Although the NYT/Sienna poll and now many others do target and weight by education to increase representativeness, the <a href="https://int.nyt.com/data/documenttools/nyt-siena-poll-methodology-june-2020/f6f533b4d07f4cbe/full.pdf">methodology</a> shows this poll (along with most) lump together everyone without a college degree. While the <a href="https://www.aapor.org/Education-Resources/Reports/An-Evaluation-of-2016-Election-Polls-in-the-U-S.aspx">AAPOR report concludes this may be ok</a>, Trump’s support does appear to increase as education decreases, which means failing to disaggregate “no high school degree,” “high school degree,” and “some college” when adjusting for education may very well result in some bias in favor of Trump.</p>
</section>
<section id="higher-error-in-subnational-polls" class="level2">
<h2 class="anchored" data-anchor-id="higher-error-in-subnational-polls">Higher Error in Subnational Polls</h2>
<p>The NYT/Sienna poll is one of the best subnational polls out there, but the error in battleground polls like this is generally <a href="https://yougov.co.uk/topics/politics/articles-reports/2016/11/11/first-thoughts-polling-problems-2016-us-elections">higher than national polls</a>. It’s harder to reach the right mix of people in individual states in a short period of time which increases the error. By error, I mean the actual error in predicting presidential vote share, not the reported “margin of error,” which is usually <a href="https://www.nytimes.com/2016/10/06/upshot/when-you-hear-the-margin-of-error-is-plus-or-minus-3-percent-think-7-instead.html"><strong>around half the actual error</strong></a>.</p>
<p>The reported margin of error here is about 2%, so doubling that, 4%, plus the 8 percent who didn’t say Biden or Trump means there may be 12% wiggle room, possibly more.</p>
</section>
<section id="issues-with-the-voter-file" class="level2">
<h2 class="anchored" data-anchor-id="issues-with-the-voter-file">Issues with the Voter File</h2>
<p>The way pollsters recruit people for their survey has a huge impact on accuracy. If you don’t get data from the right mix of people you’re not going to get a good sense of which candidate is ahead, and you can only get so much juice out of adjusting your polls using approaches like weighting.</p>
<p>Pollsters often use random-digit dialing to get a representative sample, but many of the best election surveys run today are now conducted by calling people from the voter file. The <a href="https://www.nytimes.com/2018/09/06/upshot/live-poll-explainer.html">Times used the voter file</a> in part so they could poll congressional districts, which are drawn in such idiosyncratic shapes that they don’t line up with area codes nor almost any other data set with phone numbers.</p>
<p>The Times uses the voter file to target specific subsets of the population that are hard to reach, such as low-education voters. Unfortunately, running a voter-file based poll may still not get enough low-education voters—as <a href="https://www.pewresearch.org/methods/2018/10/09/performance-of-the-samples/">Pew Research Center’s voter file study</a> showed (it used the same voter file vendor as does the Times—L2). So you have to rely on statistical adjustment, increasing error.</p>
<p>What’s more, in that Pew study only <a href="https://www.pewresearch.org/methods/2018/10/09/comparing-survey-sampling-strategies-random-digit-dial-vs-voter-files/#overview-of-study-methodology">62% of respondents who answered on a cell phone</a> were the actual person on the voter file. And these quality issues vary a lot by state—remember, the file is first gathered by the secretary of state and is subject to local laws and regulations. For example, Wisconsin’s voter file is notoriously bad.</p>
<p>All this increases total survey error and the chance that systematic biases will creep in.</p>
<p>While the polling does indeed suggest better news than if it showed Trump ahead, this is still very likely a highly competitive race.</p>


</section>

 ]]></description>
  <guid>https://solomonmg.github.io/blog/trumps-chances-are-better-than-they-look/</guid>
  <pubDate>Sat, 20 Jun 2020 00:00:00 GMT</pubDate>
  <media:content url="https://solomonmg.github.io/blog/trumps-chances-are-better-than-they-look/featured.jpg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>Facebook Condor URLs Data Release</title>
  <dc:creator>Sol Messing</dc:creator>
  <link>https://solomonmg.github.io/blog/condor-data-release/</link>
  <description><![CDATA[ 





<p><a href="../../pdf/Facebook_DP_URLs_Dataset.pdf" class="btn-paper">PDF</a> <a href="https://twitter.com/SolomonMg" class="btn-paper">Follow</a></p>
<p>On January 17, 2020 my team at Facebook launched one of the largest social science data sets ever constructed. It’s meant to facilitate research on misinformation from across the web, shared and spread on Facebook.</p>
<p><a href="../../pdf/Facebook_DP_URLs_Dataset.pdf">Full details on the release here</a>.</p>
<p>We also released the <a href="https://github.com/facebookresearch/URL-Sanitization">URL santization framework</a>, which I implemented (and which my SWE colleagues refactored).</p>
<p>What makes this data release unprecedented is that it contains <em>exposure data</em> describing external links that billions of users saw and read while using the site.</p>
<p>The data set goes beyond URL-level data, breaking down exposure and interactions by month, country, age, gender, and in the U.S., political page affinity (see Barbera et al 2015).</p>
<p>The data contain two tables: (1) a “URL attributes” table describing the 38 million URLs in the data set, including how many times users tagged those posts as containing misinformation, harassment, etc. and (2) a “breakdown” table, which aggregates counts of actions taken on urls, broken out by user demographics and URL attributes.</p>
<p>The <a href="../../pdf/Facebook_DP_URLs_Dataset.pdf">technical documentation</a> reflects more work than most papers I’ve written: . This list of authors reflects the scale of this massive team effort, and that’s before you include increadibly helpful advice we got from a number of computer scientists in the academy listed in the acknowledgements.</p>
<p>Perhaps most importantly, this release provides guarantees about anonymity in an incredibly rigorous way–action-level differential privacy, while preserving more underlying signal in the data.</p>



 ]]></description>
  <guid>https://solomonmg.github.io/blog/condor-data-release/</guid>
  <pubDate>Mon, 18 May 2020 00:00:00 GMT</pubDate>
  <media:content url="https://solomonmg.github.io/blog/condor-data-release/featured.png" medium="image" type="image/png" height="121" width="144"/>
</item>
<item>
  <title>Projecting Confidence</title>
  <dc:creator>Sol Messing</dc:creator>
  <link>https://solomonmg.github.io/blog/projecting-confidence/</link>
  <description><![CDATA[ 





<p><a href="../../pdf/aggregator.pdf" class="btn-paper">PDF</a> <a href="https://twitter.com/SolomonMg" class="btn-paper">Follow</a></p>
<p>Inspired by Donald Trump’s surprise victory over Hillary Clinton in the 2016 general election, <a href="https://www.dartmouth.edu/~seanjwestwood/">Sean Westwood</a>, <a href="http://ylelkes.com/">Yphtach Lelkes</a> and I set out to interrogate the question of whether elecion forecasts—particularly probablistic forecasts—might create a sense of inevitability, and ultimately lead people to stay home on election day.</p>
<p>Clinton herself was quoted in <a href="http://nymag.com/daily/intelligencer/2017/05/hillary-clinton-life-after-election.html?mid=nymag_press">New York Magazine</a> after the election:</p>
<blockquote class="blockquote">
<p>I had people literally seeking absolution… ‘<em>I’m so sorry I didn’t vote. I didn’t think you needed me.</em>’ I don’t know how we’ll ever calculate how many people thought it was in the bag, because the percentages kept being thrown at people — ‘<em>Oh, she has an 88 percent chance to win!</em>’</p>
</blockquote>
<p><strong>Is it plausible that forecasting could have affected the election?</strong></p>
<p>For this phenomena to affect an election, it must: 1. be visible in the media so it reaches potential voters, 2. depress turnout, and 3. affect one side more than the other. In the case of 2016, that means affecting Clinton’s supporters (and/or Clinton campaigners) more than Trump’s.</p>
<p>We found evidence for all of the above. First, witness the rise of forecasts since 2008, when FiveThirtyEight first came on the scene:</p>
<p><img src="https://solomonmg.github.io/img/forecast_google_news.png" class="img-fluid"></p>
<p>What’s more, there is good evidence that one side will be more affected. Our research (see results below) suggests that <em>candidate who is ahead in the polls is more affected</em> by probablistic forecasts. In 2016, that was Hillary.</p>
<p>And irrespective of 2016, it’s outlets with a <em>left-leaning audience</em> that publish and cover election forecasts. The websites that present their poll aggregation results in terms of probabilities have left-leaning (negative) social media audiences—only realclearpolitics.com, which doesn’t emphasize win-probabilities, has a conservative audience:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://solomonmg.github.io/img/bma_science_alignment.png" class="img-fluid figure-img"></p>
<figcaption>half</figcaption>
</figure>
</div>
<p>These data come from the average self-reported ideology of people who share links to various sites hosting poll-aggregators on Facebook, data that come from <a href="http://science.sciencemag.org/content/early/2015/05/06/science.aaa1160.full">this paper</a>’s <a href="http://dx.doi.org/10.7910/DVN/LDJ7MS">replication materials</a>.</p>
<p>When you look at the balance of coverage of probabilistic forecasts on major television broadcasts, there is more coverage on MSNBC, which has a more liberal audience.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://solomonmg.github.io/img/msnbc_mentions.png" class="img-fluid figure-img"></p>
<figcaption>half</figcaption>
</figure>
</div>
<p><strong>How much influence do forecasters really have?</strong></p>
<p>It’s increadibly difficult to tease out when one media outlet is influencing another. However, a freak event in 2018 allows us to get some traction on this question, and suggests that FiveThirtyEight’s 2018 coverage was highly influential.</p>
<p>After FiveThirtyEight’s real-time forecast suddenely moved the the GOP’s odds of taking the House from single digits to about 60% at around 8:15PM, PredictIt’s odds on the GOP rose above 50-50, &amp; <em>U.S. government bond yields rose 2-4 basis points.</em> FiveThirtyEight then altered it’s prediction system and the markets calmed down.</p>
<p><img src="https://solomonmg.github.io/img/538-markets.jpg" class="img-fluid"></p>
<p>This spike seems to have occurred because a number of big, <a href="https://fivethirtyeight.com/live-blog/2018-election-results-coverage/#3495">Republican-dominated districts started reporting returns before those that went toward Democrats</a> and because it was making inferences from partial vote counts:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://solomonmg.github.io/img/538realtimepolling.jpg" class="img-fluid figure-img"></p>
<figcaption>half</figcaption>
</figure>
</div>
<p>This was <a href="https://ftalphaville.ft.com/2018/11/07/1541617447000/Debt-markets-let-us-know-what-they-think-about-Republicans-last-night/">first reported by Colby Smith &amp; Brian Greeley of FT.com</a>. They report that because markets expected to see more inflation under a Republican House (high spending, low taxes) the U.S. Bond yield rose.</p>
<p>Was this just a correlation? Possibly, but there was pretty much nothing else happening in the U.S., and it was like 1 am in Europe, as pointed out in the FT.com piece above.</p>
<p>Josh Tucker suggested that <a href="http://themonkeycage.org/2012/10/convergence-between-polls-and-prediction-markets-in-us-presidential-election/">538 might be driving prediction markets</a> back in 2012 in a Monkey Cage blogpost.</p>
<p><strong>Our research on forecasting and perception</strong></p>
<p><a href="../../pdf/aggregator.pdf">Our research</a> shows that probablistic election forecasts make a race look less competitive. Participants in a national probability survey-experiment were substantially more certain that one candidate would win a hypothetical race after seeing a probablistic forecast than after seeing the equivalent vote share estimate and margin of error. This is a big effect—those are confidence intervals not standard errors, with p-values below <img src="https://latex.codecogs.com/png.latex?10%5E%7B-11%7D">.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://solomonmg.github.io/img/certaintyc.png" class="img-fluid figure-img"></p>
<figcaption>normal</figcaption>
</figure>
</div>
<p><strong>Why do people do this?</strong></p>
<p>More research is needed here but we do have some leads. First, small differences in the election metric most familiar to the public—vote share estimates—generally correspond to very large differences in the probability of a candidate’s chance of victory.</p>
<p>Andy Gelman referenced this in passing in a <a href="https://andrewgelman.com/2012/10/22/is-it-meaningful-to-talk-about-a-probability-of-65-7-that-obama-will-win-the-election/">2012 blogpost</a> questioning the decimal precision (0.1 percent) that 538 used to communicate its forecast on its website:</p>
<blockquote class="blockquote">
<p>That’s right: a change in 0.1 of win probability corresponds to a 0.004 percentage point share of the two-party vote. I can’t see that it can possibly make sense to imagine an election forecast with that level of precision…</p>
</blockquote>
<p>Second, people sometimes confuse probabilistic forecasts with vote share projections, and incorrectly conclude that a candidate is projected to say win 85% percent of the vote, rather than to having an 85% chance of winning the election. About 1 in 10 peope did this in our experiment.</p>
<p>As <a href="https://twitter.com/jbenton/status/1059898288139354112">Joshua Benton pointed out in a tweet</a>, TalkingPointsMemo.com <a href="https://talkingpointsmemo.com/news/issa-calls-race-early">made this very mistake</a>:</p>
<p><img src="https://solomonmg.github.io/img/TPMCorrection.jpg" class="img-fluid"></p>
<p>Finally, people tend to think in qualitative terms about the probability of events {%cite sunstein2002probability%}, {%cite keren1991calibration%}. An 85% likelihood that something will happen means it’s going to happen. These studies may help explain why after the 2016 election, so many criticized forecasters for “getting it wrong” (see <a href="https://www.nytimes.com/2016/11/10/technology/the-data-said-clinton-would-win-why-you-shouldnt-have-believed-it.html">this</a> and <a href="http://www.slate.com/articles/news_and_politics/politics/2016/01/nate_silver_said_donald_trump_had_no_shot_where_did_he_go_wrong.html">this</a>).</p>
<p><strong>What about voting?</strong></p>
<p>Perhaps most critically, we show that probabilistic forecasts showing more of a blowout can lower voting. In Study 1, we find limited evidence of this based on self reports. In Study 2, we show that when participants are faced with incentives designed to simulate real world voting, they are less likely to vote when probabilistic forecasts show higher odds of one candidate winning. Yet they are not responsive to changes in vote share.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://solomonmg.github.io/img/FT_18.01.03_prob_vote.png" class="img-fluid figure-img"></p>
<figcaption>normal</figcaption>
</figure>
</div>
<p><strong>Could this actually affect real world voting?</strong></p>
<p>Consider 2016—an unusually high number of Democrats thought the leading candidate would <em>win by quite a bit</em>:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://solomonmg.github.io/img/anes_turnout_closerace_mc_tall.png" class="img-fluid figure-img"></p>
<figcaption>normal</figcaption>
</figure>
</div>
<p>And people who say the leading candidate will <em>win by quite a bit</em> in pre-election polling are about three percentage points less likely to say they voted after the election than people who say it’s a close race. That’s after controlling for election year, prior turnout, and party identification.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://solomonmg.github.io/img/closerace_vote_anes.png" class="img-fluid figure-img"></p>
<figcaption>normal</figcaption>
</figure>
</div>
<p>The data here are from the <a href="https://electionstudies.org">American National Election Study (ANES)</a> and go back to 1952.</p>
<p>Past social science research also provides evidence that the perception of a close race boosts turnout. Some of the best evidence comes from work that analyzes the effects of releasing exit polling results before voting ends, which clearly removes uncertainty. Work examining the effects of East Coast television networks’ “early calls” for one candidate or another on West Coast turnout generally find small but substantively meaningful effects, despite the fact that these calls occur <a href="https://repository.upenn.edu/cgi/viewcontent.cgi?article=1018&amp;context=asc_papers">late on election day</a>, see also <a href="https://academic.oup.com/poq/article-pdf/50/3/331/5135691/50-3-331.pdf?casa_token=Ize8hznQUHAAAAAA:6tMTMfvGSN4tFXyyXbwkew4E47cLlCG8FegNu4ulkzqHE3hJZMzfurBb-Y1GWQcvLbZTYUysOMebxg">this</a>. Similar work exploiting voting reform as a natural experiment <a href="https://eprints.qut.edu.au/83681/1/1-s2.0-S0014292115000483-main.pdf">shows a full 11 percentage point decrease</a> in turnout in the French overseas territories that voted after exit polls were released. These designs are not confounded with the tendency for campaigns to invest more in campaigns in competitive races.</p>
<p>Researchers consistently find robust correlations between tighter elections and higher turnout <a href="https://www.sciencedirect.com/science/article/abs/pii/S0261379405000910">see this</a>; and <a href="https://biopen.bi.no/bi-xmlui/bitstream/handle/11250/2389104/Geys_ES%202016.pdf?sequence=5&amp;isAllowed=y">this</a> for reviews]. Furthermore, there is evidence from statistical models that <a href="https://digitalcommons.unl.edu/cgi/viewcontent.cgi?article=1052&amp;context=poliscifacpub">prior election returns also explain turnout</a> above and beyond campaign spending, particularly when good polling data is unavailable.</p>
<p>Field experiments provide additional evidence that perceptions of higher electoral competition increases turnout. This work finds substantive effects on turnout when polling results showing a closer race are delivered <a href="(https://huber.research.yale.edu/materials/67_paper.pdf)">via telephone</a> [among those who were reached] but null results when relying on <a href="https://www.nber.org/papers/w23071">postcards</a> to deliver closeness messages. Finally, one study conducted in the weeks leading up to the 2012 presidential election found higher rates of self-reported, post-election turnout when delivering ostensible polling results showing Obama neck-and-neck with Romney <a href="http://www.aapor.org/AAPOR_Main/media/AnnualMeetingProceedings/2013/Session_C-5-1-Vannette.pdf">which was not consistent with the extant polling data showing a comfortable Obama lead</a>.</p>
<p><strong>Could this affect politicians as well?</strong></p>
<p>Candidates’ perceptions of the closeness of an election can affect campaigning and representation {%cite enos2015campaign%}, {%cite Mutz:1997wy%}.</p>
<p>These perceptions can also shape policy decisions—-for example, prior to the 2016 election, the Obama administration’s confidence in a Clinton victory was reportedly a factor in the muted response to <a href="https://www.washingtonpost.com/graphics/2017/world/national-security/obama-putin-election-hacking/">Russian intervention in the election</a>.</p>
<p>And former FBI Director James Comey, because of his confidence in a Clinton victory, said he felt that it was his duty to write a letter to Congress on October 28 saying he was reopening the investigation into her emails. Comey explained his actions based on his certain belief in a Clinton win: ’‘[S]he’s gonna be elected president, and if I hide this from the American people, she’ll be illegitimate the moment she’s elected, the moment this comes out’’ {%cite keneally_2018%}. Nate Silver at one point said ’‘<a href="https://fivethirtyeight.com/features/the-comey-letter-probably-cost-clinton-the-election/">the Comey letter probably cost Clinton the Election</a>.’’</p>
<p><img src="https://solomonmg.github.io/img/ComeyABCCNNresize.jpg" class="img-fluid"></p>
<p><strong>Media coverage</strong> <a href="https://www.washingtonpost.com/news/politics/wp/2018/02/06/clintons-achilles-heel-in-2016-may-have-been-overconfidence/?utm_term=.619133ce9312">Washington Post</a>, <a href="https://fivethirtyeight.com/features/politics-podcast-whats-so-wrong-with-nancy-pelosi/">FiveThirthyEight’s Politics Podcast</a>, <a href="http://nymag.com/intelligencer/2018/02/americans-dont-understand-election-probabilities.html?gtm=bottom&amp;gtm=bottom">New York Magazine</a>, <a href="https://politicalwire.com/2018/02/06/election-forecasts-lower-voter-turnout/">Political Wire</a>.</p>



 ]]></description>
  <guid>https://solomonmg.github.io/blog/projecting-confidence/</guid>
  <pubDate>Mon, 18 May 2020 00:00:00 GMT</pubDate>
  <media:content url="https://solomonmg.github.io/blog/projecting-confidence/featured.jpg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>Impression of Influence</title>
  <dc:creator>Sol Messing</dc:creator>
  <link>https://solomonmg.github.io/blog/impression-of-influence/</link>
  <description><![CDATA[ 





<p><a href="../../pdf/GrimmerWestwoodMessingBook.pdf" class="btn-paper">PDF</a> <a href="https://twitter.com/SolomonMg" class="btn-paper">Follow</a></p>
<p><a href="../../pdf/GrimmerWestwoodMessingBook.pdf"><strong>The Impression of Influence: Legislator Communication, Representation, and Democratic Accountability</strong></a> <strong>Princeton University Press, 2015</strong>. With Justin Grimmer and Sean Westwood - Media: <a href="http://www.mischiefsoffaction.com/2015/01/its-frequency-not-size-compromise.html">Mischiefs of Faction</a>.</p>



 ]]></description>
  <guid>https://solomonmg.github.io/blog/impression-of-influence/</guid>
  <pubDate>Sun, 17 May 2020 00:00:00 GMT</pubDate>
  <media:content url="https://solomonmg.github.io/blog/impression-of-influence/featured.png" medium="image" type="image/png" height="144" width="144"/>
</item>
<item>
  <title>Why Election Forecasting Matters</title>
  <dc:creator>Sol Messing</dc:creator>
  <link>https://solomonmg.github.io/blog/response-to-fivethirtyeights-podcast-about-our-paper-projecting-confidence/</link>
  <description><![CDATA[ 





<p>Do you remember the night of Nov 8, 2016? I was glued to election coverage and obsessively checking probabilistic forecasts, wondering&nbsp;whether Clinton might do <em>so well</em> that she’d win in places like my home state of Arizona. Although FiveThirtyEight had Clinton’s chances at beating Trump at around 70%, most other forecasters had her at around 90%.</p>
<p>When she lost, <a href="https://www.nytimes.com/2016/11/10/us/politics/donald-trump-election-reaction.html">many on both sides of the aisle</a> were shocked. My co-authors and I wondered if America’s seeming confidence in a Clinton victory wasn’t driven in part by increasing coverage of probabilistic forecasts. And, if a Clinton victory looked inevitable, what did that do to turnout?</p>
<p><img src="https://solomonmg.github.io/img/forecast_google_news.png" class="img-fluid"></p>
<p>We weren’t alone. Clinton herself was quoted in <a href="http://nymag.com/daily/intelligencer/2017/05/hillary-clinton-life-after-election.html?mid=nymag_press">New York Magazine</a> after the election:</p>
<blockquote class="blockquote">
<p>I had people literally seeking absolution… ‘<em>I’m so sorry I didn’t vote. I didn’t think you needed me.</em>’ I don’t know how we’ll ever calculate how many people thought it was in the bag, because the percentages kept being thrown at people — ‘<em>Oh, she has an 88 percent chance to win!</em>’</p>
</blockquote>
<p>Enter&nbsp;our recent <a href="http://www.pewresearch.org/fact-tank/2018/02/06/use-of-election-forecasts-in-campaign-coverage-can-confuse-voters-and-may-lower-turnout/">blog post</a> and <a href="https://papers.ssrn.com/abstract=3117054">paper released on SSRN</a>, “Projecting confidence: How the probabilistic horse race confuses and de-mobilizes the public,” by&nbsp;Sean Westwood, Solomon Messing, and Yphtach Lelkes. While our work cannot definitively say whether probabilistic forecasts played a decisive role in the 2016 election, it does indeed show that compared to more conventional vote share projections, probabilistic forecasts can confuse people, can give people more confidence that the candidate depicted as being ahead will win, may decrease turnout, and that liberals in the U.S. are more likely to encounter them. We appreciate the media attention to this work, including coverage by the <a href="https://www.washingtonpost.com/news/politics/wp/2018/02/06/clintons-achilles-heel-in-2016-may-have-been-overconfidence/">Washington Post</a>, <a href="http://nymag.com/daily/intelligencer/2018/02/americans-dont-understand-election-probabilities.html">New York Magazine</a>, and the <a href="https://politicalwire.com/2018/02/06/election-forecasts-lower-voter-turnout/">Political Wire</a>. What’s more, FiveThirtyEight devoted much of their Feb.&nbsp;12 <a href="https://fivethirtyeight.com/features/politics-podcast-whats-so-wrong-with-nancy-pelosi/">Politics Podcast</a> to a spirited, and at points critical discussion of our work. We are open to criticism and will respond to some of the questions raised in this post. Below, we’ll show that the evidence in our study and in other research is <em>not</em> inconsistent with our headline, as the hosts suggest—we’ll detail the evidence that probabilistic forecasts <em>confuse</em> people, irrespective of their technical accuracy. We’ll also discuss where we agree with the podcast hosts. Furthermore, we’ll discuss a few topics which, judging from the hosts discussion, may not have come through clearly enough in our paper. We’ll reiterate what this work contributes to social science—how the paper adds to our understanding of how people think about probabilistic forecasts and how they may decrease voting, particularly for the leading candidate’s supporters and among liberals in the U.S. We’ll then walk readers through the way we mapped vote share projections to probabilities in the study. Finally we’ll discuss why this work matters, and conclude by pointing out future research we’d like to see in this area.</p>
<p><strong>What’s new here?</strong></p>
<p>The research contains a number findings that are new to social science:</p>
<ul>
<li>Presenting forecasted win-probabilities gives potential voters the impression that one candidate will win more decisively, compared with vote share projections (Study 1).</li>
<li>Higher win probabilities, but <em>not</em> vote share estimates, <em>decrease voting</em> in the face of the trade-offs embedded in our election simulation (Study 2). This helps confirm the findings in Study 1 and adds to the evidence from past research that people vote at lower rates when they perceive an election to be uncompetitive.</li>
<li>In 2016, probabilistic forecasts were covered more extensively than in the past and tended to be covered by outlets with more liberal audiences.</li>
</ul>
<p><strong>Where we agree</strong></p>
<p>If what you care about is conveying an accurate sense of whether one candidate will win, probabilistic forecasts do this slightly better than vote share. And, they seem to give people an edge on accuracy when interpreting the vote share if your candidate is behind. Of course, people can be confused and still end up being accurate, as we’ll discuss below.</p>
<p>We also agree that people often do not accurately judge the likelihood of victory after seeing a vote share projection. That makes sense because, as the study shows, people appear to largely ignore the margin of error, which they’d need to map between vote share estimates and win probabilities.</p>
<p>We also agree that a lot of past work shows that people stay home when they think an election isn’t close. What we’re adding to that body of work is evidence that compared with vote share projections, probabilistic forecasts give people the impression that one candidate will win more decisively, and may thus <em>more powerfully</em> affect turnout.</p>
<p><strong>Does the evidence in our study contradict our headline?</strong></p>
<p>Our headline isn’t about accuracy, it’s about <em>confusion</em>. And the evidence from this research and past work taken as a whole suggests that probabilistic forecasts confuse people — something that came up at the end of segment — even if the result sometimes is technically higher accuracy.</p>
<p>1. People in the study who saw only probabilistic forecasts were more likely to confuse probability and vote share. After seeing probabilistic forecasts, 8.6% of respondents mixed up vote share and probability, while only 0.6% of respondents did so after seeing vote share projections. We’re defining “mixed-up” as reporting the win-probability we provided as the vote share and vice-versa.</p>
<p>2. Figure 2B (Study 1) shows that people get their candidate’s likelihood of winning very wrong, even when we explicitly told them the probability a candidate will win. It’s true that they got slightly closer with a probability forecast, but they are still far off.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://solomonmg.github.io/img/likelihood_loess.png" class="img-fluid figure-img"></p>
<figcaption>normal</figcaption>
</figure>
</div>
<p>Why might this be? A lot of past research and evidence suggests that people have trouble understanding probabilities, as noted at the end of the podcast. People have a tendency to think about <a href="https://chicagounbound.uchicago.edu/cgi/viewcontent.cgi?article=1384&amp;context=law_and_economics">probabilities in subjective terms</a>, so they have trouble understanding <a href="https://www.ncbi.nlm.nih.gov/pubmed/26161749">medical risks</a> and even <a href="http://pubman.mpdl.mpg.de/pubman/item/escidoc:2101059/component/escidoc:2101058/GG_30_Chance_2005.pdf">weather forecasts</a>.</p>
<p>Nate Silver has himself <a href="https://fivethirtyeight.com/features/the-media-has-a-probability-problem/">made the argument</a> that <a href="https://www.nytimes.com/2016/11/10/technology/the-data-said-clinton-would-win-why-you-shouldnt-have-believed-it.html">the backlash</a> we saw to data and analytics in the wake of the 2016 election is due in part to the media misunderstanding probabilistic forecasts.</p>
<p>As the podcast hosts pointed out, people <em>underestimated</em> the true likelihood of winning after seeing both probabilistic forecasts and vote share projections. It’s possible that people are skeptical of any probabilistic forecast in light of the 2016 election. It’s possible they interpreted the likelihood not as hard-nosed odds, but in rather subjective terms — <em>what might happen</em>, <a href="https://chicagounbound.uchicago.edu/cgi/viewcontent.cgi?article=1384&amp;context=law_and_economics">consistent with past research</a>. Regardless, they do not appear to reason about probability in a way that is consistent with how election forecasters define the probability of winning.</p>
<p>3. Looking at how people reason about vote share — the way people have traditionally encountered polling data — it’s clear from our results that when a person’s candidate is ahead and they see a probabilistic forecast, they rather dramatically overestimate the vote share. On the other hand, when they are behind, they get closer to the right answer.</p>
<p>But we know from past research that people have a “<a href="http://journals.sagepub.com/doi/abs/10.1177/0956797609356421">wishful</a> <a href="http://www.tandfonline.com/doi/abs/10.1207/s15324834basp1304_6">thinking</a>” <a href="https://academic.oup.com/ijpor/article-abstract/9/2/105/713900">bias</a>, meaning they say their candidate is doing better than polling data suggests. That’s why there’s a positive bias when people are evaluating how their candidate will do, according to Figure 2A (Study 1).</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://solomonmg.github.io/img/votea.png" class="img-fluid figure-img"></p>
<figcaption>normal</figcaption>
</figure>
</div>
<p>The pattern in the data suggest that people are more accurate after seeing a probabilistic forecast for a losing candidate because of this effect, and <em>not</em> necessarily because they better understand that candidate’s actual chances of victory.</p>
<p>4. Perhaps even more importantly, <em>none</em> of the results here changed when we excluded the margin of error from the projections we presented to people. That suggests that the public may not understand error in the same way that statisticians do, and therefore may not be well-equipped to understand what goes into changes in probabilistic forecast numbers. And of course, very small changes in vote share projection numbers and estimates of error correspond to much larger swings in probabilistic forecasts.</p>
<p>5. Finally, as we point out in the paper, if probabilistic forecasters do not account for total error, they can really overestimate a candidate’s probability of winning. Of course, that’s because an estimate of the probability of victory bakes in estimates of error, which recent work has found is <a href="http://www.stat.columbia.edu/~gelman/research/unpublished/polling-errors.pdf">often about twice as large as the estimates of sampling error</a> provided in many polls.</p>
<p>As <a href="https://fivethirtyeight.com/features/the-polls-were-skewed-toward-democrats/">Nate Silver has alluded to</a>, if the forecaster does not account unobserved error, including error that may be correlated across surveys — he/she will artificially inflate the estimated probability of victory or defeat. Of course, FiveThirtyEight <em>does</em> attempt to account for this error, and released far more conservative forecasts than others in this space in 2016.</p>
<p>Speaking in part to this issue, <a href="http://www.stat.columbia.edu/~gelman/research/published/what_learned_in_2016_5.pdf">Andrew Gelman and Julia Azari</a> recently concluded that “polling uncertainty could best be expressed not by speculative win probabilities but rather by using the traditional estimate and margin of error.” They seemed to be speaking about other forecasters, and did not directly reference FiveThirtyEight.</p>
<p>At the end of the day, it’s easy to see that a vote share projection of 55% means that “55% of the votes will go to Candidate A, according to our polling data and assumptions.” However, it’s less clear that an 87% win probability means that “if the election were held 1000 times, Candidate A would win 870 times, and lose 130 times, based on our polling data and assumptions.”</p>
<p>And most critically, we show that probabilistic forecasts showing more of a blowout could potentially lower voting. In Study 1, we provide limited evidence of this based on self reports. In Study 2, we show that when participants are faced with incentives designed to simulate real world voting, they are less likely to vote when probabilistic forecasts show higher odds of one candidate winning. Yet they are not responsive to changes in vote share.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://solomonmg.github.io/img/FT_18.01.03_prob_vote.png" class="img-fluid figure-img"></p>
<figcaption>normal</figcaption>
</figure>
</div>
<p><strong>What’s with our mapping between vote share and probability?</strong></p>
<p>The podcast questions how a 55% vote share with a 2% margin of error is equivalent to an 87% win probability. This illustrates a common problem people have when trying to understand win probabilities — -it’s difficult to reason about the relationship between win-probabilities and vote share without actually running the numbers.</p>
<p>You can express a projection as either (1) the average vote share (can be an electoral college vote share or the popular vote share)</p>
<p><img src="https://latex.codecogs.com/png.latex?%20%5Chat%20%5Cmu_v%20=%20%5Cfrac%7B1%7D%7BN%7D%5Csum_%7Bi%7D%5E%7BN%7D%5Cbar%20x_i"></p>
<p>and margin of error</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Chat%20%5Cmu_v%20%5Cpm%20T%5E%7B0.975%7D_%7Bdf%20=%20N%7D%20%5Ctimes%20%5Cfrac%7B%20%5Chat%20%5Csigma_v%7D%7B%5Csqrt%20N%7D"></p>
<p>Here the average for each survey is <img src="https://latex.codecogs.com/png.latex?%5Cbar%20x_i">, and there are <img src="https://latex.codecogs.com/png.latex?N%20=%2020"> surveys.</p>
<p>Or (2) the probability of winning — the probability that the vote share is greater than half, based on the observed vote share and standard error:</p>
<p><img src="https://latex.codecogs.com/png.latex?1%20-%20%5CPhi%20%5Cleft%20(%5Cfrac%7B.5%20-%20%5Chat%20%5Cmu_v%7D%7B%5Chat%5Csigma_v%7D%20%5Cright%20)"></p>
<p>Going back to the example above, here’s the R code to generate those quantities:</p>
<div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb1-1">svy_mean <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> .<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">55</span> </span>
<span id="cb1-2">svy_SD <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.04483415</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># see appendix </span></span>
<span id="cb1-3">N_svy <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">20</span> </span>
<span id="cb1-4">margin_of_error <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">qt</span>(.<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">975</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">df =</span> N_svy) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> svy_SD<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sqrt</span>(N_svy) </span>
<span id="cb1-5"></span>
<span id="cb1-6">svy_mean </span>
<span id="cb1-7">[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>] <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.55</span> </span>
<span id="cb1-8"></span>
<span id="cb1-9">margin_of_error </span>
<span id="cb1-10">[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>] <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.02091224</span> </span>
<span id="cb1-11"></span>
<span id="cb1-12">prob_win <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pnorm</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">q =</span> .<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">50</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">mean =</span> svy_mean, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sd =</span> svy_SD) </span>
<span id="cb1-13">prob_win </span>
<span id="cb1-14">[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>] <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.8676222</span> </span></code></pre></div>
<p>More details about this approach are in our appendix. This is similar to how the Princeton Election Consortium generated win probabilities in 2016.</p>
<p>Of course, one can also use an approach based on simulation, as FiveThirtyEight does. In the case of the data we generated for our hypothetical election in Study 1, this approach is not necessary. However we recognize that in the case of real-world presidential elections, a simulation approach has clear advantages by virtue of allowing more flexible statistical assumptions and a better accounting of error.</p>
<p><strong>Why does this matter?</strong></p>
<p>To be clear, we are not analyzing real-world election returns. However, a lot of past <a href="https://huber.research.yale.edu/materials/67_paper.pdf">research</a> <a href="http://www2.gsu.edu/~polsnn/priorbeliefs.pdf">shows</a> that when people <a href="https://www.jstor.org/stable/1953324?seq=1#page_scan_tab_contents">think</a> an election is <a href="https://repository.upenn.edu/cgi/viewcontent.cgi?referer=&amp;httpsredir=1&amp;article=1018&amp;context=asc_papers">in the bag</a>, they tend to <a href="https://www.sciencedirect.com/science/article/pii/S0014292115000483">vote in real-world elections</a> at <a href="https://www.jstor.org/stable/2748722?seq=1#page_scan_tab_contents">lower rates</a>. Our study provides evidence that probabilistic forecasts give people more confidence that one candidate will win and suggestive evidence that we should expect them to vote at lower rates after seeing probabilistic forecasts.</p>
<p>This matters <em>a lot more</em> if one candidate’s potential voters are differentially affected, and there’s evidence that may be the case.</p>
<p>1. Figure 2C in Study 1 suggests that the <em>candidate who is ahead</em> in the polls will be more affected by the increased certainty that probabilistic forecasts convey.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://solomonmg.github.io/img/certaintyc.png" class="img-fluid figure-img"></p>
<figcaption>normal</figcaption>
</figure>
</div>
<p>2. When you look at the balance of coverage of probabilistic forecasts on major television broadcasts, there is more coverage on MSNBC, which has a more liberal audience.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://solomonmg.github.io/img/msnbc_mentions.png" class="img-fluid figure-img"></p>
<figcaption>half</figcaption>
</figure>
</div>
<p>3. Consider who shares this material in social media–specifically the average self-reported ideology of people who share links to various sites hosting poll-aggregators on Facebook, data that come from <a href="http://science.sciencemag.org/content/early/2015/05/06/science.aaa1160.full">this paper</a>’s <a href="http://dx.doi.org/10.7910/DVN/LDJ7MS">replication materials</a>. The websites that present their results in terms of probabilities have left-leaning (negative) social media audiences. Only realclearpolitics.com, which doesn’t emphasize win-probabilities, has a conservative audience:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://solomonmg.github.io/img/bma_science_alignment.png" class="img-fluid figure-img"></p>
<figcaption>half</figcaption>
</figure>
</div>
<p>4. In 2016, the proportion of American National Election Study (ANES) respondents who thought the leading candidate would “win by quite a bit” was unusually high for Democrats…</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://solomonmg.github.io/img/anes_turnout_closerace_mc_tall.png" class="img-fluid figure-img"></p>
<figcaption>normal</figcaption>
</figure>
</div>
<p>5. And we know that people who say the leading presidential candidate will “win by quite a bit” in pre-election polling are about three percentage points less likely to report voting shortly after the election than people who say it’s a close race — and that’s after conditioning on election year, prior turnout, and party identification. The data here are from the ANES and go back to 1952.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://solomonmg.github.io/img/closerace_vote_anes.png" class="img-fluid figure-img"></p>
<figcaption>normal</figcaption>
</figure>
</div>
<p>These data do not conclusively show that probabilistic forecasts affected turnout in the 2016 election, but they do raise questions about the real world consequences of probabilistic forecasts.</p>
<p><strong>What about media narratives?</strong></p>
<p>We acknowledge that these effects may change depending on the context in which people encounter them — though people can certainly encounter a lone probability number in media coverage of probabilistic forecasts. We also acknowledge that our work cannot address how these effects compare to and/or interact with media narratives.</p>
<p>However, other work that is relevant to this question has found that aggregating all polls <a href="https://academic.oup.com/poq/article-abstract/80/4/943/2738970?redirectedFrom=fulltext">reduces the likelihood that news outlets</a> focus on unusual polls that are more sensational or support a particular narrative.</p>
<p>In some ways, the widespread success and reliance on these forecasts represents a triumph of scientific communication. In addition to greater precision compared with one-off horserace polls, probabilistic forecasts can quantify how likely a given U.S. presidential candidate is to win using polling data and complex simulation, rather than leaving the task of making sense of state and national polls to speculative commentary about “paths to victory,” as we point out in the <a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3117054">paper</a>. And as one of the hosts noted, we aren’t calling for an end to election projections.</p>
<p><strong>Future work</strong></p>
<p>We agree with the hosts that there are open questions about whether the public gives more weight to these probabilistic forecasts than other polling results and speculative commentary. We have also heard questions raised about how much probabilistic forecasts might <em>drive</em> media narratives. These questions may prove difficult to answer and we encourage research that explores them.</p>
<p>We hope this research continues to create a dialogue about how to best communicate polling data to the public. We would love to see more research into how the public consumes and is affected by election projections, including finding the most effective ways to convey uncertainty.</p>



 ]]></description>
  <guid>https://solomonmg.github.io/blog/response-to-fivethirtyeights-podcast-about-our-paper-projecting-confidence/</guid>
  <pubDate>Sun, 26 Apr 2020 00:00:00 GMT</pubDate>
  <media:content url="https://solomonmg.github.io/blog/response-to-fivethirtyeights-podcast-about-our-paper-projecting-confidence/featured.png" medium="image" type="image/png" height="111" width="144"/>
</item>
<item>
  <title>Know your data - Pricing diamonds using scatterplots and predictive models</title>
  <dc:creator>Sol Messing</dc:creator>
  <link>https://solomonmg.github.io/blog/visualization-series-scatterplot-understanding-the-diamond-market/</link>
  <description><![CDATA[ 





<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://solomonmg.github.io/img/fbfa61f67413c8e2805c507a14b38c24c5373265.png" class="img-fluid figure-img"></p>
<figcaption>ggpairs</figcaption>
</figure>
</div>
<p>My last post railed against the <a href="../../blog/visualization-series-insight-from-cleveland-and-tufte-on-plotting-numeric-data-by-groups/">bad visualizations that people often use to plot quantitive data by groups, and pitted pie charts, bar charts and dot plots against each other for two visualization tasks. &nbsp;Dot plots came out on top</a>. &nbsp;I argued that this is because humans are good at the cognitive task of comparing position along a common scale, compared to making judgements about length, area, shading, direction, angle, volume, curvature, etc.—a finding credited to Cleveland and McGill. &nbsp;I enjoyed writing it and people seemed to like it, so I’m continuing my visualization series with the scatterplot.</p>
<section id="scatterplots" class="level2">
<h2 class="anchored" data-anchor-id="scatterplots">Scatterplots</h2>
<p>A scatterplot is a two-dimensional plane on which we record the intersection of two measurements for a set of case items–usually&nbsp;two quantitative variables. &nbsp;Just as humans are good at&nbsp;comparing position along a common scale&nbsp;in one dimension,&nbsp;our visual capabilities allow us to make fast, accurate judgements and recognize patterns when presented with a series of dots in two dimensions. This makes the scatterplot a valuable tool for data analysts both when exploring data and when communicating results to others.</p>
<p>In this post—part 1—I’ll demonstrate various uses for scatterplots and outline some strategies to help make sure key patterns are not obscured by the scale or qualitative group-level differences in the data (e.g., the relationship between test scores and income differs for men and women). The motivation in this post is to come up with a model of diamond prices that you can use to help make sure you don’t get ripped off, specified based on insight from exploratory scatterplots combined with (somewhat) informed speculation. In part 2, I’ll discuss the use of panels aka facets aka small multiples to shed additional light on key patterns in the data, and local regression (loess) to examine central tendencies in the data. There are far fewer bad examples of this kind of visualization in the wild than the 3D barplots and pie charts mocked in my last post, though I was still able to find <a href="http://www.showmethemath.com/Concepts_Explained/Scatter_Plot/homeworkScatterPlotAnswer.gif">this lovely scatterplot + trend-line</a>.</p>
<p><img src="https://solomonmg.github.io/img/0e5cd98eb90fbc27e55e776f3303057ef7a35dcb.gif" class="img-fluid"></p>
</section>
<section id="scatterplots-and-the-cartesian-coordinate-system" class="level2">
<h2 class="anchored" data-anchor-id="scatterplots-and-the-cartesian-coordinate-system">Scatterplots and the Cartesian coordinate system</h2>
<p>The scatterplot has a richer history than the visualizations I wrote about in my last post.&nbsp;&nbsp;The scatterplot’s face forms a two-dimensional Cartesian coordinate system, and DeCartes’ invention/discovery of this eponymous plane in around 1657 represents one of the most fundamental developments in science. &nbsp;The Cartesian plane unites measurement, algebra, and geometry, depicting the relationship between variables (or functions) visually. Prior to the Cartesian plane, mathematics was divided into algebra and geometry, and the unification of the two made many new developments possible. &nbsp;Of course, this includes modern map-making—cartography, but the <a href="http://en.wikipedia.org/wiki/Cartesian_coordinate_system#History">Cartesian plane was also an important step in the development of calculus</a>, without which very little of our modern would would be possible.</p>
<p>The scatterplot is a powerful tool to help understand the relationship between variables, and especially if that relationship is non-linear. Say you want to get a sense of whether you’re paying the right price when shopping for a diamond. You can use data on the price and characteristics of many diamonds to help figure out whether the price advertised for any given diamond is reasonable, and you can use scatterplots to help figure out how to model that data in a sensible way. Consider the important relationship between the price of a diamond and its carat weight (which corresponds to its size):</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://solomonmg.github.io/img/635214c79e184de850272a4790deed0dc870a49a.png" class="img-fluid figure-img"></p>
<figcaption>caratprice</figcaption>
</figure>
</div>
<p>A few things pop out right away. We can see a non-linear relationship, and we can also see that the dispersion (variance) of the relationship also increases as carat size increases. With just a quick look at a scatterplot of the data, we’ve learned two important things about the functional relationship between price and carat size. And, we also therefore learned that running a linear model on this data as-is would be a bad idea.</p>
</section>
<section id="diamonds" class="level2">
<h2 class="anchored" data-anchor-id="diamonds">Diamonds</h2>
<p><a name="Diamonds"></a></p>
<p>If you’ve ever used R, you’ve probably seen references to the diamonds data set that ships with Hadley Wickham’s ggplot2. It records the carat size and the price of more than 50 thousand diamonds, from http://www.diamondse.info/ collected in <a href="http://r.789695.n4.nabble.com/Year-of-data-collection-for-diamonds-dataset-in-ggplot2-td4506598.html">in 2008</a>, and if you’re in the market for a diamond, exploring this data set can help you understand what’s in store and at what price point. This is particularly useful because each diamond is unique in a way that isn’t true of most manufactured products we are used to buying—you can’t just plug a model number and look up the price on Amazon. And even an expert cannot cannot incorporate as much information about price as a picture of the entire market informed by data (though there’s no substitute for qualitative expertise to make sure your diamond is what the retailer claims).</p>
<p>But even if you’re not looking to buy a diamond, the socioeconomic and political history of the diamond industry is fascinating. Diamonds birthed the mining industry in South Africa, which is now by far the largest and most advanced economy in Africa. I worked a summer in Johannesburg, and can assure you that South Africa’s cities look far more like L.A. and San Francisco than Lagos, Cairo, Mogadishu, Nairobi, or Rabat. Diamonds have stoked conflicts ranging from the Boer Wars to modern day wars in Sierra Leone, Liberia, Côte d’Ivoire, Zimbabwe and the DRC, where the 200 carat Millennium Star diamond was sold to DeBeers at the height of the civil war in the 1990s. Diamonds were one of the few assets that Jews could conceal from the Nazis during <a href="http://www.archives.gov/research/holocaust/articles-and-papers/turning-history-into-justice.html">the “Aryanization of Jewish property”</a> in the 1930s, and the Congressional Research Service reports that <a href="http://royce.house.gov/uploadedfiles/rl30751.pdf">Al Qaeda has used conflict diamonds to skirt international sanctions and finance operations from the 1998 East Africa Bombings to the September 11th attacks</a>.</p>
<p><img src="https://solomonmg.github.io/img/c7c2fd41f3cf8c8b7423ab84c12dfbd14fed71ca.jpg" class="img-fluid"></p>
<p>Though the diamonds data set is full of prices and fairly esoteric certification ratings, hidden in the data are reflections of how a <a href="http://www.nytimes.com/2013/05/05/fashion/weddings/how-americans-learned-to-love-diamonds.html">legendary marketing campaign permeated and was subsumed by our culture</a>, hints about how different social strata responded, and insight into how the diamond market functions as a result.</p>
<p><a href="http://www.theatlantic.com/magazine/archive/1982/02/have-you-ever-tried-to-sell-a-diamond/304575/">The story starts in&nbsp;1870</a>&nbsp;according to The Atlantic, when many tons of diamonds were discovered in South Africa near the Orange River. &nbsp;Until then,&nbsp;diamonds were rare—only a few pounds were mined from India and Brazil each year. &nbsp;At the time diamonds had no use outside of jewelry as they do today in many industrial applications, so price depended only on scarce supply. &nbsp;Hence, the project’s investors formed the De Beers Cartel in 1888 to control the global price—by most accounts the most successful cartel in history, <a href="http://en.wikipedia.org/wiki/De_Beers#Diamond_monopoly">controlling 90% of the world’s diamond supply until about 2000</a>. &nbsp;But World War I and the Great Depression saw diamond sales plummet.</p>
<p><img src="https://solomonmg.github.io/img/0346d018cd97813448c1ce59a20353ed7c900013.jpg" class="img-fluid"></p>
<p>In 1938, according to the New York Times’ account,&nbsp;the De Beers cartel wrote Philadelphia ad agency N. W. Ayer &amp; Son, to investigate whether “the use of propaganda in various forms” might jump-start diamond sales in the U.S., which looked like the only potentially viable market at the time. &nbsp;Surveys showed diamonds were low on the list of priorities among most couples contemplating marriage—a luxury for the rich, “money down the drain.” &nbsp;Frances Gerety, who the Times compares to Madmen’s Peggy Olson,&nbsp;took on the&nbsp;DeBeers’ account at&nbsp;N.W. Ayer &amp; Son, and worked toward the company’s goal “to create a situation&nbsp;where almost every person pledging marriage feels compelled to acquire a diamond engagement ring.” &nbsp;A few years later, she coined the slogan, “Diamonds are forever.”</p>
<p><img src="https://solomonmg.github.io/img/90a8efdb27b9c59fbf6566591c14165beab4bfd6.jpg" class="img-fluid"></p>
<p>The Atlantic’s Jay Epstein argues that this campaign&nbsp;gave birth to modern demand-advertising—the objective was not direct sales, nor brand&nbsp;strengthening, but simply to impress the glamour, sentiment and emotional charge contained in the product itself. &nbsp;The company gave diamonds to movie stars, sent out press packages emphasizing the size of diamonds celebrities gave each other, loaned diamonds to socialites attending prominent events like the Academy Awards and Kentucky Derby, and persuaded the British royal family to wear diamonds over other gems. &nbsp;The diamond was also marketed as a status symbol, to reflect&nbsp;“a man’s … success in life,” in ads with “the aroma of tweed, old leather and polished wood which is characteristic of a good club.” &nbsp;A 1980s ad introduced the two-month benchmark: “Isn’t two months’ salary a small price to pay for something that&nbsp;lasts forever?”</p>
<p><img src="https://solomonmg.github.io/img/4f1b0bd62c2ab1b46e453b707d2981d84879d4ed.png" class="img-fluid"></p>
<p>By any reasonable measure, Frances Gerety succeeded—getting engaged means getting a diamond ring in America. Can you think of a movie where two people get engaged without a diamond ring? When you announce your engagement on Facebook, what icon does the site display? &nbsp;Still think this marketing campaign might not be the most successful mass-persuasion effort in history? &nbsp;I present to you a&nbsp;James Bond film,&nbsp;whose title bears the diamond cartel’s trademark:</p>
<p><img src="https://solomonmg.github.io/img/f4eda42fe398a38837e01e97e4d07606a20171fe.jpg" class="img-fluid"></p>
<p>Awe-inspiring and terrifying. &nbsp;Let’s open the data set. &nbsp;</p>
<p>The first thing you should consider doing is plotting key variables against each other using the ggpairs() function. &nbsp;This function plots every variable against every other, pairwise. &nbsp;For a data set with as many rows as the diamonds data, you may want to sample first otherwise things will take a long time to render. &nbsp;Also, if your data set has more than about ten columns, there will be too many plotting windows, so subset on columns first.</p>
<div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb1-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Uncomment these lines and install if necessary:</span></span>
<span id="cb1-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#install.packages('GGally')</span></span>
<span id="cb1-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#install.packages('ggplot2')</span></span>
<span id="cb1-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#install.packages('scales')</span></span>
<span id="cb1-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#install.packages('memisc')</span></span>
<span id="cb1-6"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(ggplot2)</span>
<span id="cb1-7"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(GGally)</span>
<span id="cb1-8"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(scales)</span>
<span id="cb1-9"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data</span>(diamonds)</span>
<span id="cb1-10">diasamp <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> diamonds[<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sample</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">length</span>(diamonds<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>price), <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10000</span>),]</span>
<span id="cb1-11"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggpairs</span>(diasamp, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">params =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">shape =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">I</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'.'</span>), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">outlier.shape =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">I</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'.'</span>)))</span></code></pre></div>
<!-- * R style note: I started using the "=" operator over "<-" after reading [John Mount's post on the topic](http://www.win-vector.com/blog/2013/04/prefer-for-assignment-in-r/?utm_source=rss&utm_medium=rss&utm_campaign=prefer-for-assignment-in-r), which shows how using "<-" (but not "=") incorrectly can result in silent errors.  There are other good reasons: 1.) WordPress and R-Bloggers occasionally mangle "<-" thinking it is HTML code in ways unpredictable to me; 2.) "=" is what every other programming language uses; and 3.) (as pointed out by Alex Foss in comments) consider "foo<-3" --- did the author mean to assign foo to 3 or to compare foo to -3?  Plus, 4.) the way R interprets that expression depends on white space---and if I'm using an editor like Emacs or Sublime where I don't have a shortcut key assigned to "<-", I sometimes get the whitespace wrong.  This means spending extra time and brainpower on debugging, both of which are in short supply.  
 -->
<p>Anyway, here’s the plot:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://solomonmg.github.io/img/5226c40443deaf7c082cd464531f4e27c0f151be.png" class="img-fluid figure-img"></p>
<figcaption>ggpairs</figcaption>
</figure>
</div>
<p>What’s happening is that ggpairs is plotting each variable against the other in a pretty smart way. In the lower-triangle of plot matrix, it uses grouped histograms for qualitative-qualitative pairs and scatterplots for quantitative-quantitative pairs. &nbsp;In the upper-triangle, it plots grouped histograms for qualitative-qualitative pairs (using the x-instead of y-variable as the grouping factor),&nbsp;boxplots for qualitative-quantitative pairs, and provides the correlation for quantitative-quantitative pairs. What we really care about here is price, so let’s focus on that. &nbsp;We can see what might be relationships between price and clarity, and color, which we’ll keep in mind for later when we start modeling our data, but the critical factor driving price is the size/weight of a diamond. Yet as we saw above, the relationship between price and diamond size is non-linear. What might explain this pattern? &nbsp;On the supply side, larger contiguous chunks of diamonds&nbsp;without significant flaws&nbsp;are probably much harder to find than smaller ones. &nbsp;This may help explain the exponential-looking curve—and&nbsp;I thought I noticed this when I was shopping for a diamond for my soon-to-be wife. Of course, this is related to the fact that the weight of a diamond is a function of volume, and volume is a function of x * y * z, suggesting that we might be especially interested in the cubed-root of carat weight.</p>
<p>On the demand side, customers in the market for a less expensive, smaller diamond are probably more sensitive to price than more well-to-do buyers. Many less-than-one-carat customers would surely never buy a diamond were it not for the social norm of presenting one when proposing. &nbsp;And, there are <em>fewer</em> consumers who can afford a diamond larger than one carat. &nbsp;Hence, we shouldn’t expect the market for bigger diamonds to be as competitive as that for smaller ones, so it makes sense that the variance as well as the price would increase with carat size.</p>
<p>Often the distribution of any monetary variable will be highly skewed and vary over orders of magnitude. This can result from path-dependence (e.g., the rich get richer) and/or the multiplicitive processes (e.g., year on year inflation) that produce the ultimate price/dollar amount. Hence, it’s a good idea to look into compressing any such variable by putting it on a log scale (for more take a look at <a href="http://www.r-statistics.com/2013/05/log-transformations-for-skewed-and-wide-distributions-from-practical-data-science-with-r/">this guest post on Tal Galili’s blog</a>).</p>
<div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb2-1">p <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">qplot</span>(price, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data=</span>diamonds, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">binwidth=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb2-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme_bw</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb2-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggtitle</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Price'</span>)</span>
<span id="cb2-4">p</span>
<span id="cb2-5">p <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">qplot</span>(price, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data=</span>diamonds, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">binwidth =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.01</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb2-6"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">scale_x_log10</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb2-7"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme_bw</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb2-8"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggtitle</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Price (log10)'</span>)</span>
<span id="cb2-9">p</span></code></pre></div>
<p><img src="https://solomonmg.github.io/img/0cc713fc225f620a8284b05a00900ec51e379221.png" class="img-fluid"></p>
<p><img src="https://solomonmg.github.io/img/cdd7c96b96c0c2deef8844c214703a2243a3070b.png" class="img-fluid"></p>
<p>Indeed, we can see that the prices for diamonds are heavily skewed, but when put on a log10 scale seem much better behaved (i.e., closer to the bell curve of a normal distribution). &nbsp;In fact, we can see that the data show some evidence of bimodality on the log10 scale, consistent with our two-class, “rich-buyer, poor-buyer” speculation about the nature of customers for diamonds. Let’s re-plot our data, but now let’s put price on a log10 scale:</p>
<div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb3-1">p <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">qplot</span>(carat, price, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data=</span>diamonds) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb3-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">scale_y_continuous</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">trans=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">log10_trans</span>() ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb3-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme_bw</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb3-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggtitle</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Price (log10) by Carat'</span>)</span>
<span id="cb3-5">p</span></code></pre></div>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://solomonmg.github.io/img/f7fcbbe09988fd9cddc474aecbbd918c448b7575.png" class="img-fluid figure-img"></p>
<figcaption>caratpricelog10</figcaption>
</figure>
</div>
<p>Better, though still a little funky—let’s try using&nbsp;use the cube-root of carat as we speculated about above:</p>
<div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb4-1">cubroot_trans <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>() <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">trans_new</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'cubroot'</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">transform=</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(x) x<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">^</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">inverse =</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(x) x<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">^</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span> )</span>
<span id="cb4-2">p <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">qplot</span>(carat, price, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data=</span>diamonds) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb4-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">scale_x_continuous</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">trans=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">cubroot_trans</span>(), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">limits =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.2</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>),</span>
<span id="cb4-4"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">breaks =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.2</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb4-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">scale_y_continuous</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">trans=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">log10_trans</span>(), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">limits =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">350</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">15000</span>),</span>
<span id="cb4-6"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">breaks =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">350</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5000</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10000</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">15000</span>)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb4-7"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme_bw</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb4-8"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggtitle</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Price (log10) by Cubed-Root of Carat'</span>)</span>
<span id="cb4-9">p</span></code></pre></div>
<p><img src="https://solomonmg.github.io/img/271f4d38318b2eefdbe9c5ca6ed34a467e1dab66.png" class="img-fluid"></p>
<p>Nice, looks like an almost-linear relationship after applying the transformations above to get our variables on a nice scale.</p>
</section>
<section id="overplotting" class="level2">
<h2 class="anchored" data-anchor-id="overplotting">## Overplotting</h2>
<p>Note that until now I haven’t done anything about overplotting—where multiple points take on the same value, often due to rounding. &nbsp;Indeed, price is rounded to dollars and carats are rounded to two digits. &nbsp;Not bad, though when we’ve got this much data we’re going to have some serious overplotting.</p>
<div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb5-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">head</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sort</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">table</span>(diamonds<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>carat), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">decreasing=</span><span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span> ))</span>
<span id="cb5-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">head</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sort</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">table</span>(diamonds<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>price), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">decreasing=</span><span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span> ))</span></code></pre></div>
<pre><code> 0.3 0.31 1.01  0.7 0.32    1 
2604 2249 2242 1981 1840 1558 

605 802 625 828 776 698 
132 127 126 125 124 121 </code></pre>
<p>Often you can deal with this by making your points smaller, using “jittering” to randomly shift points to make multiple points visible, and using transparency, which can be done in ggplot using the “alpha” parameter.</p>
<div class="sourceCode" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb7-1">p <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggplot</span>( <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data=</span>diamonds, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(carat, price)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb7-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_point</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">alpha =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">size =</span> .<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">75</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">position=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'jitter'</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb7-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">scale_x_continuous</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">trans=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">cubroot_trans</span>(), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">limits =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.2</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>),</span>
<span id="cb7-4"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">breaks =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.2</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb7-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">scale_y_continuous</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">trans=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">log10_trans</span>(), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">limits =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">350</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">15000</span>),</span>
<span id="cb7-6"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">breaks =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">350</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5000</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10000</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">15000</span>)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb7-7"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme_bw</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb7-8"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggtitle</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Price (log10) by Cubed-Root of Carat'</span>)</span>
<span id="cb7-9">p</span></code></pre></div>
<p><img src="https://solomonmg.github.io/img/8f93ee4c395a7382a2893b01d640b65e7006ebe9.png" class="img-fluid"></p>
<p>This gives us a better sense of how dense and sparse our data is at key places.</p>
</section>
<section id="using-color-to-understand-qualitative-factors" class="level2">
<h2 class="anchored" data-anchor-id="using-color-to-understand-qualitative-factors">## Using Color to Understand Qualitative Factors</h2>
<p>When I was looking around at diamonds, I also noticed that clarity seemed to factor in to price. &nbsp;Of course, many consumers are looking for a diamond of a certain size, so we shouldn’t expect clarity to be as strong a factor as carat weight. And I must admit that even though my grandparents were jewelers, I initially had a hard time discerning a diamond rated VVS1 from one rated SI2. Surely most people need a loop to tell the difference.&nbsp;And, <a href="http://www.bluenile.com/diamonds/diamond-cut">according to BlueNile, the cut of a diamond has a much more consequential impact on that “fiery” quality that jewelers describe as the quintessential characteristic of a diamond</a>. &nbsp;On clarity, the website states, “<a href="http://www.bluenile.com/diamonds/diamond-clarity">Many of these imperfections are microscopic, and do not affect a diamond’s beauty in any discernible way</a>.”&nbsp;Yet, clarity seems to explain an awful lot of the remaining variance in price when we visualize it as a color on our plot:</p>
<div class="sourceCode" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb8-1">p <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggplot</span>( <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data=</span>diamonds, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(carat, price, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">colour=</span>clarity)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb8-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_point</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">alpha =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">size =</span> .<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">75</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">position=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'jitter'</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb8-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">scale_colour_brewer</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">type =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'div'</span>,</span>
<span id="cb8-4"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">guide =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">guide_legend</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NULL</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">reverse=</span>T,</span>
<span id="cb8-5"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">override.aes =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">list</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">alpha =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>))) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb8-6"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">scale_x_continuous</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">trans=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">cubroot_trans</span>(), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">limits =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.2</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>),</span>
<span id="cb8-7"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">breaks =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.2</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb8-8"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">scale_y_continuous</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">trans=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">log10_trans</span>(), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">limits =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">350</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">15000</span>),</span>
<span id="cb8-9"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">breaks =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">350</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5000</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10000</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">15000</span>)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb8-10"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme_bw</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">legend.key =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_blank</span>()) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb8-11"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggtitle</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Price (log10) by Cubed-Root of Carat and Color'</span>)</span>
<span id="cb8-12">p</span></code></pre></div>
<p><img src="https://solomonmg.github.io/img/8ec0d95374a095c4268c9cfad923354f819bbdfb.png" class="img-fluid"></p>
<hr>
<p>Despite what BlueNile says, we don’t see as much variation on cut (though most diamonds in this data set are ideal cut anyway):</p>
<hr>
<div class="sourceCode" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb9-1">p <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggplot</span>( <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data=</span>diamonds, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(carat, price, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">colour=</span>cut)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb9-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_point</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">alpha =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">size =</span> .<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">75</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">position=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'jitter'</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb9-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">scale_colour_brewer</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">type =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'div'</span>,</span>
<span id="cb9-4"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">guide =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">guide_legend</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NULL</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">reverse=</span>T,</span>
<span id="cb9-5"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">override.aes =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">list</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">alpha =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>))) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb9-6"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">scale_x_continuous</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">trans=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">cubroot_trans</span>(), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">limits =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.2</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>),</span>
<span id="cb9-7"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">breaks =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.2</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb9-8"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">scale_y_continuous</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">trans=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">log10_trans</span>(), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">limits =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">350</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">15000</span>),</span>
<span id="cb9-9"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">breaks =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">350</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5000</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10000</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">15000</span>)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb9-10"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme_bw</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">legend.key =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_blank</span>()) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb9-11"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggtitle</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Price (log10) by Cube-Root of Carat and Cut'</span>)</span>
<span id="cb9-12">p</span></code></pre></div>
<p><img src="https://solomonmg.github.io/img/1ef4c90edca2b6e2014d25f2a901a221f9926f6e.png" class="img-fluid"></p>
<p>Color seems to explain some of the variance in price as well, though <a href="http://www.bluenile.com/diamonds/diamond-color">BlueNile states that all color grades from D-J are basically not noticeable</a>.</p>
<div class="sourceCode" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb10-1">p <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggplot</span>( <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data=</span>diamonds, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(carat, price, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">colour=</span>color)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb10-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_point</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">alpha =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">size =</span> .<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">75</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">position=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'jitter'</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb10-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">scale_colour_brewer</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">type =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'div'</span>,</span>
<span id="cb10-4"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">guide =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">guide_legend</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NULL</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">reverse=</span>T,</span>
<span id="cb10-5"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">override.aes =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">list</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">alpha =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>))) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb10-6"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">scale_x_continuous</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">trans=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">cubroot_trans</span>(), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">limits =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.2</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>),</span>
<span id="cb10-7"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">breaks =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.2</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb10-8"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">scale_y_continuous</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">trans=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">log10_trans</span>(), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">limits =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">350</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">15000</span>),</span>
<span id="cb10-9"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">breaks =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">350</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5000</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10000</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">15000</span>)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb10-10"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme_bw</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">legend.key =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_blank</span>()) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb10-11"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggtitle</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Price (log10) by Cube-Root of Carat and Color'</span>)</span>
<span id="cb10-12">p</span></code></pre></div>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://solomonmg.github.io/img/277ef617f6ebe3749e64a855be635406ea80dba7.png" class="img-fluid figure-img"></p>
<figcaption>caratpricecolorlog10</figcaption>
</figure>
</div>
<p>At this point, we’ve got a pretty good idea of how we might model price. But there are a few problems with our 2008 data—not only do we need to account for inflation but the diamond market is quite different now than it was in 2008. In fact, when I fit models to this data then attempted to predict the price of diamonds I found on the market, I kept getting predictions that were far too low. After some additional digging, I found the <a href="http://www.bain.com/publications/articles/global-diamond-report-2013.aspx">Global Diamond Report</a>. It turns out that prices plummeted in 2008 due to the global financial crisis, and since then prices (at least for wholesale polished diamond) have grown at a roughly a 6 percent compound annual rate. The <a href="http://diamonds.blogs.com/diamonds_update/diamond-prices/">rapidly-growing number of couples in China buying diamond engagement rings</a> might also help explain this increase. After looking at data on PriceScope, I realized that <a href="http://www.pricescope.com/diamond-prices/diamond-prices-chart">diamond prices grew unevenly across different carat sizes</a>, meaning that the model I initially estimated couldn’t simply be adjusted by inflation. While I could have done ok with that model, I really wanted to estimate a new model based on fresh data.</p>
<p>Thankfully I was able to put together a <a href="https://github.com/solomonm/diamonds-data/blob/master/dinfo.py">python script to scrape diamondse.info</a> without too much trouble. This dataset is about 10 times the size of the 2008 diamonds data set and features diamonds from all over the world certified by an array of authorities besides just the Gemological Institute of America (GIA). You can read in this data as follows (be forewarned—it’s over 500K rows):</p>
<div class="sourceCode" id="cb11" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb11-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#install.packages('RCurl')</span></span>
<span id="cb11-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'RCurl'</span>)</span>
<span id="cb11-3">diamondsurl <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">getBinaryURL</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'https://raw.github.com/solomonm/diamonds-data/master/BigDiamonds.Rda'</span>)</span>
<span id="cb11-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">load</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rawConnection</span>(diamondsurl))</span></code></pre></div>
<p>My <a href="https://github.com/solomonm/diamonds-data">github repository has the code necessary to replicate each of the figures above</a>—most look quite similar, though this data set contains much more expensive diamonds than the original. Regardless of whether you’re using the original diamonds data set or the current larger diamonds data set, you can estimate a model based on what we learned from our scatterplots. We’ll regress carat, the cubed-root of carat, clarity, cut and color on log-price. I’m using only GIA-certified diamonds in this model and looking only at diamonds under $10K because these are the type of diamonds sold at most retailers I’ve seen and hence the kind I care most about. By trimming the most expensive diamonds from the dataset, our model will also be less likely to be thrown off by outliers at the high end of price and carat. The new data set has mostly the same columns as the old one, so we can just run the following (if you want to run it on the old data set, just set data=diamonds).</p>
<div class="sourceCode" id="cb12" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb12-1">diamondsbig<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>logprice <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">log</span>(diamondsbig<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>price)</span>
<span id="cb12-2">m1 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lm</span>(logprice<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">I</span>(carat<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">^</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>)),</span>
<span id="cb12-3"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data=</span>diamondsbig[diamondsbig<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>price <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10000</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;</span> diamondsbig<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>cert <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'GIA'</span>,])</span>
<span id="cb12-4">m2 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">update</span>(m1, <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> . <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> carat)</span>
<span id="cb12-5">m3 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">update</span>(m2, <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> . <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> cut )</span>
<span id="cb12-6">m4 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">update</span>(m3, <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> . <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> color <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> clarity)</span>
<span id="cb12-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#install.packages('memisc')</span></span>
<span id="cb12-8"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(memisc)</span>
<span id="cb12-9"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mtable</span>(m1, m2, m3, m4)</span></code></pre></div>
<p>Here are the results for my recently scraped data set:</p>
<pre><code>===============================================================
                    m1          m2          m3          m4     
---------------------------------------------------------------
(Intercept)       2.671***    1.333***    0.949***   -0.464*** 
                 (0.003)     (0.012)     (0.012)     (0.009)   
I(carat^(1/3))    5.839***    8.243***    8.633***    8.320*** 
                 (0.004)     (0.022)     (0.021)     (0.012)   
carat                        -1.061***   -1.223***   -0.763*** 
                             (0.009)     (0.009)     (0.005)   
cut: V.Good                               0.120***    0.071*** 
                                         (0.002)     (0.001)   
cut: Ideal                                0.211***    0.131*** 
                                         (0.002)     (0.001)   
color: K/L                                            0.117*** 
                                                     (0.003)   
color: J/L                                            0.318*** 
                                                     (0.002)   
color: I/L                                            0.469*** 
                                                     (0.002)   
color: H/L                                            0.602*** 
                                                     (0.002)   
color: G/L                                            0.665*** 
                                                     (0.002)   
color: F/L                                            0.723*** 
                                                     (0.002)   
color: E/L                                            0.756*** 
                                                     (0.002)   
color: D/L                                            0.827*** 
                                                     (0.002)   
clarity: I1                                           0.301*** 
                                                     (0.006)   
clarity: SI2                                          0.607*** 
                                                     (0.006)   
clarity: SI1                                          0.727*** 
                                                     (0.006)   
clarity: VS2                                          0.836*** 
                                                     (0.006)   
clarity: VS1                                          0.891*** 
                                                     (0.006)   
clarity: VVS2                                         0.935*** 
                                                     (0.006)   
clarity: VVS1                                         0.995*** 
                                                     (0.006)   
clarity: IF                                           1.052*** 
                                                     (0.006)   
---------------------------------------------------------------
R-squared             0.888       0.892      0.899        0.969
N                338946      338946     338946       338946    
===============================================================</code></pre>
<p>Now those are some very nice R-squared values—we are accounting for almost all of the variance in price with the 4Cs. &nbsp;If we want to know what whether the price for a diamond is reasonable, we can now use this model and exponentiate the result (since we took the log of price). &nbsp;We need to multiply the result by exp(sigma^2/2), because the our error is no longer zero in expectation:</p>
<p>$$</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Balign*%7D%0AE(log(y)%20%5Cmid%20%5Cmathbf%7BX%7D%20=%20%5Cmathbf%7Bx%7D)%20&amp;=%20E(%5Cmathbf%7BX%7D%5Cbeta%20+%20%5Cepsilon)%5C%5C%0A%20%20E(y%20%5Cmid%20%5Cmathbf%7BX%7D%20=%20%5Cmathbf%7Bx%7D)%20&amp;=%20E(%20exp(%20%5Cmathbf%7BX%7D%5Cbeta%20+%20%5Cepsilon%20)%20)%5C%5C%0A%20%20&amp;=%20E(%20exp(%20%5Cmathbf%7BX%7D%5Cbeta%20)%20%5Ctimes%20exp(%20%5Cepsilon%20)%20)%20%5C%5C%0A%20%20&amp;=%20E(%20exp(%20%5Cmathbf%7BX%7D%5Cbeta%20)%20)%20%5Ctimes%20E(%20exp(%20%5Cepsilon%20)%20)%20%5C%5C%0A%20%20&amp;=%20exp(%5Cmathbf%7BX%7D%5Chat%5Cbeta)%20%5Ctimes%20exp(%20%5Cfrac%7B%5Chat%5Csigma%5E2%7D%7B2%7D%20)%0A%5Cend%7Balign*%7D%0A"></p>
<p>$$</p>
<p>To dig further into that last step, have a look at the <a href="http://en.wikipedia.org/wiki/Log-normal_distribution#Arithmetic_moments">Wikipedia page on log-normal distributed variables</a>. Thanks to <a href="https://sites.google.com/site/miguelgodinhomatos/">Miguel</a> for catching this. Let’s take a look at an example from Blue Nile. I’ll use the full model, m4.</p>
<div class="sourceCode" id="cb14" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb14-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Example from BlueNile</span></span>
<span id="cb14-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Round 1.00 Very Good I VS1 $5,601</span></span>
<span id="cb14-3">thisDiamond <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">carat =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.00</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">cut =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'V.Good'</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">color =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'I'</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">clarity=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'VS1'</span>)</span>
<span id="cb14-4">modEst <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">predict</span>(m4, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">newdata =</span> thisDiamond, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">interval=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'prediction'</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">level =</span> .<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">95</span>)</span>
<span id="cb14-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">exp</span>(modEst) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">exp</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summary</span>(m4)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>sigma<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">^</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>)</span></code></pre></div>
<p>The results yield an expected value for price given the characteristics of our diamond and the upper and lower bounds of a 95% CI—note that because this is a linear model, predict() is just multiplying each model coefficient by each value in our data. Turns out that this diamond is a touch pricier than expected value under the full model, though it is by no means outside our 95% CI. BlueNile has by most accounts a better reputation than diamondse.info however, and reputation is worth a lot in a business that relies on easy-to-forge certificates and one in which the non-expert can be easily fooled.</p>
<p>This illustrates an important point about generalizing a model from one data set to another. First, there may be important differences between data sets—as I’ve speculated about above—making the estimates systematically biased. Second, overfitting—our model may be fitting noise present in data set. Even a model cross-validated against out-of-sample predictions can be over-fit to noise that results in differences between data sets. Of course, while this model may give you a sense of whether your diamond is a rip-off against diamondse.info diamonds, it’s not clear that diamondse.info should be regarded as a source of universal truth about whether the price of a diamond is reasonable. Nonetheless, to have the expected price at diamondse.info with a 95% interval is a lot more information than we had about the price we should be willing to pay for a diamond before we started this exercise.</p>
<p>An important point—even though we can predict diamondse.info prices almost perfectly based on a function of the 4c’s, one thing that you should NOT conclude from this exercise is that <em>where</em> you buy your diamond is irrelevant, which apparently used to be conventional wisdom in some circles. &nbsp;You will almost surely pay more if you <a href="https://web.archive.org/web/20131231035425/http://www.businessweek.com/articles/2013-05-06/tiffany-vs-dot-costco-which-diamond-ring-is-better">buy the same diamond at Tiffany’s versus Costco</a>. But <a href="https://web.archive.org/web/20140217105722/http://www.costco.com:80/2.12-ctw-Round-Brilliant-Cut-Internally-Flawless,-D-Color-Diamond-%22Audrey%22-Platinum-Wedding-Set.product.100006730.html">Costco sells some pricy diamonds</a> as well. Regardless, you can use this kind of model to give you an indication of whether you’re overpaying.</p>
<p>Of course, the value of a natural diamond is largely socially constructed. Like money, diamonds are only valuable because society says they are—-there’s no obvious economic efficiencies to be gained or return on investment in a diamond, except perhaps in a very subjective sense concerning your relationship with your significant other. To get a sense for just how much value is socially constructed, you can compare the price of a natural diamond to a synthetic diamond, which thanks to recent technological developments are of comparable quality to a “natural” diamond. Of course, natural diamonds fetch a dramatically higher price.</p>
<p>One last thing—there are few guarantees in life, and I offer none here. Though what we have here seems pretty good, data and models are never infallible, and obviously you can still get taken (or be persuaded to pass on a great deal) based on this model. Always shop with a reputable dealer, and make sure her incentives are aligned against selling you an overpriced diamond or worse one that doesn’t match its certificate. There’s no substitute for establishing a personal connection and lasting business relationship with an established jeweler you can trust.</p>
</section>
<section id="one-final-consideration" class="level2">
<h2 class="anchored" data-anchor-id="one-final-consideration">## One Final Consideration</h2>
<p>Plotting your data can help you understand it and can yield key insights. &nbsp;But even scatterplot visualizations can be deceptive if you’re not careful. &nbsp;Consider another data set the comes with the alr3 package—soil temperature data from&nbsp;Mitchell, Nebraska, collected by&nbsp;Kenneth G. Hubbard from&nbsp;1976-1992, which I came across in&nbsp;Weisberg, S. (2005).&nbsp;<em>Applied Linear Regression</em>, 3rd edition. New York: Wiley (from which I’ve shamelessly stolen this example). Let’s plot the data, naively:</p>
<div class="sourceCode" id="cb15" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb15-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#install.packages('alr3')</span></span>
<span id="cb15-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(alr3)</span>
<span id="cb15-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data</span>(Mitchell)</span>
<span id="cb15-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">qplot</span>(Month, Temp, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> Mitchell) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme_bw</span>()</span></code></pre></div>
<p><img src="https://solomonmg.github.io/img/f02902f327e42be253eb2e245e5876de3395060b.png" class="img-fluid"></p>
<p>Looks kinda like noise. &nbsp;What’s the story here? When all else fails, think about it. What’s on the X axis? &nbsp;Month. &nbsp;What’s on the Y-axis? &nbsp;Temperature. &nbsp;Hmm, well there are seasons in Nebraska, so temperature should fluctuate every 12 months.</p>
<p>But we’ve put more than 200 months in a pretty tight space.</p>
<p>Let’s stretch it out and see how it looks:</p>
<p><img src="https://solomonmg.github.io/img/11fe76cda575e832918edc0bccf89bd911dc04ee.png" class="img-fluid"></p>
<p>Don’t make that mistake.</p>
<p>That concludes part I of this series on scatterplots. &nbsp;Part II will illustrate the advantages of using facets/panels/small multiples, and show how tools to fit trendlines including linear regression and local regression (loess) can help yield additional insight about your data.</p>
<p>You can also learn more about <a href="https://www.udacity.com/course/ud651">exploratory data analysis via this Udacity course taught by my colleagues Dean Eckles and Moira Burke, and Chris Saden</a>, which will be coming out in the next few weeks.</p>


</section>

 ]]></description>
  <guid>https://solomonmg.github.io/blog/visualization-series-scatterplot-understanding-the-diamond-market/</guid>
  <pubDate>Sun, 02 Feb 2020 00:00:00 GMT</pubDate>
  <media:content url="https://solomonmg.github.io/blog/visualization-series-scatterplot-understanding-the-diamond-market/featured.png" medium="image" type="image/png" height="144" width="144"/>
</item>
<item>
  <title>How to break regression</title>
  <dc:creator>Sol Messing</dc:creator>
  <link>https://solomonmg.github.io/blog/how-to-break-regression/</link>
  <description><![CDATA[ 





<p>Regression models are a cornerstone of modern social science. They’re at the heart of efforts to estimate causal relationships between variables in a multivariate environment and are the basic building blocks of many machine learning models. Yet social scientists can run into a lot of situations where regression models break.</p>
<p>Famed social psychologist Richard Nisbett <a href="https://www.edge.org/conversation/richard_nisbett-the-crusade-against-multiple-regression-analysis">recently argued</a> that regression analysis is so misused and misunderstood that analyses based on multiple regression “are often somewhere between meaningless and quite damaging.” (He was mainly talking about cases in which researchers publish correlational results that are covered in the media as causal statements about the world.)</p>
<p>Below, I’ll walk through some of the potential pitfalls you might encounter when you fire up your favorite <a href="https://seanjtaylor.com/post/39573264781/the-statistics-software-signal">statistical software</a> package and run regressions. Specifically, I’ll be using simulation in R as an educational tool to help you better understand the ways in which regressions can break.</p>
<p><strong>Using simulations to unpack regression</strong></p>
<p>The idea of using R simulations to help understand regression models was inspired by Ben Ogorek’s <a href="http://anythingbutrbitrary.blogspot.com/2016/01/how-to-create-confounders-with.html">post</a> on regression confounders and collider bias.</p>
<p>The great thing about using simulation in this way is that you control the world that generates your data. The code I’ll introduce below represents the true <em>data-generating process</em>,since I’m using R’s random number generators to simulate the data. In real life, of course, we only have the data we observe, and we don’t really know how the data-generating process works unless we have a solid theory (like Newtonian physics or evolution) where the system of relevant variables and causal relationships is well understood and to which there is really no analogous phenomenon in social science.</p>
<p>What I’ll do here is create a dataset based on two random standard normal variables by simulating them using the <em>rnorm()</em> function, which draws random values from a normal distribution with mean 0 and standard deviation 1, unless you specify otherwise. I’ll create a functional relationship between y and x such that a 1 unit increase in x will be associated with a&nbsp;.4 unit increase in y.</p>
<div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb1-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># make the code reproducible by setting a random number seed</span></span>
<span id="cb1-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">set.seed</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>)</span>
<span id="cb1-3"></span>
<span id="cb1-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># When everything works:</span></span>
<span id="cb1-5">N <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span></span>
<span id="cb1-6">x <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(N)</span>
<span id="cb1-7">y <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> .<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> x <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(N)</span>
<span id="cb1-8"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">hist</span>(x)</span>
<span id="cb1-9"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">hist</span>(y)</span>
<span id="cb1-10"></span>
<span id="cb1-11"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Now estimate our model:</span></span>
<span id="cb1-12"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summary</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lm</span>(y <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> x))</span>
<span id="cb1-13"></span>
<span id="cb1-14">Call<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span></span>
<span id="cb1-15"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lm</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">formula =</span> y <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> x)</span>
<span id="cb1-16">Residuals<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span></span>
<span id="cb1-17">    Min      <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>Q  Median      <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>Q     Max </span>
<span id="cb1-18"><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">3.0348</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.7013</span>  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.0085</span>  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.6212</span>  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">3.1688</span> </span>
<span id="cb1-19">Coefficients<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span></span>
<span id="cb1-20">                    Estimate Std. Error t value <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">Pr</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">|</span>t<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|</span>)    </span>
<span id="cb1-21">(Intercept)     <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.003921</span>   <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.031039</span>   <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.126</span>    <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.899</span>    </span>
<span id="cb1-22">x               <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.413415</span>   <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.030129</span>  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">13.722</span>   <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">2e-16</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">*</span></span>
<span id="cb1-23"><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">---</span></span>
<span id="cb1-24">Signif. codes<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>  <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span> ‘<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">*</span>’ <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.001</span> ‘<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span>’ <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.01</span> ‘<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span>’ <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.05</span> ‘.’ <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.1</span> ‘ ’ <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb1-25">Residual standard error<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.9814</span> on <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">998</span> degrees of freedom</span>
<span id="cb1-26">Multiple R<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>squared<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.1587</span>,    Adjusted R<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>squared<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.1579</span> </span>
<span id="cb1-27">F<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>statistic<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">188.3</span> on <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> and <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">998</span> DF,  p<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>value<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">&lt;</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">2.2e-16</span></span>
<span id="cb1-28"></span>
<span id="cb1-29"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Plot it</span></span>
<span id="cb1-30"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(ggplot2)</span>
<span id="cb1-31"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">qplot</span>(x, y) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb1-32">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_smooth</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">method=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'lm'</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb1-33">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme_bw</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb1-34">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggtitle</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"The Perfect Regression"</span>)</span></code></pre></div>
<p>Notice that the model estimates the functional relationship between x and y that I simulated quite well. The plot looks like this:</p>
<p><img src="https://cdn-images-1.medium.com/max/1600/1*0zIR7Mtuak5DamPOgEiHog.png" class="img-fluid"></p>
<p>What about omitted variables? Our machinery actually still works if there is another factor causing y, as long as it is <em>uncorrelated</em> with x.</p>
<p><strong>The dreaded omitted variable bias</strong></p>
<p>Omitted variable bias (OVB) is much feared, and judging by the top internet search results, not well understood. Some top sources say it occurs when “<a href="http://carecon.org.uk/UWEcourse/OVbias.pdf">an important</a>” variable is missing or when a variable that “<a href="https://en.wikipedia.org/wiki/Omitted-variable_bias">is correlated</a>” with both x and y is missing. I even found a university <a href="http://www3.wabash.edu/econometrics/EconometricsBook/chap18.htm">econometrics</a> course that defined OVB this way.</p>
<p>But neither of those definitions are quite right. OVB occurs when a variable that <em>causes</em> y is missing from the model (and is correlated with x). Let’s call that variable w. Because w is in play when we consider the causal relationship between x and y, it’s often referred to as “endogenous” or a “confounding variable.”</p>
<p>The example below first demonstrates that w, our confounding variable, will bias our results if we fail to include it in our model. The next two examples are essentially a re-telling of the <a href="http://anythingbutrbitrary.blogspot.com/2016/01/how-to-create-confounders-with.html">post I mentioned above</a> on collider bias, but emphasizing slightly different points.</p>
<div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb2-1">w <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(N)</span>
<span id="cb2-2">x <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> .<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> w <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(N)</span>
<span id="cb2-3">y <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> .<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> x <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> .<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> w <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(N)</span>
<span id="cb2-4"></span>
<span id="cb2-5">m1 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lm</span>(y <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> x)</span>
<span id="cb2-6"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summary</span> (m1) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Omitted variable bias</span></span>
<span id="cb2-7"></span>
<span id="cb2-8">Call<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span></span>
<span id="cb2-9"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lm</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">formula =</span> y <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> x)</span>
<span id="cb2-10"></span>
<span id="cb2-11">Residuals<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span></span>
<span id="cb2-12">    Min      <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>Q  Median      <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>Q     Max </span>
<span id="cb2-13"><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">3.2190</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.7025</span>  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.0314</span>  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.7120</span>  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">3.1158</span> </span>
<span id="cb2-14"></span>
<span id="cb2-15">Coefficients<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span></span>
<span id="cb2-16">            Estimate Std. Error t value <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">Pr</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">|</span>t<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|</span>)    </span>
<span id="cb2-17">(Intercept)  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.01126</span>    <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.03310</span>    <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.34</span>    <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.734</span>    </span>
<span id="cb2-18">x            <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.50179</span>    <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.03049</span>   <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">16.46</span>   <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">2e-16</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">*</span></span>
<span id="cb2-19"><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">---</span></span>
<span id="cb2-20">Signif. codes<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>  <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span> ‘<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">*</span>’ <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.001</span> ‘<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span>’ <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.01</span> ‘<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span>’ <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.05</span> ‘.’ <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.1</span> ‘ ’ <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb2-21"></span>
<span id="cb2-22">Residual standard error<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.046</span> on <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">998</span> degrees of freedom</span>
<span id="cb2-23">Multiple R<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>squared<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.2135</span>,    Adjusted R<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>squared<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.2127</span> </span>
<span id="cb2-24">F<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>statistic<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">270.9</span> on <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> and <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">998</span> DF,  p<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>value<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">&lt;</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">2.2e-16</span></span></code></pre></div>
<p>There it is: classic omitted variable bias. We only observed x, and the influence of the omitted variable w was attributed to x in our model. If you re-rerun the regression with w in the model, you no longer get biased estimates.</p>
<div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb3-1">m2 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lm</span>(y <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> x <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> w)</span>
<span id="cb3-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summary</span> (m2) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># No omitted variable bias after conditioning on w</span></span>
<span id="cb3-3"></span>
<span id="cb3-4">Call<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span></span>
<span id="cb3-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lm</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">formula =</span> y <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> x <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> w)</span>
<span id="cb3-6"></span>
<span id="cb3-7">Residuals<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span></span>
<span id="cb3-8">    Min      <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>Q  Median      <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>Q     Max </span>
<span id="cb3-9"><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">3.2748</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.6632</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.0001</span>  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.6933</span>  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">2.9664</span> </span>
<span id="cb3-10"></span>
<span id="cb3-11">Coefficients<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span></span>
<span id="cb3-12">            Estimate Std. Error t value <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">Pr</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">|</span>t<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|</span>)    </span>
<span id="cb3-13">(Intercept)  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.02841</span>    <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.03141</span>   <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.905</span>    <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.366</span>    </span>
<span id="cb3-14">x            <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.40627</span>    <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.03132</span>  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">12.973</span>   <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">2e-16</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">*</span></span>
<span id="cb3-15">w            <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.32344</span>    <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.03439</span>   <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">9.405</span>   <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">2e-16</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">*</span></span>
<span id="cb3-16"><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">---</span></span>
<span id="cb3-17">Signif. codes<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>  <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span> ‘<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">*</span>’ <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.001</span> ‘<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span>’ <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.01</span> ‘<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span>’ <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.05</span> ‘.’ <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.1</span> ‘ ’ <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb3-18"></span>
<span id="cb3-19">Residual standard error<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.9927</span> on <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">997</span> degrees of freedom</span>
<span id="cb3-20">Multiple R<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>squared<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.3024</span>,    Adjusted R<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>squared<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.301</span> </span>
<span id="cb3-21">F<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>statistic<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">216.1</span> on <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span> and <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">997</span> DF,  p<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>value<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">&lt;</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">2.2e-16</span></span></code></pre></div>
<p>Note that the regression errors, also known as residuals, are correlated with w:</p>
<div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"></code></pre></div>
<p>Now, recall above that I wrote that it’s wrong to say that OVB occurs when our omitted variable is correlated with both x and y. And yet w, x and w and y are all correlated in this first example:</p>
<div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb5-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">cor</span>(w,m1<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>residuals)</span>
<span id="cb5-2">[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>] <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.2597859</span></span></code></pre></div>
<p>So why can’t we just say that OVB occurs when our omitted variable is correlated with both x and y? As the next example will show, correlation isn’t enough — w needs to <em>cause</em> both x and y. We can easily imagine a case in which we don’t have causality but we still see this kind of correlation — when x and y both cause w.</p>
<p>Let’s make this a little more concrete. Suppose we care about the effect of news media consumption (x) on voter turnout (y). One factor that some researchers think may cause both news media consumption and turnout is political interest (w). If we only measure media consumption and voter turnout, political interest is likely to confound our estimates.</p>
<p>But another school of thought from social psychology — along the lines of self-perception theory and <a href="https://en.wikipedia.org/wiki/Cognitive_dissonance">cognitive dissonance</a> — suggests that the causality could be reversed: Voting behavior might be mostly determined by other factors, and casting a ballot might prompt us to be <em>more</em> interested in political developments in the future. Similarly, watching the news might prompt us to become <em>more</em> interested in politics. Let’s suppose that second school of thought is right. If so, our simulated data will look like this:</p>
<div class="sourceCode" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb6-1">media_consumption_x <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(N)</span>
<span id="cb6-2">voter_turnout_y <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> .<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> media_consumption_x <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(N)</span>
<span id="cb6-3"></span>
<span id="cb6-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Political interest increases after consuming media and participating, and, </span></span>
<span id="cb6-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># in this hypothetical world, does *not* increase media consuption or participation</span></span>
<span id="cb6-6">political_interest_w <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.2</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> media_consumption_x <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> .<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">6</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> voter_turnout_y <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(N)</span>
<span id="cb6-7"></span>
<span id="cb6-8">cormat <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">cor</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">as.matrix</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(media_consumption_x, voter_turnout_y, political_interest_w)))</span>
<span id="cb6-9"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">round</span>(cormat, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>)</span>
<span id="cb6-10"></span>
<span id="cb6-11">                     media_consumption_x voter_turnout_y political_interest_w</span>
<span id="cb6-12">media_consumption_x                 <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.00</span>            <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.11</span>                 <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.70</span></span>
<span id="cb6-13">voter_turnout_y                     <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.11</span>            <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.00</span>                 <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.46</span></span>
<span id="cb6-14">political_interest_w                <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.70</span>            <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.46</span>                 <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.00</span></span></code></pre></div>
<p>As you can see, all factors are again correlated with each other. But this time, if we <em>only</em> include x (media consumption) and y (turnout) in the equation, we get the correct estimate:</p>
<div class="sourceCode" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb7-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summary</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lm</span>(voter_turnout_y <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> media_consumption_x))</span>
<span id="cb7-2"></span>
<span id="cb7-3">Call<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span></span>
<span id="cb7-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lm</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">formula =</span> voter_turnout_y <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> media_consumption_x)</span>
<span id="cb7-5"></span>
<span id="cb7-6">Residuals<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span></span>
<span id="cb7-7">    Min      <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>Q  Median      <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>Q     Max </span>
<span id="cb7-8"><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">2.8460</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.6972</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.0076</span>  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.6702</span>  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">3.3925</span> </span>
<span id="cb7-9"></span>
<span id="cb7-10">Coefficients<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span></span>
<span id="cb7-11">                    Estimate Std. Error t value <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">Pr</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">|</span>t<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|</span>)    </span>
<span id="cb7-12">(Intercept)         <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.01202</span>    <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.03217</span>  <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.374</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.708839</span>    </span>
<span id="cb7-13">media_consumption_x  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.11719</span>    <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.03321</span>   <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">3.529</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.000436</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">*</span></span>
<span id="cb7-14"><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">---</span></span>
<span id="cb7-15">Signif. codes<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>  <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span> ‘<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">*</span>’ <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.001</span> ‘<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span>’ <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.01</span> ‘<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span>’ <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.05</span> ‘.’ <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.1</span> ‘ ’ <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb7-16"></span>
<span id="cb7-17">Residual standard error<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.014</span> on <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">998</span> degrees of freedom</span>
<span id="cb7-18">Multiple R<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>squared<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.01233</span>,   Adjusted R<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>squared<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.01134</span> </span>
<span id="cb7-19">F<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>statistic<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">12.46</span> on <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> and <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">998</span> DF,  p<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>value<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.0004359</span></span></code></pre></div>
<p>What makes defining omitted variable bias based on correlation so dangerous is that if we now include w (political interest), we will get a different kind of bias — what’s called <a href="https://en.wikipedia.org/wiki/Collider_%28epidemiology%29">collider bias</a> or <a href="https://www.annualreviews.org/doi/10.1146/annurev-soc-071913-043455">endogenous selection bias</a>.</p>
<div class="sourceCode" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb8-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summary</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lm</span>(voter_turnout_y <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> media_consumption_x <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> political_interest_w))</span>
<span id="cb8-2"></span>
<span id="cb8-3">Call<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span></span>
<span id="cb8-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lm</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">formula =</span> voter_turnout_y <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> media_consumption_x <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> political_interest_w)</span>
<span id="cb8-5"></span>
<span id="cb8-6">Residuals<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span></span>
<span id="cb8-7">    Min      <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>Q  Median      <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>Q     Max </span>
<span id="cb8-8"><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">2.1569</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5981</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.0129</span>  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5701</span>  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">2.8356</span> </span>
<span id="cb8-9"></span>
<span id="cb8-10">Coefficients<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span></span>
<span id="cb8-11">                      Estimate Std. Error t value <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">Pr</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">|</span>t<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|</span>)    </span>
<span id="cb8-12">(Intercept)           <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.003155</span>   <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.027098</span>   <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.116</span>    <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.907</span>    </span>
<span id="cb8-13">media_consumption_x  <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.437084</span>   <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.039102</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">11.178</span>   <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">2e-16</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">*</span></span>
<span id="cb8-14">political_interest_w  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.444571</span>   <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.021928</span>  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">20.274</span>   <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">2e-16</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">*</span></span>
<span id="cb8-15"><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">---</span></span>
<span id="cb8-16">Signif. codes<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>  <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span> ‘<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">*</span>’ <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.001</span> ‘<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span>’ <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.01</span> ‘<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span>’ <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.05</span> ‘.’ <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.1</span> ‘ ’ <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb8-17"></span>
<span id="cb8-18">Residual standard error<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.854</span> on <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">997</span> degrees of freedom</span>
<span id="cb8-19">Multiple R<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>squared<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.3007</span>,    Adjusted R<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>squared<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.2993</span> </span>
<span id="cb8-20">F<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>statistic<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">214.3</span> on <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span> and <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">997</span> DF,  p<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>value<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">&lt;</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">2.2e-16</span></span></code></pre></div>
<p><strong>Simpson’s paradox</strong></p>
<p>Simpson’s paradox often occurs in social science (and medicine, too) when you pool data instead of conditioning it on group membership (i.e., adding it as a factor in your regression model).</p>
<p>Suppose that, all other things being equal, consuming media causes a slight shift in policy preferences toward the left. But, on average, Republicans consume more news than non-Republicans. And we know that generally Republicans have much more right-leaning preferences.</p>
<p>If we just measure media consumption and policy preferences without including Republicans in the model, we’ll actually estimate that the effect goes in the direction <em>opposite</em> of the true causal effect.</p>
<div class="sourceCode" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb9-1">N <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span></span>
<span id="cb9-2"></span>
<span id="cb9-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Let's say that 40% of people in this population are Republicans</span></span>
<span id="cb9-4">republican <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rbinom</span>(N, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, .<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>)</span>
<span id="cb9-5"></span>
<span id="cb9-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># And they consume more media</span></span>
<span id="cb9-7">media_consumption <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> .<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">75</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> republican <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(N)</span>
<span id="cb9-8"></span>
<span id="cb9-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Consuming more media causes a slight leftward shift in policy</span></span>
<span id="cb9-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># preferences, and Republicans have more right-leaning preferences</span></span>
<span id="cb9-11">policy_prefs <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>.<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> media_consumption <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> republican <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(N)</span>
<span id="cb9-12"></span>
<span id="cb9-13"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># for easier plotting later</span></span>
<span id="cb9-14">df <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(media_consumption, policy_prefs, republican)</span>
<span id="cb9-15">df<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>republican <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">factor</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"non-republican"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"republican"</span>)[df<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>republican <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>])</span>
<span id="cb9-16"></span>
<span id="cb9-17"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># If we don't condition on being Republican, we'll actually estimate</span></span>
<span id="cb9-18"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># that the effect goes in the *opposite* direction</span></span>
<span id="cb9-19"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summary</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lm</span>(policy_prefs <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> media_consumption))</span>
<span id="cb9-20"></span>
<span id="cb9-21"></span>
<span id="cb9-22">Call<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span></span>
<span id="cb9-23"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lm</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">formula =</span> policy_prefs <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> media_consumption)</span>
<span id="cb9-24"></span>
<span id="cb9-25">Residuals<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span></span>
<span id="cb9-26">    Min      <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>Q  Median      <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>Q     Max </span>
<span id="cb9-27"><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">3.6108</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.9559</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.0198</span>  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.9257</span>  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">3.9537</span> </span>
<span id="cb9-28"></span>
<span id="cb9-29">Coefficients<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span></span>
<span id="cb9-30">                  Estimate Std. Error t value <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">Pr</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">|</span>t<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|</span>)    </span>
<span id="cb9-31">(Intercept)        <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.68923</span>    <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.04323</span>   <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">15.94</span>  <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">2e-16</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">*</span></span>
<span id="cb9-32">media_consumption  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.15269</span>    <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.03966</span>    <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">3.85</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.000126</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">*</span></span>
<span id="cb9-33"><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">---</span></span>
<span id="cb9-34">Signif. codes<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>  <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span> ‘<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">*</span>’ <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.001</span> ‘<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span>’ <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.01</span> ‘<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span>’ <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.05</span> ‘.’ <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.1</span> ‘ ’ <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb9-35"></span>
<span id="cb9-36">Residual standard error<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.317</span> on <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">998</span> degrees of freedom</span>
<span id="cb9-37">Multiple R<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>squared<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.01463</span>,   Adjusted R<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>squared<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.01365</span> </span>
<span id="cb9-38">F<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>statistic<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">14.82</span> on <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> and <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">998</span> DF,  p<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>value<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.0001257</span></span>
<span id="cb9-39"></span>
<span id="cb9-40"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Naive plot</span></span>
<span id="cb9-41"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">qplot</span>(media_consumption, policy_prefs) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb9-42">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_smooth</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">method=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'lm'</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb9-43">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme_bw</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb9-44">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggtitle</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Naive estimate (Simpson's Paradox)"</span>) </span></code></pre></div>
<p>The estimate goes in the opposite direction of the true effect! Here’s what the plot looks like:</p>
<p><img src="https://cdn-images-1.medium.com/max/1600/1*2gxiWJN7ElkYAO-wuvakzA.png" class="img-fluid"></p>
<p>To resolve this paradox, we need to add a factor in the model that indicates whether or not a respondent is a Republican. Adding that factor lets us estimate <em>separate</em> slopes for Republicans and non-Republicans. Note that this is <em>not</em> like estimating an interaction term, where two explanatory variables are multiplied together. It’s not that the slopes are <em>different</em>, we just need to estimate separate ones for Republicans and non-Republicans.</p>
<div class="sourceCode" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb10-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Condition on being a Republican to get the right estimates</span></span>
<span id="cb10-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summary</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lm</span>(policy_prefs <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> media_consumption <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> republican))</span>
<span id="cb10-3"></span>
<span id="cb10-4">Call<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span></span>
<span id="cb10-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lm</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">formula =</span> policy_prefs <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> media_consumption <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> republican)</span>
<span id="cb10-6"></span>
<span id="cb10-7">Residuals<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span></span>
<span id="cb10-8">    Min      <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>Q  Median      <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>Q     Max </span>
<span id="cb10-9"><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">3.5518</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.6678</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.0186</span>  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.6562</span>  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">3.3009</span> </span>
<span id="cb10-10"></span>
<span id="cb10-11">Coefficients<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span></span>
<span id="cb10-12">                  Estimate Std. Error t value <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">Pr</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">|</span>t<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|</span>)    </span>
<span id="cb10-13">(Intercept)        <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.05335</span>    <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.03904</span>   <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.366</span>    <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.172</span>    </span>
<span id="cb10-14">media_consumption <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.13615</span>    <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.03111</span>  <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">4.376</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.34e-05</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">*</span></span>
<span id="cb10-15">republican         <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.93049</span>    <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.06758</span>  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">28.565</span>  <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">2e-16</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">*</span></span>
<span id="cb10-16"><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">---</span></span>
<span id="cb10-17">Signif. codes<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>  <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span> ‘<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">*</span>’ <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.001</span> ‘<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span>’ <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.01</span> ‘<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span>’ <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.05</span> ‘.’ <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.1</span> ‘ ’ <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb10-18"></span>
<span id="cb10-19">Residual standard error<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.9774</span> on <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">997</span> degrees of freedom</span>
<span id="cb10-20">Multiple R<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>squared<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.4581</span>,    Adjusted R<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>squared<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.457</span> </span>
<span id="cb10-21">F<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>statistic<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">421.4</span> on <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span> and <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">997</span> DF,  p<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>value<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">&lt;</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">2.2e-16</span></span>
<span id="cb10-22"></span>
<span id="cb10-23"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Conditioning on being Republican</span></span>
<span id="cb10-24"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">qplot</span>(media_consumption, policy_prefs, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data=</span>df, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">colour =</span> republican) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb10-25">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">scale_color_manual</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">values =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"blue"</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"red"</span>)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb10-26">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_smooth</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">method=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'lm'</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb10-27">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme_bw</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb10-28">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggtitle</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Conditioning on being a Republican (Simpson's Paradox)"</span>)</span></code></pre></div>
<p>Here’s what the plot looks like:</p>
<p><img src="https://cdn-images-1.medium.com/max/1600/1*rwNGDFcsSSMiGsJoc6ZxsQ.png" class="img-fluid"></p>
<p><strong>Correlated errors</strong></p>
<p>Another cardinal sin — and one that we should worry a lot about because it often arises from social desirability bias in survey responses — is the phenomenon of correlated errors. This example is inspired by <a href="https://www.nowpublishers.com/article/Details/QJPS-6005">Vavreck (2007).</a></p>
<p>Here, self-reported turnout and media consumption are caused by a combination of social desirability bias and true turnout and true consumption, respectively:</p>
<div class="sourceCode" id="cb11" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb11-1">N <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span></span>
<span id="cb11-2"></span>
<span id="cb11-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># The "Truth"</span></span>
<span id="cb11-4">true_media_consumption <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(N)</span>
<span id="cb11-5">true_vote <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> .<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> media_consumption <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(N)</span>
<span id="cb11-6"></span>
<span id="cb11-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># social desirability bias</span></span>
<span id="cb11-8">social_desirability <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(N)</span>
<span id="cb11-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#what we actually observe from self reports:</span></span>
<span id="cb11-10">self_report_media_consumption <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> true_media_consumption <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> social_desirability</span>
<span id="cb11-11">self_report_vote <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> true_vote <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> social_desirability</span></code></pre></div>
<p>Let’s compare the estimated effect sizes of the self-reported data and the “true” data:</p>
<div class="sourceCode" id="cb12" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb12-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Self reports</span></span>
<span id="cb12-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summary</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lm</span>(self_report_vote <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> self_report_media_consumption))</span>
<span id="cb12-3"></span>
<span id="cb12-4">Call<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span></span>
<span id="cb12-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lm</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">formula =</span> self_report_vote <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> self_report_media_consumption)</span>
<span id="cb12-6"></span>
<span id="cb12-7">Residuals<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span></span>
<span id="cb12-8">    Min      <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>Q  Median      <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>Q     Max </span>
<span id="cb12-9"><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">3.9604</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.7766</span>  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.0142</span>  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.8465</span>  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">4.1811</span> </span>
<span id="cb12-10"></span>
<span id="cb12-11">Coefficients<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span></span>
<span id="cb12-12">                              Estimate Std. Error t value <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">Pr</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">|</span>t<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|</span>)    </span>
<span id="cb12-13">(Intercept)                    <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.02020</span>    <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.03951</span>   <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.511</span>    <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.609</span>    </span>
<span id="cb12-14">self_report_media_consumption  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.54605</span>    <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.02716</span>  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">20.102</span>   <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">2e-16</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">*</span></span>
<span id="cb12-15"><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">---</span></span>
<span id="cb12-16">Signif. codes<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>  <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span> ‘<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">*</span>’ <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.001</span> ‘<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span>’ <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.01</span> ‘<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span>’ <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.05</span> ‘.’ <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.1</span> ‘ ’ <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb12-17"></span>
<span id="cb12-18">Residual standard error<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.248</span> on <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">998</span> degrees of freedom</span>
<span id="cb12-19">Multiple R<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>squared<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.2882</span>,    Adjusted R<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>squared<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.2875</span> </span>
<span id="cb12-20">F<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>statistic<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">404.1</span> on <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> and <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">998</span> DF,  p<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>value<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">&lt;</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">2.2e-16</span></span>
<span id="cb12-21"></span>
<span id="cb12-22"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># "Truth"</span></span>
<span id="cb12-23"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summary</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lm</span>(true_vote <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> true_media_consumption))</span>
<span id="cb12-24"></span>
<span id="cb12-25">Call<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span></span>
<span id="cb12-26"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lm</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">formula =</span> true_vote <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> true_media_consumption)</span>
<span id="cb12-27"></span>
<span id="cb12-28">Residuals<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span></span>
<span id="cb12-29">    Min      <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>Q  Median      <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>Q     Max </span>
<span id="cb12-30"><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">3.5814</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.6677</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.0077</span>  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.6829</span>  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">3.4799</span> </span>
<span id="cb12-31"></span>
<span id="cb12-32">Coefficients<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span></span>
<span id="cb12-33">                       Estimate Std. Error t value <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">Pr</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">|</span>t<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|</span>)</span>
<span id="cb12-34">(Intercept)             <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.01372</span>    <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.03217</span>   <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.426</span>    <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.670</span></span>
<span id="cb12-35">true_media_consumption  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.01313</span>    <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.03245</span>   <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.404</span>    <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.686</span></span>
<span id="cb12-36"></span>
<span id="cb12-37">Residual standard error<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.017</span> on <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">998</span> degrees of freedom</span>
<span id="cb12-38">Multiple R<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>squared<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.0001639</span>, Adjusted R<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>squared<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>  <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.000838</span> </span>
<span id="cb12-39">F<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>statistic<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.1636</span> on <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> and <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">998</span> DF,  p<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>value<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.686</span></span></code></pre></div>
<p>The self-reported data is biased toward over-estimating the effect size, a very dangerous problem. How could we fix this? Well, one way is to actually measure social desirability and include it in the model:</p>
<div class="sourceCode" id="cb13" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb13-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summary</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lm</span>(self_report_vote <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> self_report_media_consumption <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> social_desirability))</span>
<span id="cb13-2"></span>
<span id="cb13-3">Call<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span></span>
<span id="cb13-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lm</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">formula =</span> self_report_vote <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> self_report_media_consumption <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> </span>
<span id="cb13-5">    social_desirability)</span>
<span id="cb13-6"></span>
<span id="cb13-7">Residuals<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span></span>
<span id="cb13-8">    Min      <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>Q  Median      <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>Q     Max </span>
<span id="cb13-9"><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">3.6042</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.6774</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.0127</span>  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.6899</span>  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">3.4470</span> </span>
<span id="cb13-10"></span>
<span id="cb13-11">Coefficients<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span></span>
<span id="cb13-12">                              Estimate Std. Error t value <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">Pr</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">|</span>t<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|</span>)    </span>
<span id="cb13-13">(Intercept)                    <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.01208</span>    <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.03220</span>   <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.375</span>    <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.708</span>    </span>
<span id="cb13-14">self_report_media_consumption  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.01220</span>    <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.03246</span>   <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.376</span>    <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.707</span>    </span>
<span id="cb13-15">social_desirability            <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.02245</span>    <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.04547</span>  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">22.487</span>   <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">2e-16</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">*</span></span>
<span id="cb13-16"><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">---</span></span>
<span id="cb13-17">Signif. codes<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>  <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span> ‘<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">*</span>’ <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.001</span> ‘<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span>’ <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.01</span> ‘<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span>’ <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.05</span> ‘.’ <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.1</span> ‘ ’ <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb13-18"></span>
<span id="cb13-19">Residual standard error<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.017</span> on <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">997</span> degrees of freedom</span>
<span id="cb13-20">Multiple R<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>squared<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5277</span>,    Adjusted R<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>squared<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5268</span> </span>
<span id="cb13-21">F<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>statistic<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>   <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">557</span> on <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span> and <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">997</span> DF,  p<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>value<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">&lt;</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">2.2e-16</span></span></code></pre></div>
<p>Note that this while most people think about social desirability as being a problem related to measurement error, it is essentially the same problem as omitted variable bias, as described above.</p>
<p>It’s important to remember that omitted variable bias and correlated errors are just two potential problems with regression analysis. Regression models are also not immune to issues associated with low levels of <a href="https://en.wikipedia.org/wiki/Power_%28statistics%29">statistical power</a>, the failure to account for the influence of extreme values, and <a href="https://en.wikipedia.org/wiki/Heteroscedasticity">heteroskedasticity</a>, among others. But by simulating the data-generating process, researchers can get a good sense of some of the more common ways in which statistical models might depart from reality.</p>



 ]]></description>
  <guid>https://solomonmg.github.io/blog/how-to-break-regression/</guid>
  <pubDate>Wed, 13 Jun 2018 00:00:00 GMT</pubDate>
  <media:content url="https://solomonmg.github.io/blog/how-to-break-regression/featured.png" medium="image" type="image/png" height="81" width="144"/>
</item>
<item>
  <title>Replication of ‘Bias in the Flesh’</title>
  <dc:creator>Sol Messing</dc:creator>
  <link>https://solomonmg.github.io/blog/replication-of-bias-in-the-flesh/</link>
  <description><![CDATA[ 





<p>This post presents a replication of&nbsp;<a href="../../pdf/HSVmetricsCampaignsDarknessPOQFINAL.pdf">Messing et al</a>. (2016, study 2), which showed that exposure to darker images of Barack Obama increased stereotype activation, as indicated by the tendency to finish incomplete word prompts---such as “W E L _ _ _ _”---in stereotype-consistent ways (“WELFARE”).</p>
<p>Overall, the replication shows that darker images of even&nbsp;<a href="http://www.sciencedirect.com/science/article/pii/S0022103110002635">counter-stereotypical exemplars like Barack Obama</a>&nbsp;can increase stereotype activation, but that the strength of the effect is weaker than conveyed in the original study. &nbsp;A reanalysis of the original study conducted in the course of this replication effort unearthed a number of problems that, when corrected, yield estimates of the effect that are consistent with those documented in the replication. This reanalysis also follows.</p>
<p>I'm posting this to</p>
<ol type="1">
<li>disseminate a <a href="../../pdf/HSVmetricsCampaignsDarknessPOQFINAL.pdf">corrected version of the original study</a>;</li>
<li>show how I found those problems with the original study in the course of conducting this replication;</li>
<li>circulate these generally confirmatory findings, along with a pooled analysis revealing a stronger effect among conservatives; and</li>
<li>provide a demonstration of how replication almost always enhances our knowledge about the original research, which I hope may encourage others to invest the time and money in such efforts.</li>
</ol>
<p>First some context.</p>
<p>The original study that formed the basis of the manuscript shows that more negative campaign ads in 2008 were also more likely to contain darker images of President Obama. In 2009 when I started this work, I was most proud of the method to collect data on skin complexion outlined in study 1.&nbsp; I included another study, what's now study 3, which shows that 2012 ANES survey-takers were more likely to respond negatively to Chinese characters after being presented with darker images of Obama (this is called the&nbsp;<a href="http://onlinelibrary.wiley.com/doi/10.1111/spc3.12148/abstract">Affect Misattribution Procedure (AMP)</a>). But the AMP was not a true experiment and a reviewer was concerned that Study 3 did not provide sufficiently rigorous, causal evidence that darker images alone can cause negative affect.&nbsp; So I conducted an experiment that would establish a causal link between darker images of Obama and something I thought was even more important---stereotype activation. There were strong reasons to expect this effect based on past lab studies showing links between&nbsp;<a href="https://ase.tufts.edu/psychology/tuscLab/documents/pubsCognitive2002.pdf">darker skin</a> and <a href="https://www.ncbi.nlm.nih.gov/pubmed/12088132">negative stereotypes about Blacks</a>, and past observational studies showing far more <a href="https://wcfia.harvard.edu/files/wcfia/files/2007_27_hochschild.pdf">negative socioeconomic outcomes across the board among darker versus lighter skinned Black Americans</a>. We found an effect and published the three studies.</p>
<p>This replication effort was prompted by a post-publication reanalysis and <a href="http://www.ljzigerell.com/?p=3622">critique</a>, which raised questions about potential weaknesses in the original analysis. My aim in replicating the study was to bring new data to the discussion and make sure we hadn’t polluted the literature with a false discovery.</p>
<p>The main objection was the way we formed our stereotype consistency index. The items assessing stereotype consistency comprised 11 words with missing blank spaces (e.g., L A _ _). Each fragment had as one possible solution a stereotype-related completion. The complete list follows: L A _ _ (LAZY): C R _ _ _ (CRIME); _ _ O R (POOR); R _ _ (RAP); WEL _ _ _ _ (WELFARE); _ _ C E (RACE); D _ _ _ Y (DIRTY); B R _ _ _ _ _ (BROTHER); _ _ A C K (BLACK); M I _ _ _ _ _ _ (MINORITY); D R _ _ (DRUG).</p>
<p>The author pointed out that there were many potential ways to analyze the original data---he claimed over 16 thousand. Yet very few of these are consistent with generally accepted research practices. We've known, arguably since the <a href="http://www.jstor.org/stable/2333051?origin=crossref&amp;seq=1#page_scan_tab_contents">16th century</a>, that combining several measures reduces measurement error and hence variance in estimation. This is particularly important in social science, and especially for this particular study---it would be unwise to attempt to use a single word completion or an arbitrary subset thereof to measure a complex, noisy construct like stereotype activation as measured via a word completion game. Rather, taking the <a href="https://web.stanford.edu/~jrodden/issues_apsr.pdf">average or constructing an index based on clustering several measures</a>&nbsp;should be expected to result in far less measurement error, which is what we did.</p>
<p>Still, I am sympathetic to concerns about the <a href="http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf">garden of forking paths</a>, which is part of the motivation for this replication.</p>
<p>In the original study, I formed this index based on what I judged to be the most unambiguously negative word-completions (lazy, dirty, poor), consistent with past work suggesting that darker complexion activates the <a href="https://ase.tufts.edu/psychology/tuscLab/documents/pubsCognitive2002.pdf">most&nbsp;</a><a href="https://www.ncbi.nlm.nih.gov/pubmed/12088132">negative</a> stereotypes about Blacks. I calculated that these were the three variables that also maximized interclass correlation (ICC). As a robustness check, I also computed a measure that maximized alpha reliability (AR). This measure contained more items, and also seemed to include stereotype-consistent word completions that were on balance negative---lazy, dirty, poor, crime, black, and welfare. I should have&nbsp;but did not report results based on a simple average of these items, which was not conclusive.</p>
<p><a href="http://www.ljzigerell.com/?p=3622">The critical reanalysis</a>&nbsp;cited above shows a handful of statistically significant patterns that are inconsistent with the expectations in the original study, which is suggestive evidence that it's quite possible to find signal in noise if you're analyzing arbitrary sets of variables with the originally collected data. However, as shown below in the much larger replication sample below, none of these patterns replicate.</p>
<p>The critique also noted that we did not include an analysis of several trailing questions we included on the original survey. The concern is the <a href="http://science.sciencemag.org/content/345/6203/1502">file drawer problem</a>&nbsp;- the incentives against and frequent failure to report null results - which obscures knowledge and is bad for the scientific enterprise.</p>
<p>I included those measures based on <a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1643225">past work</a> using the same images as stimuli, which found that darker images prompted more negative evaluations of Obama among people with more negative associations with Blacks, as measured using the Implicit Associations Test (IAT).&nbsp;But testing a specification that conditioned on our main outcome of interest---stereotype-consistent word completions---would mean conditioning on a post-treatment variable, particularly worrisome since we saw an effect on stereotype activation in the study.</p>
<p>Below, I pool the data and report another specification that does not require us to condition on post-treatment variables. It takes advantage of the fact that conservatives had significantly higher levels of stereotype activation (which was documented in the original study), and shows that the effect is in fact stronger among this subgroup, providing preliminary evidence in favor of this hypothesis.</p>
<p>The remainder of this post will present my own reanalysis of the original data, the replication, and finally some additional analysis of the data now possible with the larger, pooled data set.</p>
<p><strong>Re-analysis of original data</strong></p>
<p>In the process of collecting data for the replication studies, I used the same interface, simply appending the new data as additional respondents completed the survey experiment. When I geo-coded the IP address data in the full data set, I found a discrepancy between the cases I originally geo-coded as U.S. cases, and the cases that now resolved to U.S. locations in the complete data set. &nbsp;Many of these respondents appeared in sequence, suggesting they may have been skipped, perhaps due to issues related to connectivity to the geo-location server I used.</p>
<p>This prompted me to conduct a full re-analysis of the data, which yields smaller estimates of stereotype activation. First, re-estimating the index yielded different items---'black' in place of 'dirty' for the ICC measure and 'race' in place of 'welfare', 'crime', and 'dirty' in the AR measure. This is due in part to the way I computed the original indices and in part due to correcting the geo-coding issue. In the original study, I computed the index of variables that maximized alpha and ICC by hand because the epiCalc::alphaBest function (now epiDisplay::alphaBest) does not return results (nor an error message) for these data. For reanalysis, I wrote a function that computed variables to include in the index via successive removal of items. The overall alpha is actually slightly lower in new AR measure, while the new ICC measure has a slightly higher correlation coefficient.</p>
<p>For the sake of transparency, I first report results based on the original items included in the index as reported in Messing et al.&nbsp;2015 using the updated data, then report the new ICC and AR measures.</p>
<p>Using the original indices with the errantly remove cases included, instead of a 36% increase in stereotype-consistent word completions using the ICC measure, this meant a revised estimate of a 20% increase in stereotype activation (M_Light = 0.33, M_Dark = 0.41, T(859.0) = 2.08, P = 0.038, two-sided). For the AR measure, instead of a 13% increase (M_Light = 0.97, M_Dark = 1.11, T(626.72) = 1.77, P = 0.078, two-sided), this meant an 8% increase (M_Light = 0.98, M_Dark = 1.06, T(850.9) = 1.12, P = 0.265, two-sided).</p>
<p>Re-estimating the indices when including all U.S. cases translates to less conclusive findings---a revised estimate of an 8% increase in stereotype activation in the original study (M_Light = 0.79, M_Dark = 0.86, T(850.7) = 1.27, P = 0.203, two-sided) using the ICC measure, and an 8% increase (M_Light = 0.87, M_Dark = 0.91, T(839.1) = 0.77, P = 0.439, two-sided) using the AR measure.</p>
<p>A slightly smaller effect was also observed when examining differences between conservatives and other participants. Correcting the geo-coding error and updating the indices reduced the estimate of stereotype activation for conservatives. Instead of a 53% increase, the original ICC measure yields a 29% increase (M_Other = 0.35, M_Conservative = 0.49, T(205.9) = 2.49, P = 0.013, two-sided). &nbsp;The new ICC measure yields an 18% increase (M_Other = 0.80, M_Conservative = 0.98, T(207.9) = 2.41, P = 0.017, two-sided). &nbsp;For the AR measure, instead of a 29% increase, this meant an 18% increase using either measure (original: M_Other = 0.99, M_Conservative = 1.18, T(210.4) = 2.11, P = 0.036, two-sided) (new: M_Other = 0.86, M_Conservative = 1.05, T(214.3) = 2.47, P = 0.014, two-sided).</p>
<p><strong>The replication</strong></p>
<p>I conducted one exact replication and one very close replication with slightly different images, which I pooled for a total of 3,151 respondents, substantially more than the 630 included in the original writeup. &nbsp;This gives me more statistical power and more precise estimates of the effect in question. (I provide results for each design separately - one of which appears underpowered - at the end of this post).</p>
<p>To be clear, I did not pre-register this replication. However, I've tried to err on the side of exhaustive reporting when the original study did not provide exacting specificity in analyzing the new data. Due to the nature of this replication---the presentation of the same analysis conducted in the original study---the p-values provide highly informative, if not conclusive evidence regarding the nature of the effect.</p>
<p><img src="https://solomonmg.github.io/img/cb54a931e7c01f3bfd03caa2899518b447ab470e.png" class="img-fluid"></p>
<p>The average reported age was 36; 52% of participants identified as female; 84% identified as White, 8% as Black; 5% as Hispanic; and 3% as Other. 52% identified as liberal, 27% as moderate, 22% as conservative.</p>
<p>Recomputing the ICC index yielded the following items: black, poor, drug. Recomputing the AR index yielded: lazy, black, poor, welfare, crime, drug, which is close to the original study.</p>
<p>In the replication data, the ICC yields a 5% increase in stereotype activation (M_Light = 0.90, M_Dark = 0.95, T(3142.8) = 1.56, P = 0.119, two-sided). Similarly, the alpha measure yields a 5% increase (M_Light = 1.04, M_Dark = 1.09, T(3145.8) = 1.70, P = 0.089, two-sided).</p>
<p>The original study isn't completely clear on the question of whether a replication should report on the recomputed ICC and AP measures, or the exact same items as in the original study, so it's worth reporting those as well. The <em>original</em> ICC measure yields a 3% increase in stereotype activation (M_Light = 0.36, M_Dark = 0.37, T(3148.9) = 0.51, P = 0.611, two-sided). Using the <em>original</em> AR measure yields a 6% increase (M_Light = 1.01, M_Dark = 1.07, T(3147.9) = 1.91, P = 0.057, two-sided).</p>
<p>Finally, it's worth reporting on an index that simply uses all stereotype-consistent items in the replication reveals a 5% increase in stereotype activation (M_Light = 1.30, M_Dark = 1.37, T(3147.7) = 2.05, P = 0.040, two-sided).</p>
<p>A pooled analysis, after normalizing the ICC and AR measures, yields similar results:</p>
<p><img src="https://solomonmg.github.io/img/a0415aa41faee2db8452e14712ae9505f7a96f0a.png" class="img-fluid"></p>
<pre><code>=================================================
                      ICC     Alpha      ALL
-------------------------------------------------
  (Intercept)       -0.041   -0.028    1.292***
                    (0.034)  (0.034)  (0.036)
  cond: Dark/Light   0.067*   0.058    0.067*
                    (0.032)  (0.032)  (0.034)
  study              0.006   -0.001    0.008
                    (0.023)  (0.023)  (0.025)
-------------------------------------------------
  R-squared             0.0      0.0       0.0
  N                  4012     4012      4012
=================================================</code></pre>
<p>I also replicated this study with a different, lesser-known Black politician (Jesse White). However, a manipulation check revealed that only 36% of respondents said the candidate was Black in the “light” condition, compared to 83% in the darker condition, suggesting that any analysis would be severely confounded by perceived race of the target politician. (I did not ask this question in the Barack Obama studies).</p>
<p><strong>Additional analysis</strong></p>
<p>The superior power afforded by pooling all three studies may allow the exploration of treatment heterogeneity. <a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1643225">Past work</a> suggests the possibility that darker images might cause people inclined toward more stereotype-consistent responses to evaluate politicians more negatively. However, this analysis would&nbsp;<a href="http://andrewgelman.com/2017/09/12/conditioning-post-treatment-variables-can-ruin-experiment/">condition on post-treatment variables</a>, which in this case is particularly concerning since the treatment affects stereotype activation according to the original study and replication above. &nbsp;As an alternative, I consider a specification that uses conservative identification instead, which is a strong predictor of stereotype activation (as shown in the original study), but shouldn’t be affected by the treatment. It reveals evidence for the predicted interactions, suggesting that when conservatives are exposed to darker rather than lighter images of Obama, they have slightly “colder” feelings toward the former president (P = 0.039), perceive him to be less competent (P = 0.061), and less trustworthy (P = 0.083).</p>
<pre><code>=================================================================
                             obama_therm  competence    trust
-----------------------------------------------------------------
  (Intercept)                 69.446***    4.213***    3.867***
                              (0.705)     (0.029)     (0.031)
  cond: Dark/Light             0.451       0.024       0.045
                              (0.989)     (0.041)     (0.044)
  iscons                     -40.497***   -1.574***   -1.625***
                              (1.565)     (0.064)     (0.069)
  cond: Dark/Light x iscons   -4.462*     -0.167      -0.165
                              (2.160)     (0.089)     (0.095)
-----------------------------------------------------------------
  R-squared                        0.3         0.3         0.2
  N                             3932        3928        3926
=================================================================</code></pre>
<p>A plot of the model predictions for the thermometer ratings suggests that the effect is concentrated among conservatives.</p>
<p><img src="https://solomonmg.github.io/img/f82207e502f7cdd9c8a6b178efa7ca801d1b8e4c.png" class="img-fluid"></p>
<p><strong>Conclusion</strong></p>
<p>The more items one uses to form an index, the less noise we should expect, and the more likely any replication attempt should be expected to succeed. &nbsp;It should also mean greater statistical precision.&nbsp;This could explain the remaining discrepancy between this study and the original after adjusting for the geo-coding error pointed out above. It's also possible that something about the timing or the subjects recruited in the replication studies that explain the observed differences.</p>
<p>Nonetheless, this replication provides evidence that darker images of Black political figures, or at least of President Barack Obama, do in fact activate stereotypes. &nbsp;This much larger sample suggests that the true effect is smaller than what I found in the original study, which as noted above, contained some errors.</p>
<p>Replication materials available on <a href="http://dx.doi.org/10.7910/DVN/WY7PR8">dataverse</a>.</p>
<p><strong>Appendix</strong></p>
<p>Below I present alternate specifications estimated without pooling. These specifications suggest first that replication 2 (as well as the original study) was not well-powered. It also suggests that the outcome measures with more items yield more reliable estimates.</p>
<p>Outcome measure summing all items:</p>
<p>Replication 1:&nbsp;M_Light = 1.29, M_Dark = 1.36, T(2115.4) = -1.59, P = 0.113, two-sided</p>
<p>Replication 2:&nbsp;M_Light = 1.30, M_Dark = 1.39, T(982.7) = -1.35, P = 0.177, two-sided</p>
<p>Original Alpha outcome measure:</p>
<p>Replication 1:&nbsp;M_Light = 0.99, M_Dark = 1.06, T(2114.9) = -1.79, P = 0.073, two-sided</p>
<p>Replication 2:&nbsp;M_Light = 1.04, M_Dark = 1.09, T(979.8) = -0.85, P = 0.393, two-sided</p>
<p>Newly estimated Alpha&nbsp;outcome measure:</p>
<p>Replication 1:&nbsp;M_Light = 1.02, M_Dark = 1.09, T(2109.4) = -1.77, P = 0.077, two-sided</p>
<p>Replication 2:&nbsp;M_Light = 1.07, M_Dark = 1.10, T(984.6) = -0.53, P = 0.597, two-sided</p>
<p>Original ICC&nbsp;outcome measure:</p>
<p>Replication 1:&nbsp;M_Light = 0.89, M_Dark = 0.94, T(2100.6) = -1.49, P = 0.137, two-sided.</p>
<p>Replication 2:&nbsp;M_Light = 0.93, M_Dark = 0.96, T(985.9) = -0.67, P = 0.506, two-sided</p>
<p>Newly estimated ICC&nbsp;outcome measure</p>
<p>Replication 1:&nbsp;M_Light = 0.35, M_Dark = 0.36, T(2122.5) = -0.22, P = 0.826, two-sided</p>
<p>Replication 2:&nbsp;M_Light = 0.38, M_Dark = 0.40, T(978.7) = -0.72, P = 0.472, two-sided &nbsp; A replication of prior <a href="http://www.ljzigerell.com/?p=3622">critique and reanalysis</a>. &nbsp;The patterns that run contrary to our original findings are not significant in the replication data.</p>
<table style="height:1612px;" width="584">
<tbody>
<tr>
<td>
<strong>Variable</strong>
</td>
<td>
<strong>effect size</strong>
</td>
<td>
<strong>p-value</strong>
</td>
</tr>
<tr>
<td>
<span style="font-weight:400;">feeling therm</span>
</td>
<td>
<span style="font-weight:400;">-0.028</span>
</td>
<td>
<span style="font-weight:400;">0.439</span>
</td>
</tr>
<tr>
<td>
<span style="font-weight:400;">race minority welfare crime rap</span>
</td>
<td>
<span style="font-weight:400;">0.022</span>
</td>
<td>
<span style="font-weight:400;">0.538</span>
</td>
</tr>
<tr>
<td>
<span style="font-weight:400;">race minority welfare rap</span>
</td>
<td>
<span style="font-weight:400;">0.017</span>
</td>
<td>
<span style="font-weight:400;">0.629</span>
</td>
</tr>
<tr>
<td>
<span style="font-weight:400;">race minority rap</span>
</td>
<td>
<span style="font-weight:400;">0.011</span>
</td>
<td>
<span style="font-weight:400;">0.752</span>
</td>
</tr>
<tr>
<td>
<span style="font-weight:400;">race</span>
</td>
<td>
<span style="font-weight:400;">0.006</span>
</td>
<td>
<span style="font-weight:400;">0.861</span>
</td>
</tr>
<tr>
<td>
<span style="font-weight:400;">minority</span>
</td>
<td>
<span style="font-weight:400;">-0.013</span>
</td>
<td>
<span style="font-weight:400;">0.713</span>
</td>
</tr>
<tr>
<td>
<span style="font-weight:400;">rap</span>
</td>
<td>
<span style="font-weight:400;">0.024</span>
</td>
<td>
<span style="font-weight:400;">0.502</span>
</td>
</tr>
<tr>
<td>
<span style="font-weight:400;">welfare</span>
</td>
<td>
<span style="font-weight:400;">0.014</span>
</td>
<td>
<span style="font-weight:400;">0.695</span>
</td>
</tr>
<tr>
<td>
<span style="font-weight:400;">comp</span>
</td>
<td>
<span style="font-weight:400;">-0.013</span>
</td>
<td>
<span style="font-weight:400;">0.72</span>
</td>
</tr>
<tr>
<td>
<span style="font-weight:400;">crime</span>
</td>
<td>
<span style="font-weight:400;">0.016</span>
</td>
<td>
<span style="font-weight:400;">0.659</span>
</td>
</tr>
<tr>
<td>
<span style="font-weight:400;">trust</span>
</td>
<td>
<span style="font-weight:400;">0.002</span>
</td>
<td>
<span style="font-weight:400;">0.946</span>
</td>
</tr>
<tr>
<td>
<span style="font-weight:400;">brother</span>
</td>
<td>
<span style="font-weight:400;">0.042</span>
</td>
<td>
<span style="font-weight:400;">0.236</span>
</td>
</tr>
<tr>
<td>
<span style="font-weight:400;">drug</span>
</td>
<td>
<span style="font-weight:400;">0.008</span>
</td>
<td>
<span style="font-weight:400;">0.822</span>
</td>
</tr>
<tr>
<td>
<span style="font-weight:400;">lazy</span>
</td>
<td>
<span style="font-weight:400;">0.026</span>
</td>
<td>
<span style="font-weight:400;">0.474</span>
</td>
</tr>
<tr>
<td>
<span style="font-weight:400;">black</span>
</td>
<td>
<span style="font-weight:400;">0.084</span>
</td>
<td>
<span style="font-weight:400;">0.019</span>
</td>
</tr>
<tr>
<td>
<span style="font-weight:400;">dirty</span>
</td>
<td>
<span style="font-weight:400;">0.028</span>
</td>
<td>
<span style="font-weight:400;">0.438</span>
</td>
</tr>
<tr>
<td>
<span style="font-weight:400;">poor</span>
</td>
<td>
<span style="font-weight:400;">-0.002</span>
</td>
<td>
<span style="font-weight:400;">0.957</span>
</td>
</tr>
<tr>
<td>
<span style="font-weight:400;">allwcs</span>
</td>
<td>
<span style="font-weight:400;">0.073</span>
</td>
<td>
<span style="font-weight:400;">0.04</span>
</td>
</tr>
<tr>
<td>
<span style="font-weight:400;">original</span>
</td>
<td>
<span style="font-weight:400;">0.018</span>
</td>
<td>
<span style="font-weight:400;">0.611</span>
</td>
</tr>
<tr>
<td>
<span style="font-weight:400;">alpha</span>
</td>
<td>
<span style="font-weight:400;">0.068</span>
</td>
<td>
<span style="font-weight:400;">0.057</span>
</td>
</tr>
</tbody>
</table>



 ]]></description>
  <guid>https://solomonmg.github.io/blog/replication-of-bias-in-the-flesh/</guid>
  <pubDate>Mon, 16 Oct 2017 00:00:00 GMT</pubDate>
  <media:content url="https://solomonmg.github.io/blog/replication-of-bias-in-the-flesh/featured.png" medium="image" type="image/png" height="108" width="144"/>
</item>
<item>
  <title>Ideologically diverse news, an agenda for future research</title>
  <dc:creator>Eytan Bakshy</dc:creator>
  <dc:creator>Sol Messing</dc:creator>
  <link>https://solomonmg.github.io/blog/exposure-to-ideologically-diverse-response/</link>
  <description><![CDATA[ 





<p>Earlier this month, we published an early access version of our paper in ScienceExpress (<a href="../../pdf/Science-2015-Bakshy-1130-2.pdf">Bakshy et al.&nbsp;2015</a>), “Exposure to ideologically diverse news and opinion on Facebook.” The paper constitutes the first attempt to quantify the extent to which ideologically cross-cutting hard news and opinion is shared by friends, appears in algorithmically ranked News Feeds, and is actually consumed (i.e., click through to read).</p>
<p>We are grateful for the widespread interest this paper, which grew out of two threads of related research that we began nearly five years ago: Eytan and Lada's work on the role of social networks in information diffusion (<a href="http://arxiv.org/pdf/1201.4145v2.pdf">Bakshy et al.&nbsp;2012</a>) and Sean and Solomon's work on selective exposure in social media (<a href="http://crx.sagepub.com/content/41/8/1042">Messing and Westwood 2012</a>).</p>
<p>While <em>Science</em> papers are explicitly prohibited from suggesting future directions for research, we would like to shed additional light on our study and raise a few questions that we would be excited to see addressed in future work.</p>
<p><strong>Tradeoffs when Selecting a Population</strong></p>
<p>There were tradeoffs when deciding on who to include in this study. While we could have examined all U.S. adults on Facebook, we focused on people who identify as liberals or conservatives and encounter hard news, opinion, and other political content in social media regularly. We did so because many important questions around “echo chambers” and “filter bubbles”on Facebook relate to this subpopulation, and we used self-reported ideological preferences to define it.</p>
<p>Using self-reported ideological preferences in online profiles is not the only a way to measure ideology or define the population of interest. Yet, people who publicly identify as liberals or conservatives in their Facebook profiles are an interesting and important subpopulation worthy of study for many reasons. As <a href="http://gking.harvard.edu/files/gking/files/words.pdf">Hopkins and King 2010</a> have pointed out, studying the expression and behavior of those who are politically engaged online is of interest to political scientists studying activists (<a href="http://www.hup.harvard.edu/catalog.php?isbn=9780674942936">Verba, Schlozman, and Brady 1995</a>), the media (<a href="http://195.130.87.21:8080/dspace/handle/123456789/979">Drezner and Farrell 2004</a>), public opinion (<a href="https://scholar.google.com/scholar?cluster=15572370959137190849&amp;hl=en&amp;as_sdt=2005&amp;sciodt=0,5">Gamson 1992</a>), social networks (<a href="http://dl.acm.org/citation.cfm?id=1134277">Adamic and Glance 2005</a>; <a href="http://www.langtoninfo.co.uk/web_content/9780521542234_frontmatter.pdf">Huckfeldt and Sprague 1995</a>), and elite influence (<a href="http://press.princeton.edu/titles/8425.html">Grindle 2005</a>; <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.160.8347&amp;rep=rep1&amp;type=pdf">Hindman, Tsioutsiouliklis, and Johnson 2003;</a> <a href="https://books.google.com/books?hl=en&amp;lr=&amp;id=83yNzu6toisC&amp;oi=fnd&amp;pg=PR8&amp;dq=The+Nature+and+Origins+of+Mass+Opinion&amp;ots=6oEwiBZtSM&amp;sig=ySnwRZUuVtOyTkC-3raaL88pYns#v=onepage&amp;q=The%20Nature%20and%20Origins%20of%20Mass%20Opinion&amp;f=false">Zaller 1992</a>).</p>
<p>This subpopulation has limitations and is not the only population of interest. The data are not appropriate for those who seek estimates of the entire U.S. public, people without strong opinions, or people not on Facebook (at least not without additional extrapolation, re-weighting, additional evidence, etc.). While our data <em>could plausibly</em> also provide good estimates of the population of people who are ideologically active and have clear preferences, we are not claiming that's necessarily the case---that remains to be determined in future work.</p>
<p>We'd like to help other researchers looking to study other populations understand more about the population we've defined. An important question in this regard is what proportion of active U.S. adults actually report an identifiable left/right/center ideology in their profile. That number is 25%, or 10.1 million people.</p>
<p>It's also informative to examine the proportion of those users who provide identifiable profile affiliations conditional on demographics and Facebook usage:</p>
<table style="border-collapse:collapse;width:473px;height:512px;" border="0" width="174" cellspacing="0" cellpadding="0">
<colgroup>
<col style="width:65pt;" span="2" width="87">
</colgroup>
<tbody>
<tr style="height:18pt;">
<td class="xl63" style="height:18pt;width:65pt;" width="87" height="24">
<strong>Age</strong>
</td>
<td class="xl63" style="width:65pt;" width="87">
<strong>Percent reporting ideological affiliation</strong>
</td>
</tr>
<tr style="height:18pt;">
<td class="xl63" style="height:18pt;" height="24">
18-24
</td>
<td class="xl64">
21.60%
</td>
</tr>
<tr style="height:18pt;">
<td class="xl63" style="height:18pt;" height="24">
25-44
</td>
<td class="xl64">
28.50%
</td>
</tr>
<tr style="height:18pt;">
<td class="xl63" style="height:18pt;" height="24">
45-64
</td>
<td class="xl64">
24.30%
</td>
</tr>
<tr style="height:18pt;">
<td class="xl63" style="height:18pt;" height="24">
65+
</td>
<td class="xl64">
21.40%
</td>
</tr>
<tr style="height:18pt;">
<td class="xl63" style="height:18pt;" height="24">
</td>
<td class="xl63">
</td>
</tr>
<tr style="height:18pt;">
<td class="xl63" style="height:18pt;" height="24">
<strong>Gender</strong>
</td>
<td class="xl63">
<strong>Percent reporting ideological affiliation</strong>
</td>
</tr>
<tr style="height:18pt;">
<td class="xl63" style="height:18pt;" height="24">
Female
</td>
<td class="xl64">
21.90%
</td>
</tr>
<tr style="height:18pt;">
<td class="xl63" style="height:18pt;" height="24">
Male
</td>
<td class="xl64">
30.60%
</td>
</tr>
<tr style="height:18pt;">
<td class="xl63" style="height:18pt;" height="24">
</td>
<td class="xl63">
</td>
</tr>
<tr style="height:18pt;">
<td class="xl63" style="height:18pt;" height="24">
<strong>Login Days</strong>
</td>
<td class="xl63">
<strong>Percent reporting ideological affiliation</strong>
</td>
</tr>
<tr style="height:18pt;">
<td class="xl63" style="height:18pt;" height="24">
105-140
</td>
<td class="xl64">
18.90%
</td>
</tr>
<tr style="height:18pt;">
<td class="xl63" style="height:18pt;" height="24">
140-185
</td>
<td class="xl64">
26.70%
</td>
</tr>
</tbody>
</table>
<p>Clearly those who report an ideology in their profile tend to be more active on Facebook. They are also more likely to be men, which is consistent with the well-documented gender gap in American politics (<a href="http://journals.cambridge.org/action/displayAbstract?fromPage=online&amp;aid=245226&amp;fileId=S0003055404001315">Box-Steffensmeier 2004</a>).</p>
<p>It's possible that these individuals differ from other Facebook users in other ways. It seems plausible to expect these people to have higher levels of political interest, a stronger sense of political ideology and political identity, and to be more likely to be active in politics than most others on Facebook. It's also possible that these individuals are more extroverted than the average user, especially in the somewhat taboo domain of politics. These possibilities also strike us as interesting questions for study in future work.</p>
<p><strong>How to Measure Ideology</strong></p>
<p>We hope others will replicate this work using other populations and ways of measuring ideology, which will provide a broader view of exposure to political media. Data on ideology could be collected by, for example, surveying users, imputing ideology based on user behavior, or joining data to the voter file. Each of these methods have advantages and potential challenges.</p>
<p>Using surveys in future work would allow researchers to collect data on ideology in a way that can facilitate comparisons with much of the extant literature in political science, and allow researchers to sample from a less politically engaged population. Of course, this could be tricky because survey response rates might be affected by the phenomenon under study. In other words, the salience of political discussion from the right or left, and/or prior choices to consume content could make people more/less likely to respond to a survey asking about ideology, or affect the way they report the strength of their ideological preferences. This could confound measurement in a way that would be difficult to detect and correct. Yet it would be fascinating to see how survey results compare to the results in this study.</p>
<p>We would also encourage the application of large-scale methods that impute individuals’ ideological leanings using social networks or revealed preferences. This would have the advantage of allowing researchers to estimate&nbsp;ideological preferences for&nbsp;a broader population, and could be applied to empirical contexts for which self-reported ideological affiliations are not present.</p>
<p>However, these approaches present challenges. Imputing ideology based on social networks would make it difficult to estimate what proportion of people’s networks contain individuals from the other side.&nbsp;<a href="http://journals.cambridge.org/action/displayAbstract?fromPage=online&amp;aid=9586211&amp;fileId=S0003055414000525">Bond and Messing, 2015</a> and <a href="http://pan.oxfordjournals.org/content/23/1/76.full.pdf?keytype=ref&amp;ijkey=uMFPw4dsMHM7608">Barberá 2014</a> discuss some of the challenges related to estimating ideology based on revealed preferences. Another challenge specific to the quantities estimated in our paper is that because behavior may be caused by the composition of individuals’ social networks, what their friends share, and how they engage with Facebook, using revealed preferences to select the population could introduce endogenous selection bias (<a href="http://www.annualreviews.org/doi/abs/10.1146/annurev-soc-071913-043455">Elwert and Winship 2014</a>). A study that negotiates these issues would be a tremendously valuable contribution. Similar methods could also be used to obtain measures of ideological alignment of content.</p>
<p>Lastly, researchers could use party registration from the voter file. This approach would yield millions of records, but have different selection problems—match rates may differ by region, state, age, gender, etc. Again, the advantage of approaches like this are that these studies compliment each other and provide a fuller picture of how exposure to viewpoints from the other side occur in social media.</p>
<p>Future work should also examine how exposure varies in different subpopulations. For example, one hypothesis to test is whether those with weaker or less consistent ideological preferences have more cross cutting content shared by friends, rendered in social media streams, and selected for reading. Some preliminary analysis suggests that indeed, among the individuals in our study, those with a weaker stated ideological affiliation have on average more cross-cutting content at each stage in the exposure process.</p>
<p><img src="https://solomonmg.github.io/img/5a469dbbfb5b2916a87554ca4a247579da11e2b2.png" class="img-fluid"></p>
<p><strong>Other Data Sources</strong></p>
<p>There are many other important questions related to this paper that necessitate new data sources: Does encountering cross-cutting content increase or decrease attitude polarization? What about attitudes toward members of the other side? Does it change specific policy preferences? Are liberals and conservatives more or less likely to see content in News Feed <em>because</em> it was cross-cutting? Do they actively avoid cross-cutting political content <em>because</em> of expressions in the title or because of the fact that the media source is suggestive of a cross-cutting article? How do changes to ranking algorithms and user interfaces affect selective exposure? And how can we better understand actual discourse about politics in social media, rather than merely shared media content?</p>
<p>Answering these questions necessitates collecting innovative data sets via online experimentation (<a href="http://pan.oxfordjournals.org/content/20/3/351.short">Berinsky et al.&nbsp;2012</a>), social media (<a href="http://scholar.harvard.edu/dtingley/files/fall2012.pdf">Ryan and Broockman 2012</a>), crowdsourcing (<a href="http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2526461">Budak et al.&nbsp;2014</a>), large scale field experimentation (<a href="http://gking.harvard.edu/publications/randomized-Experimental-Study-Censorship-China">King et al.&nbsp;2014</a>), observational social media data, clever ways to collect data about individual differences in ranking (<a href="http://www.lazerlab.net/sites/default/files/publications/Measuring%20Personalization%20of%20Web%20Search.pdf">Hannak et al.&nbsp;2013</a>), smart ways to combine behavioral and survey data (<a href="http://arxiv.org/pdf/1304.1837v3.pdf">Chen et al.&nbsp;2014</a>), and panel data (<a href="http://faculty-gsb.stanford.edu/athey/documents/localnews.pdf">Athey and Mobius 2012</a>, <a href="https://5harad.com/papers/bubbles.pdf">Flaxman et al.&nbsp;2014</a>).</p>
<p>Many of these are causal questions necessitating experimental and/or quasi experimental designs. For example, the extent to which people select content because it is cross-cutting could be investigated using experiments like this one (e.g., <a href="http://crx.sagepub.com/content/41/8/1042">Messing and Westwood 2012</a>) or through identifying sources of natural exogenous variation. And while Diana Mutz and others have done ground-breaking research on the effects of encountering cross-cutting arguments on political attitudes (<a href="http://journals.cambridge.org/action/displayAbstract?fromPage=online&amp;aid=208463&amp;fileId=S0003055402004264">Mutz 2002</a>b) and behavior (<a href="http://www.jstor.org/stable/3088437">Mutz 2002</a>a), more research into how these effects play out in the long term (using approaches like <a href="http://faculty.wcas.northwestern.edu/~jnd260/pub/Druckman%20Fein%20Leeper%20APSR.pdf">Druckman et al 2012</a>) would be of tremendous benefit to the literature. It is difficult to expose people to any sort of argument for a long period of time (say over the course of a U.S. national political campaign cycle), in a way that is not confounded with people's existing preferences and the social environment, though creative quasi-experimental work (<a href="http://web.stanford.edu/~ayurukog/cable_news.pdf">Martin and Yurukoglu 2014</a>) is emerging in this area.</p>
<p>Many of these questions necessitate that researchers identify the effects of cross-cutting arguments both on and off Facebook. To get a full picture of how cross-cutting arguments affect politics requires understanding the myriad of ways individuals get information, both on the Internet (<a href="https://5harad.com/papers/bubbles.pdf">Flaxman et al.&nbsp;2014</a>) and offline (<a href="http://journals.cambridge.org/action/displayAbstract?fromPage=online&amp;aid=208463&amp;fileId=S0003055402004264">Mutz 2002</a>a), what kinds of information people discuss in offline contexts (<a href="http://dx.doi.org/10.1017/S0003055402004264">Mutz 2002</a>b), and the relative influence of all of these factors on opinions.</p>
<p>Finally, if individuals' online networks and choices do substantially impact the diversity of news in individuals' overall “information diets,” future research could examine the effects of connecting those with more disparate views (<a href="http://ajps.org/2015/03/11/partisanship-in-social-settings-when-democrats-and-republicans-meet/">Klar 2014</a>), encouraging consumption of cross-cutting content (<a href="http://www.smunson.com/portfolio/projects/socnews_icwsm15.pdf">Agapie and Munson 2015</a>), or simply encouraging individuals to read more diverse news by making individuals more aware of the balance of news they consume (<a href="http://www.smunson.com/portfolio/projects/aggdiversity/balancer-icwsm.pdf">Munson et al.&nbsp;2013</a>).</p>
<p>These questions are especially important in light of the fact that there are substantial opportunities for people to read more news on Facebook. The plots below illustrate the average proportion of stories shared by friends, those that are seen in News Feed, and those clicked on for liberals and conservatives in the study. Clearly there is an opportunity to read more news from either side.</p>
<p><img src="https://solomonmg.github.io/img/ddb611f4a223213b3a653208e5bdd048d6b9cd94.png" class="img-fluid"></p>
<p><strong>Dataverse</strong></p>
<p>Finally, we believe that reproducing, replicating, and conducting additional analyses on extant data sets is extremely important and helps generate ideas for future work (<a href="http://gking.harvard.edu/files/abs/replication-Abs.shtml">King 1995</a>, <a href="http://thomasleeper.com/2015/05/open-science-language/">Leeper 2015</a>). In that spirit, we have created a <a href="https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/AAI7VA">Dataverse archive</a>. The repository includes replication data, scripts, as well as some additional supplementary data and code for extending our work.</p>
<p><strong>References</strong></p>
<p>E. Bakshy, S. Messing, L.A. Adamic. 2015. Exposure to ideologically diverse news and opinion on Facebook. <em>Science</em>.</p>
<p>E. Bakshy, I. Rosenn, C.A. Marlow, L.A. Adamic. 2012. The Role of Social Networks in Information Diffusion. <em>ACM WWW 2012.</em></p>
<p>S. Messing and S.J. Westwood. 2012. Selective Exposure in the Age of Social Media: Endorsements Trump Partisan Source Affiliation When Selecting News Online. <em>Communication Research</em>.</p>
<p>P. Barberá (2015). Birds of the same feather tweet together: Bayesian ideal point estimation using Twitter data. <em>Political Analysis</em>, <em>23</em>(1), 76-91.</p>
<p>R. Bond, S. Messing, Quantifying Social Media’s Political Space: Estimating Ideology from Publicly Revealed Preferences on Facebook. <em>American Political Science Review</em></p>
<p>F. Elwert and C. Winship. 2014. Endogenous Selection Bias: The Problem of Conditioning on a Collider Variable. <em>Annual Review of Sociology.</em></p>
<p>G. King, J. Pan, and M. E. Roberts. 2014. Reverse-Engineering Censorship in China: Randomized Experimentation and Participant Observation. <em>Science</em>.</p>
<p>C. Budak, S. Goel, &amp; J. M. Rao. (2014). Fair and Balanced? Quantifying Media Bias Through Crowdsourced Content Analysis. <em>Quantifying Media Bias Through Crowdsourced Content Analysis (November 17, 2014)</em>.</p>
<p>S. Athey, M. Mobius. The Impact of News Aggregators on Internet News Consumption: The Case of Localization. Working paper. <a href="http://faculty-gsb.stanford.edu/athey/documents/localnews.pdf" class="uri">http://faculty-gsb.stanford.edu/athey/documents/localnews.pdf</a></p>
<p>A. Hannak, P. Sapiezynski, A. Molavi Kakhki, B. Krishnamurthy, D. Lazer, A. Mislove, C. Wilson. 2013. Measuring personalization of web search. <em>ACM WWW 2013</em>.</p>
<p>A. Chen and A. Owen and M. Shi. Data Enriched Linear Regression. Working paper. <a href="http://arxiv.org/pdf/1304.1837v3.pdf" class="uri">http://arxiv.org/pdf/1304.1837v3.pdf</a></p>
<p>G.J. Martin, A. Yurukoglu. Working paper. Bias in Cable News: Real Effects and Polarization. Working paper. <a href="http://web.stanford.edu/~ayurukog/cable_news.pdf" class="uri">http://web.stanford.edu/~ayurukog/cable_news.pdf</a></p>
<p>S.R. Flaxman, S. Goel, J.M. Rao. Filter Bubbles, Echo Chambers, and Online News Consumption. Working paper. <a href="https://5harad.com/papers/bubbles.pdf" class="uri">https://5harad.com/papers/bubbles.pdf</a></p>
<p>D.C. Mutz. 2002. The Consequences of Cross-Cutting Networks for Political Participation. <em>American Journal of Political Science</em>.</p>
<p>D.C. Mutz. 2002. Cross-cutting Social Networks: Testing Democratic Theory in Practice. <em>American Political Science Review</em>.</p>
<p>J. N. Druckman, J. Fein, &amp; T. Leeper. 2012. A source of bias in public opinion stability. <em>American Political Science Review</em>.</p>
<p>E. Agapie, S.A. Munson. 2015. “<a href="http://smunson.com/portfolio/projects/socnews_icwsm15.pdf">Social Cues and Interest in Reading Political News Stories</a>.” <em>AAAI ICWSM 2015</em>.</p>
<p>S. Klar. 2014. Partisanship in a Social Setting. <em>American Journal of Political Science</em>.</p>
<p>S.A. Munson, S.Y. Lee, P. Resnick. 2013. Encouraging Reading of Diverse Political Viewpoints with a Browser Widget. <em>AAAI ICWSM 2013</em>.</p>
<p>G. King. 1995. “Replication, Replication.” <em>Political Science and Politics</em>. <a href="http://j.mp/1wP9Vqn" class="uri">http://j.mp/1wP9Vqn</a></p>
<p>T. Leeper. 2015. What's in a Name? The Concepts and Language of Replication and Reproducibility. Blog post. <a href="http://thomasleeper.com/2015/05/open-science-language/" class="uri">http://thomasleeper.com/2015/05/open-science-language/</a></p>



 ]]></description>
  <guid>https://solomonmg.github.io/blog/exposure-to-ideologically-diverse-response/</guid>
  <pubDate>Fri, 24 Apr 2015 00:00:00 GMT</pubDate>
  <media:content url="https://solomonmg.github.io/blog/exposure-to-ideologically-diverse-response/featured.png" medium="image" type="image/png" height="76" width="144"/>
</item>
<item>
  <title>When to Use Stacked Barcharts?</title>
  <dc:creator>Sol Messing</dc:creator>
  <link>https://solomonmg.github.io/blog/when-to-use-stacked-barcharts/</link>
  <description><![CDATA[ 





<p>Yesterday a few of us on&nbsp;Facebook’s Data Science Team released a <a href="https://www.facebook.com/notes/10152581594083859/">blogpost showing how candidates are campaigning on Facebook in the 2014 U.S. midterm elections</a>. It was <a href="http://www.washingtonpost.com/blogs/govbeat/wp/2014/10/10/how-candidates-use-facebook-motivation-more-than-persuasion/">picked up in the Washington Post</a>, in which <a href="http://www.washingtonpost.com/people/reid-wilson">Reid Wilson</a> calls us "data wizards." Outstanding.</p>
<p>I used <a href="http://had.co.nz/">Hadly Wickham's</a> ggplot2 for every visualization in the post except a map&nbsp;that <a href="http://web.stanford.edu/~arjunw/">Arjun Wilkins</a> produced using&nbsp;D3, and for the first time I used stacked bar charts. &nbsp;Now as I've stated previously, <a href="../../blog/visualization-series-insight-from-cleveland-and-tufte-on-plotting-numeric-data-by-groups">one should generally avoid bar charts, and especially stacked bar charts</a>, except in a few specific circumstances.</p>
<p>But let's talk about when not to use stacked bar charts first---I had the pleasure of chatting with Kaiser Fung of <a href="http://junkcharts.typepad.com/">JunkCharts</a> fame the other day, and I think what makes his site so compelling is the mix of schadenfreude and <a href="http://betterthanenglish.com/fremdscham-german/">Fremdscham</a> that makes taking apart someone else's mistake such an effective teaching strategy and such a memorable read. I also appreciate the subtle nod to <a href="https://en.wikipedia.org/wiki/Found_object">junk art</a>.</p>
<p>Here's a typical, terrible stacked bar chart, which I found on http://www.storytellingwithdata.com/ and originally published on a <a href="http://blogs.wsj.com/digits/2012/10/22/microsoft-windows-8-forrester/">Wall Street Journal blogpost</a>. It shows the share of the personal computing device market by operating system, over time. The problem with using a stacked bar chart is that there are only two common baselines for comparison (the top and bottom of the plotting area), but we are interested in the relative share for more than two OS brands. The post is really concerned with Microsoft, so one solution would be to plot Microsoft versus the rest, or perhaps Microsoft on top versus Apple on the bottom with "Other" in the middle. Then we'd be able to compare the over time market share for Apple and Microsoft. As the author points out, an over time trend can also be visualized with line plots.</p>
<p><img src="https://solomonmg.github.io/img/b5e194c5114b79478a7ffcf600b18cd205a3a1b7.jpg" class="img-fluid"></p>
<p>By far the worst offender I found in my 5 minute Google search was <a href="http://junkcharts.typepad.com/junk_charts/2014/08/one-guaranteed-to-make-stephen-few-cry-.html">from junkcharts</a> and originally published on <a href="http://www.vox.com/2014/7/28/5944065/electric-cars-plug-in-vehicles-rising-sales-US">Vox</a>. These cumulative sum plots are so bad I was surprised to see them still up. The first problem is that the plots represent an attempt to convey way too much information---either plot total sales or pick a few key brands that are most interesting and plot them on a multi-line chart or set of faceted time series plots. The only brand for which you can quickly get a sense of sales over time is the Chevy Volt because it's on the baseline. I'm sure the authors wanted to also convey the proportion of sales each year, but if you want to do that just plot the relative sales. Of course, the order in which the bars appear on the plot has no organizing principle, and you need to constantly move your eyes back and forth from the legend to the plot when trying to make sense of this monstrosity.</p>
<p><img src="https://solomonmg.github.io/img/a1281b5624937d4bf8069706e74028aeb8f9952d.png" class="img-fluid"></p>
<p>As Kaiser notes in his post, less is often more. Here's his redux, which uses lines and aggregates by both quarter and brand, resulting in a far superior visualization:</p>
<p><img src="https://solomonmg.github.io/img/df59a98319dbf363b90514243d0ded9d9aa191ce.png" class="img-fluid"></p>
<p>So when *should* you use a stacked bar chart? Here are a two scenarios with examples, inspired by work with <a href="http://eytan.github.io/">Eytan Bakshy</a> and conversations with <a href="http://ta.virot.me/">Ta Chiraphadhanakul</a> and <a href="http://www.johnmyleswhite.com/">John Myles White</a>.</p>
<p>1.&nbsp;You care about comparing the&nbsp;proportion of two things, in this case the share of posts by Democrats and Republicans, along a variety of dimensions. &nbsp;In this case those dimensions consist of keyword (dictionary-based) categories (above) and LDA topics (below). &nbsp;When these are sorted by relative proportion, the reader gains insight into which campaign strategies and issues are used more by Republican or Democratic candidates.</p>
<p><img src="https://solomonmg.github.io/img/b16c48bc6a47363ced0e685cb506df99e15e0a73.png" class="img-fluid"></p>
<ol start="2" type="1">
<li>You care about comparing proportions along an ordinal, additive variable such as 5-point party identification, along a set of dimensions. &nbsp;I provide an example from a forthcoming paper below (I'll re-insert&nbsp;the axis labels&nbsp;once it's published). &nbsp;Notice that it draws the reader toward two sets of comparisons across dimensions -- one for&nbsp;strong democrats and republicans, the other for&nbsp;the set of *all* Democrats and *all* Republicans.</li>
</ol>
<p><img src="https://solomonmg.github.io/img/312fb516b818733414acbe395ded53b8424f37a2.png" class="img-fluid"></p>
<p>Of course, R code to produce these plots follows:</p>
<div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb1-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Uncomment these lines and install if necessary:</span></span>
<span id="cb1-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#install.packages('ggplot2')</span></span>
<span id="cb1-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#install.packages('dplyr')</span></span>
<span id="cb1-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#install.packages('scales')</span></span>
<span id="cb1-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(ggplot2)</span>
<span id="cb1-6"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(dplyr)</span>
<span id="cb1-7"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(scales)</span>
<span id="cb1-8"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># We start with the raw number of posts for each party for</span></span>
<span id="cb1-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># each candidate. Then we compute the total by party and</span></span>
<span id="cb1-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># category.</span></span>
<span id="cb1-11">catsByParty <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">group_by</span>(party, all_cats) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb1-12"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summarise</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">tot =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summ</span>(posts))</span>
<span id="cb1-13"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Next, compute the proportion by party for each category</span></span>
<span id="cb1-14"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># using dplyr::mutate</span></span>
<span id="cb1-15">catsByParty <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> catsByParty <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb1-16"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">group_by</span>(all_cats) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb1-17"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">prop =</span> tot<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sum</span>(tot))</span>
<span id="cb1-18"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Now compute the difference by category and order the</span></span>
<span id="cb1-19"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># categories by that difference:</span></span>
<span id="cb1-20">catsByParty <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> catsByParty <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">group_by</span>(all_cats) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb1-21"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">pdiff =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">diff</span>(prop))</span>
<span id="cb1-22">catsByParty<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>all_cats <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">reorder</span>(catsByParty<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>all_cats, <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>catsByParty<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>pdiff)</span>
<span id="cb1-23"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># And plot:</span></span>
<span id="cb1-24"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggplot</span>(catsByParty, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x=</span>all_cats, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y=</span>prop, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">fill=</span>party)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb1-25"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">scale_y_continuous</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">labels =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">percent_format</span>()) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb1-26"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_bar</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">stat=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'identity'</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb1-27"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_hline</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">yintercept=</span>.<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">linetype =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'dashed'</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb1-28"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">coord_flip</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb1-29"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme_bw</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb1-30"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ylab</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Democrat/Republican share of page posts'</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb1-31"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">xlab</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">''</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb1-32"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">scale_fill_manual</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">values=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'blue'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'red'</span>)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb1-33"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">legend.position=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'none'</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb1-34"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggtitle</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Political Issues Discussed by Party</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">\n</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span>)</span></code></pre></div>



 ]]></description>
  <guid>https://solomonmg.github.io/blog/when-to-use-stacked-barcharts/</guid>
  <pubDate>Sat, 11 Oct 2014 00:00:00 GMT</pubDate>
  <media:content url="https://solomonmg.github.io/blog/when-to-use-stacked-barcharts/featured.png" medium="image" type="image/png" height="120" width="144"/>
</item>
<item>
  <title>Insight From Cleveland And Tufte On Plotting Numeric Data By Groups</title>
  <link>https://solomonmg.github.io/blog/visualization-series-insight-from-cleveland-and-tufte-on-plotting-numeric-data-by-groups/</link>
  <description><![CDATA[ 





<p>After my post on making&nbsp;<a href="http://solomonmessing.wordpress.com/2011/11/26/putting-it-all-together-concise-code-to-make-dotplots-with-weighted-bootstrapped-standard-errors/">dotplots with concise code using plyr and ggplot</a>, I got an email from my dad who practices immigration law and runs a <a href="http://www.messinglawoffices.com/default.aspx">website with a variety of immigration resources and tools</a>. &nbsp;He pointed out that the post was written for folks who&nbsp;already know that they want to make dot plots, and who already know about bootstrapped standard errors. &nbsp;That’s not many people.</p>
<p>In an attempt to appeal to a broader audience, I’m starting a series in which I’ll outline the key principles I use when developing a visualization. &nbsp;In this post, I’ll articulate these principles, which combine some of Tuft’s aesthetic guidelines with Cleveland’s scientific approach to visualization, which is based on the psychological processes involved in making sense of visualizations, and has been&nbsp;rigorously&nbsp;tested via randomized controlled experiments. &nbsp;Based on these principles, I’ll argue that dotplots and scatterplots are better than other types of plots (especially pie charts) in most situations. &nbsp;In later posts, I’ll demonstrate another innovation whose widespread use I’ll credit to Cleveland and Tufte: the use of multiple panels (aka small multiples, trellis graphics, facets, generalized draftsman’s displays, multivar charts) to clearly convey the same information embedded in more complex and difficult to read visualizations, including multiple line plots and mosaic plots. In future posts I’ll also emphasize why it is important to provide some indication of the noise present in the underlying data using error bars or bands. &nbsp;Along the way, I’ll put you to the test–I’ll present some visualizations of the same data using different visualization techniques and ask you to try to get as much information as you can in 2 seconds from each type of visualization.</p>
<p>A good visualization conveys key information to those who may have trouble&nbsp;interpreting&nbsp;numbers and/or statistics, which can make your findings accessible to a wider audience (more on this below). &nbsp;Visualizations also&nbsp;give your audience a break from lexical processing, which is especially useful when you are presenting your findings–people can listen to you and process the findings from a well-designed visual at the same time, but most people have trouble listening while reading your PowerPoint bullet points. &nbsp;Visualizations also convey key information embedded in massive amounts of data, which can aid your own exploratory analysis of data, no matter how massive.</p>
<p>Yet most visualizations are flawed, drawn using elements that make it unnecessarily difficult for the human visual system to make sense of things. &nbsp;I see a lot of these visualizations attending research presentations, screening incoming draft manuscripts as the assistant editor for <a href="http://www.tandf.co.uk/journals/upcp">Political Communication</a>, and as a consumer of media info-graphics (CNN is especially bad, have a look at <a href="http://tech.fortune.cnn.com/tag/pie-chart/">this monstrosity</a>). &nbsp;Kevin Fox has an <a href="http://fury.com/2010/03/why-3d-pie-charts-are-bad/">especially compelling visual speaking to this here</a>. A big part of the problem is that Microsoft makes it easy to draw flashy but ultimately confusing visualizations in Excel. &nbsp;If you are too busy to read this post in full, follow this short list of guidelines and you’ll be on your way to producing elegant visualizations that impose a minimal cognitive burden on your audience:</p>
<ol type="1">
<li><p>Never represent something in 2 or <a href="http://www.psdgraphics.com/wp-content/uploads/2009/02/3d_pie_chart.jpg">worse yet 3 dimensions</a> if it can be represented in one—NEVER use pie charts, 3-D pie charts, stacked bar charts, or 3-D bar charts.</p></li>
<li><p>Remove as much chart junk as possible–unnecessary gridlines, shading, borders, etc.</p></li>
<li><p>Give your audience a sense of the noise present in your data–draw error bars or confidence bands if you are plotting estimates.</p></li>
<li><p>If you want to plot multiple types of groups on a single outcome (the visual analog of cross-tabulations/marginals), use <a href="http://solomonmessing.files.wordpress.com/2011/11/trtbypid2.png?w=640">multi-paneled plots</a>.&nbsp;These can also help if <a href="http://www.bo.astro.it/~eps/buz10503/ff08.jpg">overploting looks too cluttered</a>.</p></li>
<li><p>Avoid mosaic plots. Instead use&nbsp;<a href="http://wiki.stdout.org/rcookbook/Graphs/Facets%20(ggplot2)?action=AttachFile&amp;do=get&amp;target=hp_sex_smoker_free_free.png">paneled histograms</a>.</p></li>
<li><p>Ditch the legend if you can (you almost always can).</p></li>
</ol>
<p>The rest of the content in this series emphasizes why it makes sense to follow these guidelines. In this post I’ll look at the first point in detail and touch on the sixth. These two guidelines are most relevant when you want to look at a quantitative variable&nbsp;(e.g., earnings, vote-share, temperature, etc.) across different qualitative groupings (e.g., industry segment, candidate, party, racial group, season, etc.). &nbsp;This is one of the most common visualization tasks in business, media, and social science, and for this task people often use pie charts and/or bar charts, and occasionally dot plots.</p>
<p><strong>The science of graphical perception</strong></p>
<p>When most people think about visualization, they think first of <a href="http://www.edwardtufte.com/tufte/">Edward Tufte</a>. &nbsp;Tufte emphasizes integrity to the data, showing relationships between phenomena, and above all else aesthetic minimalism. &nbsp;I appreciate his ruthless crusade against <a href="http://chartjun%20k.karmanaut.com/">chart junk</a>&nbsp;and <a href="http://jakeporway.com/2011/08/data-without-borders-logo-contest/">pie charts (nice quote from Data without Borders)</a>. We share an affinity for multipanel plotting approaches, which he calls “small multiples,” (thanks to <a href="http://www.stanford.edu/~rjweiss/">Rebecca Weiss</a> for pointing this out) though I think people give Tufte too much credit for their invention—both <a href="http://www.juiceanalytics.com/writing/better-know-visualization-small-multiples/">juiceanalytics</a> and <a href="http://www.infovis-wiki.net/index.php/Small_Multiples">infovis-wiki</a> write that Cleveland introduced the concept/principle. However, both Cleveland and Tufte published books in 1983 discussing the use of multipanel displays; <a href="http://blog.revolutionanalytics.com/2011/11/small-multiples-of-the-sky.html">David Smith over at Revolutions</a> writes that “the”small-multiples” principle of data visualization [was] pioneered by Cleveland and popularized in Tufte’s first book”; and the earliest reference to a work containing multipanel displays I could find was published *long* before Tufte’s 1983 work–Seder, Leonard (1950), “Diagnosis with Diagrams—Part I”, Industrial Quality Control (New York, New York: American Society for Quality Control) 7 (1): 11–19.</p>
<iframe src="https://giphy.com/embed/11JbaLzOXsg6Fq" width="200" frameborder="0" class="giphy-embed" align="right">
</iframe>
<p>I’m less sure about Tufte’s advice to always show axes starting at zero, which can make comparison between two groups difficult, and to “show causality,” which can end up misleading your readers. &nbsp;Of course, the visualizations on display in the glossy pages of Tufte’s books are beautiful. &nbsp;But while his books are full of general advice that we should all keep in mind when creating plots, he does not put forth a theory of what works and what doesn’t when trying to visualize data.</p>
<p>Cleveland (with Robert McGill) develops such a theory and subjects it to rigorous scientific testing. In my last post I linked to one of&nbsp;Cleveland’s studies showing that&nbsp;<a href="https://www.cs.ubc.ca/~tmm/courses/cpsc533c-04-spr/readings/cleveland.pdf">dots (or bars) aligned on the same scale are indeed the best visualization to convey a series of numerical estimates</a>.&nbsp; In this work, Cleveland examined how accurately our visual system can process visual elements or “perceptual units” representing underlying data.&nbsp; These elements include markers aligned on the same scale (e.g., dot plots, scatterplots, ordinary&nbsp;bar charts), the length of lines that are not aligned on the same scale (e.g., stacked bar plots), area (pie charts and mosaic plots), angles (also pie charts), shading/color, volume, curvature, and direction.</p>
<p><img src="https://solomonmg.github.io/img/graphicalelementscleveland.png" title="Graphical Elements (Cleveland)" class="img-fluid"></p>
<p>He runs two experiments: the first compares judgements about relative position&nbsp;(grouped bar charts)&nbsp;to judgements based only on length (stacked bar charts); the second compares judgements about relative position (ordinary bar charts) to judgements about angles/area (pie charts). &nbsp;Here are the materials he uses, courtesy of the <a href="http://graphics.stanford.edu/">Stanford Computer Graphics Lab</a>:</p>
<p><img src="https://solomonmg.github.io/img/slide012.png" title="Graphical perception experiments" class="img-fluid"></p>
<p><img src="https://solomonmg.github.io/img/slide013.png" title="Graphical perception experiments" class="img-fluid"></p>
<p>The results are resoundingly clear—judgements about position relative to a baseline are dramatically more accurate than judgements about angles, area, or length (with no baseline).&nbsp; Hence, he suggests that we replace pie charts with bar charts or dot plots and that we substitute stacked bar charts for grouped bar charts.</p>
<p>A striking and often overlooked finding in this work is the fact that the group of participants without technical training, “mostly ordinary housewives” as Cleveland describes them, performed <em>just as well</em> as the group of mostly men with substantial technical training and experience. &nbsp; This finding provides evidence for something that I’ve long suspected: that visualizations make it easier for people lacking quantitative experience to understand your results, serving to level the playing field.&nbsp; If you want your findings to be broadly accessible, it’s probably better to present a visualization rather than a bunch of numbers.&nbsp; It also suggests that if someone is having trouble interpreting your visualizations, it’s probably your fault.</p>
<p><strong>Dotplots versus pie charts and stacked barplots</strong></p>
<p>Now let’s put this to the test. &nbsp;Take a look at each visualization below for two seconds, looking for the percent of the vote that Mitt Romney, Ron Paul, and Jon Huntsman got.</p>
<p><a href="../../img/primarydot2.png"><img src="https://solomonmg.github.io/img/primarydot2.png" title="primaryDot" class="img-fluid"></a><a href="../../img/primarypie1.png"><img src="https://solomonmg.github.io/img/primarypie1.png" title="primaryPie" class="img-fluid"></a><a href="../../img/primarystacked.png"><img src="https://solomonmg.github.io/img/primarystacked.png" title="primaryStacked" class="img-fluid"></a></p>
<p>Which is easiest to read? Which conveys information most accurately? Let’s first take a look at the most critical information–the order in which the candidates placed. &nbsp;In all plots, the candidates are arrayed in order from highest to least vote share, and it’s&nbsp;easy to see that Mitt won. &nbsp;But once we start looking at who came in second, third, and so on, differences emerge. &nbsp;It’s slightly harder to process order in the pie chart because your eye has to go around the plot rather than up and down in a straight line. &nbsp;In the stacked bar chart, we need to look up which color corresponds to which candidate’s in the legend (as Tufte told us not to use), adding a layer of cognitive processing.</p>
<p>Second, which conveys estimates most accurately? The dot plot is the clear winner here. &nbsp;We can quickly see that Romney got about 37%, Paul got about 24%, and Huntsman got about 16%, just by looking at dots relative to the axis. &nbsp;When we look at the pie chart, it’s really tough to estimate the exact percent each candidate got. &nbsp;Same with the stacked bar chart. We could add numbers to the pie and bar charts, which would even things out to some extent, but then why not just display a table with exact percents?</p>
<p>One argument I used to hear all the time&nbsp;when I worked in industry is that pie charts “convey a sense of proportion.” &nbsp;Well, sure, I guess I can kind of guestimate that Ron Paul’s vote share is about 1/4. &nbsp;What about Jon Huntsman? Hmm, it looks like about 15 percent, which is 3/20. &nbsp;But wait, why do I want to convert things into fractions anyway? I don’t think in terms of fractions, I think in terms of percents. &nbsp;And if I really care about proportion, I suppose I could extend the axis from 0 to 100.</p>
<p>Suppose I want to plot results for the top 15 candidates, not just the top 6? &nbsp;Here’s what happens:</p>
<p><a href="../../img/primarydot15.png"><img src="https://solomonmg.github.io/img/primarydot15.png" title="primaryDot15" class="img-fluid"></a><a href="../../img/primarypie15.png"><img src="https://solomonmg.github.io/img/primarypie15.png" title="primaryPie15" class="img-fluid"></a></p>
<p>No contest, the pie chart fails completely. &nbsp;We’d need to add a legend with colors for each candidate, which adds another layer of cognitive processing–we’d need to look up each color in the lengend as we go. &nbsp;And even after adding the legend, you wouldn’t be able to distinguish the lower performing candidates from say write-in votes because the pie slices would be too small. &nbsp;The stacked bar chart will fail for the same reasons, so I’ve excluded it in the interest of brevity. &nbsp;Note that we don’t need to add colors to the dotplot to convey the same information, which saves an extra plotting element that we can use to represent something else (say candidate’s campaign funds or total assets). &nbsp;And, on top of it all, the dot plot takes up less screen/page real estate!</p>
<p>Why do I use dot plots instead of ordinary bar charts? A <a href="http://www.perceptualedge.com/articles/b-eye/encoding_values_in_graph.pdf">nice visualization guide from perceptualedge.com</a>&nbsp;points out that often we want to only visualize differences between groups in a narrow range (they use an example wherein monthly expenses vary from $4,250-$5,500).&nbsp;But the length of a bar is supposed to facilitate accurate comparisons between values, so when you use a bar plot starting from $4,250, the length between bars dramatically exaggerates the actual differences. Dot plots do not have this problem because dot encode values using only location, so one must reference the axis to interpret the value.</p>
<p>A related points is that bars are often used to convey counts–we use them in histograms to represent frequency and track say counts of dollars earned/raised in bar charts. &nbsp;In fact, a team of doctors I work with at the med school recently sent in a manuscript to Radiology containing a bar chart plotting mean values between groups; they got back the following comment from the statistical reviewer: “the y-axis is quantitative but the data are represented using bars as if the data were counts.” &nbsp;People often use bar plots to convey estimates of means (and <a href="http://www.stanford.edu/~messing/APSAPoster.pdf">I’ve certainly done this</a>), which can serve to exaggerate differences in means and hence effect sizes if you do not plot the bars from zero.</p>
<p>In addition, dot plots have aesthetic advantages. &nbsp;They convey the numerical estimate in question with a single one-dimensional point, rather than a two dimensional bar. &nbsp;There’s simply less that the eye needs to process. &nbsp;Accordingly, if a pattern across qualitative groupings exists, it’s often easier to see with a dot plot. &nbsp;For example, below I plot the average user ratings for each article to which <a href="http://www.stanford.edu/~seanjw/">Sean Westwood</a> and I exposed subjects in a news reading experiment. &nbsp;The pattern that emerges is an “S” curve in which one or two stories dominate the ratings, most are sort of average, and a few are uniformly terrible. Note that you’d probably want to use something like this more for yourself than to communicate your results to others as it might overload your audience with too much information–you’d do better to select a subset of these articles or remove some of the ones in the middle (thanks to Yph Lelkes for making this point).</p>
<p><a href="../../img/dotplot-story-rating.png"><img src="https://solomonmg.github.io/img/dotplot-story-rating.png" title="dotplot-story-rating" class="img-fluid"></a></p>
<p>One question that remains is if pie charts are so bad, why are they so common? Perhaps we like them because we find them comforting just as we find pies and pizza? Well if so we’d expect pie charts to be less common in places like Japan and China where people grow up eating different food. &nbsp;Consider info-graphics in newspapers: I haven’t yet done a systematic content analysis, but I was unable to find a single pie chart in Japan’s Yomimuri Shimbun nor the Asahi Shimbun; nor in China’s Beijing Daily nor Sing Tao Daily. &nbsp;I did see plenty of maps, however, which I suppose one could argue are reminiscent of noodles.</p>
<p><strong>Implementation</strong></p>
<p>The most efficient way to produce solid visualizations with the ability to implement multiple panels, proper standard error estimates, and dot plots is probably in R using the ggplot2 package. &nbsp;If you do not have time to learn R and remain tied to MS-Excel stick to ordinary barplots to visualize quantitative variables among multiple groups (not recommended).</p>
<p>Otherwise, if you don’t already use it,&nbsp;<a href="http://cran.r-project.org/">download R</a>&nbsp;and a decent editor like <a href="http://rstudio.org/">Rstudio</a>. &nbsp;Then get started with ggplot2 and dot plots by running the following code chunk which will replicate the election figure above:</p>
<div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb1-1">pres <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">read.csv</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"https://SolomonMg.github.io/img/primaryres.csv"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">as.is=</span>T)</span>
<span id="cb1-2"></span>
<span id="cb1-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># sort data in order of percent of vote:</span></span>
<span id="cb1-4">pres <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> pres[<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">order</span>(pres<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>Percentage, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">decreasing=</span>T), ]</span>
<span id="cb1-5"></span>
<span id="cb1-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># only show top 15 candidates:</span></span>
<span id="cb1-7">pres <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> pres[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">15</span>,]</span>
<span id="cb1-8"></span>
<span id="cb1-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># create a precentage variable</span></span>
<span id="cb1-10">pres<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>Percentage <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> pres<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>Percentage<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span></span>
<span id="cb1-11"></span>
<span id="cb1-12"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># reorder the Candidate factor by percentage for plotting purposes:</span></span>
<span id="cb1-13">pres<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>Candidate <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">reorder</span>(pres<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>Candidate, pres<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>Percentage)</span>
<span id="cb1-14"></span>
<span id="cb1-15"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># To install ggplot2, run the following line after deleting the #</span></span>
<span id="cb1-16"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#install.packages("ggplot2")</span></span>
<span id="cb1-17"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(ggplot2)</span>
<span id="cb1-18"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggplot</span>(pres, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> Percentage, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">factor</span>(Candidate) )) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb1-19"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_point</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb1-20"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme_bw</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">xlab</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Percent of Vote"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ylab</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Candidate"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb1-21"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggtitle</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"New Hampshire Primary 2012"</span>)</span></code></pre></div>
<p>After loading our data and running a few preliminary data processing operations, we pass ggplot our data set, “pres,” then we tell it what aesthetic elements we want to use, in this case that x is going to be our “Percentage” variable and y is going to be our “Candidate” variable. We tell ggplot that we want to display points for every xy pair. We also tell it to use the black and white theme, and pass some obscure axis options that ensures the axis plot correctly. Then we tell it what to label the x and y axis, and give it a title.</p>
<p>We can also reproduce the article ratings by story plot above using ggplot2 (even though I originally produced the plot using the lattice package).</p>
<div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb2-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># To install ggplot2, run the following line after deleting the #</span></span>
<span id="cb2-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#install.packages("ggplot2")</span></span>
<span id="cb2-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(ggplot2)</span>
<span id="cb2-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">load</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">file</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"https://SolomonMg.github.io/img/db.Rda"</span>))</span>
<span id="cb2-5"></span>
<span id="cb2-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># if you haven't installed dplyr, delete the # and run this line:</span></span>
<span id="cb2-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># install.packages("dplyr")</span></span>
<span id="cb2-8"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(dplyr)</span>
<span id="cb2-9"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">table</span>(db<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>story)</span>
<span id="cb2-10"></span>
<span id="cb2-11"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># first we use plyr to calculate the mean rating and SE for each story</span></span>
<span id="cb2-12">ratingdat <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> db <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">group_by</span>(story) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb2-13"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summarise</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">M =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(rating, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">na.rm=</span>T),</span>
<span id="cb2-14"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">SE =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sd</span>(rating, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">na.rm=</span>T)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sqrt</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">length</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">na.omit</span>(rating))),</span>
<span id="cb2-15"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">N =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">length</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">na.omit</span>(rating)))</span>
<span id="cb2-16"></span>
<span id="cb2-17"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># make story into an ordered factor, ordering by mean rating:</span></span>
<span id="cb2-18">ratingdat<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>story <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">factor</span>(ratingdat<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>story)</span>
<span id="cb2-19">ratingdat<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>story <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">reorder</span>(ratingdat<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>story, ratingdat<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>M)</span>
<span id="cb2-20"></span>
<span id="cb2-21"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># take a look at our handiwork:</span></span>
<span id="cb2-22"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggplot</span>(ratingdat, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> M, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">xmin =</span> M<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>SE, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">xmax =</span> M<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span>SE, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> story )) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb2-23"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_point</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_segment</span>( <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> M<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>SE, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">xend =</span> M<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span>SE,</span>
<span id="cb2-24"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> story, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">yend=</span>story)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb2-25"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme_bw</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">xlab</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Mean rating"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ylab</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Story"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb2-26"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggtitle</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Rating article by Story, with SE"</span>)</span>
<span id="cb2-27"></span>
<span id="cb2-28"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Now save</span></span>
<span id="cb2-29"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggsave</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">file=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"plots/dotplot-story-rating.pdf"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">height=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">14</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">width=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">8.5</span>)</span></code></pre></div>



 ]]></description>
  <guid>https://solomonmg.github.io/blog/visualization-series-insight-from-cleveland-and-tufte-on-plotting-numeric-data-by-groups/</guid>
  <pubDate>Sun, 04 Mar 2012 00:00:00 GMT</pubDate>
  <media:content url="https://solomonmg.github.io/blog/visualization-series-insight-from-cleveland-and-tufte-on-plotting-numeric-data-by-groups/featured.png" medium="image" type="image/png" height="186" width="144"/>
</item>
<item>
  <title>Working with Bipartite/Affiliation Network Data in R</title>
  <dc:creator>Sol Messing</dc:creator>
  <link>https://solomonmg.github.io/blog/working-with-bipartite-affiliation-network-data-in-r/</link>
  <description><![CDATA[ 





<p>Data can often be usefully conceptualized in terms affiliations between people (or other key data entities). It might be useful analyze common group membership, common purchasing decisions, or common patterns of behavior. This post introduces bipartite/affiliation network data and provides R code to help you process and visualize this kind of data. I recently updated this for use with larger data sets, though I put it together a while back.</p>
<section id="preliminaries" class="level2">
<h2 class="anchored" data-anchor-id="preliminaries">Preliminaries</h2>
<p>Much of the material here is covered in the more comprehensive&nbsp;<a href="http://sna.stanford.edu/rlabs.php">“Social Network Analysis Labs in R and SoNIA,”</a>&nbsp;on which I collaborated with Dan McFarland, Sean Westwood and Mike Nowak. For a great online introduction to social network analysis see the online book&nbsp;<a href="http://www.faculty.ucr.edu/~hanneman/nettext/">Introduction to Social Network Methods</a>&nbsp;by Robert Hanneman and Mark Riddle.</p>
</section>
<section id="bipartiteaffiliation-network-data" class="level2">
<h2 class="anchored" data-anchor-id="bipartiteaffiliation-network-data">Bipartite/Affiliation Network Data</h2>
<p>A network can consist of different ‘classes’ of nodes. For example, a two-mode network might consist of people (the first mode) and groups in which they are members (the second mode). Another very common example of two-mode network data consists of users on a particular website who communicate in the same forum thread. Here’s a short example of this kind of data. Run this in R for yourself - just copy an paste into the command line or into a script and it will generate a dataframe that we can use for illustrative purposes:</p>
<div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb1-1">df <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>( </span>
<span id="cb1-2">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">person =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Sam'</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Sam'</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Sam'</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Greg'</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Tom'</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Tom'</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Tom'</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Mary'</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Mary'</span>), </span>
<span id="cb1-3">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">group =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'a'</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'b'</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'c'</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'a'</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'b'</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'c'</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'d'</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'b'</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'d'</span>), </span>
<span id="cb1-4">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">stringsAsFactors =</span> F)</span>
<span id="cb1-5"></span>
<span id="cb1-6">df</span>
<span id="cb1-7"></span>
<span id="cb1-8">person group</span>
<span id="cb1-9"><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>&nbsp;&nbsp;&nbsp; Sam&nbsp;&nbsp;&nbsp;&nbsp; a</span>
<span id="cb1-10"><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>&nbsp;&nbsp;&nbsp; Sam&nbsp;&nbsp;&nbsp;&nbsp; b</span>
<span id="cb1-11"><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>&nbsp;&nbsp;&nbsp; Sam&nbsp;&nbsp;&nbsp;&nbsp; c</span>
<span id="cb1-12"><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>&nbsp;&nbsp; Greg&nbsp;&nbsp;&nbsp;&nbsp; a</span>
<span id="cb1-13"><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>&nbsp;&nbsp;&nbsp; Tom&nbsp;&nbsp;&nbsp;&nbsp; b</span>
<span id="cb1-14"><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">6</span>&nbsp;&nbsp;&nbsp; Tom&nbsp;&nbsp;&nbsp;&nbsp; c</span>
<span id="cb1-15"><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">7</span>&nbsp;&nbsp;&nbsp; Tom&nbsp;&nbsp;&nbsp;&nbsp; d</span>
<span id="cb1-16"><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span>&nbsp;&nbsp; Mary&nbsp;&nbsp;&nbsp;&nbsp; b</span>
<span id="cb1-17"><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">9</span>&nbsp;&nbsp; Mary&nbsp;&nbsp;&nbsp;&nbsp; d</span></code></pre></div>
</section>
<section id="c1" class="level2">
<h2 class="anchored" data-anchor-id="c1">Fast, efficient two-mode to one-mode conversion in R</h2>
<p>Suppose&nbsp;we wish to analyze or visualize how the people are connected directly - that is, what if we want the network of people where a tie between two people is present if they are both members of the same group? We need to perform a two-mode to one-mode conversion.</p>
<p>To convert a two-mode incidence matrix to a one-mode adjacency matrix, one can simply multiply an incidence matrix by its transpose, which sum the common 1’s between rows. Recall that matrix multiplication entails multiplying the k-th entry of a row in the first matrix by the k-th entry of a column in the second matrix, then summing, such that the ij-th row-column entry in resulting matrix represents the dot-product of the i-th row of the first matrix and the j-th column of the second. In mathematical notation:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AAB%20=%20%5Cleft%20%5B%0A%20%20%5Cbegin%7Barray%7D%7Bcc%7D%0A%20%20%20%20%20%20a%20&amp;%20b%20%20%5C%5C%5C%5C%5C%5C%0A%20%20%20%20%20%20c%20&amp;%20d%0A%20%20%5Cend%7Barray%7D%20%5Cright%20%5D%0A%20%20%5Cleft%20%5B%20%5Cbegin%7Barray%7D%7Bcc%7D%0A%20%20%20%20%20%20e%20&amp;%20f%20%20%5C%5C%5C%5C%5C%5C%0A%20%20%20%20%20%20g%20&amp;%20h%0A%20%20%5Cend%7Barray%7D%20%5Cright%20%5D%20=%0A%20%20%5Cleft%20%5B%0A%20%20%5Cbegin%7Barray%7D%7Bcc%7D%0A%20%20%20%20ae+bg%20&amp;%20af+bh%20%5C%5C%5C%5C%5C%5C%0A%20%20%20%20ce+dg%20&amp;%20cf+dh%0A%20%20%5Cend%7Barray%7D%0A%20%20%5Cright%20%5D%0A"></p>
<p>Notice further that multiplying a matrix by its transpose yields the following:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Balign%7D%0AAA'%20=%0A%20%20%5Cleft%5B%0A%20%20%5Cbegin%7Barray%7D%7Bcc%7D%0A%20%20%20%20%20%20a%20&amp;%20b%20%20%5C%5C%5C%5C%5C%5C%0A%20%20%20%20%20%20c%20&amp;%20d%0A%20%20%5Cend%7Barray%7D%0A%20%20%5Cright%5D%0A%20%20%5Cleft%5B%0A%20%20%5Cbegin%7Barray%7D%7Bcc%7D%0A%20%20%20%20%20%20a%20&amp;%20c%20%20%5C%5C%5C%5C%5C%5C%0A%20%20%20%20%20%20b%20&amp;%20d%0A%20%20%5Cend%7Barray%7D%0A%20%20%5Cright%5D%20=%0A%20%20%5Cleft%5B%0A%20%20%5Cbegin%7Barray%7D%7Bcc%7D%0A%20%20%20%20aa+bb%20&amp;%20ac+bd%20%5C%5C%5C%5C%5C%5C%0A%20%20%20%20ca+db%20&amp;%20cc+dd%0A%20%20%5Cend%7Barray%7D%0A%20%20%5Cright%5D%0A%5Cend%7Balign%7D%0A"></p>
<p>Because our incidence matrix consists of 0’s and 1’s, the off-diagonal entries represent the total number of common columns, which is exactly what we wanted. We’ll use the <code>%*%</code> operator to tell R to do exactly this. Let’s take a look at a small example using toy data of people and groups to which they belong. We’ll coerce the data to an incidence matrix, then multiply the incidence matrix by its transpose to get the number of common groups between people.</p>
<p>This is easy to do using the matrix algebra functions included in R. But first, you need to restructure your (edgelist) network data as an incidence matrix. An incidence will record a 1 for row-column combinations where a tie is present and 0 otherwise. One easy way to do this in R is to use the table function and then coerce the table object to a matrix object:</p>
<div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb2-1">m&nbsp;<span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">table</span>( df )</span>
<span id="cb2-2">M <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">as.matrix</span>(&nbsp;m )</span></code></pre></div>
<p>If you are using the network or sna packages, a network object&nbsp;be coerced via&nbsp;<code>as.matrix(your-network)</code>; with the igraph package use&nbsp;<code>get.adjacency(your-network)</code>.</p>
<p>This is great, but what about if we are working with a really large data set? Network data is almost always sparse—there are far more pairwise combinations of potential connections than actual observed connections. Hence, we’d actually prefer to keep the underlying data structured in edgelist format, but we’d also like access to R’s matrix algebra functionality.</p>
<p>We can get the best of both worlds using the Matrix library to construct a sparse triplet representation of a matrix. But we’d also like to avoid building the entire incidence matrix and just feed Matrix our edgelist directly, a point that came up in a recent conversation I had with <a href="http://seanjtaylor.com/">Sean Taylor</a>. We feed <code>Matrix</code> our ‘person’ column to index ‘i’ (rows in the new incidence matrix), our ‘group’ column to index j (columns in the new incidence matrix), and we repeat ‘1’ for the length of the edgelist to denote an incidence.</p>
<div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb3-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Matrix'</span>)</span>
<span id="cb3-2">A <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">spMatrix</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">nrow=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">length</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">unique</span>(df<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>person)),</span>
<span id="cb3-3"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">ncol=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">length</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">unique</span>(df<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>group)),</span>
<span id="cb3-4"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">i =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">as.numeric</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">factor</span>(df<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>person)),</span>
<span id="cb3-5"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">j =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">as.numeric</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">factor</span>(df<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>group)),</span>
<span id="cb3-6"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rep</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">length</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">as.numeric</span>(df<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>person))) )</span>
<span id="cb3-7"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">row.names</span>(A) <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">levels</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">factor</span>(df<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>person))</span>
<span id="cb3-8"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">colnames</span>(A) <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">levels</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">factor</span>(df<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>group))</span>
<span id="cb3-9">A</span></code></pre></div>
<p>We will either convert to the ‘mode’ represented by the columns or by the rows. To get the one-mode representation of ties between rows (people in our example), multiply the matrix by its transpose. Note that you must use the matrix-multiplication operator&nbsp;<code>%*%</code>&nbsp;rather than a simple astrisk. The R code is:</p>
<div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb4-1">Arow <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> A <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%*%</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">t</span>(A)</span></code></pre></div>
<p>But we can still do better! The function tcrossprod is faster and more efficient for this:</p>
<div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb5-1">Arow <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">tcrossprod</span>(A)</span></code></pre></div>
<p>Arow will now represent the one-mode matrix formed by the row entities—people will have ties to each other if they are in the same group, in our example. Here’s what it looks like:</p>
<pre><code>Arow
4 x 4 sparse Matrix of class "dgCMatrix"
     Greg Mary Sam Tom
Greg    1    .   1   .
Mary    .    2   1   2
Sam     1    1   3   2
Tom     .    2   2   3</code></pre>
<p>To get the one-mode matrix formed by the column entities (i.e.&nbsp;the number of people) enter the following command:</p>
<div class="sourceCode" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb7-1">Acol <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">t</span>(A) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%*%</span> A</span></code></pre></div>
<p>Again, we can use tcrossprod to make this even more efficient:</p>
<div class="sourceCode" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb8-1">Acol <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">tcrossprod</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">t</span>(A))</span></code></pre></div>
<p>And the resulting co-membership matrix is as follows:</p>
<pre><code>Mcol
group
group a b c d
a 2 1 1 0
b 1 3 2 2
c 1 2 2 1
d 0 2 1 2</code></pre>
<p>Although we’ve used a very small network for our example, this code&nbsp;is highly extensible to the analysis of larger networks with&nbsp;R.</p>
</section>
<section id="c1" class="level2">
<h2 class="anchored" data-anchor-id="c1">Analysis of Two Mode Data and Mobility</h2>
<p>Let’s work with some actual affiliation data, collected by Dan McFarland on student extracurricular affiliations. It’s a longitudinal data set, with&nbsp;3 waves - 1996, 1997, 1998. &nbsp;It consists of students (anonymized) and the student organizations in which they are members (e.g.&nbsp;National Honor Society, wrestling team, cheerleading squad, etc.). What we’ll do is to read in the data, explore it, make a few two-to-one mode conversions, and visualize it.</p>
<div class="sourceCode" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb10-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Load the 'igraph' library</span></span>
<span id="cb10-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'igraph'</span>)</span>
<span id="cb10-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># (1) Read in the data files, NA data objects coded as 'na'</span></span>
<span id="cb10-4">magact96 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">read.delim</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'https://solomonmg.github.io/assets/img/mag_act96.txt'</span>,</span>
<span id="cb10-5"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">na.strings =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'na'</span>)</span>
<span id="cb10-6">magact97 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">read.delim</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'https://solomonmg.github.io/assets/img/mag_act97.txt'</span>,</span>
<span id="cb10-7"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">na.strings =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'na'</span>)</span>
<span id="cb10-8">magact98 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">read.delim</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'https://solomonmg.github.io/assets/img/mag_act98.txt'</span>,</span>
<span id="cb10-9"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">na.strings =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'na'</span>)</span></code></pre></div>
<p>Missing data is coded as “na” in this data, which is why we gave R the command&nbsp;na.strings = “na”.</p>
<p>These files consist of four columns of individual-level attributes (ID, gender, grade, race), then a bunch of group membership dummy variables (coded “1” for membership, “0” for no membership). &nbsp;We need to set aside the first four columns (which do not change from year to year).</p>
<div class="sourceCode" id="cb11" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb11-1">magattrib <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> magact96[,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>]</span>
<span id="cb11-2">g96 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">as.matrix</span>(magact96[,<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>)]); <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">row.names</span>(g96) <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> magact96<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>ID.</span>
<span id="cb11-3">g97 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">as.matrix</span>(magact97[,<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>)]); <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">row.names</span>(g97) <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> magact97<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>ID.</span>
<span id="cb11-4">g98 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">as.matrix</span>(magact98[,<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>)]); <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">row.names</span>(g98) <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> magact98<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>ID.</span></code></pre></div>
<p>By using the <code>[,-(1:4)]</code> index, we drop those columns so that&nbsp;we have a square incidence matrix for each year, and then tell R to set the row names of the matrix to the student’s ID. Note that we need to keep the “.” after ID in this dataset (because it’s in the name of the variable). Now we load these two-mode matrices into igraph:</p>
<div class="sourceCode" id="cb12" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb12-1">i96 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">graph.incidence</span>(g96, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">mode=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'all'</span>) )</span>
<span id="cb12-2">i97 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">graph.incidence</span>(g97, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">mode=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'all'</span>) )</span>
<span id="cb12-3">i98 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">graph.incidence</span>(g98, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">mode=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'all'</span>) )</span></code></pre></div>
<section id="plotting-two-mode-networks" class="level3">
<h3 class="anchored" data-anchor-id="plotting-two-mode-networks">Plotting two-mode networks</h3>
<p>Now, let’s plot these graphs. The igraph package has excellent plotting functionality that allows you to assign visual attributes to igraph objects before you plot. The alternative is to pass 20 or so arguments to the <code>plot.igraph()</code> function, which gets really messy.</p>
<p>Let’s assign some attributes to our graph. First we set vertex attributes, making sure to make them slightly transparent by altering the gamma, using the&nbsp;<code>rgb(r,g,b,gamma)</code>&nbsp;function to set the color. This makes it much easier to look at a really crowded graph, which might look like a giant hairball otherwise. You can read up on the RGB color model&nbsp;<a href="http://en.wikipedia.org/wiki/RGB_color_model">here</a>.</p>
<p>Each node (or “vertex”) object is accessible by calling&nbsp;<code>V(g)</code>, and you can call (or create) a node attribute by using the <code>$</code> operator so that you call&nbsp;<code>V(g)$attribute</code>. Here’s how to set the color attribute for a set of nodes in a graph object:</p>
<div class="sourceCode" id="cb13" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb13-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">V</span>(i96)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>color[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1295</span>] <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rgb</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>,.<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>)</span>
<span id="cb13-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">V</span>(i96)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>color[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1296</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1386</span>] <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rgb</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>,.<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>)</span></code></pre></div>
<p>Notice that we index the&nbsp;<code>V(g)$color</code>&nbsp;object by a seemingly arbitrary value, 1295.&nbsp; This marks the end of the student nodes, and 1296 is the first group node. You can view which nodes are which by typing&nbsp;V(i96). R prints out a list of all the nodes in the graph, and those with a number are obviously different from those that consist of a group name.</p>
<p>Now we’ll set some other graph attributes:</p>
<div class="sourceCode" id="cb14" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb14-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">V</span>(i96)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>label <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">V</span>(i96)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>name</span>
<span id="cb14-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">V</span>(i96)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>label.color <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rgb</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>,.<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>,.<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>)</span>
<span id="cb14-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">V</span>(i96)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>label.cex <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> .<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span></span>
<span id="cb14-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">V</span>(i96)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>size <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">6</span></span>
<span id="cb14-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">V</span>(i96)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>frame.color <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NA</span></span></code></pre></div>
<p>You can also set edge attributes. Here we’ll make the edges nearly transparent and slightly yellow because there will be so many edges in this graph:</p>
<div class="sourceCode" id="cb15" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb15-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">E</span>(i96)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>color <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rgb</span>(.<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>,.<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>,.<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>)</span></code></pre></div>
<p>Now, we’ll open a pdf “device” on which to plot. This is just a connection to a pdf file. Note that the code below will take a minute or two&nbsp;to execute (or longer if you have a pre- Intel dual-core processor).</p>
<div class="sourceCode" id="cb16" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb16-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pdf</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'i96.pdf'</span>)</span>
<span id="cb16-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">plot</span>(i96, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">layout=</span>layout.fruchterman.reingold)</span>
<span id="cb16-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">dev.off</span>()</span></code></pre></div>
<p>Note that we’ve used the Fruchterman-Reingold force-directed layout algorithm here.&nbsp; Generally speaking, the when you have a ton of edges, the Kamada-Kawai layout algorithm works well but, it can get really slow for networks with a lot of nodes.&nbsp;Also, for larger networks, layout.fruchterman.reingold.grid is faster,&nbsp;but can fail to produce a plot with any meaninful pattern&nbsp;if you have&nbsp;too many isolates, as is the case here. Experiment for yourself. Here’s what we get:</p>
<p><img src="https://solomonmg.github.io/img/i96.jpg" class="img-fluid"></p>
<p>It’s oddly reminiscent of a cresent and star, but impossible to read. Now, if you open the&nbsp;<a href="../../img/i96.pdf">pdf output</a>, you’ll notice that you can&nbsp;zoom in on any part of the graph ad infinitum without losing any resolution. How is that possible in such a small file? It’s possible because the pdf device output consists of data based on vectors: lines, polygons, circles, elipses, etc., each specified by a mathematical formula that your pdf program renders when you view it. Regular bitmap or jpeg picture output, on the other hand, consists of a pixel-coordinate mapping of the image in question, which is why you lose resolution when you zoom in on a digital photograph or a plot produced with most other programs.</p>
<p>Let’s remove all of the isolates (the cresent), change a few aesthetic features, and replot. First, we’ll remove isloates, by deleting all nodes with a degree of 0, meaning that they have zero edges. Then, we’ll suppress labels for&nbsp;students and make their nodes smaller and more transparent. Then we’ll make the edges more narrow more transparent. Then, we’ll replot using various layout algorithms:</p>
<div class="sourceCode" id="cb17" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb17-1">i96 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">delete.vertices</span>(i96, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">V</span>(i96)[ <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">degree</span>(i96)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span> ])</span>
<span id="cb17-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">V</span>(i96)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>label[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">857</span>] <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NA</span></span>
<span id="cb17-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">V</span>(i96)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>color[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">857</span>] <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span>&nbsp; <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rgb</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>,.<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span>
<span id="cb17-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">V</span>(i96)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>size[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">857</span>] <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span></span>
<span id="cb17-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">E</span>(i96)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>width <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> .<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span></span>
<span id="cb17-6"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">E</span>(i96)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>color <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rgb</span>(.<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>,.<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>,.<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span>
<span id="cb17-7"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pdf</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'i96.2.pdf'</span>)</span>
<span id="cb17-8"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">plot</span>(i96, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">layout=</span>layout.kamada.kawai)</span>
<span id="cb17-9"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">dev.off</span>()</span>
<span id="cb17-10"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pdf</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'i96.3.pdf'</span>)</span>
<span id="cb17-11"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">plot</span>(i96, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">layout=</span>layout.fruchterman.reingold.grid)</span>
<span id="cb17-12"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">dev.off</span>()</span>
<span id="cb17-13"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pdf</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'i96.4.pdf'</span>)</span>
<span id="cb17-14"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">plot</span>(i96, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">layout=</span>layout.fruchterman.reingold)</span>
<span id="cb17-15"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">dev.off</span>()</span></code></pre></div>
<p>I personally prefer the Fruchterman-Reingold layout in this case. The nice thing about this layout is that it really emphasizes centrality–the nodes that are most central are nearly always placed in the middle of the plot. Here’s what it looks like:</p>
<p><img src="https://solomonmg.github.io/img/i962.jpg" class="img-fluid"></p>
<p>Very pretty, but you can’t see which groups are which at this resolution. Zoom assets/in on the&nbsp;<a href="../../img/samplepdf.pdf">pdf output</a>, and you can see things pretty clearly.</p>
</section>
<section id="two-mode-to-one-mode-data-transformation" class="level3">
<h3 class="anchored" data-anchor-id="two-mode-to-one-mode-data-transformation">Two mode to one mode data transformation</h3>
<p>We’ve emphasized groups in this visualization so much, that we might want to just create a network consisting of group co-membership. First we need to create a new network object. We’ll do that the same way for this network as for our example at the top of this page:</p>
<div class="sourceCode" id="cb18" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb18-1">g96e <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">t</span>(g96) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%*%</span> g96</span>
<span id="cb18-2">g97e <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">t</span>(g97) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%*%</span> g97</span>
<span id="cb18-3">g98e <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">t</span>(g98) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%*%</span> g98</span>
<span id="cb18-4">i96e <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">graph.adjacency</span>(g96e, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">mode =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'undirected'</span>)</span></code></pre></div>
<p>Now we need to tansform the graph so that multiple edges become an attribute (&nbsp;<code>E(g)$weight</code>&nbsp;) of each unique edge:</p>
<div class="sourceCode" id="cb19" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb19-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">E</span>(i96e)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>weight <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">count.multiple</span>(i96e)</span>
<span id="cb19-2">i96e <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">simplify</span>(i96e)</span></code></pre></div>
<p>Now we’ll set the other plotting parameters as we did above:</p>
<div class="sourceCode" id="cb20" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb20-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Set vertex attributes</span></span>
<span id="cb20-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">V</span>(i96e)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>label <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">V</span>(i96e)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>name</span>
<span id="cb20-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">V</span>(i96e)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>label.color <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rgb</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>,.<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>,.<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span>)</span>
<span id="cb20-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">V</span>(i96e)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>label.cex <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> .<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">6</span></span>
<span id="cb20-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">V</span>(i96e)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>size <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">6</span></span>
<span id="cb20-6"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">V</span>(i96e)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>frame.color <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NA</span></span>
<span id="cb20-7"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">V</span>(i96e)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>color <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rgb</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>,.<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>)</span>
<span id="cb20-8"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Set edge gamma according to edge weight</span></span>
<span id="cb20-9">egam <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> (<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">log</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">E</span>(i96e)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>weight)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span>.<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">max</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">log</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">E</span>(i96e)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>weight)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span>.<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>)</span>
<span id="cb20-10"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">E</span>(i96e)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>color <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rgb</span>(.<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>,.<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>,egam)</span></code></pre></div>
<p>We set edge gamma as a function of&nbsp;how many edges exist between two nodes, or in this case, how many students each group has in common.&nbsp;For illustrative purposes, let’s compare how the Kamada-Kawai and Fruchterman-Reingold algorithms render this graph:</p>
<div class="sourceCode" id="cb21" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb21-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pdf</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'i96e.pdf'</span>)</span>
<span id="cb21-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">plot</span>(i96e, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">main =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'layout.kamada.kawai'</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">layout=</span>layout.kamada.kawai)</span>
<span id="cb21-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">plot</span>(i96e, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">main =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'layout.fruchterman.reingold'</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">layout=</span>layout.fruchterman.reingold)</span>
<span id="cb21-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">dev.off</span>()</span></code></pre></div>
<p>I like the Kamada-Kawai layout for this graph, because the center of the graph is too busy otherwise. And here’s what the resulting plot looks like: <img src="https://solomonmg.github.io/img/i96e-kk.jpeg" class="img-fluid"></p>
<p>You can check out the difference between each layout yourself. Here’s what the&nbsp;<a href="../../img/i96e.pdf">assets/pdf output looks like</a>. &nbsp;Page 1 shows the Kamada-Kawai layout and page 2 shows the Fruchterman Reingold layout.</p>
</section>
<section id="group-overlap-networks-and-plots" class="level3">
<h3 class="anchored" data-anchor-id="group-overlap-networks-and-plots">Group overlap networks and plots</h3>
<p>Now we might also be interested in the percent overlap between groups. Note that this will be a directed graph, because the percent overlap will not be symmetric across groups–for example, it may be that 3/4 of&nbsp;Spanish NHS members are in NHS, but only 1/8 of NHS members are&nbsp;in the Spanish NHS. We’ll create this graph for all years in our data (though we could do it for one year only). First we’ll need to create a percent overlap graph. We start by dividing&nbsp;each row by the diagonal (this is really easy in R):</p>
<div class="sourceCode" id="cb22" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb22-1">ol96 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> g96e<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">diag</span>(g96e)</span>
<span id="cb22-2">ol97 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> g97e<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">diag</span>(g97e)</span>
<span id="cb22-3">ol98 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> g98e<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">diag</span>(g98e)</span></code></pre></div>
<p>Next, sum the matricies and set any NA cells (caused by dividing by zero in the step above) to zero:</p>
<div class="sourceCode" id="cb23" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb23-1">magall <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> ol96 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> ol97 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> ol98</span>
<span id="cb23-2">magall[<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(magall)] <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span></span></code></pre></div>
<p>Note that&nbsp;magall&nbsp;now consists of a percent overlap matrix, but because we’ve summed over 3 years, the maximun is now 3 instead of 1. Let’s&nbsp;compute average club size, by taking the mean across each value in each diagonal:</p>
<div class="sourceCode" id="cb24" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb24-1">magdiag <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">apply</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">cbind</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">diag</span>(g96e), <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">diag</span>(g97e), <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">diag</span>(g98e)), <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, mean )</span></code></pre></div>
<p>Finally, we’ll generate&nbsp;centrality measures for magall. When we create the igraph object from our matrix, we need to set&nbsp;weighted=T&nbsp;because otherwise igraph dichotomizes edges at 1. This can distort our centrality measures&nbsp;because now edges represent &nbsp;more than binary connections–they represent the percent of membership overlap.</p>
<div class="sourceCode" id="cb25" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb25-1">magallg <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">graph.adjacency</span>(magall, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">weighted=</span>T)</span>
<span id="cb25-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Degree</span></span>
<span id="cb25-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">V</span>(magallg)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>degree <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">degree</span>(magallg)</span>
<span id="cb25-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Betweenness centrality</span></span>
<span id="cb25-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">V</span>(magallg)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>btwcnt <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">betweenness</span>(magallg)</span></code></pre></div>
<p>Before we plot this, we should probably filter some of the edges, otherwise our graph will probably be too busy to make sense of visually. &nbsp;Take a look at the distribution of connection strength by plotting the density of the magall matrix:</p>
<div class="sourceCode" id="cb26" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb26-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">plot</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">density</span>(magall))</span></code></pre></div>
<p><img src="https://solomonmg.github.io/img/densitymagall.jpeg" class="img-fluid"></p>
<p>Nearly all of the edge weights are below 1–or in other words, the percent overlap for most clubs is less than 1/3. Let’s filter at 1, so that an edge will consists of group overlap of more than 1/3 of the group’s members in question.</p>
<div class="sourceCode" id="cb27" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb27-1">magallgt1 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> magall</span>
<span id="cb27-2">magallgt1[magallgt1<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>] <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span></span>
<span id="cb27-3">magallggt1 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">graph.adjacency</span>(magallgt1, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">weighted=</span>T)</span>
<span id="cb27-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&nbsp;Removes loops:</span></span>
<span id="cb27-5">magallggt1 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">simplify</span>(magallggt1, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">remove.multiple=</span><span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">remove.loops=</span><span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>)</span></code></pre></div>
<p>Before we do anything else, we’ll create a custom layout based on Fruchterman.-Ringold wherein we adjust the coordates by hand using the&nbsp;tkplot&nbsp;gui tool to make sure all of the labels are visible. This is very useful if you want to create a really sharp-looking network visualization for publication.</p>
<div class="sourceCode" id="cb28" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb28-1">magallggt1<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>layout <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">layout.fruchterman.reingold</span>(magallggt1)</span>
<span id="cb28-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">V</span>(magallggt1)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>label <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">V</span>(magallggt1)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>name</span>
<span id="cb28-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">tkplot</span>(magallggt1)</span></code></pre></div>
<p>Let the plot load, then&nbsp;maximize the window, and select to View -&gt; Fit to Screen&nbsp;so that&nbsp;you get maximum resolution for this large graph. Now hand-place the nodes, making sure no labels overlap: <img src="https://solomonmg.github.io/img/tkplotscreenshot.png" class="img-fluid"></p>
<p>Pay special attention to whether the labels overlap (or might overlap if the font was bigger) along the vertical. Save the layout coordinates to the graph object:</p>
<div class="sourceCode" id="cb29" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb29-1">magallggt1<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>layout <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">tkplot.getcoords</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span></code></pre></div>
<p>We use “1” here&nbsp;because only if this was the first tkplot object you called. If you called tkplot a few times, use the last plot object. You can tell which object is visible because at the top of the&nbsp;tkplot&nbsp;interface, you’ll see something like “Graph plot 1” or in the case of my screenshot above “Graph plot 7” (it was the seventh time I called tkplot).</p>
<div class="sourceCode" id="cb30" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb30-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Set vertex attributes</span></span>
<span id="cb30-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">V</span>(magallggt1)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>label <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">V</span>(magallggt1)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>name</span>
<span id="cb30-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">V</span>(magallggt1)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>label.color <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rgb</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>,.<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>,.<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">6</span>)</span>
<span id="cb30-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">V</span>(magallggt1)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>size <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">6</span></span>
<span id="cb30-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">V</span>(magallggt1)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>frame.color <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NA</span></span>
<span id="cb30-6"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">V</span>(magallggt1)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>color <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rgb</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>,.<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>)</span>
<span id="cb30-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Set edge attributes</span></span>
<span id="cb30-8"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">E</span>(magallggt1)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>arrow.size <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> .<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span></span>
<span id="cb30-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Set edge gamma according to edge weight</span></span>
<span id="cb30-10">egam <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> (<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">E</span>(magallggt1)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>weight<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">+.1</span>)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">max</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">E</span>(magallggt1)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>weight<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">+.1</span>)</span>
<span id="cb30-11"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">E</span>(magallggt1)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>color <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rgb</span>(.<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>,.<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>,egam)</span></code></pre></div>
<p>One thing that we can do with this graph is to set label size as a function of degree, which adds a “tag-cloud”-like element to the visualization:</p>
<div class="sourceCode" id="cb31" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb31-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">V</span>(magallggt1)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>label.cex <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">V</span>(magallggt1)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>degree<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">max</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">V</span>(magallggt1)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>degree)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> .<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span></span>
<span id="cb31-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#note, unfortunately one must play with the formula above to get the</span></span>
<span id="cb31-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#ratio just right</span></span></code></pre></div>
<p>Let’s plot the results:</p>
<div class="sourceCode" id="cb32" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb32-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pdf</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'magallggt1customlayout.pdf'</span>)</span>
<span id="cb32-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">plot</span>(magallggt1)</span>
<span id="cb32-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">dev.off</span>()</span></code></pre></div>
<p>Note that we used the custom layout, which because we made part of the igraph object&nbsp;magallggt1, we did not need to specify in plot command. assets/Here’s the&nbsp;<a href="../../img/magallggt1custom.pdf">pdf output</a>, and here’s what it looks like:</p>
<p><img src="https://solomonmg.github.io/img/magallggt1custom.png" class="img-fluid"></p>
<p>This visualization reveals much more information about our network than our cresent-star visualization.</p>
</section>
<section id="mobility-markov-and-transition-probabilities" class="level3">
<h3 class="anchored" data-anchor-id="mobility-markov-and-transition-probabilities">Mobility, Markov, and Transition Probabilities</h3>
<p>In order to shed light on how people flow through these groups, we’ll compute transition probabilities. These transition probabilities are more generally referred to as Markov chains.</p>
<p>First we’ll create a new matrix that multiplies 1996 magnet with 1997 magnet so you see the number of students moving from 1996 membership to 1997 memberships.</p>
<p>Before we actually do this, we need to do some data munging to make sure that the rows and columns for g96 and g97 are the same. We’ll use the match() function for this.</p>
<div class="sourceCode" id="cb33" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb33-1">    </span>
<span id="cb33-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># First, let's get an idea of how many column-names (activities) and row</span></span>
<span id="cb33-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># names (student ids) are in common between the two years:</span></span>
<span id="cb33-4">  </span>
<span id="cb33-5">(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">cnames =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">intersect</span>( <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">colnames</span>(g96), <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">colnames</span>(g97) ) )</span>
<span id="cb33-6">(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">rnames =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">intersect</span>( <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">row.names</span>(g96), <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">row.names</span>(g97) ) )</span>
<span id="cb33-7">  </span>
<span id="cb33-8"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Great, there are a lot of names in common. Now we</span></span>
<span id="cb33-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># need to make sure we are only using the rows</span></span>
<span id="cb33-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># and columns of each matrix that contain entries used in</span></span>
<span id="cb33-11"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># both years. We also need to make sure that the columns and</span></span>
<span id="cb33-12"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># rows are in the same order.</span></span>
<span id="cb33-13">  </span>
<span id="cb33-14"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># In order to accomplish this we are going to exploit R's</span></span>
<span id="cb33-15"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># indexing capabilities. We are going to have R "rebuild"</span></span>
<span id="cb33-16"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># each matrix according to the order of rnames and cnames.</span></span>
<span id="cb33-17"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># We'll use the match() function to accomplish this.</span></span>
<span id="cb33-18">g96matched <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> g96[ <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">match</span>(rnames, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">row.names</span>(g96)), <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">match</span>(cnames, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">colnames</span>(g96)) ]</span>
<span id="cb33-19">g97matched <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> g97[ <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">match</span>(rnames, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">row.names</span>(g97)), <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">match</span>(cnames, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">colnames</span>(g97)) ]</span>
<span id="cb33-20">  </span>
<span id="cb33-21"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># We need to do the same thing for the diagonal of the matrix g96e, which is</span></span>
<span id="cb33-22"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># our co-membership/affiliation matrix computed above:</span></span>
<span id="cb33-23">mag96diagmatched <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">diag</span>( g96e[ <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">match</span>(cnames, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">colnames</span>(g96e)), </span>
<span id="cb33-24">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">match</span>(cnames, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">colnames</span>(g96e)) ] )</span>
<span id="cb33-25">  </span>
<span id="cb33-26"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Now let's check to make sure things worked correctly:</span></span>
<span id="cb33-27"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">which</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">row.names</span>(g96matched) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">row.names</span>(g97matched))</span>
<span id="cb33-28"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">which</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">colnames</span>(g96matched) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">colnames</span>(g97matched))</span></code></pre></div>
<p>Now that these are effectively matricies, we can multiply to get the transition probability matrix:</p>
<div class="sourceCode" id="cb34" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb34-1">mag96_97 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">t</span>(g96matched) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%*%</span> g97matched</span></code></pre></div>
<p>Let’s munge the 97 and 98 data and repeat:</p>
<div class="sourceCode" id="cb35" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb35-1"></span>
<span id="cb35-2">cnames <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">intersect</span>( <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">colnames</span>(g97), <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">colnames</span>(g98) ) </span>
<span id="cb35-3">rnames <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">intersect</span>( <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">row.names</span>(g97), <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">row.names</span>(g98) )</span>
<span id="cb35-4">g97matched <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> g97[ <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">match</span>(rnames, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">row.names</span>(g97)), <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">match</span>(cnames, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">colnames</span>(g97)) ]</span>
<span id="cb35-5">g98matched <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> g98[ <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">match</span>(rnames, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">row.names</span>(g98)), <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">match</span>(cnames, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">colnames</span>(g98)) ]</span></code></pre></div>
<p>And again for the 97-98 transition:</p>
<div class="sourceCode" id="cb36" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb36-1">mag97_98 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">t</span>(g97matched) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%*%</span> g98matched</span></code></pre></div>
<p>Now we need to get the group-level membership matrix diagonal, ordered by the current set of columns.</p>
<div class="sourceCode" id="cb37" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb37-1">mag96diagmatched <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">diag</span>( g96e[ <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">match</span>(cnames, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">colnames</span>(g96e)), </span>
<span id="cb37-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">match</span>(cnames, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">colnames</span>(g96e)) ] )</span>
<span id="cb37-3"></span>
<span id="cb37-4">mag97diagmatched <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">diag</span>( g97e[ <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">match</span>(cnames, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">colnames</span>(g97e)), </span>
<span id="cb37-5">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">match</span>(cnames, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">colnames</span>(g97e)) ] )</span>
<span id="cb37-6">  </span>
<span id="cb37-7">mag98diagmatched <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">diag</span>( g98e[ <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">match</span>(cnames, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">colnames</span>(g98e)), </span>
<span id="cb37-8">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">match</span>(cnames, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">colnames</span>(g98e)) ] )</span></code></pre></div>
<p>And finally we can create the transition probability matrix! Divide magmob96_97 by mag96diagmatched in to get the transition probability matrix (Markov chain):</p>
<div class="sourceCode" id="cb38" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb38-1">magmob96_97 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> mag96_97<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>mag96diagmatched </span>
<span id="cb38-2">magmob97_98 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> mag97_98<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>mag97diagmatched</span></code></pre></div>
<p>Now add the matrices and divide by 2:</p>
<div class="sourceCode" id="cb39" style="background: #f1f3f5;"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb39-1">mobility_all <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> (magmob96_97 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> magmob97_98)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span></span></code></pre></div>
<p>Now plot as with the event-overlap graphs!</p>


</section>
</section>

 ]]></description>
  <guid>https://solomonmg.github.io/blog/working-with-bipartite-affiliation-network-data-in-r/</guid>
  <pubDate>Sun, 04 Mar 2012 00:00:00 GMT</pubDate>
  <media:content url="https://solomonmg.github.io/blog/working-with-bipartite-affiliation-network-data-in-r/featured.jpg" medium="image" type="image/jpeg"/>
</item>
</channel>
</rss>
