Propagation of Epidemics in Citation Networks

Team

Introduction

Academic research should live or die by its own merits. However, human cognitive shortcuts have long believed to give undue advantage to particular institutions or researchers, sometimes blinding reviewers to errors of lack of rigor on their work. We investigate this imbalance by studying the spread of ideas across academic research networks using disease spread models adapted from epidemiology.

We specifically focus on spread of ideas in the domain of Machine Learning (ML), a specialised area of research in Computer Science. We take the papers published in the year 2014 as our set of base “pathogens” and assess the network growth dynamics of idea spread amongst major ML conferences.

We assess if idea spread is driven by connectivity amongst original authors or the explicit prestige of their institution uisng an epidemiological model to simulate the spread of an idea on our collaboration network.

Approach

We approach our problem with the following steps:

Analogies with Epidemiology : We base our work on analogies with concepts and models in epidemiology, modelling the spread of ideas on propagation of epidemics of infectious diseases. This involves drawing parallels between concepts and events in the spread of infectious diseases and spread of research in academia.
Building a Citation and Infection Network : We query the Microsoft Academic Graph to extract data for papers published in 2018 and beyond for the major ML conferences. We create a network ( a directed graph) of all papers citing one another. Using this citation network, from each pair of adjacent nodes, we select those edges, which satisfy the certain conditions for infection.
Estimating Idea Quality : We query openly available review data from OpenReview.net for papers accepted to ICLR in 2018-2020. We compute the average rating of each paper by taking the mean of its three independent review scores.
Devising a Prestige Metric : We still use the prestige metric devised by the current method, but we add more nuance to our metric prestige by factoring in other rankings driven more by reputation and assign them higher weightage. We devise our own university prestige metric by taking a weighted average of the prestige metric used in previous work, Geometric Mean Count (GMC) of papers published in the areas relevant to our task (CSRankings) and the Peer Assessment Score (PAS) used in US News Rankings.
Tracing Infections : We treat ICLR 2018 papers as patient-zero papers and walk down the path of of all the papers that cited the source directly or indirectly) until we reach all of its last descendants. We note the authors for the patient-zero papers and all the authors of the descendants of patient-zero.
Preparing a Collaboration Network : We create a collaboration network of machine learning researchers by taking all papers published in top conferences between 2016-2020 and creating links between every pair of authors who published a paper together during that time period.
Simulating Epidemics : In order to evaluate if idea spread is tied to the connectivity and collaborations of the original authors rather than the explicit prestige of their institution, we simulate the spread of an idea on our collaboration network using the Susceptible - Infected - Recovered (SIR) model of epidemics.

Results

After investigating the mechanisms that drive idea propagation, we discover that neither the prestige of an institution, nor the rating given by peer reviewers predicts whether an academic idea will spread. We infer that idea spreading could be egocentric - with focus more on individual researchers than the institutions at which they work. Thus, it appears to be driven by how well connected the researchers themselves are, which can lead to uptake by collaborators and thus better spread and name recognition.

This conlcusion is drawn from the following multiple observations:

Highly ranked universities have more connected researchers, perhaps driven by the fact that they have more research faculty in general.
Ideas of extremely high or low, caliber infect fewer researchers, suggesting most researchers get infected with good or moderate ideas.
Strong positive correlation between simulated epidemic size and actual number of infected authors, suggesting that local collaboration network structure is an important vehicle for paper spread
No significant relationship between either prestige or paper rating aftercontrolling for simulated epidemic size, suggesting that the local structure of collaborations surrounding paper authors are the most important factors influencing their spread

Resources

Resource	Link
Code	Github Link
Complete Report (All Phases)	Comprehensive Project Report
Final Paper Only	Project Paper
Presentation	Project Presentation
Software	Tarball Link

Propagation of Epidemics in Citation Networks

CS-8803 : Data Science for Epidemiology Project (Fall 2020)

Introduction

Approach

Results

Resources