skip to primary navigationskip to content

GDELT datasets

last modified Oct 15, 2014 02:56 PM
Two new GDELT Global Knowledge Graph (GKG) datasets were recently released:
The first is the set of underlying GKG datasets behind our paper that data mined more than 21 billion words of academic literature from JSTOR, DTIC, CORE, CiteSeerX, CIA, and the Internet Archive:
In the hopes of seeding new kinds of research that incorporate the cultural knowledge of the world's academic literature, we are making the GKG datasets behind that paper available for open research.  NOTE that these do NOT contain the text of the articles themselves, only the metadata computed from each article, which includes computed metadata of the references cited in each paper, allowing applications such as identifying the most cited authors and institutions relating to specific geographies, topics, and socio-political groups.  The full GKG dataset collection of around 40GB is available:
We have also released a new Human Rights GKG, which encodes in quantitative form a cross-section of the world’s public knowledge of human rights issues across the world, scattered across the hundreds of thousands of textual reports, calls to action, alerts, field interviews, and other material published by organizations throughout the globe.  This initial GKG encodes over 110,000 documents encoding a number of the major international human rights report archives, offering a computable overview of global human rights issues over the decades:
The GDELT GKG format encodes lists of social groups, organizations, locations, major themes, emotions, and a range of other metadata computed from each document, making it possible to conduct a wide array of studies that blend spatial, semantic, citation, and network analyses (
We're very much looking forward to seeing what you all are able to do with these new GKG collections!  For more information on the GDELT Project more broadly, see the main site ( or the blog (