Global Factor Data
Global Factor Data
Global Factor Data
Global Factor Data

How to download JKP data from WRDS through R

AUTHOR
Theis I. Jensen, Bryan Kelly, and Lasse H. Pedersen

Overview

This document explains how to download the data set from Is There a Replication Crisis in Finance by Jensen, Kelly, and Pedersen (2023, henceforth JKP) directly from R. To successfully run this code you need a WRDS account with access to CRSP and Compustat. A viable alternative is to download the data directly from WRDS’s web interface

Downloading the JKP data from WRDS

Step 0 - Download the relevant packages

library(RPostgres)
library(httr)
library(readxl)
library(tidyverse)
library(data.table)

📋Copied!

Step 1 - Connect to WRDS

The first step is to connect to the WRDS server. To do so, start by replacing USERNAME and PASSWORD with your WRDS username and password in the code below. Next, run the code to connect to the WRDS server. Note: WRDS uses multi-factor authentication, so you might have to check approve the request before the code will successfully execute.

wrds <- dbConnect(Postgres(),
                    host= 'wrds-pgdata.wharton.upenn.edu',
                    port= 9737,
                    dbname= 'wrds',
                    sslmode= 'require',
                    user= 'USERNAME',
                    password=' PASSWORD' )

📋Copied!

Step 2 - Decide on your desired data subset

The full JKP data is massive, but most people only need a subset of the data. Here I’ll show how to generate the data set used by JKP, except that I’ll only use data from developed countries. To extract developed countries, I use the “country classification.xlsx” available from our Github repository The code below downloads this file and extracts the ISO code for the developed countries.

# Convenience function to download excel files from Github
github_excel <- function(link) {
temp_file <-< /span> tempfile(fileext= ".xlsx" )
req <- GET(link,
      authenticate(Sys.getenv( "GITHUB_PAT" ), "" ),
      write_disk(path=temp_file))
data <- readxl:: read_excel (temp_file)
unlink(temp_file)
return(data)
}

# Download country classification file
link <- "https://github.com/bkelly-lab/ReplicationCrisis/raw/master/GlobalFactors/Country%20Classification.xlsx" # Notice I'm using the 'raw' rather than 'blob' URL
countries <- link |> github_excel()

# Extract developed countries
(countries_rel <- countries |>
    filter(msci_development=="developed") |>
    pull(excntry))

📋Copied!

[1] "USA" "JPN" "HKG" "GBR" "CAN" "AUS" "DEU" "FRA" "SWE" "CHE" "SGP" "ITA" [13] "ESP" "ISR" "NLD" "NOR" "BEL" "DNK" "FIN" "NZL" "AUT" "IRL" "PRT"

Next, the data set contains more than 400 stock characteristics but JKP only uses a subset of 153 characteristics that is used to create published equity factors. The list of relevant characteristics is here characteristics is here The code below downloads this file and extracts the relevant characteristics.

# Extract the factor details files
link <- "https://github.com/bkelly-lab/ReplicationCrisis/raw/master/GlobalFactors/Factor%20Details.xlsx"
chars <- link |> github_excel()

# Extract the relevant characteristics
chars_rel <- chars |>
    filter(!is.na(abr_jkp)) |>
    pull(abr_jkp)

📋Copied!

Finally, JKP relies on four screens:

I’ll include these screens in the SQL query

Step 3 - Extracting the data

Extracting data from one country

First, I’m going to show a simple extract of some identifying information, the stock’s size group, its market equity, and its return over the next 1 month from stocks listed in Denmark:

# Convenience function to fetch data from WRDS
wrds_fetch <- function(wrds, sql_query, n=-1){
    res <- dbSendQuery(wrds, sql_query)
    data <- dbFetch(res, n=n)
    dbClearResult(res)
    return(data)
}

# Download JKP data from Denmark
sql_query <- paste0(
    " SELECT id, eom, excntry, gvkey, permno, size_grp, me, ret_exc_lead1m
        FROM contrib.global_factor
        WHERE common=1 and exch_main=1 and primary_sec=1 and obs_main=1 and
               excntry='DNK';"

)
data <- wrds |> wrds_fetch(sql_query)

# Show
data |> as_tibble()

📋Copied!

# A tibble: 59,781 × 8

id eom excntry gvkey permno size_grp me ret_exc_lead1m
<dbl> <date> <chr> <chr> <dbl> <chr> <dbl> <dbl>
1 301555201 1985-12-31 DNK 015552 NA large 633. -0.126
2 301563002 1985-12-31 DNK 015630 NA small 239. -0.0943
3 300802002 1985-12-31 DNK 008020 NA large 752. -0.0212
4 301644901 1985-12-31 DNK 016449 NA large 572. -0.141
5 301560001 1985-12-31 DNK 015600 NA small 262. -0.0565
6 302390301 1985-12-31 DNK 023903 NA small 312. -0.0630
7 302391101 1985-12-31 DNK 023911 NA small 267. -0.0988
8 302482601 1985-12-31 DNK 024826 NA small 139. 0.00548
9 310113005 1985-12-31 DNK 101130 NA small 200. 0.201
10 310115101 1985-12-31 DNK 101151 NA small 229. -0.118

# ℹ 59,771 more rows

Extracting data from many countries

Next, I’m downloading the 153 characteristics from all developed countries (on my machine, this takes around 25 minutes the data set is 12.4GB):

sql_query <- paste0(
    " SELECT id, eom, excntry, gvkey, permno,",
          paste0(chars_rel, collapse=", " ),
    " FROM contrib.global_factor
    WHERE common=1 and exch_main=1 and primary_sec=1 and obs_main=1 and
excntry in",

    "(" , paste0("'", countries_rel, "'" , collapse=", " ), ")" ,
";"
)
data <- wrds |> wrds_fetch(sql_query)

📋Copied!

Extracting data from many countries with limited RAM

If your computer has limited RAM, a viable alternative is to download the data country-by-country and save the country-specific files to your local PC:

for (country in countries_rel) {
    sql_query <- paste0(
    " SELECT id, eom, excntry, gvkey, permno,",
          paste0(chars_rel, collapse=", " ),
    " FROM contrib.global_factor
    WHERE common=1 and exch_main=1 and primary_sec=1 and obs_main=1 and
          excntry = '"
, country, "';" )

# Get data
data <- wrds |> wrds_fetch(sql_query)

# Save
data |> fwrite(paste0(country, ".csv"))
}

📋Copied!

Notes