Global Factor Data
Global Factor Data
Global Factor Data
Global Factor Data

How to download JKP data from WRDS

AUTHOR
Theis I. Jensen, Bryan Kelly, and Lasse H. Pedersen

Overview

This document explains how to download the data set from Is There a Replication Crisis in Finance by Jensen, Kelly, and Pedersen (2023, henceforth JKP) directly from Python or R. To successfully run this code you need a WRDS account with access to CRSP and Compustat. A viable alternative is to download the data directly from WRDS's web interface

Step 0 - Download the relevant packages

import wrds
import pandas as pd
library(RPostgres)
library(httr)
library(readxl)
library(tidyverse)
library(data.table)

Step 1 - Connect to WRDS

The first step is to connect to the WRDS server. Run the code below to connect to the WRDS server.

wrds_db = wrds.Connection()

You will see prompts asking for your username and password. Enter your username and password to proceed. You may also see a prompt for creation of .pgpass file. Choose 'y' and proceed.

Start by replacing USERNAME and PASSWORD with your WRDS username and password in the code below. Next, run the code to connect to the WRDS server.

wrds <- dbConnect(Postgres(),
                   host = 'wrds-pgdata.wharton.upenn.edu',
                   port = 9737,
                   dbname = 'wrds',
                   sslmode = 'require',
                   user = 'USERNAME',
                   password = 'PASSWORD')

Note: WRDS uses multi-factor authentication, so you might have to check approve the request before the code will successfully execute.

Step 2 - Decide on your desired data subset

The full JKP data is massive, but most people only need a subset of the data. Here I'll show how to generate the data set used by JKP, except that I'll only use data from developed countries.

To extract developed countries, I use the "country classification.xlsx" available from our Github repository The code below downloads this file and extracts the ISO code for the developed countries.

# downloading and extracting list of developed countries
countries = pd.read_excel(
    'https://github.com/bkelly-lab/ReplicationCrisis/raw/master/GlobalFactors/Country%20Classification.xlsx'
)
countries_rel = countries[countries['msci_development'] == 'developed']['excntry'].tolist()
# Convenience function to download excel files from Github
github_excel <- function(link) {
  temp_file <- tempfile(fileext = ".xlsx")
  req <- GET(link,
             authenticate(Sys.getenv("GITHUB_PAT"), ""),
             write_disk(path = temp_file))
  data <- readxl::read_excel(temp_file)
  unlink(temp_file)
  return(data)
}

# Download country classification file
# Notice I'm using the 'raw' rather than 'blob' URL
link <- "https://github.com/bkelly-lab/ReplicationCrisis/raw/master/GlobalFactors/Country%20Classification.xlsx"
countries <- link |> github_excel()

# Extract developed countries
(countries_rel <- countries |>
  filter(msci_development == "developed") |>
  pull(excntry))
[1] "USA" "JPN" "HKG" "GBR" "CAN" "AUS" "DEU" "FRA" "SWE" "CHE" "SGP" "ITA"
[13] "ESP" "ISR" "NLD" "NOR" "BEL" "DNK" "FIN" "NZL" "AUT" "IRL" "PRT"

Next, the data set contains more than 400 stock characteristics but JKP only uses a subset of 153 characteristics that is used to create published equity factors. The code below downloads this file and extracts the relevant characteristics.

# downloading and extracting list of characteristics
chars = pd.read_excel('https://github.com/bkelly-lab/ReplicationCrisis/raw/master/GlobalFactors/Factor%20Details.xlsx')
chars_rel = chars[chars['abr_jkp'].notna()]['abr_jkp'].tolist()
# Extract the factor details files
link <- "https://github.com/bkelly-lab/ReplicationCrisis/raw/master/GlobalFactors/Factor%20Details.xlsx"
chars <- link |> github_excel()

# Extract the relevant characteristics
chars_rel <- chars |>
  filter(!is.na(abr_jkp)) |>
  pull(abr_jkp)

Finally, JKP relies on four screens:

I'll include these screens in the SQL query

Step 3 - Extracting the data

Extracting data from one country

First, I'm going to show a simple extract of some identifying information, the stock's size group, its market equity, and its return over the next 1 month from stocks listed in Denmark:

# Download JKP data from Denmark
sql_query = f"""
    SELECT id, eom, excntry, gvkey, permno, size_grp, me, ret_exc_lead1m
    FROM contrib.global_factor
    WHERE common=1 and exch_main=1 and primary_sec=1 and obs_main=1 and
    excntry='DNK'
"""

data = wrds_db.raw_sql(sql_query)

         id             eom           excntry  gvkey     permno  size_grp   me            ret_exc_lead1m
0        301555201.0    1985-12-31    DNK      015552    None    large      632.873233    -0.125864
1        301563002.0    1985-12-31    DNK      015630    None    small      238.626127    -0.094336
1        301563002.0    1985-12-31    DNK      015630    None    small      238.626127    -0.094336
2        300802002.0    1985-12-31    DNK      008020    None    large      752.215748    -0.021207
3        301560001.0    1985-12-31    DNK      015600    None    small      262.161970    -0.056547
4        301644901.0    1985-12-31    DNK      016449    None    large      572.112443    -0.140906
...      ...            ...           ...      ...       ...     ...        ...           ...
59776    335097901.0    2023-04-30    DNK      350979    None    mega       14.437491     NaN
59777    335111701.0    2023-04-30    DNK      351117    None    mega       17.879868     NaN
59778    335118601.0    2023-04-30    DNK      351186    None    mega       36.533362     NaN
59779    335351301.0    1985-12-31    DNK      353513    None    mega       33.802300     NaN
59780    335630801.0    1985-12-31    DNK      356308    None    None       NaN           NaN
59781 rows × 8 columns
# Convenience function to fetch data from WRDS
wrds_fetch <- function(wrds, sql_query, n=-1){
  res <- dbSendQuery(wrds, sql_query)
  data <- dbFetch(res, n=n)
  dbClearResult(res)
  return(data)
}

# Download JKP data from Denmark
sql_query <- paste0(
  " SELECT id, eom, excntry, gvkey, permno, size_grp, me, ret_exc_lead1m
    FROM contrib.global_factor
    WHERE common=1 and exch_main=1 and primary_sec=1 and obs_main=1 and
          excntry='DNK';"
)
data <- wrds |> wrds_fetch(sql_query)

# Show
data |> as_tibble()
# A tibble: 59,781 × 8

      id           eom           excntry gvkey     permno   size_grp me      ret_exc_lead1m
      <dbl>        <date>        <chr>   <chr>     <dbl>    <chr>    <dbl>   <dbl>
1     301555201    1985-12-31    DNK     015552    NA       large    633.    -0.126
2     301563002    1985-12-31    DNK     015630    NA       small    239.    -0.0943
3     300802002    1985-12-31    DNK     008020    NA       large    752.    -0.0212
4     301644901    1985-12-31    DNK     016449    NA       large    572.    -0.141
5     301560001    1985-12-31    DNK     015600    NA       small    262.    -0.0565
6     302390301    1985-12-31    DNK     023903    NA       small    312.    -0.0630
7     302391101    1985-12-31    DNK     023911    NA       small    267.    -0.0988
8     302482601    1985-12-31    DNK     024826    NA       small    139.    0.00548
9     310113005    1985-12-31    DNK     101130    NA       small    200.    0.201
10    310115101    1985-12-31    DNK     101151    NA       small    229.    -0.118
# ℹ 59,771 more rows
Extracting data from many countries

Next, I'm downloading the 153 characteristics from all developed countries.

On my machine, this takes around 35 minutes and the data set is 17.3GB in csv format.

sql_query = f"""
    SELECT id, eom, excntry, gvkey, permno, size_grp, me, {', '.join(map(str, chars_rel))}
    FROM contrib.global_factor
    WHERE common=1 and exch_main=1 and primary_sec=1 and obs_main=1 and
    excntry in ({', '.join("'" + str(item) + "'" for item in countries_rel)})
"""

data = wrds_db.raw_sql(sql_query)

On my machine, this takes around 25 minutes the data set is 12.4GB.

sql_query <- paste0(
  " SELECT id, eom, excntry, gvkey, permno,",
        paste0(chars_rel, collapse=", " ),
  " FROM contrib.global_factor
    WHERE common=1 and exch_main=1 and primary_sec=1 and obs_main=1 and
    excntry in",
  "(" , paste0("'", countries_rel, "'" , collapse=", " ), ")" ,
  ";"
)
data <- wrds |> wrds_fetch(sql_query)
Extracting data from many countries with limited RAM

If your computer has limited RAM, a viable alternative is to download the data country-by-country and save the country-specific files to your local PC:

for country in countries_rel:
    sql_query = f"""
        SELECT id, eom, excntry, gvkey, permno, size_grp, me, {', '.join(map(str, chars_rel))}
        FROM contrib.global_factor
        WHERE common=1 and exch_main=1 and primary_sec=1 and obs_main=1 and
        excntry = '{str(country)}'
    """
    data = wrds_db.raw_sql(sql_query)
    data.to_csv(f'{country}.csv')
for (country in countries_rel) {

  sql_query <- paste0(
    " SELECT id, eom, excntry, gvkey, permno,",
          paste0(chars_rel, collapse=", " ),
    " FROM contrib.global_factor
      WHERE common=1 and exch_main=1 and primary_sec=1 and obs_main=1 and
            excntry = '", country, "';")

  # Get data
  data <- wrds |> wrds_fetch(sql_query)

  # Save
  data |> fwrite(paste0(country, ".csv"))
}

Notes