Global Factor Data
Global Factor Data
Global Factor Data
Global Factor Data

How to download JKP data from WRDS through Python

AUTHOR
Theis I. Jensen, Bryan Kelly, and Lasse H. Pedersen

Overview

This document explains how to download the data set from Is There a Replication Crisis in Finance by Jensen, Kelly, and Pedersen (2023, henceforth JKP) directly from Python. To successfully run this code you need a WRDS account with access to CRSP and Compustat. A viable alternative is to download the data directly from WRDS’s web interface

Downloading the JKP data from WRDS

Step 0 - Download the relevant packages

import wrds
import pandas as pd

📋Copied!

Step 1 - Connect to WRDS

The first step is to connect to the WRDS server. Run the code below to connect to the WRDS server.

wrds_db = wrds.Connection()

📋Copied!

You will see prompts asking for your username and password. Enter your username and password to proceed. You may also see a prompt for creation of .pgpass file. Choose 'y' and proceed. Note: WRDS uses multi-factor authentication, so you might have to check approve the request before the code will successfully execute.

Step 2 - Decide on your desired data subset

The full JKP data is massive, but most people only need a subset of the data. Here I’ll show how to generate the data set used by JKP, except that I’ll only use data from developed countries.

To extract developed countries, I use the “country classification.xlsx” available from our Github repository The code below downloads this file and extracts the ISO code for the developed countries.

#downloading and extracting list of developed countries
countries = pd.read_excel('https://github.com/bkelly-lab/ReplicationCrisis/raw/master/GlobalFactors/Country%20Classification.xlsx')
countries_rel = countries[countries['msci_development'] == 'developed']['excntry'].tolist()

📋Copied!

Next, the data set contains more than 400 stock characteristics but JKP only uses a subset of 153 characteristics that is used to create published equity factors. The list of relevant characteristics is here The code below downloads this file and extracts the relevant characteristics.

#downloading and extracting list of characteristics
chars = pd.read_excel('https://github.com/bkelly-lab/ReplicationCrisis/raw/master/GlobalFactors/Factor%20Details.xlsx')
chars_rel = chars[chars['abr_jkp'].notna()]['abr_jkp'].tolist()

📋Copied!

Finally, JKP relies on four screens:

I’ll include these screens in the SQL query

Step 3 - Extracting the data

Extracting data from one country

First, I’m going to show a simple extract of some identifying information, the stock’s size group, its market equity, and its return over the next 1 month from stocks listed in Denmark:

# Download JKP data from Denmark
sql_query= f"""
          SELECT id, eom, excntry, gvkey, permno, size_grp, me, ret_exc_lead1m
                   FROM contrib.global_factor
                   WHERE common=1 and exch_main=1 and primary_sec=1 and obs_main=1 and
                   excntry='DNK'
          """

data = wrds_db.raw_sql(sql_query)

📋Copied!
id eom excntry gvkey permno size_grp me ret_exc_lead1m
0 301555201.0 1985-12-31 DNK 015552 None large 632.873233 -0.125864
1 301563002.0 1985-12-31 DNK 015630 None small 238.626127 -0.094336
1 301563002.0 1985-12-31 DNK 015630 None small 238.626127 -0.094336
2 300802002.0 1985-12-31 DNK 008020 None large 752.215748 -0.021207
3 301560001.0 1985-12-31 DNK 015600 None small 262.161970 -0.056547
4 301644901.0 1985-12-31 DNK 016449 None large 572.112443 -0.140906
... ... ... ... ... ... ... ... ...
59776 335097901.0 2023-04-30 DNK 350979 None mega 14.437491 NaN
59777 335111701.0 2023-04-30 DNK 351117 None mega 17.879868 NaN
59778 335118601.0 2023-04-30 DNK 351186 None mega 36.533362 NaN
59779 335351301.0 1985-12-31 DNK 353513 None mega 33.802300 NaN
59780 335630801.0 1985-12-31 DNK 356308 None None NaN NaN

59781 rows × 8 columns

Extracting data from many countries

Next, I’m downloading the 153 characteristics from all developed countries (on my machine, this takes around 35 minutes and the data set is 17.3GB in csv format):

sql_query= f"""
          SELECT id, eom, excntry, gvkey, permno, size_grp, me, {', '.join(map(str, chars_rel))}
                   FROM contrib.global_factor
                   WHERE common=1 and exch_main=1 and primary_sec=1 and obs_main=1 and
                   excntry in ({', '.join("'" + str(item) + "'" for item in countries_rel)})
         """

data = wrds_db.raw_sql(sql_query)

📋Copied!
Extracting data from many countries with limited RAM

If your computer has limited RAM, a viable alternative is to download the data country-by-country and save the country-specific files to your local PC:

for country in countries_rel:
sql_query= f"""
          SELECT id, eom, excntry, gvkey, permno, size_grp, me, {', '.join(map(str, chars_rel))}
                    FROM contrib.global_factor
                    WHERE common=1 and exch_main=1 and primary_sec=1 and obs_main=1 and
                    excntry = {"'" + str(country) + "'"}
         """

data = wrds_db.raw_sql(sql_query)
data.to_csv(f'{country}.csv')

📋Copied!

Notes