JKP data: code migration from SAS/R to Python

Introduction

We're excited to announce the release of the newest version of the code that generates the global dataset of factor returns, stock returns, and firm characteristics from "Is there a Replication Crisis in Finance?" (Jensen, Kelly, and Pedersen, Journal of Finance 2023). This new version has been completely rewritten in Python, replacing the previous SAS and R codebase, and is freely available on GitHub. Join our community discussions to ask questions, request features, or share insights. We are grateful to Faheem Almas and Fernando Reyes De La Luz for excellent research assistance.

Highlights of this release:

Migration to Python: We’ve switched from SAS and R to Python.
Modular, faster codebase: The code is now more modular and leverages Polars and DuckDB for performance.
Substantial speed-ups: The full database and factor returns can be generated in ~6 hours.
Single-command execution: The characteristics database and factor returns can be computed using a single Python command.
Reproducible environments with uv: We now use uv for fast, lockfile-based dependency management, making installs and runs consistent across machines.
More data: The code now outputs additional data, like global standardized quarterly and annual accounting data; global daily and monthly factors for the Fama-French three-factor model and the Hou-Xue-Zhang four-factor model; and global GICS industry returns.
Modern output formats: Outputs have moved from CSV to Parquet.

Naturally, changing the programming language led to some differences in the underlying algorithms relative to SAS. Nevertheless, most functions and the overall structure of the SAS/R codebase have been retained. For details on factor definitions and construction, see our documentation.

Output structure

All output is contained in:

jkp-data/data/processed/

The output is organized as follows:

accounting_data/
- quarterly.parquet and annual.parquet contain firm-level accounting characteristics from Compustat data sourced from quarterly and annual filings, respectively, and standardized across countries.
characteristics/
- world_data_unfiltered.parquet contains the monthly stock file with stock-level characteristics, without filters.
- world_data_filtered.parquet contains the monthly stock file with stock-level characteristics filtered by primary_sec == 1 (primary security flag), common == 1 (common stock flag), obs_main == 1 (main observation flag), exch_main == 1 (main exchange flag).
- Partitions of world_data_filtered.parquet by country (e.g., GBR.parquet, USA.parquet). Files are named using ISO Alpha-3 country codes.
other_output/
- ap_factors_monthly.parquet and ap_factors_daily.parquet contain the returns of FF3 and HXZ4 factor portfolios for every country at monthly and daily frequencies, respectively.
- market_returns.parquet and market_returns_daily.parquet contain market-portfolio returns for every country at monthly and daily frequencies, respectively.
- nyse_cutoffs.parquet contains NYSE market-capitalization quantiles used for portfolio construction.
- return_cutoffs.parquet and return_cutoffs_daily.parquet contain return quantiles used as winsorization thresholds.
portfolios/
- pfs(_daily).parquet contains portfolios formed at monthly (daily) frequency by sorting stocks into three groups based on non-microcap breakpoints. Portfolio 1 (3) includes stocks with the lowest (highest) value of the characteristic.
- hml(_daily).parquet contains long–short portfolios formed at monthly (daily) frequency that go long Portfolio 3 (high characteristic values) and short Portfolio 1 (low characteristic values) from pfs.csv.
- lms(_daily).parquet contains long–short portfolios formed at monthly (daily) frequency following the Jensen, Kelly, and Pedersen (2023) signing convention (e.g., long low-asset-growth, short high-asset-growth).
- cmp(_daily).parquet contains rank-weighted (characteristic-managed) portfolios formed at monthly (daily) frequency within mega, large, small, micro, and nano capitalization groups in the U.S.
- country_factors(_daily)/ contains country-by-country lms.parquet files for easier use at monthly (daily) frequency. Files are named using ISO Alpha-3 country codes.
- regional_factors(_daily)/ contains regional factor portfolios based on lms.parquet at monthly (daily) frequency, constructed following Jensen, Kelly, and Pedersen (2023).
- clusters(_daily).parquet contains returns for cluster of factors organized into 13 themes.
- regional_clusters(_daily)/ contains regional cluster outputs.
- industry_gics.parquet contains GICS industry-level returns.
return_data/
- world_dsf.parquet contains daily stock-level returns for all securities in the database.
- world_ret_monthly.parquet contains monthly stock-level returns for all securities in the database.
- daily_rets_by_country/ contains partitions of world_dsf.parquet by country (e.g., GBR.parquet, USA.parquet). Files are named using ISO Alpha-3 country codes.

The jkp-data/data/processed/ directory also contains two additional subfolders, raw/ and interim/, used by the code to store temporary data for computing stock-level characteristics.

Function behavior in Python relative to the SAS/R portions is close to identical, with changes in standardized_accounting_data for more precise handling of duplicates.

There are slight divergences between outputs, the main sources of them are:

Numerical precision: Differences in floating-point handling between SAS and Python can alter inequality evaluations and calculated values.
Data revisions: Underlying data may have changed.
Algorithmic adjustments: Differences due to updates in standardized_accounting_data for improved duplicate handling.

Next, we show a strong alignment between the characteristics and factors produced by the SAS/R code with those from the new Python code.

Comparison of characteristics at stock level

The data contains 402 stock-level characteristics. We start by computing the Spearman correlation between each characteristic from SAS/R and the corresponding characteristic from Python. The correlation is computed across all firms, dates, and countries where both versions have non-missing values (in reality the overlap is close to perfect).

Figure 1: Histogram of Spearman rank correlations.

Figure Figure 1 shows that the Spearman correlations are high, with the lowest correlation above 0.994, the average correlation at 0.999, and the majority of the correlations being effectively 1.

Moving to the Pearson correlations, two characteristics (resff3_6_1 and resff3_12_1), have low correlations, but these low correlations are solely due to outliers. The 1st and 99th percentiles of the characteristics are effectively identical between SAS and Python. Beside these two characteristics, sale_emp and bidaskhl_21d have a Pearson correlation of 0.977 and 0.991, respectively, and all other characteristics have a Pearson correlation above 0.997. We provide a wide range of summary statistics to compare characteristics in SAS/R and Python in sas_vs_py_summ_stats.parquet.¹

US factor comparison

We next compute the Pearson correlations between the U.S. long–short factor-portfolio returns produced by SAS/R and Python. For each factor, we weight stocks using the same capped-value-weights as in Jensen, Kelly, and Pedersen (2023). Figure Figure 2 shows that all correlations are close to 1.

Figure 2: Correlation of US factor portfolios produced by Python and SAS/R.

Next, we compare the mean returns and standard deviations of the factors. In the scatter plots below, the x-axis shows the statistic for SAS/R and the y-axis shows the corresponding statistic for Python. If the two versions match, the points should lie on a 45-degree line through the origin (slope 1, intercept 0). Figure Figure 3 shows that the alignment is close to perfect.

Figure 3: Comparison of mean and standard deviation of US factor portfolios produced by Python and SAS/R.

World ex-US factor comparison

We now repeat the comparison for the World ex-US region. Figure Figure 4 shows that the return correlations are high, but lower than in the U.S. The differences mainly arise in earlier years where few stocks have a non-missing characteristic, and where small changes in which stocks are longed and shorted can have a big impact on the factor return. Figure Figure 5, however, shows that the mean and standard deviation of the factors across the two datasets align closely.

Figure 4: Correlation of World ex-US factor portfolios produced by Python and SAS/R.

Figure 5: Comparison of mean and standard deviation of World ex-US factor portfolios produced by Python and SAS/R.

Overall, the mean and standard deviation of the factor returns are close to those produced in SAS/R.

Correlation by factor-country

Finally, we examine correlations at the factor–country level. Figure Figure 6 summarizes these comparisons by plotting the inverse cumulative distribution of the correlations for each weighting scheme (equal-weights, value-weights, and capped-value-weights). The figure shows that the vast majority of factors across countries and weighting schemes achieve a correlation close to 1.

Figure 6: Inverse CDF of return series correlation.

Conclusion

We have completely migrated the JKP data codebase from SAS/R to Python, resulting in a more modular, faster, and user-friendly implementation. The new code produces outputs that closely align with those from the previous SAS/R codebase, ensuring continuity and reliability for users. Our hope is that migration to Python code will make it easier for people to use, modify, and contribute to the codebase. Please post any questions, concerns, or feedback in our GitHub Discussions, or browse the full source code in our GitHub repository.

Footnotes

In the new Python code, the ‘div’ and ‘eqnpo’ characteristics in the market_chars_monthly subroutine are coerced to 0 whenever their magnitude falls below 1e-5, in order to eliminate arithmetic noise. When comparing characteristics values, we have likewise set ‘div’ and ‘eqnpo’ to 0 in the SAS results whenever their magnitude is smaller than 1e-5. In the portfolio return comparisons, however, we use the original SAS/R output, that is, before applying the zero-floor adjustments to div_* and eqnpo_*.↩︎