Global Factor Data
Global Factor Data
Global Factor Data
Global Factor Data

Common Task Framework Rules

For step-by-step instructions on how to submit a model, please refer to the Python or R guide, as applicable.

Section 1: Research Methodology Rules

Rule 1: Temporal Integrity

All information used for portfolio construction at time t must be available at or before time t. Portfolio weights cannot use future information in any form.

Example:

You cannot use returns from June 2025 to create a portfolio in May 2025. All information used for portfolio construction at time t must be available at or before time t.

Rule 2: Feature Selection Constraints

Feature selection must be entirely algorithmic. Manual selection of features based on historical performance knowledge is prohibited. All feature engineering must be performed programmatically within the submitted code.

Prohibited Example:

Creating a model that exclusively uses 12-month return momentum and book-to-market equity to build a portfolio. Such selections could introduce look-ahead bias if the feature choices were influenced by their known historical performance.

Permitted Example:

Building a portfolio based on the ten best-performing characteristics at time t, where the selection is determined algorithmically using only information available at that point in time.

Rule 3: Data Sources

Only the provided CTF dataset is permitted. External data sources, including macroeconomic indicators, alternative data, or web-scraped information, are prohibited. Feature engineering through mathematical transformations of provided characteristics is permitted.

Prohibited Example:

Using external macroeconomic data, alternative datasets, or web-scraped information to enhance your model.

Permitted Example:

Creating new features by transforming or combining existing characteristics in the provided dataset (e.g., ratios, moving averages, or interaction terms).

Rule 4: Reproducibility Requirements

Submissions must be fully reproducible. All code must be self-contained with complete dependency specifications:

  • Python: Use requirements.txt or pyproject.toml for package dependencies
  • R: Use renv.lock for reproducible package versions

Dependencies must be available from PyPI (Python) or CRAN (R). Private package repositories or local packages are not supported. See Rule 8 for pre-installed packages.

Best Practices:
  • Pin exact package versions (e.g., statsmodels==0.14.0 not statsmodels>=0.14)
  • Consider using uv (Python) or renv (R) for precise dependency management and reproducibility
  • Test your submission in a fresh environment before uploading
Rule 5: Technical Implementation

Required submission components:

  • Model Script (required): A self-contained Python or R script that implements the main() function with the specified signature (see Rules 11-12)
  • Portfolio Weights (generated): CSV file with portfolio weights for all observations where ctff_test is True
  • Methodology Document (encouraged): A PDF describing your approach is encouraged but not required

See the Python and R guides for more details.

Rule 6: Portfolio Construction

Portfolios must be rebalanced monthly. No constraints are imposed on portfolio construction methodology. Shorting, leverage, position limits, and turnover constraints are permitted at the submitter's discretion.

Rule 7: Academic Integrity

Submissions of prior published work are encouraged. Multiple submissions are permitted to allow iterative improvement. Standard academic citation practices are recommended for methodology descriptions.

Section 2: Execution Environment Rules

Rule 8: Pre-Installed Packages

Each execution environment includes a baseline set of packages. Additional dependencies may be specified in your dependency file.

Python Pre-Installed Packages:

  • pandas (≥2.0.0)
  • numpy (≥1.24.0)
  • pyarrow (≥10.0.0)
  • boto3 (≥1.26.0)
  • scipy
  • scikit-learn
  • polars
  • joblib

R Pre-Installed Packages:

  • arrow (required for Parquet I/O)
  • data.table (high-performance data manipulation)
  • dplyr (tidyverse data manipulation)
  • tidyr (data tidying)

Additional packages may be installed by including a requirements.txt (Python) or renv.lock (R) file with your submission. All additional packages must pass security scanning before installation.

Rule 9: Compute Resources and Time Limits

Submissions execute on high-performance computing infrastructure with the following resource allocations:

Resource Specification
CPU Cores 32
Memory 300 GB RAM
Execution Time Limit 24 hours

Submissions that exceed memory limits will terminate with an out-of-memory error. Submissions that exceed the time limit will be terminated and marked as failed.

Best Practices:
  • Use vectorized operations over explicit loops
  • Consider memory-efficient data types (e.g., float32 instead of float64)
  • Test locally with a validation dataset before full submission

Note: Resource specifications may be updated. Check this page for current values.

Rule 10: Network Isolation

Submitted code executes in a fully isolated network environment:

  • No outbound network access: Attempts to connect to external URLs, APIs, or services will fail
  • No internet connectivity: The execution environment has no route to the public internet
  • Build-time network access: Network access is available only during the container build phase for package installation

Any code that requires runtime network access will fail. All data required for model execution is provided via the function arguments.

Section 3: Submission Format Rules

Rule 11: Main Function Signature

Your submission must define a main function with the exact signature specified for your language:

def main(chars: pd.DataFrame, features: pd.DataFrame, daily_ret: pd.DataFrame) -> pd.DataFrame:
    """
    Args:
        chars: Stock characteristics (ctff_chars.parquet)
        features: Computed features (ctff_features.parquet)
        daily_ret: Historical daily returns (ctff_daily_ret.parquet)

    Returns:
        DataFrame with columns: id, eom, w
    """
    # Your model logic
    return output_df
main <- function(chars, features, daily_ret) {
    # chars: data.frame from ctff_chars.parquet
    # features: data.frame from ctff_features.parquet
    # daily_ret: data.frame from ctff_daily_ret.parquet

    # Return data.frame with columns: id, eom, w
    return(output_df)
}

Submissions without a valid main function will fail validation.

Rule 12: Output Format Requirements

Your main() function must return a DataFrame with the following schema:

Column Type Description
id integer Security identifier from the input data
eom date End of month date (YYYY-MM-DD format)
w float Portfolio weight
About the id column:

The id values come from the input data and must be returned unchanged. For CRSP securities, the id is the CRSP permno. For Compustat securities, the id is a composite identifier. Your output should use the same id values present in the input DataFrames.

Example output:

id eom w
10006 2024-01-31 0.05
17566 2024-01-31 0.03
38914 2024-01-31 -0.02

Validation Requirements:

  • DataFrame must be non-empty
  • Column names must be exactly id, eom, w (case-sensitive)
  • No missing values in any column
  • Output size must not exceed 50 MB when serialized

The pipeline infrastructure automatically captures your function's return value and writes it to the output file. Your results must be returned exclusively via the main() function's return value. Do not attempt startup scripts, container entrypoints, or direct output file writing.

Rule 13: Source File Constraints

Source code files must meet the following requirements:

Constraint Limit
Maximum file size 1 MB per file
File encoding UTF-8
Binary files Not permitted (source files only)

Files exceeding these limits will be rejected during validation.

Rule 14: Data Input

Your main() function receives three DataFrames as arguments, loaded from the following Parquet files:

Argument Source File Description
chars ctff_chars.parquet Stock characteristics (fundamental data)
features ctff_features.parquet Computed features (technical indicators)
daily_ret ctff_daily_ret.parquet Historical daily returns

The pipeline loads these files and passes them to your function. You do not need to read files directly.

Execution Modes:
  • Validation: Uses a small subset (~4 MB) for quick testing
  • Full: Uses the complete dataset (~1.1 GB) for final scoring

The CTF_EXECUTION_MODE environment variable indicates which mode is running, but most models do not need to check this.

Section 4: Security Rules

Rule 15: Prohibited Operations

The following operations are prohibited and will cause submission rejection:

Network Operations:

  • socket, urllib, requests, http.client (Python)
  • download.file(), url(), httr calls (R)

Shell Execution:

  • subprocess, os.system, os.popen (Python)
  • system(), system2(), shell() (R)

Dynamic Code Execution:

  • eval(), exec(), compile() (Python)
  • eval(), parse() with arbitrary strings (R)

Filesystem Access:

  • Reading or writing files outside permitted temporary directories
  • Attempts to access system files, environment secrets, or other submissions

Credential Exposure:

  • Hardcoded API keys, passwords, or tokens in source code
Warning:

Submissions containing these patterns will fail security validation.

Rule 16: Dependency Security Scanning

All package dependencies are scanned for known vulnerabilities before installation:

Scanning Tools:

  • Python: OSV-Scanner, pip-audit, GuardDog (malware detection)
  • R: ROSV (R OSV wrapper)
Important:

Submissions with dependencies containing HIGH or CRITICAL severity vulnerabilities will be rejected. If you believe your submission was incorrectly rejected, contact the administrators.

Section 5: Execution Process Rules

Rule 17: Logging and Debugging

Standard output and error streams from your code are captured:

  • Use print() (Python) or print()/cat() (R) for debugging output
  • Adding logging output helps CTF administrators diagnose issues with your submission
  • Avoid logging sensitive information

Consider logging progress updates, timing information, and intermediate results to aid troubleshooting.

Rule 18: Randomness and Reproducibility

For reproducible results:

  • Set random seeds explicitly (e.g., np.random.seed(42) or set.seed(42))
  • Avoid operations with non-deterministic ordering unless seeded
  • Test locally to verify consistent outputs

While not strictly enforced, reproducible code helps with debugging and validation.

Best Practices:
  • Set seeds at the very start of your main() function
  • Run your model multiple times locally to verify consistent output
  • Document any intentional non-determinism in your methodology

Note: These rules are designed to ensure the academic integrity, security, and real-world applicability of submitted models. Adherence to these guidelines is essential for meaningful comparative analysis within the Common Task Framework. Rules and resource specifications may be updated; please check this page regularly for the most current requirements.