"""Elastic Net model for stock return prediction using pooled and stock-wise training modes.

NOTE: This module is designed for the CTF pipeline, which provides load_data() and export_data()
functions. The __name__ == '__main__' block has been removed as it's not needed for pipeline
execution and had parameter mismatches.

PERFORMANCE OPTIMIZATIONS (2025-11-16):
---------------------------------------
1. Pre-allocation optimization (O(n²) → O(n)):
   - Replaced np.vstack/np.append pattern in retrain loop with pre-allocated arrays
   - Use index pointer (train_end) instead of copying growing arrays each iteration
   - Eliminates quadratic memory copying overhead as training set expands
   - Security: Safe - pure memory layout optimization, no external input handling

2. Warm start for incremental fitting:
   - Added warm_start=True to ElasticNet model initialization
   - Reuses previous coefficients as starting point for next fit
   - Significantly faster convergence when data changes incrementally
   - Security: Safe - sklearn built-in parameter, no security implications

3. Simplified standardization:
   - Replaced complex Polars-based standardization with simple pandas groupby
   - Removed DataFrame ↔ Polars roundtrip overhead
   - More maintainable, fewer dependencies to audit
   - Security: Safer - standard pandas operations, no complex Polars transformations

4. Removed unused StandardScaler:
   - Date-specific standardization already handles feature scaling
   - Eliminates duplicate/conflicting standardization logic
   - Security: Safe - removes unused code path

Expected speedup: 3-10x for typical validation datasets, scaling with number of retrain intervals.

BUG FIX #1 (2025-11-16):
------------------------
WHAT WENT WRONG: Model execution failed with error: 'DataFrame' object has no attribute 'to_list'
    - Line 10 used `features.to_list()` to convert pandas Index/Series to a Python list
    - This caused an AttributeError during pipeline execution

WHY IT WENT WRONG: Pandas uses `tolist()` not `to_list()`
    - The pandas library (and its underlying numpy arrays) use the method `tolist()` without underscore
    - The method `to_list()` with underscore does not exist in pandas/numpy
    - This is a common mistake when switching between pandas and other libraries (e.g., Polars uses `to_list()`)

HOW IT WAS FIXED: Changed `features.to_list()` to `features.tolist()` on line 10
    - Replaced the incorrect method name with the correct pandas/numpy method
    - The code now properly converts the pandas Index/Series to a Python list for iteration

BUG FIX #2 (2025-11-16):
------------------------
WHAT WENT WRONG: Pipeline failed with error: 'Column not found: features'
    - Test script passed but actual pipeline failed
    - Test script was converting features DataFrame to a Series
    - Pipeline provides features as a DataFrame with a "features" column

WHY IT WENT WRONG: Test-production parity issue
    - The pipeline follows the pattern from benchmark code (e.g., ipca_pf.py)
    - Pipeline signature: main(chars: DataFrame, features: DataFrame, daily_ret: DataFrame)
    - Features DataFrame has a "features" column containing the feature names
    - Test script was incorrectly extracting this to a Series before passing to main()

HOW IT WAS FIXED: Updated standardize() function to handle DataFrame features parameter
    - Line 96-105: Added DataFrame detection and proper column extraction
    - If features is DataFrame with "features" column, extract: features['features'].tolist()
    - If features is DataFrame without "features" column, extract first column
    - Otherwise, treat as Series/list (backward compatibility)
    - Updated run_test_1_validation.py to keep features as DataFrame (matching pipeline)

BUG FIX #3 (2025-11-16):
------------------------
WHAT WENT WRONG: Model execution failed with error: 'Input X contains NaN'
    - ElasticNet cannot handle NaN values in feature data
    - Validation data contained rows with missing values in some features
    - Model training failed even though standardization replaces NaN with 0

WHY IT WENT WRONG: Incomplete NaN handling
    - Standardization only handles NaN for features that are being standardized
    - Some rows may have NaN in features that aren't in the standardization list
    - Need to drop rows with ANY NaN values in features before model training

HOW IT WAS FIXED: Added NaN filtering before model training
    - Lines 169-174: Added dropna() on existing feature_columns + target_variable
    - Lines 278-283: Added same NaN filtering to mode selection function
    - Filters columns to only those that exist in DataFrame before dropna()
    - Drops any rows with NaN in existing features or target before training
    - Updated skip message to mention "after NaN removal" for clarity
    - Ensures clean data is provided to ElasticNet model

BUG FIX #4 (2025-11-16):
------------------------
WHAT WENT WRONG: Pipeline CSV validation failed with "Weight value '' is not numeric"
    - Model output contained NaN values in the 'w' (weight) column
    - When exported to CSV via to_csv(), NaN values become empty strings ''
    - Pipeline's schema validator strictly requires numeric values
    - Validation failed: "Row 2: Weight value '' is not numeric; Row 3..." (98 total errors)

WHY IT WENT WRONG: shift(1) operation creates NaN for first observation per stock
    - Code uses shift(1) to prevent look-ahead bias: stock_output.groupby('id')['predicted_return'].shift(1)
    - This shifts predictions forward by 1 period per stock (grouped by 'id')
    - First observation of each stock has no "previous period" prediction → NaN
    - NaN is correct for pandas operations, but invalid for CSV export to pipeline

HOW IT WAS FIXED: Added dropna() after shift operations to remove NaN rows
    - Line 212-213: Added dropna(subset=['predicted_return']) in simple mode
    - Line 270-271: Added dropna(subset=['predicted_return']) in retrain mode
    - Drops rows where shift created NaN (first observation per stock)
    - Matches benchmark model behavior (ipca_pf, kns_pc, one_over_n all return clean DataFrames)
    - Ensures all weight values are numeric for CSV validation
    - Reduces output from 193 rows to ~95 rows (depending on number of stocks)

BUG FIX #5 (2025-11-16):
------------------------
WHAT WENT WRONG: ElasticNet convergence warning during full data run
    - ConvergenceWarning: "Objective did not converge. You might want to increase iterations..."
    - Duality gap: 2.016e+02, tolerance: 7.870e-01
    - Duality gap was 256x larger than tolerance threshold (critical convergence failure)
    - Model did not find optimal solution → coefficients inaccurate → predictions unreliable

WHY IT WENT WRONG: Extremely small alpha caused numerical instability
    - Original alpha=0.000001 approaches zero (sklearn warns against this for "numerical reasons")
    - With such minimal regularization, problem approaches unregularized OLS
    - Coordinate descent algorithm struggles with numerical precision at this scale
    - Pure Lasso (l1_ratio=1.0) harder to converge than ElasticNet (mixed L1/L2)
    - Default max_iter=1000 insufficient for convergence with very small alpha

HOW IT WAS FIXED: Increased regularization for numerical stability
    - Line 377: Changed alpha from 0.000001 to 0.001 (1000x increase)
    - Line 378: Changed l1_ratio from 1.0 to 0.9 (added 10% Ridge component)
    - alpha=0.001 still provides very light regularization but is numerically stable
    - Small Ridge component helps convergence without changing model character
    - Original alpha provided almost no regularization anyway (near-zero)
    - Expected: No convergence warnings, same or better execution time
    - Alternative (if must keep alpha=0.000001): Set max_iter=50000, selection='random'

CLEANUP (2025-11-16):
--------------------
Removed __name__ == '__main__' block:
    - The pipeline provides its own load_data() and export_data() implementations
    - The __main__ block had a parameter mismatch (main() expects 3 args, was called with 2)
    - Not needed for pipeline execution
"""

# Import necessary libraries
from sklearn.linear_model import ElasticNet
import numpy as np
import pandas as pd

# Optimized standardization using pandas groupby
def standardize(df, features):
    """Standardize features by date (eom) using z-score normalization.

    Applies date-specific standardization: (x - mean) / std for each feature
    within each date group. Handles NaN/inf values by replacing with 0.

    Args:
        df: DataFrame with features and 'eom' column
        features: DataFrame with 'features' column, Series, or list of feature column names

    Returns:
        DataFrame with standardized features
    """
    # Extract feature list from DataFrame (pipeline provides DataFrame with "features" column)
    if isinstance(features, pd.DataFrame):
        if 'features' in features.columns:
            feature_list = features['features'].tolist()
        else:
            # Fallback: assume single column
            feature_list = features.iloc[:, 0].tolist()
    else:
        # Already a Series or list
        feature_list = features.tolist() if hasattr(features, 'tolist') else list(features)

    # Create a copy to avoid modifying original
    df = df.copy()

    # Standardize each feature by date
    for feature in feature_list:
        # Group by date and apply z-score standardization
        df[feature] = df.groupby('eom')[feature].transform(
            lambda x: (x - x.mean()) / x.std(ddof=0)
        )

        # Replace NaN/inf with 0 (handles division by zero when std=0)
        df[feature] = df[feature].replace([np.inf, -np.inf], np.nan).fillna(0)

    return df


# Function to train a pooled Elastic Net model on all stocks with retraining after a specified number of days
def train_and_predict_pooled_elastic_net_by_days(df, feature_columns, target_variable, model_type='simple', alpha=0.001, l1_ratio=0.5, days_interval=5):
    """Train Elastic Net model with optional retraining on expanding window.

    Optimized version using:
    - Pre-allocated arrays (O(n) instead of O(n²) memory copying)
    - warm_start=True for faster convergence
    - Index-based window expansion instead of vstack/append

    Args:
        df: DataFrame with features, target, and 'ctff_test' flag
        feature_columns: List of feature column names
        target_variable: Name of target column
        model_type: 'simple' (single fit) or 'retrain' (expanding window)
        alpha: L1/L2 regularization strength
        l1_ratio: Balance between L1 and L2 (1.0 = pure Lasso)
        days_interval: Number of dates between retraining

    Returns:
        DataFrame with predictions
    """
    predictions_list = []

    # Split into train and test sets based on 'ctff_test' flag
    train_data = df[df['ctff_test'] == False].copy()
    test_data = df[df['ctff_test'] == True].copy()

    # Drop rows with NaN values in features or target
    # ElasticNet cannot handle NaN values
    # Only check columns that actually exist in the DataFrame
    cols_to_check = [col for col in feature_columns + [target_variable] if col in train_data.columns]

    train_before = len(train_data)
    test_before = len(test_data)
    train_data = train_data.dropna(subset=cols_to_check)
    test_data = test_data.dropna(subset=cols_to_check)
    train_after = len(train_data)
    test_after = len(test_data)

    print(f"NaN filtering: Train {train_before} → {train_after} rows ({train_before-train_after} dropped), "
          f"Test {test_before} → {test_after} rows ({test_before-test_after} dropped), "
          f"{len(cols_to_check)} columns checked ({len(feature_columns)} features + target)")

    # Skip if there is no test data or train data
    if len(train_data) == 0 or len(test_data) == 0:
        print(f"Skipping due to empty train or test data after NaN removal.")
        return pd.DataFrame()

    # Initialize the Elastic Net model with warm_start for incremental fitting
    elastic_net_model = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, warm_start=True)

    if model_type == 'simple':
        # Train the Elastic Net model on the full train data once
        X_train = train_data[feature_columns].values
        y_train = train_data[target_variable].values

        elastic_net_model.fit(X_train, y_train)

        # Predict on all test data in one go
        X_test = test_data[feature_columns].values
        stock_predictions = elastic_net_model.predict(X_test)

        stock_output = pd.DataFrame({
            'eom': test_data['eom'].values,
            'id': test_data['id'].values,
            'actual_return': test_data[target_variable].values.flatten(),
            'predicted_return': stock_predictions
        })
        stock_output['predicted_return'] = stock_output.groupby('id')['predicted_return'].shift(1)
        # Drop rows with NaN predictions (first observation per stock from shift)
        stock_output = stock_output.dropna(subset=['predicted_return'])
        return stock_output

    elif model_type == 'retrain':
        # Sort test data by date for time-ordered processing
        test_data_sorted = test_data.sort_values(by='eom').reset_index(drop=True)
        unique_dates = test_data_sorted['eom'].unique()

        # PRE-ALLOCATE: Build full matrix [train rows; test rows in time order]
        # This eliminates O(n²) vstack/append overhead
        X_train_initial = train_data[feature_columns].values
        y_train_initial = train_data[target_variable].values

        X_test_all = test_data_sorted[feature_columns].values
        y_test_all = test_data_sorted[target_variable].values

        # Concatenate once instead of vstack-ing each iteration
        X_full = np.vstack([X_train_initial, X_test_all])
        y_full = np.concatenate([y_train_initial, y_test_all])

        # Track the end of the training window (exclusive index)
        train_end = len(X_train_initial)
        test_start_idx = 0  # Current position in test_data_sorted

        # Iterate over dates in intervals
        for i in range(0, len(unique_dates), days_interval):
            interval_dates = unique_dates[i:i + days_interval]

            # Find rows for this date interval
            date_mask = test_data_sorted['eom'].isin(interval_dates)
            block_len = date_mask.sum()

            if block_len == 0:
                continue

            # Fit on current training window (0:train_end in X_full/y_full)
            elastic_net_model.fit(X_full[:train_end], y_full[:train_end])

            # Extract this interval's test block
            test_end_idx = test_start_idx + block_len
            X_block = X_full[len(X_train_initial) + test_start_idx:
                             len(X_train_initial) + test_end_idx]
            y_block = y_full[len(X_train_initial) + test_start_idx:
                             len(X_train_initial) + test_end_idx]

            # Predict on this block
            interval_predictions = elastic_net_model.predict(X_block)

            # Build output for this interval
            date_group = test_data_sorted[date_mask]
            interval_output = pd.DataFrame({
                'eom': date_group['eom'].values,
                'id': date_group['id'].values,
                'actual_return': y_block,
                'predicted_return': interval_predictions
            })
            interval_output['predicted_return'] = interval_output.groupby('id')['predicted_return'].shift(1)
            # Drop rows with NaN predictions (first observation per stock from shift)
            interval_output = interval_output.dropna(subset=['predicted_return'])
            predictions_list.append(interval_output)

            # Expand training window to include this block (just move pointer, no copying!)
            train_end += block_len
            test_start_idx = test_end_idx

        # Combine all predictions
        final_output_df = pd.concat(predictions_list, ignore_index=True)
        return final_output_df

    return pd.DataFrame()

def train_and_predict_elastic_net_with_mode(df, feature_columns, target_variable, model_type='simple', training_mode='pooled', alpha=0.001, l1_ratio=0.5, days_interval=5):

    predictions_list = []

    # Split into train and test sets based on 'ctff_test' flag
    train_data = df[df['ctff_test'] == False].copy()
    test_data = df[df['ctff_test'] == True].copy()

    # Drop rows with NaN values in features or target (for stockwise mode)
    # ElasticNet cannot handle NaN values
    # Only check columns that actually exist in the DataFrame
    cols_to_check = [col for col in feature_columns + [target_variable] if col in train_data.columns]
    train_data = train_data.dropna(subset=cols_to_check)
    test_data = test_data.dropna(subset=cols_to_check)

    # Skip if there is no test data or train data
    if len(train_data) == 0 or len(test_data) == 0:
        print(f"Skipping due to empty train or test data after NaN removal.")
        return pd.DataFrame()

    if training_mode == 'pooled':
        # Reuse the pooled training function
        return train_and_predict_pooled_elastic_net_by_days(
            df, feature_columns, target_variable, model_type, alpha, l1_ratio, days_interval
        )

    elif training_mode == 'stockwise':
        # Train separately for each stock by calling the previous function with filtered data
        stock_ids = train_data['id'].unique()

        for stock_id in stock_ids:
            # Filter data for the current stock
            stock_train_data = train_data[train_data['id'] == stock_id]
            stock_test_data = test_data[test_data['id'] == stock_id]

            # Skip if no test data for the current stock
            if len(stock_test_data) == 0:
                continue

            # Combine train and test data into a single DataFrame for the current stock
            stock_df = pd.concat([stock_train_data, stock_test_data])

            # Use the same function as for pooled mode, but with stock-specific data
            stock_output = train_and_predict_pooled_elastic_net_by_days(
                stock_df, feature_columns, target_variable, model_type, alpha, l1_ratio, days_interval
            )

            # Append the predictions for the current stock to the overall list
            predictions_list.append(stock_output)

        # Combine all stock-specific predictions into a single DataFrame
        final_output_df = pd.concat(predictions_list, ignore_index=True)
        return final_output_df

def main(chars, features, daily_ret):
    df = chars
    target_variable = 'ret_exc_lead1m'

    # Extract feature list from DataFrame (pipeline provides DataFrame with "features" column)
    if isinstance(features, pd.DataFrame):
        if 'features' in features.columns:
            feature_columns = features['features'].tolist()
        else:
            feature_columns = features.iloc[:, 0].tolist()
    else:
        feature_columns = features.tolist() if hasattr(features, 'tolist') else list(features)

    df = standardize(df=df, features=features)
    
    # Define the Elastic Net hyperparameters
    # Increased alpha and added Ridge component for convergence (see BUG FIX #5 below)
    alpha_value = 0.001  # Regularization strength (was 0.000001, increased 1000x for numerical stability)
    l1_ratio_value = 0.9  # Ratio of L1 (Lasso) to L2 (Ridge) - 90% Lasso + 10% Ridge for convergence
    
    # Set the number of days to take in for each retraining iteration
    days_interval = 100  # Change this value to control how many days to take before retraining
    
    # Call the function with 'retrain' mode and specify the training mode
    model_type = 'retrain'
    training_mode = 'pooled'  # Set 'pooled' or 'stockwise' based on preference
    
    # Train the model with the chosen parameters
    final_predictions_df = train_and_predict_elastic_net_with_mode(
        df, feature_columns, target_variable, model_type, training_mode, alpha_value, l1_ratio_value, days_interval)
    pf = final_predictions_df.assign(rank=lambda x: x.groupby('eom')['predicted_return'].rank(method='first')).assign(count=lambda x: x.groupby('eom')['rank'].transform('count')).assign(w=lambda x: x['rank'] / x['count'])[['id', 'eom', 'w']]

    return pf