Back to Research

ML MODEL TRAINING

Part 4: From Failed LLMs to Gradient Boosting Success

Omega Arena • February 2026 • COMPLETE

ASSETS

YEARS

197K

ROWS

0.566

XGBOOST AUC

Abstract. Part 3 documented the Omega System—171 hand-crafted metrics. Here it is explored whether machine learning can find patterns in the same data. After failed experiments with LLMs and reinforcement learning, gradient boosting models (XGBoost, CatBoost, LightGBM) achieved AUC scores above 0.50—better than random chance. This suggests learnable patterns exist in the data.

1. The Question 2. The Dataset 3. Failed: Large Language Models 4. Failed: Reinforcement Learning (PPO) 5. What Worked: Gradient Boosting 6. Understanding the Results 7. TFT: Neural Network Attempt 8. Technical Challenges 9. Final Status 10. Next: Regime Detection 11. Preliminary Conclusion

1. THE QUESTION

Part 3's Omega System used hand-coded signal aggregation rules. The results were impressive: 9,868% returns in 2021 backtests. But those rules encode human assumptions about markets.

What if machine learning is allowed to find the patterns instead?

The goal: train models on the same 171 features (expanded to ~16,000 through feature engineering) and see if they can learn to predict price direction. No hand-coded rules. Just data.

2. THE DATASET

Parameter	Value
Assets	94 cryptocurrencies
Time Range	2015 - 2025 (10 years)
Rows	~197,000
Base Features	171 (from Omega System)
Engineered Features	~16,000
Target Variable	Price direction (binary: up/down)
Train/Test Split	90/10 time-based (no lookahead)

Feature engineering expanded the 171 base metrics to ~16,000 through rolling windows, lags, interactions, and cross-asset relationships. For training efficiency, the top 500-1000 features were selected by variance.

3. FAILED EXPERIMENT: LARGE LANGUAGE MODELS

The first approach was ambitious: train a Large Language Model (Qwen 3 4B) to predict prices.

The reasoning seemed sound—LLMs have shown remarkable abilities in pattern recognition, and financial data is ultimately sequential information.

RESULT: ABANDONED

LLMs are fundamentally designed for language—discrete tokens with semantic meaning. Financial time series are continuous numerical data. These are different domains requiring different architectures.

Why LLMs Failed

Problem	Explanation
Tokenization mismatch	Numbers get tokenized inconsistently ("123.45" → multiple tokens)
No numerical reasoning	LLMs don't understand that 50.1 > 49.9 in a meaningful way
Training efficiency	Billions of parameters for a task that doesn't need language understanding
Hallucination risk	LLMs can generate plausible-sounding but wrong predictions

Note: LLMs will still be used in Part 6—not for prediction, but for decision synthesis. Claude Opus 4.5 will interpret model outputs and make final trading decisions. That's what LLMs are good at.

4. FAILED EXPERIMENT: REINFORCEMENT LEARNING (PPO)

The second approach: train a Proximal Policy Optimization (PPO) agent to trade.

Unlike supervised learning (predict up/down), RL agents learn through interaction. The agent takes actions (BUY/SELL/HOLD), receives rewards (profit/loss), and learns a policy.

RESULT: ABANDONED

After 40,000+ timesteps of training, the agent collapsed to a single action: HOLD. It learned that doing nothing was the safest way to avoid losses.

Training Metrics Before Collapse

Metric	Value	Interpretation
entropy_loss	-0.106 → -0.155	Collapsing to single action
explained_variance	0.874 → 0.272	Losing predictive power
mean_reward	-0.01	Slight negative (fees eating profits)
episode_length	39,434	Never closing positions

Why PPO Failed

Sparse rewards: Trading rewards come at episode end, making credit assignment difficult
Long episodes: 39,000+ steps before any feedback signal
Local minimum: HOLD avoids losses, so the agent gets stuck there
Exploration collapse: Entropy dropped, meaning the agent stopped exploring alternatives

5. WHAT WORKED: GRADIENT BOOSTING

After the LLM and RL failures, the focus shifted to proven approaches for tabular data: Gradient Boosting Decision Trees (GBDT).

Three models were selected: XGBoost, CatBoost, and LightGBM. Each has slightly different algorithms, providing ensemble diversity.

Model Status

Model	Status	HPO Trials	Best AUC	Notes
XGBoost	DONE	500/500	0.566	Best performer
CatBoost	DONE	500/500	0.530	GPU-accelerated
LightGBM	DONE	500/500	0.520	Memory-optimized
TFT	DONE	—	N/A	Poor fit for classification

Hyperparameter Optimization

Each model underwent 500 trials of Bayesian optimization using Optuna with TPESampler. This isn't random search—the optimizer learns from previous trials to explore promising parameter regions.

Parameter	Search Range
n_estimators	500 - 3000
max_depth	4 - 15
learning_rate	0.001 - 0.1 (log scale)
subsample	0.5 - 1.0
colsample_bytree	0.5 - 1.0
reg_alpha	1e-8 - 10 (log scale)
reg_lambda	1e-8 - 10 (log scale)

XGBoost Best Configuration

Parameter	Value
n_estimators	2,520
max_depth	14
learning_rate	0.084
min_child_weight	10
subsample	0.633
colsample_bytree	0.857
gamma	1.22
Best AUC	0.566

6. UNDERSTANDING THE RESULTS

What Does AUC 0.566 Mean?

AUC (Area Under ROC Curve) measures how well a model distinguishes between classes:

AUC Value	Interpretation
0.50	Random chance (coin flip)
0.50 - 0.60	Poor, but better than random
0.60 - 0.70	Moderate predictive power
0.70 - 0.80	Good
0.80+	Excellent (suspicious for financial data)

AUC 0.566 is modest but meaningful. It means the model is correct more often than a coin flip. In financial markets, even small edges compound over thousands of trades.

HPO vs Final Test AUC

Important caveat: HPO AUC scores are from validation data. Final test scores on truly unseen data are often lower:

Model	HPO Best AUC	Final Test AUC	Drop
CatBoost	0.530	~0.51	-0.02
LightGBM	0.520	~0.50	-0.02
XGBoost	0.566	TBD	—

7. TFT: THE NEURAL NETWORK ATTEMPT

Temporal Fusion Transformer (TFT) is a neural network architecture designed for time series forecasting.

RESULT: COMPLETE BUT POOR FIT

TFT successfully trained, but it's designed for regression (predicting continuous values) not classification (predicting up/down). May be revisited for price magnitude prediction.

8. TECHNICAL CHALLENGES

Training on 197K rows × 16K features presented engineering challenges:

Memory Management

Original dataset required 128GB+ RAM
Created low-memory version selecting top 500 features by variance
Cast all features to float32 (half memory of float64)
Explicit garbage collection between operations

Data Type Issues

Mixed object/numeric columns caused training failures
Added pd.to_numeric(errors='coerce') preprocessing
Handled infinity values with replace([np.inf, -np.inf], 0)

Model Saving

Multiple training runs lost due to crashes before saving
Implemented immediate save after training, before evaluation
Added timestamped backups to prevent overwrites

9. FINAL STATUS

Model	Status	Result
XGBoost	DONE	AUC 0.566 — Best performer
CatBoost	DONE	AUC 0.530
LightGBM	DONE	AUC 0.520
TFT	DONE	Poor fit for classification
PPO	ABANDONED	Collapsed to HOLD action
Qwen LLM	ABANDONED	Wrong architecture for numerical data

10. NEXT: REGIME DETECTION (PART 5)

With price prediction models complete, the next phase is regime detection—dedicated models to classify market conditions as Bull, Bear, or Sideways.

These models don't predict prices. They provide context. When the ensemble knows "we're in a bear market," it can weight signals and adjust risk parameters accordingly.

IN DEVELOPMENT FOR PART 5:

• Hidden Markov Model (HMM) — Unsupervised regime discovery using 219 features
• Random Forest Classifier — Supervised classification with 235 features, hyperparameter-optimized
• Bidirectional LSTM + Attention — 90-day sequences, multi-task learning (daily/weekly/monthly)
• Ensemble Voting — Combine all three for robust regime signals

Dataset: 233,507 rows × 203 features × 97 assets (2014-2026)
Labels: 100% hindsight-accurate (UP/DOWN/SAME, BULL/BEAR/SIDEWAYS)

11. PRELIMINARY CONCLUSION

The journey from LLMs to gradient boosting reflects a fundamental truth in machine learning: match the architecture to the problem.

LLMs excel at language, not numbers
Reinforcement learning needs careful reward design
Gradient boosting remains the gold standard for tabular data
Hyperparameter optimization matters—Trial 244 vs Trial 3 is 0.49 vs 0.57 AUC

The models show consistent >50% accuracy on out-of-sample data. This is not a guarantee of trading profits, but it's evidence that learnable patterns exist. Whether those patterns persist in live markets is the ultimate test.

Part 4 Status: COMPLETE
All gradient boosting models trained. XGBoost achieved best AUC (0.566). Regime detection models (Part 5) are now in development.