ML MODEL TRAINING
Part 4: From Failed LLMs to Gradient Boosting Success
TABLE OF CONTENTS
1. THE QUESTION
Part 3's Omega System used hand-coded signal aggregation rules. The results were impressive: 9,868% returns in 2021 backtests. But those rules encode human assumptions about markets.
What if machine learning is allowed to find the patterns instead?
The goal: train models on the same 171 features (expanded to ~16,000 through feature engineering) and see if they can learn to predict price direction. No hand-coded rules. Just data.
2. THE DATASET
| Parameter | Value |
|---|---|
| Assets | 94 cryptocurrencies |
| Time Range | 2015 - 2025 (10 years) |
| Rows | ~197,000 |
| Base Features | 171 (from Omega System) |
| Engineered Features | ~16,000 |
| Target Variable | Price direction (binary: up/down) |
| Train/Test Split | 90/10 time-based (no lookahead) |
Feature engineering expanded the 171 base metrics to ~16,000 through rolling windows, lags, interactions, and cross-asset relationships. For training efficiency, the top 500-1000 features were selected by variance.
3. FAILED EXPERIMENT: LARGE LANGUAGE MODELS
The first approach was ambitious: train a Large Language Model (Qwen 3 4B) to predict prices.
The reasoning seemed sound—LLMs have shown remarkable abilities in pattern recognition, and financial data is ultimately sequential information.
LLMs are fundamentally designed for language—discrete tokens with semantic meaning. Financial time series are continuous numerical data. These are different domains requiring different architectures.
Why LLMs Failed
| Problem | Explanation |
|---|---|
| Tokenization mismatch | Numbers get tokenized inconsistently ("123.45" → multiple tokens) |
| No numerical reasoning | LLMs don't understand that 50.1 > 49.9 in a meaningful way |
| Training efficiency | Billions of parameters for a task that doesn't need language understanding |
| Hallucination risk | LLMs can generate plausible-sounding but wrong predictions |
4. FAILED EXPERIMENT: REINFORCEMENT LEARNING (PPO)
The second approach: train a Proximal Policy Optimization (PPO) agent to trade.
Unlike supervised learning (predict up/down), RL agents learn through interaction. The agent takes actions (BUY/SELL/HOLD), receives rewards (profit/loss), and learns a policy.
After 40,000+ timesteps of training, the agent collapsed to a single action: HOLD. It learned that doing nothing was the safest way to avoid losses.
Training Metrics Before Collapse
| Metric | Value | Interpretation |
|---|---|---|
| entropy_loss | -0.106 → -0.155 | Collapsing to single action |
| explained_variance | 0.874 → 0.272 | Losing predictive power |
| mean_reward | -0.01 | Slight negative (fees eating profits) |
| episode_length | 39,434 | Never closing positions |
Why PPO Failed
- Sparse rewards: Trading rewards come at episode end, making credit assignment difficult
- Long episodes: 39,000+ steps before any feedback signal
- Local minimum: HOLD avoids losses, so the agent gets stuck there
- Exploration collapse: Entropy dropped, meaning the agent stopped exploring alternatives
5. WHAT WORKED: GRADIENT BOOSTING
After the LLM and RL failures, the focus shifted to proven approaches for tabular data: Gradient Boosting Decision Trees (GBDT).
Three models were selected: XGBoost, CatBoost, and LightGBM. Each has slightly different algorithms, providing ensemble diversity.
Model Status
| Model | Status | HPO Trials | Best AUC | Notes |
|---|---|---|---|---|
| XGBoost | DONE | 500/500 | 0.566 | Best performer |
| CatBoost | DONE | 500/500 | 0.530 | GPU-accelerated |
| LightGBM | DONE | 500/500 | 0.520 | Memory-optimized |
| TFT | DONE | — | N/A | Poor fit for classification |
Hyperparameter Optimization
Each model underwent 500 trials of Bayesian optimization using Optuna with TPESampler. This isn't random search—the optimizer learns from previous trials to explore promising parameter regions.
| Parameter | Search Range |
|---|---|
| n_estimators | 500 - 3000 |
| max_depth | 4 - 15 |
| learning_rate | 0.001 - 0.1 (log scale) |
| subsample | 0.5 - 1.0 |
| colsample_bytree | 0.5 - 1.0 |
| reg_alpha | 1e-8 - 10 (log scale) |
| reg_lambda | 1e-8 - 10 (log scale) |
XGBoost Best Configuration
| Parameter | Value |
|---|---|
| n_estimators | 2,520 |
| max_depth | 14 |
| learning_rate | 0.084 |
| min_child_weight | 10 |
| subsample | 0.633 |
| colsample_bytree | 0.857 |
| gamma | 1.22 |
| Best AUC | 0.566 |
6. UNDERSTANDING THE RESULTS
What Does AUC 0.566 Mean?
AUC (Area Under ROC Curve) measures how well a model distinguishes between classes:
| AUC Value | Interpretation |
|---|---|
| 0.50 | Random chance (coin flip) |
| 0.50 - 0.60 | Poor, but better than random |
| 0.60 - 0.70 | Moderate predictive power |
| 0.70 - 0.80 | Good |
| 0.80+ | Excellent (suspicious for financial data) |
HPO vs Final Test AUC
Important caveat: HPO AUC scores are from validation data. Final test scores on truly unseen data are often lower:
| Model | HPO Best AUC | Final Test AUC | Drop |
|---|---|---|---|
| CatBoost | 0.530 | ~0.51 | -0.02 |
| LightGBM | 0.520 | ~0.50 | -0.02 |
| XGBoost | 0.566 | TBD | — |
7. TFT: THE NEURAL NETWORK ATTEMPT
Temporal Fusion Transformer (TFT) is a neural network architecture designed for time series forecasting.
TFT successfully trained, but it's designed for regression (predicting continuous values) not classification (predicting up/down). May be revisited for price magnitude prediction.
8. TECHNICAL CHALLENGES
Training on 197K rows × 16K features presented engineering challenges:
Memory Management
- Original dataset required 128GB+ RAM
- Created low-memory version selecting top 500 features by variance
- Cast all features to float32 (half memory of float64)
- Explicit garbage collection between operations
Data Type Issues
- Mixed object/numeric columns caused training failures
- Added
pd.to_numeric(errors='coerce')preprocessing - Handled infinity values with
replace([np.inf, -np.inf], 0)
Model Saving
- Multiple training runs lost due to crashes before saving
- Implemented immediate save after training, before evaluation
- Added timestamped backups to prevent overwrites
9. FINAL STATUS
| Model | Status | Result |
|---|---|---|
| XGBoost | DONE | AUC 0.566 — Best performer |
| CatBoost | DONE | AUC 0.530 |
| LightGBM | DONE | AUC 0.520 |
| TFT | DONE | Poor fit for classification |
| PPO | ABANDONED | Collapsed to HOLD action |
| Qwen LLM | ABANDONED | Wrong architecture for numerical data |
10. NEXT: REGIME DETECTION (PART 5)
With price prediction models complete, the next phase is regime detection—dedicated models to classify market conditions as Bull, Bear, or Sideways.
These models don't predict prices. They provide context. When the ensemble knows "we're in a bear market," it can weight signals and adjust risk parameters accordingly.
• Hidden Markov Model (HMM) — Unsupervised regime discovery using 219 features
• Random Forest Classifier — Supervised classification with 235 features, hyperparameter-optimized
• Bidirectional LSTM + Attention — 90-day sequences, multi-task learning (daily/weekly/monthly)
• Ensemble Voting — Combine all three for robust regime signals
Dataset: 233,507 rows × 203 features × 97 assets (2014-2026)
Labels: 100% hindsight-accurate (UP/DOWN/SAME, BULL/BEAR/SIDEWAYS)
11. PRELIMINARY CONCLUSION
The journey from LLMs to gradient boosting reflects a fundamental truth in machine learning: match the architecture to the problem.
- LLMs excel at language, not numbers
- Reinforcement learning needs careful reward design
- Gradient boosting remains the gold standard for tabular data
- Hyperparameter optimization matters—Trial 244 vs Trial 3 is 0.49 vs 0.57 AUC
The models show consistent >50% accuracy on out-of-sample data. This is not a guarantee of trading profits, but it's evidence that learnable patterns exist. Whether those patterns persist in live markets is the ultimate test.
All gradient boosting models trained. XGBoost achieved best AUC (0.566). Regime detection models (Part 5) are now in development.
© 2026 Omega Arena