Analysis Task¶
This project aims to analyze data from the NBA Draft Combine for the years 2000-2020. The objective is to understand the relationship between player metrics determined in the combine and draft outcome across all positions.
In doing so, this analysis asks the question: how and to what extent does combine data correlate with draft decisions among NBA prospects (2000-2020)? More specifically:
1. Which input variables does the trained model rely on most heavily when making predictions of draft outcome?
2. Do the input variables contain statistically significant information that a Random Forest learning algorithm can exploit to make useful predictions?
Data Source and Preparation¶
The data for this analysis is derived from official stats NBA, and collected here: https://www.kaggle.com/datasets/marcusfern/nba-draft-combine
After preprocessing and cleaning the data externally, the source material contains 1249 entries and addresses the following metrics: position, height, weight, wingspan, standing reach, standing vertical, max vertical, bench reps, lane agility time, three-quarter court sprint time, modified lane agility time, hand length inches, hand width inches, body fat percentage, as well as player position, and draft outcome.
Analysis¶
Import Packages¶
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import permutation_test_score
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report
import base64
from IPython.display import HTML, display
from rich.console import Console
from rich.table import Table
from rich.text import Text
from tabulate import tabulate
Import and Clean Files¶
# Load CSV
data = pd.read_csv("NBA Draft (2000-2020).csv")
#Replace "" spaces with NaN
data=data.replace("", np.nan)
# Identify numeric columns
numeric_columns = data.select_dtypes(include=[np.number]).columns
# Fill missing values with column median
for col in numeric_columns:
data[col] = data[col].fillna(data[col].median())
Set Up and Train AI (Random Forest Classification)¶
# Define features and target
X = data.drop(columns=["drafted", "yearCombine"])
y = data["drafted"]
# Convert categorical variables to numerical
X = pd.get_dummies(X)
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=49)
# Train model on training set, Random Forest (100 trees, shuffled)
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
RandomForestClassifier(random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| n_estimators | 100 | |
| criterion | 'gini' | |
| max_depth | None | |
| min_samples_split | 2 | |
| min_samples_leaf | 1 | |
| min_weight_fraction_leaf | 0.0 | |
| max_features | 'sqrt' | |
| max_leaf_nodes | None | |
| min_impurity_decrease | 0.0 | |
| bootstrap | True | |
| oob_score | False | |
| n_jobs | None | |
| random_state | 42 | |
| verbose | 0 | |
| warm_start | False | |
| class_weight | None | |
| ccp_alpha | 0.0 | |
| max_samples | None | |
| monotonic_cst | None |
Test AI Model according to reserved test data¶
# Predict with AI the target of test set given its features
y_pred = rf.predict(X_test)
#calcuate accuracy and F1 Metrics
acc=accuracy_score(y_test, y_pred)
f1=accuracy_score(y_test, y_pred)
# Rich table display
console = Console()
table = Table(title="Overall Metrics")
table.add_column("Metric", style="black", no_wrap=True)
table.add_column("Value", justify="right")
table.add_row("Accuracy", f"{acc:.3f}")
table.add_row("F1 Score", f"{f1:.3f}")
console.print(table)
Overall Metrics ┏━━━━━━━━━━┳━━━━━━━┓ ┃ Metric ┃ Value ┃ ┡━━━━━━━━━━╇━━━━━━━┩ │ Accuracy │ 0.608 │ │ F1 Score │ 0.608 │ └──────────┴───────┘
#Print Confusion matrix as a rich table
console = Console()
cm = confusion_matrix(y_test, y_pred)
table = Table(title="Confusion Matrix")
table.add_column(" ", justify="right")
table.add_column("Predicted 0", justify="center")
table.add_column("Predicted 1", justify="center")
table.add_row("Actual 0", str(cm[0, 0]), str(cm[0, 1]))
table.add_row("Actual 1", str(cm[1, 0]), str(cm[1, 1]))
console.print(table)
Confusion Matrix ┏━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓ ┃ ┃ Predicted 0 ┃ Predicted 1 ┃ ┡━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩ │ Actual 0 │ 57 │ 57 │ │ Actual 1 │ 41 │ 95 │ └──────────┴─────────────┴─────────────┘
AI Model Evaluation Results¶
The results of this evaluation are encouraging. The model was tested on the data 250 prospects (or 1/4 of the total entries), and correctly predicted their draft status 168 times: 60.8% accuracy. If one had simply guessed "yes" to all the prospects in the sample group (as shown above in the confusion matrix), you would be right 54.4% of the time (136 drafted of 250 tested). We can consider this the baseline, and conclude the model is 6.4% above baseline.
The confusion matrix shows the model tends to slightly overpredict drafting (predicting 154 rather than 136 for the total number of drafts) but does well overall.
The F1 score (0.659722) balances the model’s ability to correctly identify drafted players (recall) with how often predicted drafted players are actually drafted (precision). Running on a scale from 0 to 1, this moderate score indicates the model is reasonably effective at predicting drafted players.
Feature Importance¶
Given that the model has the ability to predict drafting better than the baseline, we return to our first question:
Which input variables does the trained model rely on most heavily when making predictions of draft outcome?
# Extract feature importances
rf_importances = rf.feature_importances_
# Create a DataFrame sorted by importance
feature_importance_df = pd.DataFrame({
"Feature": X_train.columns,
"Importance": rf_importances
}).sort_values("Importance", ascending=False).reset_index(drop=True)
console = Console()
# Limit to top N features
top_n = 10
df = feature_importance_df.head(top_n)
max_importance = df["Importance"].max()
bar_width = 30
# Create a horizontal bar
def importance_bar(value, max_value, width=30):
filled = int((value / max_value) * width)
return Text("█" * filled + " " * (width - filled), style="blue")
# Create a Rich table
table = Table(title="Random Forest Feature Importance (Top 10)")
table.add_column("Feature", style="black", no_wrap=True)
table.add_column("Importance", justify="right")
table.add_column("")
# Add each feature as a row with a horizontal bar
for _, row in df.iterrows():
table.add_row(
row["Feature"],
f"{row['Importance']:.4f}",
importance_bar(row["Importance"], max_importance, bar_width))
# Print the table
console.print(table)
Random Forest Feature Importance (Top 10) ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Feature ┃ Importance ┃ ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ timeLaneAgility │ 0.1130 │ ██████████████████████████████ │ │ body_fat_pct │ 0.1062 │ ████████████████████████████ │ │ weight │ 0.0968 │ █████████████████████████ │ │ max_vertical │ 0.0950 │ █████████████████████████ │ │ wingspan │ 0.0899 │ ███████████████████████ │ │ height │ 0.0871 │ ███████████████████████ │ │ timeThreeQuarterCourtSprint │ 0.0860 │ ██████████████████████ │ │ reach_standing │ 0.0833 │ ██████████████████████ │ │ standing_vertical │ 0.0830 │ ██████████████████████ │ │ bench_reps │ 0.0753 │ ███████████████████ │ └─────────────────────────────┴────────────┴────────────────────────────────┘
These results reveal that lane agility time carries the most weight in the model’s decision of whether a player is drafted across all positions, followed by body fat percentage and weight. Max vertical jump, wingspan, and height also play notable roles in the model’s predictions.
Still, the outcome is relatively flat: bench press, which has the least importance as a combine metric, is still only ~1/3 less important than lane agility time.
Data Predictability¶
Finally, we can ask the question:
Do the input variables contain statistically significant information that a Random Forest learning algorithm can exploit to make useful predictions?
To answer this question, we'll look at the p-value:
score, perm_scores, p_value = permutation_test_score(
rf,
X,
y,
scoring="f1",
cv=5,
n_permutations=500,
n_jobs=-1
)
# Print the results
print("=== Permutation Test Results ===")
print(f"Permutation test p-value: {p_value:.4f}")
=== Permutation Test Results === Permutation test p-value: 0.5968
A p-value of 0.05 or lower would indicate that the model’s predictions are statistically significant. Our p-value (0.5968) is far above this threshold. Although the model’s observed accuracy and F1 score were encouraging, this high p-value suggests that its apparent predictive power is likely due to random chance rather than genuine predictive strength.
This answers our question: the input variables from the NBA combine likely do not contain enough statistically significant information to train a learning algorithm that predicts the draft better than chance. In other words, the NBA combine alone is not predictive of draft outcomes, even though some correlations may still exist.
Conclusion¶
This analysis explored the predictive value of NBA Draft Combine metrics for draft outcomes from 2000–2020 using a Random Forest classifier. While the model achieved an accuracy of 60.8% and an F1 score of 0.659722, permutation testing revealed a high p-value (0.5968), indicating that these predictions are not statistically significant. The combine data alone does not reliably predict draft outcomes.
Key Insights:¶
Most influential metrics: Lane agility time, body fat percentage, and weight were the features the model relied on most heavily, followed by wingspan, max vertical jump, and three-quarter court sprint time.
Overall predictability: Despite some correlations between combine metrics and draft likelihood, the dataset lacks sufficient statistical signal to consistently inform draft decisions.
Final Remarks:¶
The NBA Draft Combine provides some information on the role of various aspects of player athleticism, but these are shown not to be predictive of draft outcome. These are also shaped by factors not captured in the combine, including in-game performance, skill development, and team strategy. Integrating these elements could improve future predictive models.