What is this?
An Expected Goals (xG) dashboard for the top 5 European football leagues — Premier League, La Liga, Bundesliga, Serie A, and Ligue 1. It pulls shot-by-shot data from Understat, trains a custom XGBoost model to predict goal probability, and presents the results as interactive team, player, and match analytics.
The model is trained on historical data (2020–2024) and scored against a 2025 holdout set, where it achieves a Brier score of 0.074 — comparable to Understat's own model at 0.072.
Data Pipeline
understat Python library to pull every match result, then fetch individual shot data per match via get_match_shots(). Each shot record includes position (x, y), shot type, situation, last action, and whether it resulted in a goal. ~250k shots total across 5 leagues and 6 seasons.CalibratedClassifierCV(method="isotonic", cv=5) to ensure predicted probabilities are well-calibrated — a model predicting 0.30 xG should see goals ~30% of the time. Penalty kicks are assigned a fixed xG of 0.76 rather than being modelled (insufficient variance).Specialist Models
Three situation groups currently have dedicated models. Understat doesn't tag counter-attacks as a separate situation (they appear as OpenPlay), so counter context is captured via the fast_break feature instead.
- OpenPlay — Full 24-feature set. Deeper trees (max_depth up to 5). Covers the vast majority of shots (~74%). Includes throughball, rebound and transition proxies.
- FromCorner — 13 features focused on header mechanics: header × distance, header × angle, centrality, weak-angle header flag. Corners are almost exclusively aerial situations; removing irrelevant features reduces overfitting.
- SetPiece — Covers SetPiece (indirect free kicks) and DirectFreekick. 13 features focused on distance, angle and shot type. No transition features — set pieces are stationary.
The Watch List
Rather than a static season table, the Watch List is a rolling 30-day form guide. For each player with ≥2 goals in the last 30 days, it computes a 0–100 rating from three components:
- Efficiency (40%) — Goals ÷ xG, capped at 2.0× to prevent small-sample flukes from dominating. A rating above 1.0 means the player is outperforming their chances.
- Shot Quality (35%) — Mean xG per shot, capped at 0.50. Rewards players getting into high-value positions, not just high-volume shooters.
- Volume (25%) — Log-scaled shot count. Ensures consistent creators score higher than one-hit wonders, without letting shot-count alone dominate.
Each component is normalised 0–1 across all qualifying players before blending, then the composite score is scaled to 0–100. Arrows show movement vs the prior 30-day window — hover for previous period stats.
Known Limitations
- No defensive context — Understat doesn't expose freeze-frame data (defender positions). This is likely the main reason our AUC (0.792) slightly trails Understat (0.805).
- No goalkeeper data — Shot-stopping quality isn't modelled. A shot at an elite keeper has the same xG as one at a weaker keeper.
- Understat coordinates are from event data, not tracking data, so positional accuracy is limited to the data source.
- Counter-attack proxy — Understat doesn't provide a counter-attack flag. We approximate with last-action context (TakeOn, BallRecovery, Throughball).