About The Model

Learn more about our models and how we predict MLB game outcomes

About The Model & Training Methods

Our Models

We employ two sophisticated models to predict the outcomes of MLB games. Each model is designed to leverage different sets of data and machine learning techniques to maximize the accuracy of our predictions.

Model 1

Data Sources

  • Box Scores & Stats (1998-2024): Our analysis begins with over two decades of detailed box scores and game statistics. This extensive dataset includes every aspect of the game, from runs, hits, and errors to individual player performance metrics like batting averages, strikeouts, and walks. By leveraging this wealth of information, our model can identify long-term trends and subtle patterns that might be overlooked by less comprehensive analyses.
  • MLB Data and Historical Sports Book Odds: In addition to game statistics, we incorporate historical sports book odds. These odds provide valuable insights into how bookmakers have viewed the probability of various outcomes over time. By analyzing discrepancies between actual outcomes and predicted odds, our model can adjust its predictions to account for market expectations and biases.
  • MLB Player Positions: Understanding the dynamics of player positions is crucial. Our model considers not only the positions players typically occupy but also their performance in those roles. This includes defensive statistics, positional versatility, and how players' positions affect their offensive and defensive contributions.

Machine Learning Technique

XGBoost: This model utilizes the XGBoost algorithm, a powerful gradient boosting framework renowned for its efficiency and accuracy. XGBoost works by building multiple decision trees and combining their results to make final predictions. It is particularly effective in handling large datasets and capturing complex interactions between variables. Our implementation of XGBoost is fine-tuned through extensive cross-validation and parameter optimization, ensuring it captures the intricate patterns and trends within our data.

Model 2

Data Sources

  • Game Data (2010-2024): This model draws from a decade and a half of comprehensive game statistics, providing a robust foundation for analyzing recent trends and developments in MLB. This dataset includes everything from game outcomes and scoring patterns to detailed player statistics and in-game events.
  • Pitchers & Pitchers' ERA: A key focus of this model is pitching performance. By analyzing detailed statistics on pitchers, including Earned Run Average (ERA), strikeouts, walks, and other pitching metrics, we can gauge a pitcher's effectiveness and consistency. ERA, in particular, serves as a crucial indicator of a pitcher's ability to prevent runs.
  • Team Batting Stats: Our model also incorporates extensive batting statistics, including metrics such as batting averages, on-base percentages, slugging percentages, and advanced stats like Weighted On-Base Average (wOBA) and Wins Above Replacement (WAR). These metrics help assess the offensive capabilities of teams and individual players, providing a comprehensive view of their potential impact on game outcomes.
  • Historical Sports Book Odds: As with Model 1, we integrate historical sports book odds to gain insights into market expectations and biases. This information helps our model adjust its predictions to reflect the perspectives of experienced bookmakers.

Machine Learning Techniques

  • Random Forest: This ensemble method builds multiple decision trees and merges their results to improve accuracy and control overfitting. By averaging the predictions of many trees, our Random Forest model provides robust and reliable predictions, even in the presence of noisy data.
  • Logistic Regression: A statistical model that uses a logistic function to model binary dependent variables, logistic regression is particularly useful for estimating the probability of binary outcomes, such as win/loss predictions. This technique helps our model provide clear and interpretable probability estimates.
  • Neural Network: Our neural network model mimics the human brain's interconnected neuron structure to identify patterns and relationships in data. This deep learning approach allows our model to capture complex interactions and non-linear relationships, enhancing its predictive accuracy.
  • XGBoost: We use XGBoost again for its superior performance in handling various types of data and producing highly accurate predictions. Its ability to manage large datasets and complex interactions makes it an invaluable tool in our predictive arsenal.
  • Light Gradient Boosting Machine (LightGBM): LightGBM is a gradient boosting framework that uses tree-based learning algorithms, optimized for speed and efficiency. Its ability to handle large-scale data with high efficiency makes it a key component of our model.

Model Selection

After running predictions through all these techniques, we choose the best result based on a combination of model performance metrics and historical accuracy. This multi-model approach ensures that we leverage the strengths of each technique to provide the most accurate predictions possible. By combining the insights from different models, we can account for a wider range of variables and interactions, leading to more reliable and comprehensive predictions.

Additional Tools and Methods

kProps Tool

Overview: The kProps tool is an essential component of our prediction strategy, designed to extract and analyze key propositions from detailed game and player data. It focuses on several crucial aspects of team and player performance to enhance the accuracy of our predictions.

Key Features:

  • Team Rosters & Player Positions: The kProps tool provides comprehensive information on team compositions and individual player roles. By analyzing team rosters, we can assess the impact of player availability, positional changes, and lineup adjustments on game outcomes.
  • Game Logs & Player Performance: Tracking game-by-game performance is vital for understanding player trends and consistency. The kProps tool collects detailed game logs for pitchers and hitters, allowing us to analyze performance metrics over time. This includes metrics such as innings pitched, strikeouts, earned runs, and batting statistics.
  • Starting Pitcher Identification: Identifying starting pitchers is crucial for accurate game predictions. The kProps tool determines which pitchers are scheduled to start, using historical data and current roster information. This helps us assess the potential impact of starting pitchers on game outcomes, considering factors like pitcher rest days and performance trends.

Data Sources:

  • MLB API: Real-time and historical data from the official MLB API provide a reliable and up-to-date source of game and player information. This ensures our predictions are based on the most current and accurate data available.
  • Game Statistics & Player Metrics: Detailed statistics on games and player performance form the core of the kProps tool. By analyzing these metrics, we can identify key performance indicators and trends that influence game outcomes.

kProps Tool: Starting Pitcher Strikeout Prediction

Overview: Our kProps tool is designed to predict the number of strikeouts for starting pitchers using advanced machine learning techniques. Trained on comprehensive data from 2020 to 2024, it focuses on key metrics such as innings pitched, ERA, and historical strikeout rates to provide accurate predictions for today’s games.

Key Features:

  • Player Performance Data: The model analyzes detailed game logs for each starting pitcher, including innings pitched, strikeouts, and ERA, to assess their performance over time.
  • Historical Data: Utilizes extensive historical data to identify trends and patterns in a pitcher’s strikeout performance.
  • Matchup Analysis: Evaluates the dynamics between the starting pitcher and the opposing team's batting lineup, taking into account past matchups and performance against similar hitters.

Model Training:

Our kProps tool uses a Linear Regression model, trained on data from the past five years. The model is built using a carefully selected dataset that captures the nuances of pitcher performance and strikeout potential. Here's an outline of the training process:

  • Data Collection: Gathering game logs and statistics for all starting pitchers from 2020 to 2024.
  • Feature Selection: Key features such as innings pitched and ERA are used to train the model.
  • Model Evaluation: The model is evaluated using metrics like Mean Squared Error to ensure accuracy and reliability.

Prediction Process:

  • Daily Pitcher Data: The tool retrieves today’s starting pitchers and fetches their game logs.
  • Feature Preparation: The model prepares the features required for prediction, focusing on innings pitched and ERA.
  • Strikeout Predictions: The model predicts the number of strikeouts for each starting pitcher based on their historical performance data.

The kProps tool provides an informed and statistically-backed prediction of strikeouts for today’s starting pitchers, helping users make better betting decisions.

Home Run Predictions

Overview: Our home run prediction model focuses on identifying likely home run outcomes based on a variety of factors. This model enhances our overall prediction accuracy by pinpointing players and conditions conducive to home runs.

Key Features:

  • Player Batting Stats: The model analyzes batting metrics such as at-bats, slugging percentage, isolated power (ISO), home runs per at-bat (HR/AB), plate appearances, and runs batted in (RBI). These metrics help assess a player's power-hitting potential.
    • Plate Appearances (PA): More chances at bat generally mean more opportunities to hit a home run.
    • Slugging Percentage (SLG): This stat measures a player's power by factoring in extra-base hits. A high SLG is a good indicator of a player's ability to hit the long ball.
    • Isolated Power (ISO): ISO is similar to SLG but removes singles from the equation. It's a good measure of a player's raw power.
    • Home Runs per At-Bat (HR/AB): This directly measures a player's home run frequency.
    • Runs Batted In (RBI): While RBI isn't a perfect stat, it can be a decent proxy for power. A player who drives in a lot of runs is likely hitting for power.
  • Pitcher Matchups: Understanding the dynamics between hitters and pitchers is crucial for predicting home runs. The model considers pitcher performance metrics such as home runs allowed, innings pitched, earned run average (ERA), strikeouts, walks, hits allowed, and WHIP (walks plus hits per inning pitched) to evaluate the likelihood of a player hitting a home run against a specific pitcher.
  • Historical Data: Utilizing historical home run data allows the model to identify patterns and trends in player performance and game conditions. This includes analyzing factors such as ballpark characteristics, weather conditions, and game situations.

Machine Learning Technique:

  • Random Forest Classifier & Regressor: By leveraging the power of Random Forest algorithms, our home run prediction model provides highly accurate predictions. The Random Forest Classifier handles the binary classification of whether a home run will occur, while the Random Forest Regressor predicts the expected number of home runs.

Disclaimers

Unpredictable Variables:

Our predictions do not account for sudden injuries, weather conditions, or other unforeseen factors that might affect the game's outcome. Such variables can introduce a level of randomness that is beyond the scope of any predictive model.

Inherent Randomness in Baseball:

Baseball, like all sports, involves a significant amount of randomness and luck. Events such as an unexpected home run or an error can significantly alter the game's outcome. Our models incorporate extensive data to mitigate this randomness but cannot eliminate it entirely.

Responsible Gambling:

Gambling involves risk, and it is important to bet responsibly. Our predictions are based on historical data and statistical models, which are not foolproof. If you or someone you know has a gambling problem, it is important to seek help. Organizations like the National Council on Problem Gambling provide resources and support for those in need.

Informational Purposes:

The predictions made by our models are intended for informational purposes only. They are not guaranteed outcomes, and users should not rely solely on these predictions for making gambling decisions.

By understanding the limitations and potential risks, users can make more informed decisions and enjoy the predictive insights provided by our models responsibly.