Sports Betting Machine Learning Algo (WIP)


Python SQL Discord.py PyTorch Beautifulsoup OCR Machine Learning



About the project

Below you can find the whole detailed project related to developing a machine learning algo to classify sports bets of different forms from messages (generally through discord messages but can be used for any input source). This was step 1 of an even bigger project, below you can find each part of the project.

 

Part 1 - Development of the model

  1. Collecting, labeling and preprocessing data from all sorts of potential sports betting messages (images/text).
  2. Model design using PyTorch, using RNN and transformer-based models, starting with BERT/RoBERTa and tweaking with PyTorch's transformers library.
  3. Training the model using supervised learning.
  4. Evaluation of the model using live feed data that betting staff post (new unbiased data which will give the best evaluation).
  5. Postprocessing to ignore betting recaps and any past bets that have already happened, part 2 will be checking for successes and failures.
  6. Deployment.

Part 1.5 - Using OpenAI in the interim while the model is being trained

  1. Utilizing gpt-3.5-turbo-instruct to pull sports betting data from messages.
  2. Messages go through some preprocessing before running through the model.
  3. Making sure the prompt is good enough to return consistent results each time to avoid errors.

Part 2 - Storing bets and returning W/L of each bet

  1. Once the model predicts and classifies the data properly, it will be taking and storing messages into a database. 
  2. Format of the messages data will be saves as {sport: "", teams: ["team1", "team2"], players: ["player1", "player2", "etc"], bet_type: "over/under/etc", points: "", units: "", etc}
  3. Once each betting day ends (generally midnight EST), the data will be checked to see if the bet resulted in a win or a loss.
  4. Daily recaps on the data will be made.
  5. Data will also be accessible visually via a Streamlit app to give better historical W/L data on all handicappers for members to view.

Part 3 - Converting images with bets into text

  1. Some betters send images with their bets.
  2. OCR was used to pull the text from the messages and then that text was run through the model to classify the data. 

Part 4 - Future additions

Once this is working and gathering betting data/results accurately, the next step would be to look into adding the following.

  • Taking in the bet and gathering data related to the teams/players, then spitting out a confidence interval on the bet, maybe giving it a risk factor.
  • Giving confidence intervals on the handicappers rather than the bet, to give a rating on the handicapper giving members bets. 

 

Part 1 - Development of the Model

This part goes over all the necessary steps and work involved in creating the model. This was the bulk of the work for the whole project, especially gathering and labeling the data so there was enough to build out an accurate model.

 

Step 1 - Gathering Data

  • Old data is in html format (pulled from old discord channels)
  • Data is then extracted with Beautifulsoup and saved in a database
  • Once the data is all in the database it is then given a label
    • To do so the labeling work is done through Discord.
    • All data is stored in a databased but Discord is used to send surveys to anyone willing to help label the data.
    • A command is used that will pull any unlabeled message data randomly and then lets a user lebel it accordingly.
    • The labels that the data can be given are {sport: "", teams: ["team1", "team2"], players: ["player1", "player2", "etc"], bet_type: "over/under/etc", points: "", units: "", etc}. Below you can find an example of the Discord survey.
  • There is also a possibility of images containing bets so another set of data will be collected containing images.

Survey for messages

The survey will have the following questions to get the necessary label for the data.

  1. Full message is sent to the user.
  2. Asked if there is a bet in the message, if there is not then ignore, and if it is a recap, ignore as well. Message will be removed from the database and user will be given a new survey.
  3. Asked to take out only the data that is relevant (just the bet, historical data or other irrelevant data to be left)
  4. Paste in the wanted text.
  5. Preprocessing done at this stage, this is where the text will get smaller based on the user input as well as changing all text to lowercase.
  6. Send the new refined message.
  7. Ask how many bets are made in this message.
  8. Then ask the following questions for each number of bets in that message. NA is written if the data is not provided (ex. bookmaker).
    1. Sport/eSport -  basketball, soccer, valorant, cs2, etc.
    2. Team(s) - format as (team1, team2),  if none then write NA
    3. Player(s) - format as (player1, player2, etc) if none then write NA
    4. bet_type - over/under, moneyline, spread, etc.
    5. p_r_a - explicitely write out point and/or rebounds and/or assists, in nothing assume points (ex. Points + Assists)
    6. total_points - ex. 22.5
    7. points_spread - ex. -15
    8. units_bet - NA (will mean default 1u)
    9. bookmaker - ex. MGM 
    10. odds - ex. 1.87
  9. Once this is done then give a new survey if the user wants. 

Survey for images

This survey is very similar to the one used for text but is slightly different.

  1. Image is sent to the user.
  2. Asked if there is a bet in the image, if there is not then ignore, and if it is a recap, ignore as well. Image will be removed from the database and user will be given a new survey.
  3. Ask how many bets are made in the image.
  4. Then ask the following questions for each number of bets in that message. NA is written if the data is not provided (ex. bookmaker).
    1. Sport/eSport -  basketball, soccer, valorant, cs2, etc.
    2. Team(s) - format as (team1, team2),  if none then write NA
    3. Player(s) - format as (player1, player2, etc) if none then write NA
    4. bet_type - over/under, moneyline, spread, etc.
    5. p_r_a - explicitely write out point and/or rebounds and/or assists, in nothing assume points (ex. Points + Assists)
    6. total_points - ex. 22.5
    7. points_spread - ex. -15
    8. units_bet - NA (will mean default 1u)
    9. bookmaker - ex. MGM 
    10. odds - ex. 1.87
  5. Once this is done then give a new survey if the user wants. 

Part 1.5 - Using OpenAI in the Interim While the Model is Being Trained

Since gathering enough data and labeling it will take time, along with all the other steps of model development, I opted to use OpenAI's gpt-3.5-turbo-instruct to predict the and classify each bet in a message and store it in a database. 

 

Some preprocessing was done to the data in order to clean it up for the request. This was done to shorten the length to only include the bet as well as to help return consistent results. Below you can find some of the filtering that was done.

lines = message.lower().split('\n')
betting_lines = ""

words_to_ignore = ["recap", "summary"]
words_to_keep = ["pts", "reb", "ast", "p+a", "p+r", "a+r", "bet"]
over_under_pattern = r'\b[ou]\d+(\.\d+)?\b'
unit_pattern = r'\d+(\.\d+)?u'

abbreviations = {
    "pts": "points",
    "ast": "assists",
    "reb": "rebounds",
    "p+a": "points + assists",
    "p+r": "points + rebounds",
    "a+r": "assists + rebounds",
    "p+a+r": "points + assists + rebounds"
}

over_under_mapping = {
    'o': 'over', 
    'u': 'under'
}

def replace_over_under(match):
    full_match = match.group(0)
    prefix = full_match[0]  # Extract the prefix ('o' or 'u')
    value = full_match[1:]  # Extract the numerical value
    return f"{over_under_mapping[prefix]}{value}"

if not any(keyword in message.lower() for keyword in words_to_ignore):
    for line in lines:
        matches_over_under = re.findall(over_under_pattern, line)
        matches_unit = re.findall(unit_pattern, line)
        if matches_over_under or matches_unit:
            for abbreviation, full_form in abbreviations.items():
                line = line.replace(abbreviation, full_form)
            modified_line = re.sub(over_under_pattern, replace_over_under, line)
            betting_lines = betting_lines + modified_line + "\n"

The initial filtering was reading the message line by line and pulling out just the bets, as the bets would containg specific text, regular expression was used to pull out just the data that was needed. 

The next step was to change the abbeviations, this can help more accurately classify each part of the bet when the request is made. 

 

Once this was done, there was an addition of some text to go with the message to get consistent results. You can find both of those below. 

prompt_front = "Provide detail for each bet mentioned in the following message:\n"
promt_end = """
List all bets in the message with newline to separate each bet. Include the following data for each, keeping the same category names and if it doesn't exist in the bet, write None:

Sport: Must specify sport/esport
Teams: List any teams involved in the bet, if mentioned, with values separated by a ','
Players: Name of the player(s) involved in the bet, if mentioned, with values separated by a ','
Bet Type: Specify the type of bet (over, under, moneyline, spread, parlay, handicap, live)
Pts/Reb/Ast: Specify if the bet involves points, rebounds, or assists (points, rebounds, assists) with values separated by a '+'
Total Points: The total points involved in the bet, if mentioned
Point Spread: The spread needed for the bet, if mentioned
Units Bet: The number of units bet on this particular bet, don't include u or units
Bookmaker: Name of the bookmaker associated with the bet
Odds: The odds associated with the bet
"""

 

The returned request would look something like this:

Sport: Basketball
Teams: None
Players: J. Poole
Bet Type: Over
Pts/Reb/Ast: Points + Assists
Total Points: 20.5
Point Spread: None
Units Bet: 1
Bookmaker: MGM
Odds: 1.87

Sport: Basketball
Teams: None
Players: T. Hendricks
Bet Type: Over
Pts/Reb/Ast: Points + Rebounds
Total Points: 17.5
Point Spread: None
Units Bet: 1
Bookmaker: MGM
Odds: 1.85

Sport: Basketball
Teams: None
Players: L. James
Bet Type: Over
Pts/Reb/Ast: Points + Rebounds
Total Points: 31.5
Point Spread: None
Units Bet: 1
Bookmaker: FanDuel
Odds: 1.87

 

This would then be sent elsewhere to be processed and stored in the database, you can read more on that in part 2. I have also attached the full function below.

async def get_ai_response(message):

    lines = message.lower().split('\n')
    betting_lines = ""

    words_to_ignore = ["recap", "summary"]
    words_to_keep = ["pts", "reb", "ast", "p+a", "p+r", "a+r", "bet"]
    over_under_pattern = r'\b[ou]\d+(\.\d+)?\b'
    unit_pattern = r'\d+(\.\d+)?u'

    abbreviations = {
        "pts": "points",
        "ast": "assists",
        "reb": "rebounds",
        "p+a": "points + assists",
        "p+r": "points + rebounds",
        "a+r": "assists + rebounds",
        "p+a+r": "points + assists + rebounds"
    }

    over_under_mapping = {
        'o': 'over', 
        'u': 'under'
    }

    if not any(keyword in message.lower() for keyword in words_to_ignore):
        for line in lines:
            matches_over_under = re.findall(over_under_pattern, line)
            matches_unit = re.findall(unit_pattern, line)
            if matches_over_under or matches_unit:
                for abbreviation, full_form in abbreviations.items():
                    line = line.replace(abbreviation, full_form)
                modified_line = re.sub(over_under_pattern, replace_over_under, line)
                betting_lines = betting_lines + modified_line + "\n"

    prompt_front = "Provide detail for each bet mentioned in the following message:\n"
    promt_end = """
    List all bets in the message with newline to separate each bet. Include the following data for each, keeping the same category names and if it doesn't exist in the bet, write None:

    Sport: Must specify sport/esport
    Teams: List any teams involved in the bet, if mentioned, with values separated by a ','
    Players: Name of the player(s) involved in the bet, if mentioned, with values separated by a ','
    Bet Type: Specify the type of bet (over, under, moneyline, spread, parlay, handicap, live)
    Pts/Reb/Ast: Specify if the bet involves points, rebounds, or assists (points, rebounds, assists) with values separated by a '+'
    Total Points: The total points involved in the bet, if mentioned
    Point Spread: The spread needed for the bet, if mentioned
    Units Bet: The number of units bet on this particular bet, don't include u or units
    Bookmaker: Name of the bookmaker associated with the bet
    Odds: The odds associated with the bet
    """

    if betting_lines != "":
        # Send the message to OpenAI for processing
        response = await client.chat.completions.create(
                model="gpt-3.5-turbo-instruct",
                prompt=prompt_front + betting_lines + promt_end,
                temperature=0.5,
                max_tokens=500
            )

    betting_info = response.choices[0].text

    return betting_info

 

Part 2 - Storing Bets and Returning W/L of Each Bet