Use Advanced MySQL Operations to Analyze Python Web Scraper Data

In this tutorial, we will be building a web scraper in Python to aggregate data from the top five soccer leagues in the world. Then, we'll use advanced MySQL operations such as dynamic SQL, user variables, and CTEs to analyze the data and create visualizations.

2 months ago   •   14 min read

By Ochuko Onojakpor
Table of contents

Prerequisites

To follow this tutorial, you should have the following:

  • Python 3.7 or newer.
  • Arctype
  • Basic understanding of SQL.
  • A text editor.

Installing the required libraries

The libraries required for this tutorial are as follows:

  • numpy — fundamental package for scientific computing with Python
  • pandas — library providing high-performance, easy-to-use data structures, and data analysis tools
  • requests — is the only Non-GMO HTTP library for Python, safe for human consumption. (love this line from official docs :D)
  • BeautifulSoup — a Python library for pulling data out of HTML and XML files.

To install the libraries required for this tutorial, run the following commands below:

pip install numpy
pip install pandas
pip install requests
pip install bs4

Building the Python Web Scraper

Now that we have all the required libraries installed let’s get to building our web scraper.

Importing the Python libraries

import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
import json

Carrying Out Site Research

The first step in any web scraping project is researching the web page you want to scrape and learn how it works. That is critical to finding where to get the data from the site.  For this tutorial, we'll be using http://understat.com.

understat website

We can see on the home page that the site has data for six European leagues. However, we will be extracting data for just the top 5 leagues(teams excluding RFPL).

We can also notice that data on the site starts from 2014/2015 to 2020/2021. Let’s create variables to handle only the data we require.

# create urls for all seasons of all leagues
base_url = 'https://understat.com/league'
leagues = ['La_liga', 'EPL', 'Bundesliga', 'Serie_A', 'Ligue_1']
seasons = ['2016', '2017', '2018', '2019', '2020']

The next step is to figure out where the data on the web page is stored. To do so, open Developer Tools in Chrome, navigate to the Network tab, locate the data file (in this example, 2018), and select the “Response” tab. After executing requests, this is what we'll get.

soccer data file location
Using Developer Tools to determine where the data is stored

After looking through the web page's content, we discovered that the data is saved beneath the "script" element in the teamsData variable and is JSON encoded. As a result, we'll need to track down this tag, extract JSON from it, and convert it to a Python-readable data structure.

soccer data json encoding
Parsing JSON-encoded data

Decoding the JSON Data with Python

season_data = dict()
for season in seasons:
url = base_url+'/'+league+'/'+season
res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml")
# Based on the structure of the webpage, I found that data is in the JSON variable, under <script> tags
scripts = soup.find_all('script')
string_with_json_obj = ''
# Find data for teams
for el in scripts:
if 'teamsData' in str(el):
string_with_json_obj = str(el).strip()
# strip unnecessary symbols and get only JSON data
ind_start = string_with_json_obj.index("('")+2
ind_end = string_with_json_obj.index("')")
json_data = string_with_json_obj[ind_start:ind_end]
json_data = json_data.encode('utf8').decode('unicode_escape')
#print(json_data)

After running the python code above, you should get a bunch of data that we’ve cleaned up.

Understanding the Scraper Data

When we start looking at the data, we realize it's a dictionary of dictionaries with three keys: id, title, and history. Ids are also used as keys in the dictionary's initial layer.

The Collaborative SQL Editor

Therefore, we can deduce that history has information on every match a team has played in its own league (League Cup or Champions League games are not included).

After reviewing the first layer dictionary, we can begin to compile a list of team names.

# Get teams and their relevant ids and put them into separate dictionary
teams = {}
for id in data.keys():
teams[id] = data[id]['title']

We see that column names frequently appear; therefore, we put them in a separate list. Also, look at how the sample values appear.

columns = []
# Check the sample of values per each column
values = []
for id in data.keys():
columns = list(data[id]['history'][0].keys())
values = list(data[id]['history'][0].values())
break

Now let’s get data for all teams. Uncomment the print statement in the code below to print the data to your console.

# Getting data for all teams
dataframes = {}
for id, team in teams.items():
teams_data = []
for row in data[id]['history']:
teams_data.append(list(row.values()))
df = pd.DataFrame(teams_data, columns=columns)
dataframes[team] = df
# print('Added data for {}.'.format(team))

After you have completed this code, we will have a dictionary of DataFrames with the key being the team's name and the value being the DataFrame containing all of the team's games.

Manipulating the Data Table

When we look at the DataFrame content, we can see that metrics like PPDA and OPPDA (ppda and ppda allowed) are represented as total sums of attacking/defensive actions.

However, they are shown as coefficients in the original table. Let's clean that up.

for team, df in dataframes.items():
dataframes[team]['ppda_coef'] = dataframes[team]['ppda'].apply(lambda x: x['att']/x['def'] if x['def'] != 0 else 0)
dataframes[team]['oppda_coef'] = dataframes[team]['ppda_allowed'].apply(lambda x: x['att']/x['def'] if x['def'] != 0 else 0)

We now have all of our numbers, but for every game. The totals for the team are what we require. Let's look at the columns we need to add up. To do so, we returned to the original table on the website and discovered that all measures should be added together, with only PPDA and OPPDA remaining as means in the end. First, let’s define the columns we need to sum and mean.

cols_to_sum = ['xG', 'xGA', 'npxG', 'npxGA', 'deep', 'deep_allowed', 'scored', 'missed', 'xpts', 'wins', 'draws', 'loses', 'pts', 'npxGD']
cols_to_mean = ['ppda_coef', 'oppda_coef']

Finally, let’s calculate the totals and means.

for team, df in dataframes.items():
sum_data = pd.DataFrame(df[cols_to_sum].sum()).transpose()
mean_data = pd.DataFrame(df[cols_to_mean].mean()).transpose()
final_df = sum_data.join(mean_data)
final_df['team'] = team
final_df['matches'] = len(df)
frames.append(final_df)
full_stat = pd.concat(frames)
full_stat = full_stat[['team', 'matches', 'wins', 'draws', 'loses', 'scored', 'missed', 'pts', 'xG', 'npxG', 'xGA', 'npxGA', 'npxGD', 'ppda_coef', 'oppda_coef', 'deep', 'deep_allowed', 'xpts']]
full_stat.sort_values('pts', ascending=False, inplace=True)
full_stat.reset_index(inplace=True, drop=True)
full_stat['position'] = range(1,len(full_stat)+1)
full_stat['xG_diff'] = full_stat['xG'] - full_stat['scored']
full_stat['xGA_diff'] = full_stat['xGA'] - full_stat['missed']
full_stat['xpts_diff'] = full_stat['xpts'] - full_stat['pts']
cols_to_int = ['wins', 'draws', 'loses', 'scored', 'missed', 'pts', 'deep', 'deep_allowed']
full_stat[cols_to_int] = full_stat[cols_to_int].astype(int)

In the code above, we reordered columns for better readability, sorted rows based on points, reset the index, and added column ‘position’.

We also added the differences between the expected metrics and real metrics.

Lastly, we converted the floats to integers where appropriate.

Beautifying the Final Output of the Dataframe

beautified data frame
We want our data table to look like the original

Finally, let’s beautify our data to become similar to the site data in the image above. To do this, run the python code below.

python
col_order = ['position', 'team', 'matches', 'wins', 'draws', 'loses', 'scored', 'missed', 'pts', 'xG', 'xG_diff', 'npxG', 'xGA', 'xGA_diff', 'npxGA', 'npxGD', 'ppda_coef', 'oppda_coef', 'deep', 'deep_allowed', 'xpts', 'xpts_diff']
full_stat = full_stat[col_order]
full_stat = full_stat.set_index('position')
# print(full_stat.head(20))

To print a part of the beautified data, uncomment the print statement in the code above.

Compiling the Final Python Data Aggregator Code

To get all the data, we need to loop through all the leagues and seasons then manipulate it to be exportable as a CSV file.

import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
import json
# create urls for all seasons of all leagues
base_url = 'https://understat.com/league'
leagues = ['La_liga', 'EPL', 'Bundesliga', 'Serie_A', 'Ligue_1']
seasons = ['2016', '2017', '2018', '2019', '2020']
full_data = dict()
for league in leagues:
season_data = dict()
for season in seasons:
url = base_url+'/'+league+'/'+season
res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml")
# Based on the structure of the webpage, I found that data is in the JSON variable, under <script> tags
scripts = soup.find_all('script')
string_with_json_obj = ''
# Find data for teams
for el in scripts:
if 'teamsData' in str(el):
string_with_json_obj = str(el).strip()
# print(string_with_json_obj)
# strip unnecessary symbols and get only JSON data
ind_start = string_with_json_obj.index("('")+2
ind_end = string_with_json_obj.index("')")
json_data = string_with_json_obj[ind_start:ind_end]
json_data = json_data.encode('utf8').decode('unicode_escape')
# convert JSON data into Python dictionary
data = json.loads(json_data)
# Get teams and their relevant ids and put them into separate dictionary
teams = {}
for id in data.keys():
teams[id] = data[id]['title']
# EDA to get a feeling of how the JSON is structured
# Column names are all the same, so we just use first element
columns = []
# Check the sample of values per each column
values = []
for id in data.keys():
columns = list(data[id]['history'][0].keys())
values = list(data[id]['history'][0].values())
break
# Getting data for all teams
dataframes = {}
for id, team in teams.items():
teams_data = []
for row in data[id]['history']:
teams_data.append(list(row.values()))
df = pd.DataFrame(teams_data, columns=columns)
dataframes[team] = df
# print('Added data for {}.'.format(team))
for team, df in dataframes.items():
dataframes[team]['ppda_coef'] = dataframes[team]['ppda'].apply(lambda x: x['att']/x['def'] if x['def'] != 0 else 0)
dataframes[team]['oppda_coef'] = dataframes[team]['ppda_allowed'].apply(lambda x: x['att']/x['def'] if x['def'] != 0 else 0)
cols_to_sum = ['xG', 'xGA', 'npxG', 'npxGA', 'deep', 'deep_allowed', 'scored', 'missed', 'xpts', 'wins', 'draws', 'loses', 'pts', 'npxGD']
cols_to_mean = ['ppda_coef', 'oppda_coef']
frames = []
for team, df in dataframes.items():
sum_data = pd.DataFrame(df[cols_to_sum].sum()).transpose()
mean_data = pd.DataFrame(df[cols_to_mean].mean()).transpose()
final_df = sum_data.join(mean_data)
final_df['team'] = team
final_df['matches'] = len(df)
frames.append(final_df)
full_stat = pd.concat(frames)
full_stat = full_stat[['team', 'matches', 'wins', 'draws', 'loses', 'scored', 'missed', 'pts', 'xG', 'npxG', 'xGA', 'npxGA', 'npxGD', 'ppda_coef', 'oppda_coef', 'deep', 'deep_allowed', 'xpts']]
full_stat.sort_values('pts', ascending=False, inplace=True)
full_stat.reset_index(inplace=True, drop=True)
full_stat['position'] = range(1,len(full_stat)+1)
full_stat['xG_diff'] = full_stat['xG'] - full_stat['scored']
full_stat['xGA_diff'] = full_stat['xGA'] - full_stat['missed']
full_stat['xpts_diff'] = full_stat['xpts'] - full_stat['pts']
cols_to_int = ['wins', 'draws', 'loses', 'scored', 'missed', 'pts', 'deep', 'deep_allowed']
full_stat[cols_to_int] = full_stat[cols_to_int].astype(int)
col_order = ['position', 'team', 'matches', 'wins', 'draws', 'loses', 'scored', 'missed', 'pts', 'xG', 'xG_diff', 'npxG', 'xGA', 'xGA_diff', 'npxGA', 'npxGD', 'ppda_coef', 'oppda_coef', 'deep', 'deep_allowed', 'xpts', 'xpts_diff']
full_stat = full_stat[col_order]
full_stat = full_stat.set_index('position')
# print(full_stat.head(20))
season_data[season] = full_stat
df_season = pd.concat(season_data)
full_data[league] = df_season

To analyze our data in Arctype, we need to export the data to a CSV file. To do this, copy and paste the code below.

python
data = pd.concat(full_data)
data.to_csv('understat.com.csv')

Analyzing Scraper Data with MySQL

Now that we have a clean CSV file containing our soccer data, let's create some visualizations.  First, we'll need to import the CSV file into a MySQL table.

Importing CSV Data into MySQL

To use the data we extracted, we need to import the CSV data as a table in our database. To do this, follow the steps below:

Step 1
In the database menu, click on the three-dotted icon and select “Import Table”. Click on “accept” to accept the schema.

Step 2
Enter table name as “soccer_data”, then rename the first two columns to “league” and “year”. Leave all other settings and click the “Import CSV” button, as seen in the image below.

arctype table create table from imported csv
Arctype makes importing CSV data fast and simple

After following the steps above, the “soccer_data” table should be populated with data from the CSV file, as seen in the image below.

arctype table view
Our soccer data is now in a MySQL database!

Now that we have imported our data stored in a CSV file, we can compare various data and visualize them on data charts.

Use Dynamic SQL to Create a Pivot Table and Bar Chart

We will be analyzing scored and missed shots data for one league across all years in order to calculate each team's shots per goal ratio. The perfect league to run this analysis on is the “Bundesliga” as they are a league known for taking many outside-the-box shots.  

Creating a Shots-Per-Goal Pivot Table

For this visualization, we're going to need our results in a pivot-style table with a unique column for each season in the dataset.  This is the basic logic of our query:

SELECT
	year,
	SUM(
		CASE 
			WHEN year = '2020' THEN (scored + missed) / scored 
			ELSE NULL
		END
	) AS `2020 Season`
FROM 
	soccer_data
WHERE
	league = 'Bundesliga'
GROUP BY
	team
ORDER BY
	team;

This way, the shots-per-goal ratio of each team in the 2020 season is outputted in a column called 2020 Season.  But what if we want five separate columns doing the same thing for five seasons?  Of course, we can define each one manually, or we can use GROUP CONCAT() and user variables to do this dynamically.  The only dynamic component of our query is the season columns in our SELECT statement, so let's start by SELECTing this query string into a variable (@sql).  

SELECT
  GROUP_CONCAT(
    DISTINCT CONCAT(
      'SUM(case when year = ''',
      year,
      ''' then (scored + missed) / scored ELSE NULL END) AS `',
      year,
      ' Season`'
    )
    ORDER BY
      year ASC
  ) INTO @sql
FROM
  soccer_data;

Here, DISTINCT CONCAT() is generating a SUM(CASE WHEN year=...)  column definition for each distinct value in the year column of our table. If you want to see the exact output, simply add SELECT @sql; on a new line and execute the query.

The fast and easy-to-use SQL client for developers and teams

Now that we have the dynamic portion of our query string, we just need to add in everything else around it like this:

SET
  @sql = CONCAT(
    'WITH pivot_data AS (SELECT team, ',
    @sql,
    'FROM understat_com 
	WHERE league=''Bundesliga'' 
	GROUP BY team 
	ORDER BY team) 
	SELECT * 
	FROM pivot_data 
	WHERE `2019 Season` IS NOT NULL 
	AND `2020 Season` IS NOT NULL;'
  );

Finally, we just need to prepare an SQL statement from the string in @sql and execute it:

PREPARE stmt FROM @sql;
EXECUTE stmt;

Run, rename and save the entire query above.

Visualizing The Goal-per-Shot Ratio with a Bar Chart

An excellent way to visualize the query above is using a bar chart. Use the team column for the x-axis and each of our Season columns for the y-axis.  This should yield a bar chart like the one below:

arctype complex barchart visualization
This bar chart displays every team's goal-per-shot ratio for the five seasons between 2016 and 2020

Create a 'Top 5 vs. The Rest' Pie Chart Using CTEs

We will be analyzing data for one league from different years in order to compare the total wins of the top five teams against the rest of the league. The perfect league to run this analysis on is the “Serie A” as they record many wins.

How to Separate the Top 5 Teams using a WITH Statement

For this visualization, we essentially want our query results to look like this:

team wins
Team 1 100
Team 2 90
Team 3 80
Team 4 70
Team 5 60
Others 120

With this in mind, we'll focus first on the rows for the top five teams:  

WITH top5 AS(
	SELECT 
		team,
		SUM(wins) as wins 
	FROM 
		soccer_data
	WHERE 
		league='Serie_A'
	GROUP BY 
		1
	ORDER BY 
		2 DESC
	LIMIT 5
)
SELECT * FROM top5

Here, we're using a WITH clause to create a Common Table Expression (CTE) called top5 with the team name and total wins for the top 5 teams.  Then, we're selecting everything in top5.

Now that we have the top five teams, let's use UNION to add the rest:

UNION
SELECT
	'Other' as team,
	SUM(wins) as wins
FROM
	soccer_data
WHERE
	league='Serie_A'
	AND team NOT IN (SELECT team FROM top5)

Run, rename and save the entire query above.

Visualizing Top 5 vs. The Rest with a Pie Chart

An excellent way to visualize the query above is using a pie chart. Use the team column for 'category' and wins for 'values'.  After adding the columns, we should have a pie chart like the one in the image below.  As you can see, the top five teams comprise close to 50% of wins in the Serie A league:

top five vs the rest pie chart arctype visualization
There are 28 teams in Serie A. Five teams account for nearly 50% of wins.

Comparing Win-Loss Ratios with CTEs and Dynamic SQL

For this query, we'll be using CTEs and dynamic SQL to compare the win-loss ratios of top three teams in Serie A to all other teams. We'll want our result set to look something like this:

year team 1 team 2 team 3 others
2016 value value value value
2017 value value value value
2018 value value value value
2019 value value value value
2020 value value value value

The fundamental query logic should look something like this:

SELECT 
	year, 
	MAX(CASE 
		WHEN team = 'team1' THEN wins / losses
		ELSE NULL
	END) AS `team 1`
	AVG(CASE
		WHEN team NOT in ('team1','team2','team3') THEN wins / losses
		ELSE NULL
	END) AS `other`
FROM 
	soccer_stats
WHERE
	league = 'Serie_A'
GROUP BY 
	team

Of course, this won't quite work without some MySQL magic.

Separating the Top 3 Teams

First things first, let's separate our top three teams using a CTE:

WITH top3 AS(
  SELECT
    team,
    AVG(wins / loses) as wins_to_losses
  FROM
    soccer_data
  WHERE
    league = 'Serie_A'
  GROUP BY
    team
  ORDER BY
    2 DESC
  LIMIT
    3
)

Generating Dynamic SQL Strings inside a CTE

Because each of these teams will need its own column, we'll need to use dynamic SQL to generate some special CASE statements.  We'll also need to generate a CASE statement for our 'Other' column.  For this, we'll use dynamic SQL inside a CTE:

variable_definitions AS(
	SELECT
	(
		GROUP_CONCAT(
			CONCAT(
				'''', 
				team, 
				''''
			)
		)
	) as team_names,
    (
		GROUP_CONCAT(
			DISTINCT CONCAT(
			'MAX(case when team = ''',
			team,
			''' then wins / loses ELSE NULL END) AS `',
			team,
			'`'
			)
		)
	) as column_definitions
	FROM top3
)

Next, let's take the team_names and column_definitions strings and INSERT them into variables:

SELECT
  team_names,
  column_definitions 
INTO 
  @teams,
  @sql
FROM
  variable_definitions;

At this point, we should have a list of the top three teams in string format saved to @teams and our column case statements for the top three teams saved to @sql.  We just have to build the final query:

SET
  @sql = CONCAT(
    'SELECT year, ',
    @sql,
    ', AVG(CASE WHEN team NOT IN (',
    @teams,
    ') THEN wins / loses ELSE NULL END) AS `Others` ',
    'FROM soccer_data WHERE league = ''Serie_A'' GROUP BY year;'
  );
 prepare stmt FROM @sql;
 EXECUTE stmt;

You can find the query in full at the bottom of this article.


Visualizing Win-Loss-Ratio with an Area Chart

An excellent way to visualize the query above is using an area chart. To create this area chart, use year for the x-axis and all other columns for the y-axis.  Your chart should look something like this:

arctype area chart wins to losses
The final result

Because we're using dynamic SQL, we can easily add more team columns by changing LIMIT 3 in the top3 CTE to LIMIT 5:

area chart top five teams arctype

Conclusion

In this article, you learned how to extract sports data with python from a website and use advanced MySQL operations to analyze and visualize it with Arctype. In addition, you saw how easy it is to run SQL queries on your database using Arctype and got the chance to explore some of its core features and functionalities.

The source code of the python script, the CSV file, and other data are available on Github. If you have any questions, don't hesitate to contact me on Twitter: @LordChuks3.

Final SQL Query:

WITH top3 AS(
  SELECT
    team,
    AVG(wins / loses) as wins_to_losses
  FROM
    soccer_data
  WHERE
    league = 'Serie_A'
  GROUP BY
    team
  ORDER BY
    2 DESC
  LIMIT
    3
), variable_definitions AS(
  SELECT
    (GROUP_CONCAT(CONCAT('''', team, ''''))) as team_names,
    (
      GROUP_CONCAT(
        DISTINCT CONCAT(
          'MAX(case when team = ''',
          team,
          ''' then wins / loses ELSE NULL END) AS `',
          team,
          '`'
        )
      )
    ) as column_definitions
  FROM
    top3
)
SELECT
  team_names,
  column_definitions 
INTO 
  @teams,
  @sql
FROM
  variable_definitions;
SET
  @sql = CONCAT(
    'SELECT year, ',
    @sql,
    ', AVG(CASE WHEN team NOT IN (',
    @teams,
    ') THEN wins / loses ELSE NULL END) AS `Others` ',
    'FROM soccer_data WHERE league = ''Serie_A'' GROUP BY year;'
  );
 prepare stmt FROM @sql;
 EXECUTE stmt;
JOIN the Arctype Newsletter
Programming stories, tutorials, and database tips every 2 weeks

Spread the word

Keep reading