Querying GitHub data with SQL: Insights for developers and engineering leaders

In this article, we'll learn about using a tool called MergeStat to query and gain insights from a GitHub repo.

2 years ago   •   6 min read

By Bonnie
You can use Arctype and MergeStat to get key insights about your Repository.
Table of contents

Most people think of GitHub as a code hosting and collaboration platform, but it can also be used as a data source for insights about software development. In this article, we will show you how to use MergeStat and MySQL to query GitHub data and extract valuable insights for developers and engineering leaders. We will start by getting data from a GitHub repo using MergeStat. Then, we will connect to a MySQL server with the MergeStat data using Arctype. Finally, we will explore the data and visualize it using some nifty graphs!

What can we learn by querying GitHub data?

Querying GitHub data with SQL can help us extract useful insights about our project and our team. Some use cases of visualized GitHub data include:

  • Engineering metrics
  • Code quality and pattern monitoring
  • Software supply chain analytics
  • Configuration and "infra-as-code" insights
  • Audit and compliance
  • Developer onboarding
  • Vulnerability monitoring

In this article, I will show you how to get data from a GitHub repository using a tool called MergeStat and an experimental MySQL compatibility layer. MergeStat enables you to perform SQL queries on data in git repositories. Once we have MergeStat set up, we will connect to a MySQL server with a MergeStat backend using Arctype, which will allow us to visualize and explore data. Arctype is a database management and visualization tool that allows easy and effective management of data in your database.

Building and running the MergeStat application

Our first step to getting data from GitHub? Setting up MergeStat and its dependencies. The best way to install MergeStat's dependencies on your system is by building and running a Docker container locally.

Docker is a software platform that allows you to quickly build, test, and deploy applications. Docker provides you with the ability to package and run an application in a loosely isolated environment called a container.

Let us now build a container using Docker so that we can be able to run the MergeStat application on our system. You will need to download or clone MergeStat source code on GitHub to get started.

Open the folder with the source code using the terminal or command line. Then copy and paste the docker build command shown below in the terminal.

docker build . -t mergestat/mysql-server
Building Docker

Note that you need Docker installed in your system to run the Docker commands in your terminal.

Press enter to run the command. Your terminal should look like the one shown below.

Screenshot of terminal.
The result of running the build command.

Let's now run the docker run command so the MergeStat application can start running on our system. MergeStat will enable us to get data from GitHub and use MySQL and Arctype to view it.

Copy and paste the docker run command shown below in the terminal and press enter.

docker run -p 13306:3306 -v ${PWD}:/repo mergestat/mysql-server
Running Docker

MergeStat is now running in your system, as shown below.

Screenshot of terminal
MergeStat is now running.

Getting GitHub repo data and visualizing it

We will use Arctype to view data on a GitHub repo. You can download the Arctype app to connect to this local database. Also, make sure you have a MySQL database installed in your system.

Screenshot of Arctype
Adding a database connection in Arctype

You will need to add your database credentials. In our case, select MySQL. You can fill in the credentials using the following information:

  • Name: enter any name that you want
  • Host: enter 127.0.0.1
  • Port: enter 13306
  • User: enter root
  • Password: enter root
  • Database: enter the URL of the GitHub repo that you want to study

The credentials should look as shown below:

Screenshot of connection credentials
Example connection credentials

After entering the credentials, press the "Test Connection" button at the bottom right of your screen to be sure that everything works. Then save the connection.

Once MySQL and Arctype have successfully connected, you can create a new query by clicking the + button, as highlighted below.

Screenshot of Arctype
Creating a new query in Arctype

Run the query below to see all commits in your GitHub repo. Don’t forget to name your query and save it to access it later.

select * from commits
Selecting all commits from a repo

After running the query, you should have something like shown below.

Screenshot of Arctype
Result of the query, displaying all commits

Let us now create and run a query that produces a chart of unique contributors every month by author_email. Here is what the query will look like.

SELECT
    count(DISTINCT author_email),
    YEAR(author_when) AS y,
    MONTH(author_when) AS m,
    DATE_FORMAT(author_when, '%Y-%m')
FROM commits
WHERE parents < 2
GROUP BY y, m
ORDER BY y, m
A query for displaying a chart of unique contributors per month.

After running the query, you should have something like the result shown below.

Screenshot of Arctype
The result - unique contributors per month (by email) as seen in Arctype

We will now use the table we just created to come up with a chart. On the top right corner of the table, you have "Table" and "Chart" options—select "Chart," as shown below.

Screenshot of Arctype
Select the "Chart" option here.

You can select any chart type on the right side of your screen. In our case, we will choose a bar chart.

Place DATE FORMAT on the X-axis and count, y, and m on the Y-axis. You will then come up with a chart as shown below. Yours will look different and will depend on the data you are working with, of course.

Screenshot of Arctype
Chart of commit authors in Arctype

Author by number of commits per month and year

Let's now create a bar chart that shows how many commits one person has made in a GitHub repository. For this example, we will use the React Codebase repository on GitHub.

First, create a new workspace on Arctype (as explained in the section above) and name it ReactCodeBase. Then, connect to MySQL and enter the credentials below.

Run the query below in Arctype (and don’t forget to save it).

select 
author_name,
count(*) as Total,
YEAR(author_when) AS y,
MONTH(author_when) AS m,
DATE_FORMAT(author_when, '%Y-%m')
from commits 
group by author_name, y, m
Finding the number of commits per month and year by author

You can then come up with a table and chart like the one shown below. Feel free to configure your chart as you please.

Screenshot of Arctype
Results of the query in Arctype
Screenshot of Arctype
The chart derived from the query results

Top 10 contributors

In this example, we will create and run a query that creates a table and a chart of the top 10 contributors to a GitHub repository. We'll still be using the React Codebase repository.

Run the query below in Arctype and save it.

SELECT
author_name,
Count(author_name) AS TotalCommits FROM commits
GROUP BY author_name ORDER BY TotalCommits DESC LIMIT 10;
Finding the top 10 contributors

You can come up with a table and chart that look like the ones shown below.

Screenshot of Arctype
The result of the query, showing the top 10 contributors in Arctype
Screenshot of Arctype
A chart showing the top contributors, also in Arctype

Conclusion

Now that you know how to get data from a GitHub repo and visualize it, you can decide to create and run all kinds of SQL queries to obtain valuable insights about your project. From engineering metrics to fun stats about your top contributors, you'll be able to learn more about your code than you previously thought possible. The sky is the limit!

JOIN the Arctype Newsletter
Programming stories, tutorials, and database tips every 2 weeks
Follow Arctype's Development
Programming stories, tutorials, and database tips every 2 weeks

Spread the word

Keep reading