SQL Cheat Sheet #2: Intro to Data Science

a year ago   •   2 min read

By Kevin Zhang

If you are seeking a career in data science or analytics, then you’ve undoubtedly already begun to familiarize yourself with SQL (structured query language) and its many functions. Whether or not you are well versed in SQL at this point, it is important that your skillset be expanded and strengthened.

Understanding the dataset

Before you can analyze or manipulate the data, you first need to understand what it is. Counting how many tweets Elon Musk has posted in the last 24 hours, or generating a list of popular websites in Italy, requires some understanding and knowledge of the dataset. Here are a few go-to queries:

Counting rows and items

One of the most frequently used aggregate functions in SQL is COUNT(). COUNT() takes one argument, i.e., the column name whose values need to be counted.

select count(*) from myTable;

Sum

SUM() function is used to calculate the sum of a numerical column.

select sum(myRow) as total_sum from myTable;

Calculate Average

To calculate the average, we use AVG().

select avg(myRow) as total_average from myTable;

Finding distincts

The DISTINCT keyword in SQL is used to remove duplicate data and keep the first occurrence of each value.

select disctinct myRow from myTable;

Understanding highs and lows

Next, it is helpful to understand the various values in your dataset. Are they similar? To they differ? Do they make sense at all? Here are a few queries to help you get started:

Calculate standard deviation

The standard deviation can give you a better idea of the variation of the values. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range.

select stddev(myRow) as std_dev from myTable;

Extreme value identification

Finding the min and max will further help you understand how much the data varies, both for given rows, but also for the data overall.

select max(myRow) as max_spend from myTable;

select min(myRow) as min_spend from myTable;

Sorting and limiting

Lastly, to help us put our data into perspective, we can sort and limit. Here is how you can do so in SQL:

Limiting data

This query selects the first five rows in our table. LIMIT is  used when dealing with large amounts of data and when you'd like to see what the intial values look like.

select * from myTable limit 10;

Offset data

If you just want to select only the fourth and fifth rows, we can use the OFFSET clause to skip 2 rows.

select * from myTable limit 10 offset 2;

Order by

We can also sort our data into ascending and descending order by using the keyword order by. In addition, we can use asc and desc to accurately sort.

select * from myTable order by myRow;

// We can also do it in ascending and descending order
select * from myTable order by myRow desc;
select * from myTable order by myRow asc;

I hope this blogpost is a helpful first step to help you work with your data in SQL. In Arctype, you can make sure to save your queries if you find yourself running them often. Simply create a new query and make sure to save it (see below)!

Spread the word