How to use SQL to directly query files

Did you know it was possible to query files using SQL directly? In this article, we'll take a look at a few tools you can use to quickly and conveniently extract insights from your .csv, .tsv, and .json files... and much more.

8 months ago   •   7 min read

By Harshil Patel
In this article, we review four tools for querying files directly with SQL
Table of contents

Have you ever been sent a file and asked to find important information buried within it? Your coworkers would be very impressed if you could query the files in a quick and efficient manner. But… how exactly are you going to achieve such a feat?

As you probably know, SQL allows you to modify database data quickly and easily. When trying to work with data files, developers usually load data into a database and manage it via SQL. In a perfect world, you could just query the database to get the information your company needs. But in real life, there’s a catch: data loading is often not straightforward.

It would be extremely convenient if SQL queries could be run directly on files, skipping the database setup step. It turns out, other people have noticed this fact and have built tools to query your files directly with SQL. In this article, we'll look at a few tools and list some different SQL file query scenarios that they can handle. We'll also compare the tools and share some general considerations. Let’s begin.

TextQL

TextQL might be a suitable fit for you if you're looking for a simple tool to query .csv or .tsv files. TextqQL allows you to execute SQL on structured text with ease. It also allows you to use quote-escaped delimiters. When running a query, TextQL can automatically detect numeric and datetime data in appropriate formats, which helps you make your work easier. TextQL lets you list as many files and folders as you want. You can load the files or directories you’re working with by listing them at the end of the command.

Installation

Let's look at a few ways to install TextQL. One of the simplest methods you can use to get started is to use Homebrew. Just run the command below:

brew install textql
Installing TextQL via Homebrew

If you're running Docker, you can also install TextQL with a single short command:

docker build -t textql .
Installing TextQL via Docker

Lastly, if you're using Linux, you may be familiar with the AUR tool. AUR installation is also straightforward:

yaourt textql-git
Installing TextQL via AUR

Now that we've gotten TextQL set up and working, let's see what it can do.

Example

We're going to examine an example to understand how TextQL works. Assume we have the following data.

cat sample_data.csv
Id,name,value,timestamp
1, Jack,5,1643004723 
1, John,11,1643114723 
1, James,-3,1645596723
Sample data

Now, excluding the header, let's count the number of data items we have.

textql -header -sql "select count() from sample_data" sample_data.csv

#output
3
Counting the number of items

You should see an output of "3," which is exactly what we'd expect. We can also use TextQL to find the maximum value of this data. We can do that by running the command below:

textql -header -sql "select max(value) from sample_data" sample_data

#output
11
Finding the maximum value

Again, we see the expected value of 11. Very nice!

q

You might consider using q to query your files if you are working with .csv or .tsv files and need something that works faster than TextQL. q aims to bring SQL’s expressive capability to the Linux command line by offering direct access to multi-file SQLite3 databases and simple access to text as actual data.

q uses the SQLite engine. It allows you to run SQL-like statements directly on tabular text data, with the data being auto-cached to speed up subsequent queries on the same file. Using q, SQL statements can be run directly on multi-file SQLite3 databases without requiring them to be merged or loaded into memory.

Installation

Like TextQL, it's easy to install q via homebrew. Simply run the command below:

brew install harelba/q/q
Installing q via Homebrew

q also has a standalone executable, which you can download from this link. If you want to use the Windows installer, you can run the executable and follow the prompts displayed on the screen. Full installation instructions, as well as releases for each platform, can be found here.

You can also install q as an .rpm package. You can download the package from this link, then install it via the following commands:

rpm -ivh <package-filename>
Or
rpm -U <package-filename> 
Installing q as an .rpm package

Example

To get an idea of how q works, let's run a query on a file where columns are named a1, a2, ... aN.

q -H "select a1,a5 from file.csv"
Querying a file to find certain columns

Now,  we'll COUNT DISTINCT values in a specific field. In this case, let's try to see how many UUIDs we have with the following command:

q -H -t "SELECT COUNT(DISTINCT(uuid)) FROM ./file.csv"
Finding the number of UUIDs

The output of this command will be a numeric value, equal to the number of UUIDs in the file.

OctoSQL

OctoSQL is primarily a command-line application that allows you to query a variety of databases and filetypes using SQL in a single interface, as well as perform JOINS between them. OctoSQL is a fully expandable, fully-featured dataflow engine that can be used to provide a SQL interface for your applications. It validates and optimizes queries based on database types. It may process massive volumes of data and return partial results before completing the complete query. Out of the box, OctoSQL only supports .csv and JSON files. You'll need to install a plugin to query other file formats, like Excel or Parquet files.

Installation

You can run the following command to install OctoSQL with Homebrew.

brew install cube2222/octosql/octosql
Installing OctoSQL via Homebrew

Installation with Go is also supported, as shown below:

go install -u github.com/cube2222/octosql
Installing OctoSQL via Go

As mentioned, OctoSQL has a robust plugin capability. You can install plugins using commands like the one shown below:

octosql plugin install postgres
Installing a plugin in OctoSQL

Example

Let's have a look at a simple example of invoice data. Suppose we have .csv of invoice data in the format shown below:

octosql "SELECT * FROM ./invoice2.csv"

cust_id,cust_name,value
121, Jack,599.00
122, John,1100.00
123, James,400.50
Sample invoice data

Suppose we want to calculate a sum of the invoice values. In that case, we could use the following command to find the answer:

#Sum
octosql "SELECT id, SUM(value) as value_sum FROM ./invoices2.csv GROUP BY id ORDER BY value sum DESC”
Finding the sum of values

Similarly, we can find a count of all the entries using this command:

#Count
octosql "SELECT COUNT(*) FROM mydb.customer”
Finding the count of customers

We can also perform more advanced operations like joins. The command below will get us what we want:

octosql "SELECT inv_id, value, email FROM ./invoices.csv JOIN mydb.customers ON customer_id = customer_id”
Performing a join with OctoSQL.

OctoSQL may be the ideal fit for you if you require a tool that can work with a variety of file formats. One drawback that few people consider is that OctoSQL uses a custom engine instead of SQLite, which indicates that it may be missing some features.

DSQ

DSQ is a command-line tool that lets you execute SQL queries. It supports a wide range of file formats, including .csv, JSON, .tsv, Excel, Parquet, and .ods.

Installation

To install DSQ on a Mac or Linux operating system, use the following command, visit the releases page and download the release you want. Then simply unzip the download and add DSQ to your $PATH. You can follow the same steps for Windows.

You can also install DSQ in Go using the following command:

$ go install github.com/multiprocessio/dsq@latest
Installing DSQ via Go

Example

DSQ allows you to either stream data or provide a file name to work with. The example below shows how you can use either a .json or .ndjson format, depending on your requirements.

$ dsq testdata.json "SELECT * FROM {} WHERE x > 10"
#or
$ dsq testdata.ndjson "SELECT name, AVG(time) FROM {} GROUP BY name ORDER BY AVG(time) DESC"
Using DSQ to interact with .json or .ndjson data

You can also use DSQ to work with data from different origin types. The example below connects a .csv dataset and a .json dataset.

$ dsq testdata/join/users.csv testdata/join/ages.json \
      "select {0}.name, {1}.age from {0} join {1} on {0}.id = {1}.id"
Working with data from two different sources (.csv and .json)

If you need a tool that can handle a wide range of file formats and uses SQLite, this is the tool for you.

Cases for SQL file queries

To see these tools in action, let's look at a few SQL file query examples. We'll be using .txt or .csv files for these examples, since almost all of the tools we discussed above support this file format.

Filtering via SQL

Filtering allows you to view only the information you wish to see. Filters are useful for displaying only the desired records from a form or report. Using a filter, you may limit the data in a view without affecting the architecture of the underlying object.

For example, let's start by selecting records from a text file that contains the information below.

Class Name English History Math
1 James Kim 78 65 67
1 John White 87 61 79
2 Paige Davis 77 82 94
3 Edwin Henderson 65 78 94

Suppose we want to choose the students in class 1 in the above file. The file's first row provides column names, whereas the rest of the rows have comprehensive information that we can use to query the records we want. So, we could run the following command to get the information we want:

$select * from E:/txt/Students_file.txt where CLASS = 1
Getting the students from class 1

In SQL, you may sort data in ascending or descending order by one or more columns. To try this out, let's sort the table of student scores by class in ascending order and total score in descending order. We can do that with the following command:

$select * from E:/txt/Students_file.txt order by CLASS,ENGLISH+HISTORY+MATH desc
Ordering in descending order

You can combine results using aggregation by grouping records depending on their value. Grouping can also be used to calculate the sum of many values in a group. The grouped aggregates give a summary of a set of rows. For example, you can use the command below to find the lowest English score, the highest History score, and the overall Math score.

$select CLASS,min(English),max(History),sum(Math) from E:/txt/students_file.txt group by CLASS
Grouping by class

As you can see, the ability to perform SQL queries directly on files can help us answer important questions from our data.

Conclusion

In conclusion, let's use the chart below to take a quick look at all of the tools we mentioned before.

Name Supported Files Engine
q .csv, .tsv SQLite
TextQL .csv, .tsv SQLite
OctoQL .csv, .json, Excel, Parquet Custom
DSQ .csv, .tsv, .json, Parquet, Excel, logs SQLite

Without having to go through a database, you can use the tools above to execute SQL on your files for instant insights. Each of these tools has different pros and cons, so be sure to choose the option that best suits your needs. Happy querying.

JOIN the Arctype Newsletter
Programming stories, tutorials, and database tips every 2 weeks




Follow Arctype's Development
Programming stories, tutorials, and database tips every 2 weeks

Spread the word

Keep reading