Have you ever been sent a file and asked to find important information buried within it? Your coworkers would be very impressed if you could query the files in a quick and efficient manner. But… how exactly are you going to achieve such a feat?
As you probably know, SQL allows you to modify database data quickly and easily. When trying to work with data files, developers usually load data into a database and manage it via SQL. In a perfect world, you could just query the database to get the information your company needs. But in real life, there’s a catch: data loading is often not straightforward.
It would be extremely convenient if SQL queries could be run directly on files, skipping the database setup step. It turns out, other people have noticed this fact and have built tools to query your files directly with SQL. In this article, we'll look at a few tools and list some different SQL file query scenarios that they can handle. We'll also compare the tools and share some general considerations. Let’s begin.
TextQL might be a suitable fit for you if you're looking for a simple tool to query .csv or .tsv files. TextqQL allows you to execute SQL on structured text with ease. It also allows you to use quote-escaped delimiters. When running a query, TextQL can automatically detect numeric and datetime data in appropriate formats, which helps you make your work easier. TextQL lets you list as many files and folders as you want. You can load the files or directories you’re working with by listing them at the end of the command.
Let's look at a few ways to install TextQL. One of the simplest methods you can use to get started is to use Homebrew. Just run the command below:
If you're running Docker, you can also install TextQL with a single short command:
Lastly, if you're using Linux, you may be familiar with the AUR tool. AUR installation is also straightforward:
Now that we've gotten TextQL set up and working, let's see what it can do.
We're going to examine an example to understand how TextQL works. Assume we have the following data.
Now, excluding the header, let's count the number of data items we have.
You should see an output of "3," which is exactly what we'd expect. We can also use TextQL to find the maximum value of this data. We can do that by running the command below:
Again, we see the expected value of 11. Very nice!
You might consider using q to query your files if you are working with .csv or .tsv files and need something that works faster than TextQL. q aims to bring SQL’s expressive capability to the Linux command line by offering direct access to multi-file SQLite3 databases and simple access to text as actual data.
q uses the SQLite engine. It allows you to run SQL-like statements directly on tabular text data, with the data being auto-cached to speed up subsequent queries on the same file. Using q, SQL statements can be run directly on multi-file SQLite3 databases without requiring them to be merged or loaded into memory.
Like TextQL, it's easy to install q via homebrew. Simply run the command below:
q also has a standalone executable, which you can download from this link. If you want to use the Windows installer, you can run the executable and follow the prompts displayed on the screen. Full installation instructions, as well as releases for each platform, can be found here.
You can also install q as an .rpm package. You can download the package from this link, then install it via the following commands:
To get an idea of how q works, let's run a query on a file where columns are named a1, a2, ... aN.
Now, we'll COUNT DISTINCT values in a specific field. In this case, let's try to see how many UUIDs we have with the following command:
The output of this command will be a numeric value, equal to the number of UUIDs in the file.
OctoSQL is primarily a command-line application that allows you to query a variety of databases and filetypes using SQL in a single interface, as well as perform JOINS between them. OctoSQL is a fully expandable, fully-featured dataflow engine that can be used to provide a SQL interface for your applications. It validates and optimizes queries based on database types. It may process massive volumes of data and return partial results before completing the complete query. Out of the box, OctoSQL only supports .csv and JSON files. You'll need to install a plugin to query other file formats, like Excel or Parquet files.
You can run the following command to install OctoSQL with Homebrew.
Installation with Go is also supported, as shown below:
As mentioned, OctoSQL has a robust plugin capability. You can install plugins using commands like the one shown below:
Let's have a look at a simple example of invoice data. Suppose we have .csv of invoice data in the format shown below:
Suppose we want to calculate a sum of the invoice values. In that case, we could use the following command to find the answer:
Similarly, we can find a count of all the entries using this command:
We can also perform more advanced operations like joins. The command below will get us what we want:
OctoSQL may be the ideal fit for you if you require a tool that can work with a variety of file formats. One drawback that few people consider is that OctoSQL uses a custom engine instead of SQLite, which indicates that it may be missing some features.
DSQ is a command-line tool that lets you execute SQL queries. It supports a wide range of file formats, including .csv, JSON, .tsv, Excel, Parquet, and .ods.
To install DSQ on a Mac or Linux operating system, use the following command, visit the releases page and download the release you want. Then simply unzip the download and add DSQ to your $PATH. You can follow the same steps for Windows.
You can also install DSQ in Go using the following command:
DSQ allows you to either stream data or provide a file name to work with. The example below shows how you can use either a .json or .ndjson format, depending on your requirements.
You can also use DSQ to work with data from different origin types. The example below connects a .csv dataset and a .json dataset.
If you need a tool that can handle a wide range of file formats and uses SQLite, this is the tool for you.
Cases for SQL file queries
To see these tools in action, let's look at a few SQL file query examples. We'll be using .txt or .csv files for these examples, since almost all of the tools we discussed above support this file format.
Filtering via SQL
Filtering allows you to view only the information you wish to see. Filters are useful for displaying only the desired records from a form or report. Using a filter, you may limit the data in a view without affecting the architecture of the underlying object.
For example, let's start by selecting records from a text file that contains the information below.
Suppose we want to choose the students in class 1 in the above file. The file's first row provides column names, whereas the rest of the rows have comprehensive information that we can use to query the records we want. So, we could run the following command to get the information we want:
In SQL, you may sort data in ascending or descending order by one or more columns. To try this out, let's sort the table of student scores by class in ascending order and total score in descending order. We can do that with the following command:
You can combine results using aggregation by grouping records depending on their value. Grouping can also be used to calculate the sum of many values in a group. The grouped aggregates give a summary of a set of rows. For example, you can use the command below to find the lowest English score, the highest History score, and the overall Math score.
As you can see, the ability to perform SQL queries directly on files can help us answer important questions from our data.
In conclusion, let's use the chart below to take a quick look at all of the tools we mentioned before.
|OctoQL||.csv, .json, Excel, Parquet||Custom|
|DSQ||.csv, .tsv, .json, Parquet, Excel, logs||SQLite|
Without having to go through a database, you can use the tools above to execute SQL on your files for instant insights. Each of these tools has different pros and cons, so be sure to choose the option that best suits your needs. Happy querying.