Parsing is the process of refining the structure of your data to adhere to a specific data storage format. Turning your data into a particular form is usually necessary to run SQL queries inside your database. For example, some SELECT queries require that data selected from two or more tables have the same amount of columns and that the data types of those columns would also be the same.
Parsing data can lead to mistakes that cost engineers time and resources. Moreover, the bigger your data set is, the more difficult it is to parse and understand. This post will introduce simple SQL operations - and their pros and cons - that make transforming your raw data simple and fast.
How do you parse data?
Generally, the process of parsing data goes something along these lines:
- Before you parse your data, you must have it at hand in the first place. This step is where engineers acquire data needed for parsing, typically from a raw CSV, JSON file, database dump, .txt file, or many other sources.
- Once engineers have decided what kind of data they would need to parse and work with, they generally import it into certain database instances.
- Once the data reaches a database, the real work begins. An engineer parsing the data might use CRUD (Create, Read, Update, and Delete) operations, indexes, and partitions.
To accomplish these steps there are just a few SQL keywords to understand. One gotcha is that these SQL commands are destructive: they can change the schema and data. Therefore engineers should watch for the privileges inside their database instances. Here are some theoretical use cases:
ALTER TABLEcommands are helpful if we want to add (or remove) indexes to or from our database.
DELETEwill be beneficial if we want to update a bunch of records and we can't think of a reliable way to do that, or if we want to delete a bunch of records existing in a table once we no longer need them.
DROP TABLEis helpful to save space on the disk (especially if the table comes with many indexes or partitions). Also, we may drop our tables if we want to perform swapping operations (swapping operations might be handy if we have a lot of data).
Now that you have the order of operations and some commands, here is a real scenario:
- We have data about an orchard. Say it is formatted with columns named
- We decided that we would only need to use the owner's name and his garden size and that all other columns are unnecessary.
- We remove the columns and rows we don't need by running
DELETEqueries and go about our day.
Easy, right? However, avoid jumping to any conclusions so quickly. Instead, we will look into what can happen once we parse big data sets.
Advanced considerations for parsing
As a database engineer or developer with very large amounts of data to parse there are subtleties to the operations above to keep in mind.
The more data you have, the more complex parsing will become, and ordinary queries like
ALTER TABLE might not be an ideal choice anymore: in that case, you’re better off dropping the table, adding or removing the necessary columns, and loading data into the new table instead. That is actually a simple operation.
CREATE TABLE demo AS demo2
The above will work if you want to create a table called
demo with the structure of
INSERT INTO demo (column) SELECT column FROM demo2
will also work if you want to insert data into a single column.
LOAD DATA INFILE
If you’re dealing with bigger sets of data,
LOAD DATA INFILE is your friend – one can use
LOAD DATA INFILE to load data only into specific columns of a database by specifying the columns they want to load data into at the end of the statement.
LOAD DATA INFILE ‘file.txt’ INTO TABLE demo_table (demo_column);
Bear in mind that when parsing is concerned,
LOAD DATA INFILE comes with its own features including giving you the ability to ignore certain columns by writing
IGNORE which may be useful in scenarios where you have a lot of data, but do not want to keep it all: export your data into a file, then create a table without the column in question, and use a query like this:
LOAD DATA INFILE ‘file.txt’ IGNORE INTO TABLE demo_table (demo_column) FIELDS TERMINATED BY ‘:’;
To insert the first column of data before the “:” sign into a column called
demo_column. That way the parsing will be way faster compared to running other similar queries.
Parsing data relies on the settings allocated to your database – double-check the settings you have set inside of my.cnf (or if you use other database management systems, other files), use functions available in Linux to see how many resources you have, and make sure to properly use all of the resources you can to make your DBMS perform at the best of its ability. The better your database will be optimized, the faster parsing operations will become.
Keep in mind that there are certain standards that database administrators usually use to parse data including the format that the data is formatted in. The majority of parsed data sets will parse the data using the “\t” (tab) delimiter, others will parse data using “:”, and some other people like to use “,” to make the data suitable for Excel. There also are other characters that are used, but these are not as frequent.
See where we’re going with this? The use of basic functions is enough to enable you to quickly copy and paste stuff over into a SQL script. However, don’t think we’re finished – parsing also comes with problems unique to itself.
Common parsing problems
Parsing can also be problematic once more extensive data sets enter the picture. One of the scenarios where the parsing of data can become a bit inconvenient is when we need to change the structure of a table. In MySQL, for example, designs of a table are usually changed by either removing, moving, or renaming a column in question. In MySQL, you can remove data by running
DELETE queries, while you can add columns or drop them using the
ALTER TABLE statement.
Both of these statements can become problematic in their own ways. For example,
DELETE queries might cause issues when we have indexes or partitions set up (both indexes and partitions speed up
SELECT queries at the expense of slowing down
DELETEs). In contrast,
ALTER queries might become problematic because when we involve more significant amounts of data,
ALTER TABLE queries usually copy the data on the disk before doing any actual "work" on the information itself.
For example, here's a process in which
ALTER TABLE might work against us:
- We have a table with a structure relevant to the data.
- We find that we might not necessarily need a couple of columns, so we elect to remove them altogether.
- We run
ALTER TABLEqueries to change the structure of a table (rename, delete, move columns, etc.)
We might need to allocate a little more time towards the third step of the procedure because of what is happening behind the scenes. When you run an
ALTER TABLE query, MySQL usually goes through a couple of stages to complete this process:
- MySQL creates a new table with our new desired schema.
- MySQL copies all the rows from our old table to the new table it just made.
- MySQL swaps the old table with the new table and removes the old table altogether.
Not surprisingly, some parts of this process can take a long time to execute, especially if we're dealing with more extensive data sets. So how do we overcome this, you ask? There is one way:
- We need to create a new table with the structure we desire to employ (so that we avoid running
ALTER TABLEqueries and making a copy of data on the disk in the first place.)
- We insert the data into our table.
- We insert data from our old table to the new table we just created and then drop the old table. Simple as that!
If we perform the steps above, MySQL will no longer need to create a temporary table. Problem solved!
When we are dealing with indexes or partitions, on the other hand, we must be aware that both of them slow down
DELETE queries. In that case, it may be wise only to delete data from a specific partition rather than deleting it from the entire table at once.
Tools you can use to parse and analyze data
Database GUI tools often allow you to upload CSVs to your database directly and use a GUI selector to filter out which columns you want.
There are also handy data importing tools like Flatfile or ETL tools like Fivetran.
In this post, we have focused on simple SQL operations (with some surprising side effects) that you can use to quickly load your data so you can get on with writing queries and building applications.