Our Journey with IndexedDB

By upgrading to IndexedDB as the local data store, developers can export unlimited data from Arctype, get faster results from queries, and see much less memory consumption from the app. This is the story of our implementation

20 days ago   •   10 min read

By Everett Berry

Memory usage is an important metric for Arctype. Developers expect their tools - especially UIs - to be fast and smooth. Recently we improved our memory management algorithms using IndexedDB, Socket.IO, and Typescript. This post takes you inside our reason for the changes, design considerations, implementation, and impact.

Prior to this work, querying a database from Arctype was simple. We would grab every row up to 10,000 and load them into memory. We would display a fraction of rows in the results pane, the spreadsheet view, or a chart. When you needed more rows, we would throw out the first 10,000 and fetch another batch.

Most individual queries never returned that many rows. But significant memory usage could occur across queries in multiple tabs or on dashboards with multiple charts. For imports, we previously had a limit of 100MB. For exports, we only outputted the first 10,000 lines. This meant you could not download your whole database in a CSV. Although we might question your reason for doing this.

Previous versions of Arctype could experience spiky memory usage in certain cases.
Previous versions of Arctype could experience spiky memory usage for lots of data.

To deliver consistently low memory usage and fast rendering UI, we looked at options to write some rows to disk before they were displayed. Arctype already works well locally thanks to features like offline mode and no-account access so this was a natural progression. localStorage was the first place that came to mind. But IndexedDB has a few advantages over localStorage.

  • It is asynchronous, so fetching data is non-blocking
  • LocalStorage has a maximum storage limit of 5MB, making it non-ideal to cache large results or the results from a significant amount of queries

And so we implemented IndexedDB as a disk cache in Arctype to remove all existing import and export limits, to reduce the memory footprint, and to render results faster. Before we go through each tool in Arctype and spec out the changes, let's explore what IndexedDB is and how its API works.

What is IndexedDB?

IndexedDB is a transactional database system meant to persistently store data inside a user's browser. Its low level API lets developers create web applications with rich query abilities regardless of whether the client is connected to the internet. The API is centered around a few key concepts.

Databases are identified by a name and integer.

  • Connections must be opened to the database and closed after used.
  • When you open a database you must handle the case where the database doesn't exist or the database is of a previous version.
  • When there are multiple databases open - as in the case with multiple tabs in Arctype - one of those tabs may want to upgrade the database. Generally you need to close your open databases so the other tab can proceed.

Object Stores are arbitrary javascript objects. Databases can store any amount of them. Keys identify object stores. Every object store has exactly one. The key can either come from the object itself, through a property path, or you can generate sequential ids.

Indexes exist in any number inside object stores. They are:

  • Located in object by property path
  • Offer a way of looking up objects quickly
  • Also can enforce unique constraints

Transactions. All reads and writes to the object store must be done in a transaction. Transactions can use cursors to get multiple data like:

  • To simply get a single object by a key or an index you can use get() method. But to retrieve multiple rows you must use cursors.
  • Can get all objects where the index equals a value. Or retrieve them in a range

The IndexedDB API is highly asynchronous, with callbacks in the form of DOM events. Most functions return a response object. You actually get the result by installing onerror and onsuccess handlers on that object, as well as possibly handlers for other events. This design diverges from the now-standard promises pattern, but the API was created a few years before promises became ubiquitous in Javascript, which is why it was designed this way.

Now.. what has to change inside Arctype to accommodate IndexedDB?

Planning for Change

Arctype has 3 main tools. Query view is a SQL editor with autocomplete, query variables, search-in-results and other niceties for when you need to craft a great query. Table view is a spreadsheet-esque component with single cell editing, filtering and sorting, schema change tools, and more. Dashboards are ways to visualize queries, group together tables and charts, and add some simple inputs and internal tools. Prior to IndexedDB, all these components fetched and displayed rows slightly differently.

Arctype consists of a Query Editor, Table View, and Dashboard.
Arctype consists of a Query Editor, Table View, and Dashboard shown in Light Mode (beta).

Query View

Everything in IndexedDB is retrieved asynchronously. That means that for rendering rows in the results pane in Query view, we will have to store them in memory for a short amount of time. We can use a simple cache of 1000 rows to achieve this. This memory cache would only need to exist for the current query tab. We will create a small library to manage the storage and expose a few React hooks for Arctype developers. In the new scheme, the rows are not directly stored in Redux and instead a reference to the RowState object (discussed in the implementation section below) is tracked. We can add a few metadata fields to our Redux store to help with the bookkeeping, but there are some complications.

The IndexedDB change is visible to the user through the Results Pane.
The IndexedDB change is visible to the user through the Results Pane.

One is that Query view allows client side row filtering via the Search results box in the pink rectangle above. That brings some extra complications because it is not straightforward to just ask "give me row 42 from localstorage" when you actually want the 42nd filtered row. We also have a View JSON mode for any cells in Query View that contain JSON. In the new regime, all JSON rows will have to be loaded into memory.

Table View (Spreadsheet)

The spreadsheet retrieves rows on-demand as they are rendered, and if the rows are not ready yet the callback can just return NULL. Once the rows come in, the memoized function that fetches new rows will change and the spreadsheet will be rerendered. The way our virtualized rendering works, it should be pretty easy for the spreadsheet to take a callback that would be called whenever the visible set of rows changed, which in turn is helpful for maintaining the set of rows that are paged into memory.

Switching to query streaming using IndexedDB just felt better.
Switching to query streaming using IndexedDB just felt better.

In prior versions of Arctype, Table view would stream the results page-by-page from the database by making queries with OFFSET ${start} LIMIT ${nRows} clauses. With IndexedDB we can remove that clause and rely on socket.io to just stream the query right from the database without the offset and limit clause. This has a noticeable impact on the rendering of the component, shown above.

Table View allows users to export their tables - with filters or raw - to Excel and CSV files. One investigation that was needed was to determine whether the CSV and Excel export library has to accumulate the file in memory, or if it can stream it to disk. This would have an impact on whether exports worked from all platforms, or just the desktop platforms and no the web.

Dashboards

To make dashboards and public queries work with more than 10,000 rows we needed to change these APIs to support a streaming interface. We could create a new REST endpoint for the streaming version of the runQuery API. This way we do not break compatibility with older clients. Once this is done we will be able to integrate these APIs into the IndexedDB query manager code. Then we can make Dashboard queries run through this new interface.

Instead of loading all data into Tables and Charts, we now reference the queries in IndexedDB
Instead of loading all data into Tables and Charts, we now reference the queries in IndexedDB

The Dashboard will be modified so that instead of storing the full data directly in the chart or table, it just has a query ID to reference the running query. And anywhere where we were accessing the data, it will have to use the APIs exposed by our IndexedDB code to access the data. The charts will have to just grab all the data. The spreadsheet component should be able to work in the same way as the spreadsheet in the Query View. The table component may need some modifications to work in the new paradigm.

We also need to add some code to the query manager where we figure out what queries are still "live" — i.e. which ones are still needed by some tab that is open. It will have to read the runIDs out of the dashboard objects. The diagram above shows this indirection.

The Implementation

Now that we have an understanding of how each tool within Arctype needs to be modified to support the new memory regime, let's discuss the specific steps (and bugs!) we went through to implement IndexedDB and provide a smoother experience across all query workloads. In particular, we created a centralized RowStore class in Typescript, had to modify our streaming interface, and sorted through two major paging issues.

RowStore Class

The state surrounding the use of IndexedDB is in a provider. There are also hooks that components can call to get action functions for operating on queries, as well as reserving a view of a range of rows in the query. We developed a RowStore class that manages the row database and caches of rows for each query. The database is very simple, it just contains the rows, keyed by the queryId and the row index.

Writing rows can take time, so we want to display rows right away. They go right into the cache, and then are written out in the background. The code keeps track of the ranges of rows that are in memory, as well as the rows that are desired to be in memory. We call the second type "retained" ranges. They are retained either because there is a spreadsheet open on the screen looking at them, or because there is a write operation pending on them.

if row in cache
    	display()
elif row in retained and not in cache
	load()
else if row not in retained and in cache
	evict()

When there are retained rows that are not actually in memory, we begin a load from indexeddb. When there are rows in memory that are not retained, we can free them, or evict them from the cache.

Driver Callbacks and SocketIO

Previously our streaming query APIs between the main process and the render process would trigger data events as quickly as we get rows from the database. We needed a way of slowing down the query to match the rate that the results can be processed. Otherwise we have no option but to let the rows buffer in memory, which will cause the memory usage of the app to spike until the render process is able to work its way through the backlog.

For the web app, the way we were using socketIO for queries was not conducive to the ACK messages needed to prevent those memory spikes. We were actually using the broadcast messaging functionality internally, even though we were never actually sending the messages to more than one client. Now, instead of having the query API be a normal REST endpoint that then triggers broadcast messages on websockets, we now do the entire query through the socket.io connection and ACK it between the processes as below. This is a cleaner design in the long run.

socket.on('message', async ({ data, elapsedMs }, ack) => {
	logIntegrationEvent('queryResource', { runId: queryRunId, event: 'message' })
	await onMessage(data, elapsedMs)
	ack()
})

The steps to implement this change were:

  • Modify the Postgres and MySQL backends to take async functions as callbacks. The query will be paused while the callbacks are running. Then we make the APIs in the renderer also take async callbacks, and make sure the callbacks do not return until processing is complete
  • Create a mechanism for communicating backpressure through from the backend to the renderer
    • In the main process, the renderer will send events back to acknowledge each message that is sent after the processing callback has run
    • In the server, the renderer will use socket.io's message acknowledgement feature.
  • Limit the number of rows in memory for dashboard elements like charts. This isn't just about memory consumption. If a user creates a chart with an incredible amount of rows it will take so long to render as to effectively freeze the app for the user.

Memory Management

Managing thousands of rows across cache, memory, writes, and evictions in this new way led to some interesting bugs during testing. In one instance, rows were being swapped out of memory even when visible on the user's screen. This only seemed to occur when the React debugging code was turned off. Our hypothesis was that this made the renders happen faster.

In the IndexedDB code we keep rows in memory if it's a member of a range that some code in the application is requesting to be kept. When you are looking at a table in query view, the query view should be requesting that those rows be kept. But in this case they were removed from memory (and eventually read back in).

When rows in the results pane were being removed and recreated instead of updated they could be dropped in between. In the process of fixing that we found a second bug that was causing us to retain all the rows instead of just the requested ones! This made our memory issue much worse and more persistent. Now once the rows are done writing they will stop taking up memory as is intended.

if (range.viewRangeData && filter !== '' && filter !== undefined) {
	const filteredRows: unknown[] = []
	range.viewRangeData.filteredRows = filteredRows
}

Fixing both these bugs was trivial. We now no longer retain everything. We were destroying then recreating ranges instead of updating them which lead to rows being dropped from memory when they should not have been. Finally we wait for one IndexedDB  transaction to finish before initiating another. This allows us to drop ranges from memory sooner after writing them.

Learnings and Future Work

That's it. Our usage of indexedDB is actually very basic. One object store of rows, keyed by query id and row index. For future improvements, we are considering using a promise-based wrapper of IndexedDB. The native API is event based and a little bit difficult to work with but our usage is simple enough that it is not that big of an inconvenience.

The actual secret sauce here is the cache code which decides what rows are in memory depending on what rows we're trying to display, or decides if the rows haven't been committed to IndexedDB yet. From there, we have to write rows to IndexedDB in transactions as they come in - or if the set of visible rows change, and secondly we have to page rows back from IndexedDB.

With these changes, you can export unlimited data from Arctype, see your query results appear faster in Query View, and enjoy the benefits of Arctype using much less memory on your machine. We consistently see under 200 MB even for tens of tabs with queries running all over. Interested in working on technical problems like this? We are hiring.

Spread the word

Keep reading