TMWL June’20 — CSV’s, databases and V8 internals

Programmers share what they’ve learned

Maria Kucharczyk
SoftwareMill Tech Blog

--

Every month we share what we’ve learned in our team. In June, Jakub, Michał and Marcin discovered:

  • how to get the path of a JavaScript/TypeScript test file that calls a function from within the function.
  • how to do joins on CSVs the lazy way ;)
  • how intarray extension can help optimize your query execution plan.

Get the path of a file that calls a function from within the function — by Jakub

This month I learned how to get the path of a JavaScript/TypeScript test file that calls a function from within the function. The solution is both surprisingly simple and surprisingly hacky.

You probably know how the Jest snapshot testing works. During the test execution Jest generates snapshot files with the representation of some components, strings or data structures. This directory with snapshot files is created in the same location as the test file.

The approach taken by Jest is a no-brainer. It just takes the path of the file provided by the test runner. However, I work on a test tool that is supposed to be test-runner-agnostic and I don’t have access to the test file path. I have to figure it out… from the stack trace.

The Error object in JavaScript has a stack property that contains the stack trace. The V8 JavaScript engine allows the customization of it. The stack trace is formatted by calling Error.prepareStackTrace(error, structuredStackTrace). In the listing above I simply change this function for a while so it returns the array of CallSite objects. Because of that I can get this array by calling new Error().stack.

Since a CallSite object contains the getFileName function the rest of the solution is pretty straightforward. I can search for a file whose name ends with .spec.js or .test.js (or their equivalents for TypeScript).

You can read more about it in this thread at Stack Overflow, besides there are at least two popular (7m and 19m weekly downloads) libraries that use this approach: caller-callsite and get-caller-file.

Lazy dev doing joins on CSVs — by Michał

This month I had to perform some transformations on csv files. Nothing complicated. Actually a simple join of two files with additional filtering. I wondered, what is the easiest way to do it?

Maybe to write a Scala app? Well, I would need to check how the file is escaped and decide how to parse the file, manually or with some additional library. Then I would need to do these joins manually and save back to a file. Hmm, too complicated.

So maybe… I can just load a csv to a database (e.g. Postgres can be spawn with a single line Docker command), do joins and filterings there and export back to csv? Sounds quite good. The standard way to load csv to a db would be to define a table schema and execute COPY command. However… if the db is in Docker, I would need to get this file there, do some docker run magic for the COPY operation, etc. Nah, there must be a simpler way.

Then the miracle happened…. it appeared that it is enough to use… IntelliJ. Thanks to database tools it is actually possible to connect to a database instance, import a csv file via a wizard where IDE automatically proposes a table schema and does everything for you. Just a few clicks. After doing joins and filters it is possible to download the result as csv. A few minutes and the job is done.

To get more details about these features, you can read the JetBrains docs here.

PostgreSQL: intarray extension in action — by Marcin

Recently we’ve been asked by SwissBorg to help them improve performance of their Akka persistence journals. They are growing very fast and the universal Akka-persistence-JDBC plugin has become too generic.

They would like to have a plugin that’s more Postgres-oriented and squeezes all the juice from its features (partitions, BRIN indexes, arrays), so we rolled up our sleeves and started the symphony of research & coding.

One of the challenges was to find an alternative for storing tags in a text column. Because we wanted to keep the total size of indexes as small as possible and use caching instead of query joins we decided to assign unique integer id for each unique tag and store those ids in an int[] column.

Yet another decision was to partition the journal, for reasons that are out of the scope for this short note but will be covered on the blog soon, so stay tuned!

We ended up with the following schema:

Where each of partition is also partitioned by the sequence_number range:

After filling up the table with the random data (volume of 100 000 000 events) I started measuring performance for the following query:

(@> operator stands for _contains all elements of_. See Array functions documentation for more details)

I wasn’t surprised with the result — it took about 12 seconds on a remote Postgres (Aurora) instance. Let’s take a look on the execution plan:

Query planner uses a BRIN index on the ordering column (BRIN indexes contain information about block ranges and because blocks might contain records that are out of the condition’s range (ordering > 1000000 AND ordering <= 10000000), an additional re-check is required).

Also, Postgres reads all the records that fit in the range and then filters them out so only those no deleted and containing given tag id are left.

Let’s try to improve the schema and create a GIN index on the tag column:

And re-run the same query.

This time it took approximately 10 seconds (!). I was expecting some kind of boost caused by the index presence…

Let’s try to solve the mystery and take a look at the query execution plan again:

Query cost looks a little bit better, but it’s still not what we have expected. For some reason, Postgres doesn’t want to take out newly created GIN index into account when it comes to performing the search (but probably it has some impact on the filtering efficiency, though).

The RDBMS doesn’t provide any kind of query hints and the query takes too long.

Are we doomed to failure?

No!

Postgres comes with a handy extension called intarray (available also in AWS Aurora). As the name suggests it brings some extras for int[] column types such as functions, operators and (what’s the most interesting for us now) support for indexed searches!

Let’s enable the extension and replace an index with the one that leverages a dedicated class operator:

Let’s try the query again:

The difference is hard to overlook! 1 second vs 10–12 seconds is much more I would expect.

Query plan has changed a lot:

Seems that intarray’s gin__int_ops was able to enforce query planner to use newly created tags GIN index together with the ordering BRIN one.

I can say that this extension module is the tool that helped me this month the most!

And what have you learned in June? Let us know! :)

BTW, we are always looking for outstanding professionals to join our team!

Questions? Ask us anything about remote work, how does the cooperation with us look like, what projects do we have, or about anything else - on the dedicated Slack channel 💡

--

--