How to store real-time data to database

clip art of 
 a double-quote character

Question

Here is my situation:

  • I have a WebSocket connection that feeds me data real-time
  • I have an async callback function that gets the data and inserts into a queue
  • I have another thread that reads from the queue and stores it into the Postgres database, one row at a time (using python’s psycopg2 library)

The problem is the real-time data comes in faster than the time it takes to insert into the database, and in a couple of hours, the server runs out of memory. (Is this because psycopg2 library is slow?)

A simple solution is to create more threads that insert into the database; however, this will cause the data to be out of order. Is there a database that sorts the data? Other suggestions would be very much appreciated.

asked 2020-02-23 by MoneyBall


Answer

Generally speaking with relational databases, 1000 single-row inserts will be slower than a single 1000-row insert. Batching up your data in memory and loading to the database may be faster.

As you receive data into memory, you build a queue. Then, you pop the top item from your queue and insert it into the database. Instead, pop the top N items and insert them all in a single INSERT statement. The ideal value for N isn’t universal, so you’ll have to experiment. Try starting at 1000, 10000, and going from there based on the throughput results.

I’ve also seen (and supported) architectures where high volume data is streamed to a file, then asynchronously loaded to the DB. This helps ensure that your receiving service never runs out of memory and crashes (thus losing data in memory), but it does add some complexity, and doesn’t guarantee that you’ll always load data faster than it comes in. Rather, it just separates it into a separate problem to solve. In this configuration, your receiving service would write data to files until they are N MB/rows in size. Then, you roll over into the next file. (Personally, I like datetime stamps on file names to preserve file order.) Asynchronously, a separate service picks up the less-than-most-recent files and loads them to the database, (using COPY or whatever Python library you find for that).

answered 2020-02-23 by Andy Mallon