How to optimize Mysql database of 250 million rows for bulk inserts and selects

I am a student, who have been given the task of designing a database for sensor data. My university currently has a large database which is being filled with these data, but a lot of what is being stored is not necessary. They want me to extract some of the fields from the existing database, and insert it into a new one, which will only hold the ‘essentials’. I will need to extract every row from the old one, as well as fetching new data once a day.

  • There are 1500 sensors.
  • They generate a reading every minute.
  • Approximately 2.1 million readings every day
  • The current database have about 250 million rows.

The queries which will be performed will typically be to select sensor readings for a set of sensors between a given time span.

I was initially naive with respect to the added complexity large amounts of data introduces, so I grossly underestimated the time needed for this task. Because of this, and the fact that I don’t have access to the server from home, I am here asking for help and input.

The initial design looks like this:

CREATE TABLE IF NOT EXISTS SENSORS (
    ID smallint UNSIGNED NOT NULL AUTO_INCREMENT,
    NAME varchar(500) NOT NULL UNIQUE,
    VALUEFACETS varchar(500) NOT NULL,
    PRIMARY KEY (ID)
); 

CREATE TABLE IF NOT EXISTS READINGS (
    ID int UNSIGNED AUTO_INCREMENT,
    TIMESTAMP int UNSIGNED INDEX NOT NULL,
    VALUE float NOT NULL,
    STATUS int NOT NULL,
    SENSOR_ID smallint UNSIGNED NOT NULL,
    PRIMARY KEY (ID),
    FOREIGN KEY (SENSOR_ID) REFERENCES SENSORS(ID)
);

Design Question

My first question is whether i should keep an auto-incremented key for the readings, or if it would be more beneficial to a have a composite key on TIMESTAMP(UNIX epoch) and SENSOR_ID?

This question applies both to the fact that I have to insert 2.1 million rows per day, as well as the fact that I want to optimize for the aforementioned queries.

Initial Bulk insert:

After a lot of trial and error and finding a guide online I have found that inserting using load infile, will best suit this purpose. I have written a script that will select 500 000 rows at the time from the old db, and write them (all 250 million) to a csv file, which will look like this:

TIMESTAMP,SENSOR_ID,VALUE,STATUS
2604947572,1399,96.434564,1432543

My plan is then to sort it with GNU sort, and split it into files containing 1 million rows.

Before inserting these files, I will remove the index on TIMESTAMP, as well as running these commands:

SET FOREIGN_KEY_CHECKS = 0;
SET UNIQUE_CHECKS = 0;
SET SESSION tx_isolation='READ-UNCOMMITED';
SET sql_log_bin = 0;

After inserting, I will of course revert these changes.

  • Is this plan at all viable?

  • Can the inserts be quickened if i sort the csv based on SENSOR_ID and TIMESTAMP instead of TIMESTAMP and SENSOR_ID?

  • After turning indexing back on after the bulk insert, will the insertion of 2 million rows each day be possible?

  • Is it possible to do the daily inserts with regular insert statements, or will I have to use load infile in order to keep up
    with the input load?

my.cnf

Every configuration is default except for these:

  innodb_flush_log_at_trx_commit=2
  innodb_buffer_pool_size=5GB
  innodb_flush_method=O_DIRECT
  innodb_doublewrite = 0

Are there any other optimizations I need for this particular purpose?

The server has 8GB of ram.
mysqld Ver 8.0.22
Ubuntu 20.04

Any thoughts, ideas or inputs would be greatly appreciated.