My use case is something like the following:
- There are
nmedical cases today.
- Every day
10new cases arrive.
- Each case is a time series of varying length, of measurements of raw data. Assume a bunch of ecg signals.
- Raw data comes from various
- Each source has its own sampling rate (between
100per second), and can have
- Data sources are not necessarily exactly syncronized
- Each data source has ‘c_k’ columns
.h5file with ~
540000samples, in the highest sampling rate (100HZ).
- Data is mostly sampled into matrices, and columns will be those matrices flattened into their respective indices.
- The code base has to be in Python.
c_2=2, sampling rate of
1: “every row”, sampling rate of
2: “every 2 rows”. There is no “time” column. (should there be?)
Sample | 1_1 | 2_1 | 2_2 1 | 1.0 | 10.0| 100.0 2 | 2.0 | | 3 | 3.0 | 30.0| 300.0 4 | 4.0 | | 5 | 5.0 | 50.0| 500.0
I would like to have the ability to query in an SQL-like manner, over any intersection of cases, samples, data, with the most flexibility, for research purposes.
This is not for production. Both querying and adding data doesn’t happen often, and there are no scale-related constraints.
What would be a good design for this?
I assume some information may be missing, so please ask and I will add in whatever is needed.