In order to process/query large data sets, such as ECG and EEG data, efficiently or in real time, data compression is a common pre-processing technique for managing very large data. Compressing data with tolerated errors (under certain metrics) can greatly reduce the original data size and can be beneficial to storage and query efficiency. The stored compressed data can represent the original data approximately on a given metrics and can be used for various analyses thereafter. The approximate query algorithms work on the compressed data and produce the results. Examples of queries such as these include finding the best matched or all similar patterns from a given pattern. In order to achieve the desired application outcomes, the following interlocking topics need to be considered.
Compression of time series data
The key technique is to develop efficient and effective compressing algorithms that can be used for querying in real time. These require that the algorithms have low time/space complexities, preferable O(N) and O(log N) respectively, as supported from our current achieved compressing algorithms. Our achieved algorithms need to be extended for more general applications and to support other error tolerant metrics.
Currently we are investigating the use of these algorihms for phsyilogical data, which is psuedo-periodic. Compression of this data is important, but it will be operations which can be done on the compressed data, such as real time queries of large amounts of the data, whcih will be advantageous. On esuch application may be physioloigcal data monitored while a patient is under aneasthetic. This can be regarded as multidimensional data and will require our current algorithms to be extended to multidimensional data.
Approximate query processing
While it is a challenge to obtain efficient compression algorithms, it is even more challenging to derive these algorithms in a way which supports efficient querying.
The key to this challenge is being able to retrieve the answer from the synopsis rather than needding to rebuild the uncompressed data.
For a single time series, queries can include the examination of the relative and/or absolute changes of signals in span, frequency and shape.
A common query may need to require a similar patterns to be found in mulptiple time series. Algorithms on matching patterns on (pseudo-periodic) data are especially important.
To perform these queries efficiently, the methods may require space compensation and operations classification, although this far from clear.
Traditional database management systems are not designed for rapid and continuous loading of individual data items, and they do not directly support time series data computations, such as continuous queries (i.e., let users get new results from a database without having to issue the same query repeatedly).
In the case of physiological data generated by real-time surveillance systems, the data are often voluminous and expensive to analyze.
Since most analyses are interested in data dynamics such as trends and outliers, a high level of data abstraction and aggregation is needed.
We are investigating extending our compression algorithms to support these types of queries being executed in real time.
Top-k similarity join
Linking of patient records from heterogeneous data repositories is a core requirement in Health Information Environment (HIE). It enables epidemic diseases alerting, better health service and high quality data. The prevalent approach to such record linkage task is to use the similarity join technique. A major challenge is how to perform the similarity join in an efficient and scalable way, when linking large data sets. Another issue is how to determine the similarity threshold. This aggravates the efficiency problem as the current practice is a merely try-and-error approach.