Datasets¶
Available datasets:
Name | Type of experiment | Channels | Sampling rate | Original size |
---|---|---|---|---|
CWRU | Initiated faults | up to 3 vibrations | 12 kHz / 48 kHz | 656 Mb |
Dcase2020 | Labelled data | 1 acoustic | 16 kHz | 16.5 Gb |
DIRG | Initiated faults & Run-to-failure | 6 vibrations | 51.2 kHz / 102.4 kHz | 3 Gb |
FEMTO | Run-to-failure | up to 3, 2 vibrations + 1 temperature | 25.6 kHz / 10 Hz | 3 Gb |
Fraunhofer151 | Labelled data | 5, 1 voltage + 1 rpm + 3 vibrations | 4096 Hz | 11 Gb |
IMS | Run-to-failure | up to 8 vibrations | 20480 Hz | 6.1 Gb |
Paderborn | Initiated faults & Run-to-failure | 8, 2 currents + 1 vibration + 3 mechanic + 1 temperature | 64 kHz / 4 kHz / 1 Hz | 20.8 Gb |
Meaning of different types of experiment:
- Initiated faults: fault is precisely labelled, e.g. by the name of the faulty component.
- Labelled data: fault is roughly labelled, e.g. by 'normal' or 'faulty'.
- Run-to-failure: continuous experiment, no information about the faulty state is available in general.
Installation & Preprocessing¶
Nomenclature
In dpmhm
we distinguish three concepts:
- Original dataset: what is provided by the original source
- Built dataset: original dataset formatted by
tensorflow-datasets
- Preprocessed dataset: built dataset after preprocessing steps
As mentionned in Workflow a dataset needs to be installed and preprocessed before being fed to ML models.
Installation¶
Let's take example of the dataset CWRU. For installation simply use
from dpmhm import datasets
datasets.install('CWRU')
See the tutorial Installation of Datasets and Preprocessing of Datasets for a in-depth walkthrough.
Preprocessing¶
A preprocessing pipeline consists of 3 levels of transformations:
- File-level preprocessing: data selection, label modification, signal resampling & truncation etc.
- Data-level preprocessing: feature extraction, windowed view, data augmentation etc.
- Model-level preprocessing: adaptation to the specification of a machine learning model.
Let's take example of a dataset of 3-channels acoustic records of 10 seconds each file. The corresponding steps would be:
- Select only the first channel. Make finer labels.
- Compute the spectrogram of each record and split into patches of shape (64, 64).
- Make batches of paired view of patches, for the training of a contrastive learning model.
See the page Datasets for more details.
Convention for the data structure¶
Built datasets have standard interface that can be inspected using the property .element_spec
. In dpmhm
a built dataset contains the following fileds:
Name | Type | Description |
---|---|---|
signal |
dict | data of this record |
sampling_rate |
int or dict | sampling rate (Hz) of signal |
metadata |
dict | all other information about this record |
and the same structrue is followed by all items of the dataset. When sampling_rate
is a dictionary, it has the same keys as signal
and the values is the corresponding sampling rate of each channel in signal
, otherwise it is a number representing the common sampling rate of all channels.
For example the structure of the dataset CWRU looks like:
>>> ds['train'].element_spec
{'metadata': {'FaultComponent': TensorSpec(shape=(), dtype=tf.string, name=None),
'FaultLocation': TensorSpec(shape=(), dtype=tf.string, name=None),
'FaultSize': TensorSpec(shape=(), dtype=tf.float32, name=None),
'FileName': TensorSpec(shape=(), dtype=tf.string, name=None),
'LoadForce': TensorSpec(shape=(), dtype=tf.uint32, name=None),
'NominalRPM': TensorSpec(shape=(), dtype=tf.uint32, name=None),
'RPM': TensorSpec(shape=(), dtype=tf.uint32, name=None)},
'sampling_rate': TensorSpec(shape=(), dtype=tf.uint32, name=None),
'signal': {'BA': TensorSpec(shape=(None,), dtype=tf.float32, name=None),
'DE': TensorSpec(shape=(None,), dtype=tf.float32, name=None),
'FE': TensorSpec(shape=(None,), dtype=tf.float32, name=None)}}
Data sturcture
We follow the principle that all original information has to be preserved in the built dataset. However some of them may be non-essential and can be dropped in the subsequent preprocessing steps, which may modify the element specification of the dataset.
Performance¶
The subsequent preprocessing steps of a dataset actually define a pipeline of transformations, very often heavy-lifted by the method .map()
. In the graph execution mode of Tensorflow these transformations are not effective until an element of the preprocessed dataset is loaded into memory. Some intermediate steps may be repetitive and hinder the performance. As a remedy one can first pre-compute the heavyweight intermediate transformations and export the transformed dataset to disk, then postpone the lightweight final transformations in memory.