Showcase the time series machine learning model for canary analysis.
Given 2 time series of equal length (e.g. canary phases) and sampled at the same frequency, detect the following:
- They are similar if the time series patterns are similar and the values are within an acceptable deviation range.
- They are dissimilar if the patterns are different, or the values are outside of an acceptable deviation range. Acceptable deviation range is something the model infers from the training data.
We generate a synthetic dataset inspired by the UCI Synthetic Control dataset (see below screenshot), which is commonly used for time series model validations in the academic community.
The time series within this dataset follow the normal pattern with no clear trend.The pattern can be explained as y(t) = m + s. We model s, which captures the variation or noise from a normal distribution s~N(0,1). Also, in order to introduce anomalies in the dataset, we randomly vary m, starting from 1 to a specified high limit. The implementation of the above is given by the code snippet below.
def normal(n_samples=100, t_samples=30, m_max=1):
data = 
for i in range(n_samples):
m = np.random.randint(1, m_max, 1)
sample = m + np.random.normal(0,1,t_samples)
We generate multiple datasets with 30 time series each. Each dataset is generated by varying the range of values for the variable m from 1 to m_max. The higher the range, the higher the probability of finding dissimilar time series in the dataset.
For each dataset, we compare all pairs of time series, amounting to 900 comparisons.
We plot the percentage of dissimilarities detected for the datasets vs the upper limit of the variable m (m_max) for that dataset. We expect the percentage of dissimilarity detected grows with the increase in value for m_max and tapers off at some point.
The code snippet for this is given below. The highlighted portion is the call to our SAX HMM model.
n_samples = 30
n_comparison = n_samples * n_samples
for m_max in range(2, 30, 1):
data = np.array(normal(n_samples=n_samples, m_max=m_max))
error = 0.
for i in range(n_samples):
test = data[i, :]
for j in range(n_samples):
control_data = data[j, :]
sdf = SAXHMMDistanceFinder(control, test)
result = sdf.compute_dist()
if result[‘risk’] == 1:
error += 1
We see a 75% dissimilar prediction at m_max=5 and it reaches its peak by m_max=15. As expected, the percentage labeled dissimilar grows with the increase in the value of m_max and tapers off at about 93%.
The results showcase the SAX HMM machine learning model for time series canary analysis. The dataset was synthetically generated much like the UCI synthetic data set. The dissimilar detection rate grows with m_max as we expect.
Reference: Synthetic Control Chart Time Series by Dr Robert Alcock.
Thanks for reading!