Teach you how to use Python to play with time series data, from sampling, prediction to clustering | Code

Latest update time：2019-02-11

Reads：

Original work by Arnaud Zinflou,
translated by Guo Yipu
Produced by Quantum Bit | Official account QbitAI

Time series data is time series data.

Stock prices, daily weather, and weight changes are all time series data. This type of data is quite common and is also a challenge for all data scientists.

So, if you ever come across time series data, how can you use Python to handle it?

Time series data sampling

data set

The example used here is electricity consumption by households in London between November 2011 and February 2014.

It can be seen that this data set is counted every half hour, recording how much electricity each household uses. Based on this data, some charts can be generated for analysis.

Of course, since the data we consider are mainly in the two dimensions of time and electricity consumption, the other dimensions can be deleted.

Re-sampling

Let’s start with resampling. Resampling means changing the time frequency in time series data, which is a very useful skill in feature engineering to add some structure to supervised learning models.

The resampling method using pandas is similar to groupby. The following example makes it easier to understand.

First, you need to change the sampling period to weekly:

data.resample() is used to resample the kWh column in the data frame.

The 'W' indicates that we want to change the sampling period to weekly.

sum() is used to find the total amount of electricity during this period.

Of course, we can also follow the same pattern and change the sampling period to daily.

Pandas has many built-in resampling options, such as different time periods:

There are also different sampling methods:

You can use these directly or define your own.

Modeling with Prophet

Facebook Prophet was launched in 2017 and can be operated in Python and R.

Prophet is naturally good at analyzing time series data. It is adaptable to any time scale and can handle outliers and missing data well. It is very sensitive to trend changes and takes into account the impact of special times such as holidays. Change points can be customized.

Before using Prophet, let’s rename each column in the dataset. The data column is ds and the value we want to predict is y.

The following example is a time series with daily intervals.

Import Prophet, create a model, and fill in data.

In Prophet, the changeprior prior scale parameter can control the sensitivity to trend changes. The higher the parameter, the more sensitive it is. Setting it to 0.15 is more appropriate.

To implement the forecasting function, we create the future data frame, set the forecast time and frequency, and then Prophet can start forecasting.

The forecast here is set for two weeks, in days.

There you have it, we can now predict household electricity consumption for the next two months.

In the figure, the black dots are actual values, the blue dots are predicted values, and the light blue shaded area represents uncertainty.

Of course, if the forecast period is very long, the uncertainty will also increase.

Using Prophet, we can also easily see a visual trend chart.

Looking at the second picture above, we can clearly see that household electricity consumption increases in autumn and winter, but decreases in spring and summer, based on years; electricity consumption on Sundays is greater than on the other six days of the week.

LSTM prediction

LSTM-RNN can perform long sequence observations. This is the architecture diagram of the internal unit of LSTM:

LSTM seems to be very suitable for time series data prediction. Let's let it process our data with a one-day cycle:

LSTMs are sensitive to the size of the input data, especially when using sigmoid or tanh activation functions.

You can also normalize the data, that is, rescale the data to the range [0,1] or [-1,1]. You can easily normalize the dataset using the MinMaxScaler preprocessing class in the scikit-learn library.

Now, split the dataset into training and testing sets.

The following code divides 80% of the data into a training set and the remaining 20% is reserved for the test set.

Define a function to create a new dataset and use this function to prepare for modeling.

The input data of the LSTM network needs to be set into a specific array structure: [samples, time steps, features].

Now we are using [sample, feature], we need to add the time step, and use the following method to transform the training set and test set into what we want

Done, now design and debug the LSTM network.

From the loss plot, we can see that the model performs similarly on both the training and test sets.

As shown in the figure below, LSTM performs very well when fitting the test set.

Clustering

Finally, we will perform clustering on our example dataset.

There are many clustering methods, one of which is clustering hierarchically.

There are two ways to layer: starting from the top and starting from the bottom. Here we choose to start from the bottom.

The method is simple, import the original data, then add two columns for the day of the year and the hour of the day.

Connections and Treemaps

The linkage function combines distance information and groups objects into clusters based on similarity, connecting them to each other to create larger clusters. This process is iterated until all objects in the original dataset are connected to each other in the hierarchical tree.

This completes the clustering of our data:

Done. Simple, isn’t it?

But what is ward in the code?

This is a new clustering method. The keyword ward makes the link function use the ward variance minimization algorithm.

Now, take a look at the clustering dendrogram:

On the x-axis are the labels, or sample indices;

On the y-axis is distance;

The vertical lines are cluster merges;

The horizontal lines indicate which clusters/labels are part of the merge to form new clusters;

The length of the vertical line is the distance at which new clusters are formed.

Simplified to be clearer:

Portal

https://towardsdatascience.com/playing-with-time-series-data-in-python-959e2485bff8

-over-

Join the community

The QuantumBit AI community has started recruiting. Students who are interested in AI are welcome to reply to the keyword "communication group" in the dialogue interface of the QuantumBit public account (QbitAI) to obtain the way to join the group;

In addition, professional qubit sub-groups (autonomous driving, CV, NLP, machine learning, etc.) are recruiting for engineers and researchers working in related fields.

To join the professional group, please reply to the keyword "professional group" in the dialogue interface of the Quantum Bit public account (QbitAI) to obtain the entry method. (The professional group has strict review, please understand)

Sincere recruitment

Qbit is recruiting editors/reporters, and the work location is Beijing Zhongguancun. We look forward to talented and enthusiastic students to join us! For relevant details, please reply to the word "recruitment" in the dialogue interface of the Qbit public account (QbitAI).