ADAPTABLE AUTOREGRESSIVE MOVING AVERAGE FILTER TRIGGERING CONVOLUTIONAL NEURAL NETWORKS FOR CHOREOGRAPHIC MODELING

: Choreographic modeling, that is identiﬁcation of key choreographic primitives, is a signiﬁcant element for Intangible Cultural Heritage (ICH) performing art modeling. Recently, deep learning architectures, such as LSTM and CNN, have been utilized for choreographic identiﬁcation and modeling. However, such approaches present sensitivity to capturing errors and fail to model the dynamic characteristics of a dance, since they assume a stationarity between the input-output data. To address these limitations, in this paper, we introduce an AutoRegressive Moving Average (ARMA) ﬁlter into a conventional CNN model; this means that the classiﬁcation output feeds back to the input layer, improving overall classiﬁcation accuracy. In addition, an adaptive implementation algorithm is introduced, exploiting a ﬁrst-order Taylor series expansion, to update network response in order to ﬁt dance dynamic characteristics. This way, the network parameters (e.g., weights) are dynamically modiﬁed improving overall classiﬁcation accuracy. Experimental results on real-life dance sequences indicate the out-performance of the proposed approach with respect to conventional deep learning mechanisms.


INTRODUCTION
The domain of Intangible Cultural Heritage (ICH) comprises a vast range of non-material elements, such as performing arts (e.g., folklore dances), music and oral cultural traditions (Kurin, 2004). It is clear that ICH elements are of great importance and therefore, these assets have been identified by UNESCO to ensure an efficient protection and preservation. As far as preservation of performing arts is concerned, kinesiology analysis and choreographic modeling constitute a very important aspect of folklore dance modelling. One of the most important elements of choreographic analysis is the identification of the dancer's movements and poses (i.e., dancer's postures). Recently motion capturing digitization systems are capable of providing 3D measurements of the body parts of a dancer (Rallis et al., 2018). Then, we can proceed to the identification of key primitives of a dance.
In general, deep learning models receives as inputs either raw visual signals of a choreographic sequence or transformed data, that is, 3D features, and then they generate labelled classes corresponding to dance choreographic primitives. Recently, Long Short Term Memory (LSTM) has proven especially useful in choreographic modeling . The LSTM networks usually operates on 3D skeleton data of a dancer, instead of RGB content. This way the complexity of the input data is reduced, increasing choreographic classification performance. Actually, the main advantage of an LSTM network is its recurrent characteristics, implemented also in a bi-directional way (e.g., non causal modelling). Non-causality is necessary since modeling and identification of choreographic primitives depends on both backward and forward dancer's steps.
The main drawback of using 3D skeleton data sequences through an LSTM network is that the choreographic model-ing performance is highly sensitive to skeleton signal errors. Missing skeleton points, as a result of errors of the motion capturing devices, significantly affect the performance of choreographic primitives classification. Another limitation is the assumption of stationarity between the input-output data. This means that the network weights of the LSTM model remains constant during choreographic modeling. However, a dance sequence presents several dynamics and dancer's attributes such as gender, age and personalized style, significantly affect the overall dance performance.
Instead, using RGB content as input to a deep learning network, we face the aforementioned skeleton error issues. Convolutional Neural Network (CNNs) have proven, recently, to be robust classifiers, especially of processing high-dimensional RGB visual data (LeCun et al., 1998), (Makantasis et al., 2017a). Therefore, CNN networks have been used for human action recognition (Varol et al., 2018), (Kamel et al., 2019).
However, issues related with the dynamic nature of a choreographic can not be addressed using conventional CNN models since model parameters (i.e., network weights) remains constant during the operation of the model. Additionally, the RGB data alone deteriorate the overall choreographic modeling performance due to the existence of enormous spatial-temporal information, confusing the classification due to the following reasons: First, the purpose of the convolutional layer of a CNN is to transform the raw RGB visual data into low-forms of representations, through the "deep convolutions". In this case, the convolutional layer transforms the whole input image frame, including the irrelevant visual background content to the choreographic modeling, into low dimensional forms of representation, which are then fed to a fully connected neural network. Second, a conventional CNN structure has not the recurrent characteristics inherently existing in a LSTM model let alone its main bi-directional capabilities. Finally, network weights are assumed to be constant throughout network operation, failing, therefore, to address the dynamic characteristics of a dance.

Related Works
Kinesiology modelling are distinguished into methods that exploit supervised learning and those algorithms of using an unsupervised paradigm. In the literature, the works proposed cover human activity indexing (Ben-Arie et al., 2002), pose identification (Chéron et al., 2015), action prediction (Hadfield, Bowden, 2013), emotion recognition (Fan et al., 2016) and background subtraction (Piccardi, 2004). In (Milbich et al., 2017), an unsupervised approach is proposed for modelling human activities, while in (Rallis et al., 2018), summarization of folklore dances have been introduced using an hierarchical SMRS algorithm. In this context, the work of (Wang et al., 2011) has introduced an action recognition framework exploiting dense trajectories. Finally, in (Kolekar, Dash, 2016) hidden Markov models (HMM) has proposed for human activity recognition.
Recently deep machine learning methods have been introduced for analysis of folklore sequences. A brief review of deep learning for computer vision applications one can be found at . In (Zeng et al., 2014), a CNN neural network model have been introduced for human activity analysis, while the work of (Khaire et al., 2018) uses RGB-D and skeleton data for activity analysis. In (Simonyan, Zisserman, 2014), the authors introduce a two-stream convolutional neural network structure for action recognition in videos. In this context, the work of (Wang et al., 2017) introduces a three-stream CNN for action recognition modelling, while the work of (Kamel et al., 2018) proposes CNNs structures on depth maps and postures for human action recognition. Finally, Makantasis el al. (Makantasis et al., 2016) introduces a behavioural understanding approach for industrial environments, while in (Gan et al., 2015), the authors introduces a flexible Deep CNN for detecting spatio-temporal relationships in videos.
Another area of research related with this paper is background modeling and consequently foreground extraction. Towards this direction salient maps have been proposed in (Makantasis et al., 2013) exploiting concepts of visual attention algorithms. In this context, the work of (Babaee et al., 2018) introduces a background modeling algorithm using CNN structures. Similarly, in (Varadarajan et al., 2015), the authors introduce methods of Mixture of Gaussians to face background dynamics. In (Bianchi et al., 2019), the authors proposed a neural network implementation of the ARMA filter with a recursive and distributed formulation, obtaining a convolutional layer that is efficient to train, localized in the node space, and can be transferred to new graphs unseen during training. In (Defferrard et al., 2016) the authors are interested in generalizing CNN from low-dimensional regular grids to high-dimensional irregular domains, such as social networks, brain connectomes or words' embedding, represented by graphs.

Paper contribution
To face the aforementioned limitations, in this paper, we introduce a novel CNN model with Autoregressive Moving Average (ARMA) capabilities. In addition, we introduce adaptive capabilities into the proposed non-linear ARMA model in a way that the network weights are dynamically adapted to face the current choreographic dynamics. We call this model adaptable ARMA-based CNN filer due to its adaptive and Autoregressive-Moving Average capabilities.
In particular, the proposed network filter feeds back its classification output to the input layer, implementing an autoregressive triggering mechanism; the output variable depends on its own previous values. In addition, we introduce a Tapped Delay Line (TDL) input to the CNN model in order to capture the temporal dependencies of a choreography. The TDL filter implements a moving average (Doulamis et al., 2003).
Finally, we introduce a computationally efficient and adaptive algorithm for dynamically modifying the network weights of the fully connected layer of the CNN model to fit the dynamic nature of a choreography. The proposed way of adaptation allows to the new ARMA-enriched CNN to automatically adapt its behavior to the current conditions while simultaneously respecting the already accumulated knowledge as much as possible. This way, the new model is able to capture the nonstationary behaviors of a choreography.
In addition, to face the first limitation of using a conventional CNN model for choreographic modeling, we prior to the classification stage. In this context, the irrelevant to the choreographic modeling background content is isolated, creating an RGB mask of dancers' postures. In this way, the hierarchies of convolutions of the CNN transforms the RGB dancers' postures into low forms of representations, e.g., kinesiology dancers' features, which are then used for choreographic modeling. Therefore, the proposed approach faces the skeleton error sensitive issues of the current LSTM filters and simultaneously addresses the previous discussed limitations of using conventional CNN models on the raw RGB data (that is dynamic training and adaptive since the output of a dance pose estimator should affect its own previous value). This paper is organized as follows: Section 1.1 describes previous works. The new proposed ARMA-enriched CNN model is discussed in Section 2. In this section, the adaptive behavior of the model is also given along with the proposed optimization process to maximize its efficiency and the variational inference-based background subtraction method. Experimental results on real-life dances are presented in Section 3. Finally, Section 4 draws the conclusions.

AN ARMA-ENRICHED CNN FOR CHOREOGRAPHY MODELING
Fig. 1 indicates our proposed overall architecture for choreographic modeling. As is observed, our proposed framework encompasses the following components. The first is responsible for the data acquisition (the motion capturing sensors) that is used to obtain the RGB images of a choreographic sequence as well as the skeleton data. The second component is related with the background subtraction for reducing the irrelevant to choreographic modeling content. This information is fed as input to the proposed adaptive ARMA-enriched CNN model (the third component). The adaptive ARMA-enriched CNN filter is a conventional CNN enriched with an ARMA Filter as well as with adaptive network weight strategies for dynamically adjust model response to fit dance dynamics. The MA component is responsible for delaying the input signals into several taps. In addition, the AR filter is responsible to feed back the classification output to the input in a way that the current choreographic modeling is related with its own previous values. Finally, the adaptive algorithm is responsible for dynamically modifying the weights of the fully connected layer of the CNN to face the dynamic nature of a choreography.

The Kinect-based Acquisition Component
The acquisition module adopted for modeling the dancer's motion trajectories in 3D space exploits the Kinect-II motion capturing System. It should be mentioned that the Kinect motion capturing system also extracts the respective RGB visual data. Fig. 2 shows a snapshot of the proposed Kinect-II architecture used for motion capturing of the dance sequences.
The recorded data from the Kinect system is to extract a) the RGB visual content of the choreography and b) the respective 3D skeleton joints. In this paper, we use only the RGB information as sensorial input to identify the choreographic primitives, since skeleton sensorial data are sensitive to errors, especially in case of using low-cost motion capturing systems such as the Kinect.

The Autoregressive Moving Average Convolutional Neural Network
In the following we assume a non-linear relationship, denoted as g(·). This relationship relates the output of the neural network model y(n) with input sensorial signals x(n) at a time instance n. Actually, the purpose of g(·)) is to transform the raw RGB input signals x(n) into labeled choreographic primitives classes. Therefore, we have that where q expresses a time window of previous observations affecting the choreographic classification of the current image frame n, while p the order of the previous classification outputs affecting the choreographic modeling. Error e(n) is an independent and identically distributed (i.i.d) process.
In order to approximate the non-linear function of g(·), we use machine learning methods. The machine learning algorithms minimizes the error e(n) through training. In particular, it has been proven that a Tapped Delay Line (TDL) input filter can approximate the non-linear function of (1) with any degree of accuracy (Doulamis et al., 2003).
The main limitation of using a simple fully connected neural network (e.g., a feedforward one) is the training procedure are unstable especially in cases where large amount of multidimensional data are used as input signals, such as series of RGB image content. To face these difficulties, CNN models have been proposed as an alternative classification mechanism for processing RGB input signals compared to conventional feedforward structures (LeCun et al., 1998 To model a MA property into a CNN filter, we include a Tapped Delay Line (TDL) layer to the network. This is illustrated in Fig. 3. The TDL layer is responsible for delaying the input signal for q discrete time instances. Therefore, it is responsible for implementing the x(n), x(n − 1), · · · , x(n − q) relationship of (1). MA behavior means that identification of a choreographic primitive at a time instance n should not limited to a single image frame, but rather to a set of q frames. That is, vector y(n) depends on q previous samples x(n − j), j = 0, · · · , q − 1.

The AutoRegressive behavior:
On the other hand, the output of the pose estimator should not only depend on external, even cumulative, input but also on its classification output history, so as to eliminate abrupt spikes in the recognition output. Therefore, including an additional time window of previous classification outputs in the input of the model can effect the consideration of previous identification behavior and ensure smoother output. This is also illustrated in Fig. 3, where the classification output feeds back to the input layer. Actually, the AR behavior implements the second part of (1), that is the non linear function of y(n) is related with its own previous values y(n − 1), · · · , y(n − p).

The Convolutional Layer:
The purpose of this layer is extract descriptors from the sensorial input signals with a latent way. In the following, the outputs of the convolutional layer of the CNN is denoted as f1, f2, · · · , fL. These outputs are fed as inputs to the classification layer which is resposible for choreographic modeling. The structure of the convolutions layer adopted in this paper are the following: It consists of convolutions and RELU, max pooling filters. The first layer of convolutions consists of 32 filters of a size of 5x5x3. ON the other hand, the second layer composes of 64 convolutional filters of a size of 5x5x32. The classification layer uses the descriptors of the convolutional layer, that is the f1, f2, . . . , fL, to provide the final choreographic modeling. Fig.3 depicts the structure of the proposed deep learning model for choreographic modeling.
Therefore, our proposed ARMA-enriched CNN architecture supports both input-and output memory to the model, thus approximating a Non-linear NARMA filter, functioned with the power of a CNN. We call this model Autoregressive Moving Average Convolutional Neural Network, named in short ARMA-CNN model. Fig. 3 presents the proposed ARMA-CNN) architecture adopted for choreographic modeling.

The Adaptive Behavior of the ARMA-Enriched CNN
The main limitation of the aforementioned architecture is that it is assume a stationary input-output relationship. However, this is not valid in a choreographic modeling since many dynamics are involved. Therefore, adaptable strategies are required to update the model response in a highly dynamic way.
Let us now denote as w b the parameters of the fully connected neural layer, that is the network weights, before the network adaptation. Let us also assume that wa is the network weights are the adaptation. We assume that these weights are related as follows In Eq.
(2) dw refers to a small perturbation of the network weights. Eq.
(2) means that we only need to compute the small perturbation of the network weights dw in order to estimate the new network weights (that is after the adaptation) from the previous ones, w b . Usually, a choreography consists of a constant main choreographic pattern. For example, the main choreographic pattern of two different choreographies are depicted in Fig. 4. A frequency domain approach is adopted for estimating the main choreographic pattern as in (Baihua Li, Holstein, 2002). Let us denote that using the method of (Baihua Li, Holstein, 2002), the main choreographic pattern have been estimated as γ = {c1(ns), · · · , cL(ne)} In Eq. 3 ci(t) expresses the choreographic primitive that the image frame at time instance t belongs to. This means that ns and ne refers to the start and end time instance of the main choreographic pattern. In case that a misclassification occurs within the a choreographic pattern group, network weight adaptation is needed. Therefore, the new network weights are estimated in a way that the network response, after the weight adaption, approximates the main choreographic pattern group sequence.
yw a (n) ≈ ci(n) ∀ci(n) ∈ γ (4) In Eq. (4), yw a (n) denotes the response of the network at the time instance n of using the new adapted weights wa. Eq. (4) means that the network response should respect the main choreographic pattern sequence.
Using the assumption of Eq.
(2), one can apply first-order Taylor series expansion for estimating the small weight perturbation dw. In this way, a system of linear equations are derived as follows In Eq. (5) matrix Ai expresses a matrix that it is derived from the previous network weights, that is w b , while ei(n) is a scalar expresses the difference of the network response before and after the adaptation. Therefore, Solving Eq. (5) one can estimate the the small weight perturbation dw and thus the new weights wa. The new ways are estimated in a way that the previous behavior of the network is optimized (see Eq. (4)).

The Optimization Procedure
The main problem of solving Eq.(5) is that we have only one equation whereas the number of weights are many. This means that dw is a multi-dimensional vector of size equal to the number of network weights of the fully connected layer of the network (see Fig.3). Therefore, there is no a unique solution of solving Eq. (5).
To address this limitation, an additional constraint is introduced in this paper. Particularly, we select among all possible solutions that satisfy Eq. (5), the one that yields a minimum modification of the small perturbations dw. This means that we have the following constraint optimisation framework min dw subject to ci(n + 1) = Ai · dw (7) Solving Eq. (7), we can estimate the small perturbation of dw. An alternative framework is not to modify the weights in a way to have the minimum possible norm of dw subject to constraint of (5). Instead, the previous network knowledge should be modified as discusses in (Doulamis et al., 2003).

Variational Inference of Gaussian Modeling for Background Subtraction
As far as background modeling is concerned, a a variational inference approach of Gaussian Mixtures is adopted (Makantasis et al., 2017b). The advantages of this algorithm compared to the usage of traditional mixture of Gaussians schemes is that it substitute scalar parameters with probability distributions. Therefore, more accurate background modeling is performed. In addition, this approach is less computationally complex compared to traditional mixture of Gaussians schemes which is an important aspect for folklore analysis. Initially, every pixel is divided by its intensity in RGB colour space. Each pixel is computed expressing its probability whether it is included in the Foreground or Background with the following equation: Actually, in a variational inference approach, variable ωi,t is a probability density function, say P (Xt|ω), instead of a scalar value as in a conventional Gaussian Mixture Model. However, in Eq. (8), we have denoted as scalar for simplicity purposes (More information can be found at (Makantasis et al., 2017b)). In addition, in Eq. (8), Xt expresses the current pixel in frame t and K the number of the distributions of the mixture. The weight of the i-th distribution in frame t is expressed as ωi,t. Additionally, the mean of the i-th distribution in frame t is expressed as µi,t and the standard deviation of the i-th distribution in frame t is expressed as Σi,t. Moreover, the η (Xt, µi,t, Σi,t) declares the probability density function and is defined as following as a Gaussian distribution.
The difference between a Gaussian mixture and a variational inference is that the weights ωi,t of Eq. (8) are probability distributions instead of scalar. Therefore, better function approximations are achieved, improving background/foreground separation performance as it is discussed in (Makantasis et al., 2017b).

Description of the dataset used
For evaluating and comparing the proposed algorithm against SOA methods folklore video sequences are used as presented in   Thessaloniki. All video sequences are Greek traditional folkloric dances, the selection of which was made by dance experts  from the Aristotle University of Thessaloniki to achieve variability in terms of styling, rhythm and gender. The selection of different human sexes is due to the fact that men and women follow different style in their dance performance. Table 1 describe the folklore dance sequences used in this experiment. For every dance video sequence a small description is provided for clarification purposes. The adopted frame rate is of about 30 fps. This results in an estimate of a time window of about 15 to 30 frames, meaning of about 0.5 to 1 sec delay. In this table, we depict the main choreographic primitives of each dance. It should be mentioned that these primitives does not refer to the steps of the choreography as being taught to a dancer trainer but to the main "activities" of the dance in the digitized manner. Fig. 4 visually depicts the main choreographic primitives of two dance sequences. As is observed, the choreographic primitives same similarities with each other, imposing difficulties in the recognition process.

Choreographic Identification Performance
The proposed approach was compared with traditional adopted classifiers such as k-Nearest-Neighbor (kNN), kernel-based SVM structures, Feedforward Neural Network (FNN1) with 1 hidden layer of 10 neurons, and another FNN2 with 2 hidden layers of 10 neurons/layer. Finally, the CNN classifier was tested with a normal input layer as well as an input layer with autoregressive moving average behavior as proposed in this paper. For comparison, we include metrics from information retrieval such as precision and recall, accuracy and F1-score. During the experiments the dataset was split into a training set and a test set following an 90 to 10 ratio. Fig. 5 presents the aforementioned metrics for different machine learning configuration networks. As is observed, the proposed method, that is of using Autoregressive and Moving Average (ARMA), through an adaptive implementation, outperforms the compared machine     Table 2. It is clear that background modeling improves the overall classification performance. This is mainly due to the fact that irrelevant visual information (that is the background content) is isolated from the classification process. It should be mentioned that in Fig. 5 the results are obtained using the background separation algorithm.
The effect of the background modeling and therefore, the foreground estimation is depicted in Fig.6. Background removal is very important for choreographic modeling, since irrelevant to the choreography content is discarded. Fig. 7 indicates the effect of the size of a window (e.g., memory of window) as far as classification performance is concerned. As it is observed the implementation of the Memory Window in the classification procedure increases the total accuracy in each algorithm (SVM, kNN, FNN1, FNN2, CNN).

CONCLUSIONS
This paper presents an adaptable autogressive and moving average layer (R-ARMA) into a conventional CNN filter to model the dynamic behavior of a choreography. The proposed architecture improves the performance of LSTM networks which is currently used for a choreography modeling, receiving as input 3D skeleton points of the dancers. The main issues of using 3D skeleton features is that the classification performance is quite sensitive to errors of the skeleton. For this reason, an alternative approach is adopted in this paper based on the capabilities of CNN models.
In particular, we use RGB input data towards choreographic modeling. RGB inputs are less sensitive to skeleton errors. However, the main drawback of this approach is that a) they can not have the recurrent characteristics of the LSTM structures, failing, therefore to handle the dynamics inherently presenting in a choreography, b) the background visual content confuses the classification accuracy since it is irrelevant to the choreography and c) they assume stationarity between the inputoutput data which is contradictory with the dynamic nature of a choreography. To address the aforementioned issues, we introduce, in this paper, a novel AutoRegressive, Moving Average (ARMA) filter to a CNN model in order to stimulate recurrent network characteristics. In addition, to face the choreography dynamics, we introduce an adaptation mechanisms in a way that the network weights of the fully connected hidden layer is dynamically updated to fit current environmental characteristics. Experimental results on real-life sequences illustrate the efficiency of the proposed model against conventional deep machine learning filters.
As future work, such a framework can be used in the context of educational or entertainment applications for Intangible Cultural Heritage.