Multi-feature spatial distribution alignment enhanced domain adaptive method for tool condition monitoring

Highlights Abstract ▪ A domain adaptive method for aligning multi feature spatial distributions is proposed. ▪ A ResNet18_BiLSTM feature extraction model is proposed to reduce signal fluctuations. ▪ A soft threshold technique based on attention mechanism is proposed for informativeness. Transfer learning (TL) has been successfully implemented in tool condition monitoring (TCM) to address the lack of labeled data in real industrial scenarios. In current TL models, the domain offset in the joint distribution of input feature and output label still exists after the feature distribution of the two domains is aligned, resulting in performance degradation. A multiple feature spatial distribution alignment (MSDA) method is proposed, Including Correlation alignment for deep domain adaptation (Deep CORAL) and Joint maximum mean difference (JMMD). Deep CORAL is employed to learn nonlinear transformations, align source and target domains at the feature level through the second-order statistical correlations. JMMD is applied to improve domain alignment by aligning the joint distribution of input features and output labels. ResNet18 combining with bidirectional short-term memory network and attention mechanism is developed to extract the invariant features. TCM experiments with four transfer tasks were conducted and demonstrated the effectiveness of the proposed method.


Introduction
Recent years, with the continual development of the machining process, the machining complexity and accuracy of products have been greatly improved, and the condition of tools during processing directly affects the surface quality of products processed. In order to obtain high precision machining products, it is necessary to establish an effective tool condition monitoring (TCM) system [6,18,41]. Generally, a tool's condition is divided into three periods: grinding, steady, and failure. When a new tool starts to be used, it needs to go through a short break-in phase between the tool and the machined workpiece firstly, followed by a slow increase in wear and a long period of steady wear. Failure is the final sharp wear stage of the tool until the end of its useful life [22]. As the deterioration of the tool wear increases, the surface quality of the workpiece decreases.
Therefore, a great deal of studies have been conducted by many researchers on TCM in order to achieve high quality machined products [5,8,35,43]. The results show that 10% to 40% of the process downtime is caused by tool fault, which often results in 50% to 80% of the effective tool life being used [24,39].
Therefore, an effective TCM method is of great importance to improve productivity, surface quality of machined products and cost savings [20].
Tool condition is difficult to describe using precise mathematical models because it is nonlinear, time-varying and continuous in actual industrial scenarios. Since the 1980s, TCM has been extensively studied [1], and many effective models have been proposed, including statistics, physical, data-driven, and hybrid models [16]. Data-driven models have been shown significant benefits in dealing with monitoring the tool condition due to the independence of the complex physical model and the systematic a priori knowledge [11,40]. It can effectively extract wear feature information from time or frequency domain signals of tools without the need for empirical knowledge [28,36]. Guan et al. proposed a method based on Hilbert edge spectrum to analyze the wear signals for effective feature extraction and achieve accurate classification of tool wear conditions [11]. Yan et al. used ResNet18 network to fuse the collected signals in multiple channels, which effectively improved accuracy in tool wear monitoring [36].
Nawrocki et al. utilized vibroacoustic signals obtained from spindle bearings in mass production machines in the automotive industry to diagnose the spindle and detect wear symptoms [28].
Jamshidi et al. employed machine tool spindle current and multi-scale analysis for tool condition monitoring [17]. Kasim et al. proposed the Z-rotation method to calculate the milling tool wear progress index based on variance across signal components [19]. Rizal et al. developed an embedded multisensor system on a rotary dynamometer for real-time condition monitoring of milling tools [29]. Data-driven based condition monitoring methods require a large number of labeled training sets to learn the model [4,23], however, in actual machining process, machines are usually operating under different working conditions, and it is challenging to collect enough labeled samples for model training under each working condition [25].
Transfer learning (TL) tries to resolve this issue. Li et al. proposed an adaptive partial domain approach for implementing smart fault diagnosis [15,27,34]. Chen et al. proposed a method to calibrate data labels using a TL algorithm, which makes TL play a significant role for fault diagnostics of wind turbines [4]. Long  For data-driven methods, TCM is a time series problem that captures the nonlinear mapping relationship between the time series of previous and future tool conditions during the machining process [38]. Although recurrent neural network (RNN) can retain the input short-term memory and establish the mapping relationship between short-term memory and target vector [3], it cannot solve the long-term prediction problem.
Long Short-Term Memory (LSTM) and Gate Recursive Unit Network (GRU), as variants of RNN, can capture long-term dependencies between cutting force signals and tool states [9,21]. However, due to the complexity of the model, the difficulty of training increases sharply with the increase of layers, and the demand for training data increases [10].
Moreover, performance is also affected by the number of hidden layers and units.
To solve the above problem, this paper proposes a novel multi-feature spatial distribution alignment (MSDA) network for TCM under variable working conditions, taking advantage of TL to reduce the requirement for feature distribution consistency between the training set and the test set, and reduces the dependence on labeled samples. A structure based on ResNet18 and BiLSTM is constructed for long-term and shortterm prediction of tool conditions to improve the feature extraction capability. In order to retain valuable features, the extracted features are subjected to an attention mechanism and soft thresholding to minimize noise-related information. The contributions of this paper are as follows: (1) A domain adaptive method for aligning multi-feature spatial distributions is proposed to achieve feature alignment in the source and target domains, considering the joint distribution of features and labels in the deeper levels of the aligned neural network.
(2) A ResNet18_BiLSTM feature extraction model is proposed to extract features from spatial and temporal dimensions to reduce the effects of signal fluctuations, in which the gradient disappearance and information loss are avoided through the residual network and preserve the integrity of signal features.
(3) A soft threshold technique based on attention mechanism is proposed to effectively improve the value of information.

Theoretical background
When the datasets in the source and target domains have different feature distributions, traditional supervised learning algorithms are often unable to achieve effective classification, and domain adaptation is well suited for this situation.
Since the data in the target and source domains obey the probability distributions of P and Q, respectively, and for domain adaptation, our goal is to construct a deep neural network that classifies unlabeled data in the target domain through feature learning that is amenable to transfer, as follows: where (⋅) denotes DNN and ̂ is the output of the model prediction; therefore, the purpose of domain adaptation is to minimize the target domain risk ( ) with supervision of the source data Błąd! Nie można odnaleźć źródła odwołania..
We can write the total domain adaptation loss as： where ℒ is the maximal cross-entropy loss, is the trade-off parameter, ℒ TL denotes partial loss to reduce the difference in characteristics between the two domains: where is the count of all possible labels and is the indicator function.

Multi-feature spatial distribution alignment(MSDA)
The approach of Deep CORAL is analogous to that of , DAN and Reverse Grad [14] methods, it adds another loss (CORAL loss) with the aim of minimizing the variance of the learned covariance matrix across domains, which is analogous to that of mini-mizing MMD in the case of a polynomial kernel, but it's more potent than DDC (which merely aligns the means of the sample) and more amenable to optimization than DAN, the ReverseGrad algorithm. It is easy to optimize, and its most salient feature is that it can be seamlessly integrated across  where and denote in terms of the two dataset domains covariance matrices, respectively, is the dimension of each sample， ∥⋅∥ 2 denotes the Frobenius norm between the covariance matrices.
where denotes a column vector whose elements are all equal to 1.

Joint Maximum Mean Discrepancy
To introduce JMMD, we first introduce the concept of MMD, and ( , ) between the source and target domains. The resulting distance metric becomes the joint maximum mean difference (JMMD) with the following equation： where is the set of top-level networks, | | is the number of layers of the matching set, denotes the -th level activation generated in the source domain, and the -th level activation generated in the target domain.
We add it to the loss function to achieve domain adaptation for feature transfer between the target and source domains, as can be seen above, and the final loss function is set: As can be seen from the MSDA loss formula, the training process contains two trade-off parameters and . These two trade-off parameters have an important impact on the accuracy of MSDA, and we will determine the settings of these two tradeoff parameters based on specific experimental data in Section 4.

Attention_ResNet_BiLSTM model
The key idea behind ResNet is adding directly connected channels to the network, the concept of identity short-cuts. The

Traditional residual network
ResNet is a deep-learning method that has received much attention in the past few years [14], and the residual block (RB) is its basic building block, as shown in Figure 4, the RB consists of two ReLu, two BN, two convolutional layers and a Shortcut Connection, the Shortcut Connection is what allows ResNet to outperform general ConvNets. In a general convolutional network, the cross-entropy error gradient is back-propagated layer by layer. By using identity shortcuts, the gradient can effectively fluxes to previous layers near the input layer, thus allowing for efficient parameter updates. Figure 4, shows the general structure of the ResNet, which is made up of one input layer, and convolution layers, many RB, a BN, a ReLu, a GAP, and outputs fully-connected layers, and serves as the basis for further improvements needed in this study.

Soft thresholding based on attention mechanism
The cutting force signal collected in the experiment contains rich information about the change of tool condition, but there is inevitably some noise in the cutting force signal, when the model extracts these features from the signal, it is not beneficial to monitor the tool condition, in order to retain the valuable features to remove the redundancy, these unimportant features may be noticed through the attention mechanism, by soft threshold painting to make them zero [7]. thereby enhancing in terms of the neural network's ability to extract useful information from the cut-off force signal. The traditional soft thresholding operation often requires setting filters based on human experience, and the setting of filter thresholds requires a great deal of expertise. Deep learning changes this way of thinking, and instead of needing to think about setting thresholds, deep learning uses gradient descent to learn automatically, and the formula for soft thresholding is: where is the input feature, is the output feature, and is the threshold (a positive parameter). Soft thresholding preserves useful negative features by setting the features close to zero in the ReLu activation layer to zero.
The soft threshold is inserted into the residual block as a nonlinear layer, as shown in Figure 5. The residual block is different from the classical residual block in Figure 4, and there is a special module dedicated to learning the threshold in Figure   5.  In gradient backpropagation, the absolute value of the feature mapping X is first subjected to the GAP operation to obtain a one-dimensional vector, which is then passed to the FC layer to obtain a scaling parameter, and the end uses the Sigmod function to its scaling to (0.1), the range is expressed as： where is the scaling parameter and is the output of the FC layer, multiply based on the mean of | | in order to achieve the desired threshold value, and the threshold value is denoted as： = ⋅ average , , where is the threshold value, , , are the width, height and index of the channel of the extracted feature mapping, respectively. The iterative process of deep learning can keep the threshold value in a reasonable range of values so that the soft thresholding not all outcomes will be zero.

Bi-directional long short term memory
LSTM is a kind of RNN, RNN was found to have the vanishing of gradients, exploding gradients and poor dependencies over { where, f , i , and0 mean the input gate forgetting gate, where h , h , and h denote the moment t, the forward LSTM, the reverse LSTM, and the final output of the feature vector, respectively. w is the weight parameter of the BiLSTM to be learned. After the feature vectors are processed by the BiLSTM, two double tangent (tanh) functions are stacked as activation functions to monitor the current condition.

Training setup
To rationalize the testing process, we trained each model for 160 iterations, and during the training process, for the first 50 iterations, we only used model weight-sharing transfer to obtain the so-called pre-trained model, and then activated the MSDA domain adaptive strategy. The model training and testing process alternated, and during the training we used small batches of Adam for backpropagation, each batch size equal to 64, using a "stepwise" strategy as the learning rate decay method, with an initial learning rate of 0.001, decaying at 80 and 120, respectively, multiplied by 0.1.

Normalization
Data normalization is a fundamental step in migration learning that ensures the input values are within a specific range. It plays a crucial role in reducing differences in data distribution between the source and target domains, enhancing the model's generalization capability, and improving its adaptability to target domain data. In this study, we employ Z-score normalization, which is calculated using the following equation: where is the input data, mean is the mean of , and std is the standard deviation of .

Experimental setup
The TCM experiments were performed on a CNC (DMTG VDL850A) machining center as shown in Figure 6. The workpiece material was AISI 1045 steel with dimensions of L1300 mm × W100 mm × H80 mm. The tools used are threeflute uncoated carbide end milling cutters with a diameter of 10 mm and the chemical properties given in Table 1.  The tool and the workpiece in the experiment are dry cutting, and the wear and tear of the tool is relatively rapid, we will put the tool offline under the industrial microscope after each tool stroke to measure the distance of 1.5 meters for each stroke.

Selection of cutting force
In the process of milling tool condition monitoring, we use radial cutting force as the input to the model. The radial force is one of the primary components of the cutting force and directly reflects the interaction between the tool and the workpiece during the cutting process. Furthermore, the relationship between radial force and tool condition or workpiece material is more apparent. Additionally, the variation range of radial force is typically larger than that of axial force or tangential force because the relative motion between the tool and the workpiece primarily occurs along the radial direction during the cutting process. Therefore, the radial force exhibits higher sensitivity and provides more information for tool condition detection and monitoring.

Model hyperparameter analysis
Grid search is a commonly used method for hyperparameter tuning. One of its advantages is its simplicity and intuitiveness. According to the accuracy 3D plot we also found that when the value of is fixed and β is changed, the correct rate changes less, and when the value of is fixed and is changed, the accuracy rate changes significantly, thus indicating that the influence of on the MSDA domain adaption is greater than that of . Fig. 10. The results of model hyperparameter analysis.

Attention_ResNet_BiLSTM
In order to test the efficacy and superiority of our proposed     Figure 13, we can see that in the first 50 times of pre-training the correct rate and loss value float

MSDA Performance Evaluation
The correct rate and loss value fluctuate greatly in the first 50 pretraining sessions, but after the introduction of the MSDA domain adaptive approach in the 51st session, the correct rate and loss value soon stabilize, which also shows the effectiveness of our method for the transfer task. In order to prevent the problem of small size of the training dataset, which leads to insufficient data volume to support the training and evaluation of the model, we expand the dataset by using data overlapping. Data overlapping refers to extracting the subsequences of a signal in such a way that the neighboring subsequences have a certain overlapping portion. A larger percentage of overlap not only provides more training samples, but also the higher correlation between training samples increases the diversity of samples, which is conducive to increasing the learning ability and stability of the model. So we consider using 75% overlap ratio. From Fig. 13, it can be seen that the proposed framework has obvious accuracy and loss function jitter in the validation set during the migration task of D3→D4, and it requires many iterations to stabilize. So the D3 →D4 task was targeted for data expansion experiments.
The experimental results are presented in Figure 14. By comparing Figure 14