Long-range Dependencies Learning Based on Non-Local 1D-Convolutional Neural Network for Rolling Bearing Fault Diagnosis

In the field of data-driven bearing fault diagnosis, convolutional neural network (CNN) has been widely researched and applied due to its superior feature extraction and classification ability. However, the convolutional operation could only process a local neighborhood at a time and thus lack ability of capturing long-range dependencies. Therefore, building an efficient learning method for long-range dependencies is crucial to comprehend and express signal features considering that the vibration signals obtained in a real industrial environment always have strong instability, periodicity, and temporal correlation. This paper introduces non-local mean to the CNN and presents a 1D non-local block (1D-NLB) to extract long-range dependencies. The 1D-NLB computes the response at a position as a weighted average value of the features at all positions. Based on it, we propose a non-local 1D convolutional neural network (NL-1DCNN) aiming at rolling bearing fault diagnosis. Furthermore, the 1D-NLB could be simply plugged into most existing deep learning architecture to improve their fault diagnosis ability. Under multiple noise conditions, the 1D-NLB improves the performance of the CNN on the wheelset bearing dataset of high-speed train and the Case Western Reserve University bearing dataset. The experiment results show that the NL-1DCNN exhibits superior results compared with six state-of-the-art fault diagnosis methods.


Introduction
ROLLING bearings are the pivot components of the rotating machinery, and the damage of them directly declines the performance of the mechanical system, and safety problems, as well as enormous economic losses, could be caused. However, the long-time process under adverse operating conditions could easily cause different kinds of damage such as crack, abrasion, and gap. Therefore, the health condition monitoring for rolling bearings is crucial to protect the machinery system from safety problems [1].
With the development of the internet of things and the demand for long-term condition monitoring, companies have obtained enormous industrial data. Since the data-driven machine learning method could extract features of the machinery system form historical data automatically, it has been widely applied in the field of rolling bearing fault diagnosis. In general, the traditional diagnosis methods [2][3][4][5] mainly include two steps: (1) feature extraction and (2) fault recognition. The feature extraction [2,6] is to obtain the features that can reflect the state of the machine through the feature extraction algorithm. Fault recognition [3,7] uses a classifier algorithm to identify and classify the obtained features. However, the manually extracted statistical features can hardly characterize the complex dynamic features of vibration signals. Moreover, most of these classifier algorithms are shallow models, which cannot learn the complex non-linear relationship effectively. Thus, it is easy for them to make a wrong judgment.
In recent years, deep learning has attracted more and more attention in the field of fault diagnosis [8][9][10][11]. Compared with traditional methods, the deep learning method could extract features from lower level to higher level automatically based on multiple nonlinear operations, and thus it could diagnose with higher intelligence. In particular, the convolutional neural network (CNN) has achieved remarkable success in fault diagnosis tasks due to its unique feature learning mechanism [12][13][14]. For example, Ince et al. [15] proposed a new one-dimensional CNN (1DCNN) for the real-time fault diagnosis of motors. Peng et al. [16] used a 1D deep residual CNN to diagnose the fault status of train wheelset bearing. Chen et al. [17] combined the CNN with an extreme learning machine to improve the fault diagnosis performance of the network. These methods are based on the 1DCNN [15][16][17][18][19][20][21], which mainly takes signals as input and automatically extracts fault features and diagnoses fault types through 1D convolution. In addition, Xia et al. [22] proposed a multi-sensor-based CNN fault diagnosis method to learn spatial and temporal information from multiple sensors simultaneously to obtain better results. Wen et al. [23] used two-dimensional CNN (2DCNN) to diagnose the health status of various mechanical components. These methods are based on the 2DCNN [22][23][24], which recombine 1D signal into 2D image or time spectrum, and then use 2D network architecture to get the final diagnosis results. However, compared with the 1DCNN, the network structure and operation process required by the 2DCNN is more sophisticated. Therefore, in this paper, we use the 1DCNN to solve the fault diagnosis of rolling bearings.
Even though the CNN has been successfully applied in bearing fault diagnosis, it was initially introduced to solve computer vision problems such as image segmentation [25] and face recognition [26]. In order to accomplish these tasks, CNN needs to pay more attention to the relevant information of the local neighborhood. Therefore, CNN lacks sufficient attention to the relevance of long-distance information.
Nevertheless, the vibration signal of rotating machinery is significantly different from the image. It is a temporal signal with strong periodicity. In addition, because of complicated operation conditions, these signals are always with strong nonlinearity and instability. Therefore, there is a strong correlation among different time points. Among these periodicities and correlations, there may be a large quantity of valuable information hidden. For example, as shown in Fig. 1, when a bearing has a local fault, the faulty part and other components produce a periodic short-term impact and encourage the bearing system to perform high-frequency free attenuation vibration according to its resonance frequency. Therefore, if we only consider the signal within a local region, diagnosis is more likely to be interfered with by random factors [27]. Apart from this, comparing the relationship among the amplitude of impulse points in different periods and positions is considered practical to understand the information in the signal fully. The Non-local mean (NLM) algorithm was first introduced by Buades et al. [28] in the field of image de-noising. This algorithm firstly breaks the image into patches of the same size. Then, it replaces the value at one pixel with the weighted average based on the similarity among the patch where the pixel belongs and other patches. In that way, the NLM could use the dependencies among one pixel and other pixels. Therefore, this method has a strong ability to capture long-range dependencies and has shown its extraordinary performance on image de-noising. Besides, The NLM is also widely used in the de-noising task of 1D time-series signals and has achieved impressive results. For instance, Van  The contributions of this paper are summarized as follows: 1) Inspired by the NLM algorithm in the field of signal de-noising, this paper proposes a non-local module based on the 1DCNN for capturing long-term dependencies of signals.
2) The proposed 1D-NLB can be integrated into every 1DCNN as an efficient, simple, and universal component, thereby improving the diagnosis performance of the network.
3) This paper proposes the 1DCNN based on 1D-NLB to diagnose the health status of rolling bearings.
4) The NL-1DCNN has been extensively verified on the wheelset bearing dataset and the Case Western Reserve University (CWRU) bearing dataset [32], which has achieved better diagnostic results than six state-of-the-arts fault diagnosis methods.
The rest of this paper is organized as follows. In Section II, the realization of the NLM algorithm on signal is described. In Section III, the proposed NL-1DCNN is described in detail. Section IV verifies the effectiveness and superiority of the NL-1DCNN. Section VI summarizes the whole paper.

Realization of NLM on Vibration Signal
The NLM algorithm for signal de-noising is mainly based on the following procedures. First, a neighborhood block is constructed with each vibration signal point as the center, and then structural information, similar to the neighborhood block, is searched in the global range of the signal. Finally, the information is weighted and averaged to eliminate noise in the vibration signal.
Suppose the expression of the vibration signal of faulty rolling bearing is: where x(t) is the fault impulse signal, n(t) is the noise generated by other factors such as resonance and y(t) is the observed signal.
The mission of de-noising is to eliminate n(t) from the observed vibration signal y(t) so that the original fault impulse signal x(t) could recover. For any position t, the estimated K(t) which is the weighted average of signal values within a predefined search neighborhood N(t) is given by: where ω(t, s) is the weight associated with sth searched point and tth desired point in N(t) which represents the search window centered on position t.
is the normalized factor. The weight, as describe in [33], is given by: where λ is the bandwidth parameter and ∆ represents the local patch of L∆ points surrounding the position t; the patch surrounding the position s also contains L∆ points; d 2 represents the sum of the squares of Euclidean distances of the local patches centered on the signal points t and s. The novelty of NLM is that the weight between two local patches relies on their similarity rather than their physical distance [34]. Therefore, the de-noising process of NLM is non-local. The illustration of universal architecture of the 1D-NLB. "×" denotes batch matrix multiplication and "+" represents elements-wise add. This module can well capture the long-distance dependencies of the input signal.

The Proposed Nl-1dcnn Fault Diagnosis Method
In this section, the generic definition of non-local operation in the CNN is firstly introduced. Then we give instance based on the definition. For the last part, the NL-1DCNN aiming at rolling bearing fault diagnosis is introduced in detail.

Definition of 1D Non-Local
Different from the implementation of NLM algorithm in the field of vibration signal de-noising, the non-local operation in the 1DCNN takes feature signals as input, and then outputs feature signals containing global feature information. Therefore, we define a generic non-local operation in the 1DCNN as: where i is the index of a position on the output feature signal, and the response at that position is the value obtained after a non-local operation. j is the index that enumerates all possible positions. n is the input feature signal and m is the output which has the same length as n. The function f is responsible for calculating the dependency between indexes i and all indexes j of the signal. The function g computes the response of the input signal at position j. The response is normalized by a factor κ(n).
This operation takes the relationship between position i with any position j into consideration and regards the weighted average value of the response as output. Therefore, it can make the network perceive long-range dependencies among different regions in the input feature signal at one time. By comparison, the convolutional operation could only learn the feature within a local neighborhood whose size equals the size of convolution kernels. Likewise, a recurrent neural network (RNN) could only capture the dependencies among neighboring times.
The 1D non-local operation is very simple. The basic idea is to calculate the long-range correlation between the current position and other positions in the input signal, so that the algorithm can quickly capture the detailed local information and global information of the input signal. In addition, this operation can be easily implemented in the CNN with only a small amount of parameter increasing.

1D Non-Local Block
According to the above definition, the pivot of 1D-NLB operation is function f which calculates similarity and function g computing the response. Thus, the realization of these two functions is highly related to the performance of 1D-NLB. In this paper, for simplicity, we only consider g as a linear transformation, which means ( ) where Wg is a weight matrix to be learned. According to the implementation of non-local operations in [28,31], a natural choice of f is the Gaussian function. For the convenience of capturing the dependencies among different regions in the signal, we define the f as: where n T i nj represents dot-product similarity, which is much easier to realize in various neural network platforms, and does not add any training parameters. Thus, the normalized factor is defined as:

Fig. 2.
illustrates the realization of the 1D-NLB in the 1DCNN. n is the input feature signal, n ∈ R B×W×C , where B is batch size, W means the length of the signal and C represents the number channels. At the very beginning, n is multiplied by n T and get matrix v, v ∈ R B×W×W . Then, v is fed into softmax layer to obtain the dependencies among one position of n and other positions. The result could be expressed as: Meanwhile, n goes through a 1×1 convolutional layer to halve its channels. After that, it is multiplied byv and pass another 1×1 convolutional layer so that the number of channels could recover to C. Thus, the output m is calculated by: At last, in order to optimize the feature signal while retaining the original information. We introduce residual connection on this basis to form a complete 1D-NLB. As a result, the output m is rewritten as: The method we proposed computes the dependencies among one local region of the input signal and the entire signal. Besides, the information could be extracted by only increasing extremely few training parameters. The 1D-NLB is very simple to be plugged into most existing 1DCNN. It could also be embedded into any layers among the network to combine the long-range dependencies with short-range information at different level. Therefore, this allows us to build an architecture with a strong ability to learn the global information contained in signal.

Non-Local 1D-Convolutional Neural Network
The 1D-NLB can be simply embedded in the 1DCNN to improve its learning ability of long-range dependencies of input signals. Based on 1D-NLB, we propose the NL-1DCNN, which aims at rolling bearing fault diagnosis. The universal architecture of the NL-1DCNN is shown in Fig. 3. The NL-1DCNN takes a 1D vibration signal as input. First, two shallow convolution modules are used to learn the shallow feature information in the signal. Subsequently, a 1D-NLB is used to learn the long-range dependencies features of the signal. Through the feature learning of the shallow convolution module, the input signal of 1D-NLB can encode enough semantic information, so that 1D-NLB can obtain the temporal correlation in the signal with higher effectiveness and accuracy. This is why two shallow convolution modules are used before 1D-NLB. In addition, the NL-1DCNN also uses multiple convolution modules to encode the high-level semantic features of the signal, so that different types of signals have sufficient distinction. For each convolutional module, it is consisted by a 1D convolutional layer, a batch normalization and a ReLU activation function layer. We implement down-sampling by setting a large convolution stride, which can minimize the corresponding information loss.
For the classification stage, the learned feature is sent to a global average pooling (GAP) [35] layer followed by a softmax activation.
Assuming there are H different classes, the output probability Qh for the class h is calculated by: where qh is the input of the softmax layer. The diagnosis output is the fault label corresponding to the largest Qh.
The detailed architecture of the NL-1DCNN is demonstrated in TABLE I. The length of the input signal of the NL-1DCNN is 2048 × 1, which can ensure that the input signal contains a complete period. Six convolutional modules are applied in the NL-1DCNN in total. Among them, the first two layers of convolution modules are used to capture the shallow information of the input signal, then 1D-NLB is used to learn long-range dependencies features, and the last four layers of convolution modules are used to learn high-level semantic features. The number of channels of the network's convolution module gradually increases from 16 to 128. Except that the stride of the first layer is set to 4, the stride of other layers is set to 2, so that the dimension of the feature signal is finally compressed to 16 × 128. Inspired by [16,19,36], we use wide convolution kernel to learn more fault-related features of the signal. In order to balance the feature extraction capability and the number of parameters of the network model, we set the size of the convolution kernel to gradually decrease, that is, the size of the convolution kernel of the network is gradually reduced from 24 × 1 to 3 × 1. The proposed network model thus uses large convolution kernels in shallow layers to obtain sufficient shallow features from the signal. The extracted features are then filtered and abstracted using small convolution kernels in the deep layers to build high-level features that can be used for device health identification.
Apart from this, we use GAP layer to compress the signal into a vector, which decreases the number of trained parameters dramatically compared with using fully connected layer. The probability is outputted by the softmax function.

Experiment Verification
In this section, we perform ablation study and comparative experiments on the wheelset bearing dataset and motor bearing data from the CWRU to verify the effectiveness and superiority of the proposed non-local operation and fault diagnosis method.

Experiment Setup
Deep learning based methods need a large quantity of samples to optimize parameters and the process of slicing the training samples with overlap proposed by [16,19] could enormously increase the number of training samples. Therefore, we adopt the same method for data augmentation. The length of each sample is 2048 while the step size of sliding segmentation is set to 128 in our experiment. 2048 is greater than the number of sampling points in one rotation cycle of the device, so each sample contains complete cycle information.
The proposed NL-1DCNN is realized in the Keras library under Python 3.5. The training and testing process are performed on a workstation with an Intel Core i7-6850K CPU and a GTX 2080 GPU. In addition, we changed the division standard deviation to division variance in z-score normalization. We find that this can make the network achieve better performance. During the training process, we adopt Adam optimizer and the learning rate is set to 0.0001. The batch size is 196 and 96 on wheelset bearing dataset and motor bearing dataset respectively. In this paper, we adopt three generic performance indicators: accuracy, recall and precision.
To better stimulate strong noise disturbance of bearings in the real circumstance, we added additional Gaussian white noise to the raw signals. The definition of SNR is shown as: 10 10 signal dB noise where Psignal and Pnoise are the power of signal and the noise respectively.
In this paper, the NL-1DCNN is compared with six state-of-the-arts deep learning based methods. First, we compare the NL-1DCNN with dislocated time series CNN (DTS-CNN) proposed by Liu et al. [27]. The DTS-CNN uses a dislocate layer, so that the network can learn the correlation between different time series in the signal to a certain extent. In the experiments, m, n, and k of the DTS-CNN are set to 10, 512, and 30, respectively, and a dropout layer with a dropout rate of 0.2 is used in the fully connected layer to suppress overfitting. In addition, we compare the NL-1DCNN with the LSTM-based methods. The LSTM has a good learning ability of timing correlation features. In this experiment, the used LSTM has two LSTM cells, where its time steps are 64 and the input dimension is 32.
Finally, we also selected the two state-of-the-art 1DCNN-based fault diagnosis methods, namely wide first-layer kernels CNN (WDCNN) [19] and residual-learning -based CNN (ResCNN) [18], which use wide convolution kernel and residual network structure, respectively; and the two state-of-the-art 2DCNN-based fault methods, namely Wen-CNN [23] and hierarchical learning rate adaptive deep CNN (ADCNN) [24], both convert 1D signals into 2D images, and then use different structures of 2D networks to learn fault features. To fairly compare the performance of different methods, we have trained and tested these methods under the same experimental conditions, and four-fold cross validation is also applied to verify the performance of every method.

Data description
The wheelset bearing test rig provides the experiment data. As shown in Fig. 4, the wheelset bearing test rig is mainly composed of a drive motor, a belt transmission device, a lateral loading set, a vertical loading set and two fan motors. The vertical and the lateral loading sets are designed to mimic two-dimensional loads in real train operation. An axle and its two supporting bearings are assembled to the test rig. Use acceleration sensor to collect the vibration signal of rolling bearing. The acceleration sensor is fixed at 9 o'clock and 12 o'clock of the axle box, and the sampling frequency is 5120 Hz. The experimental bearings used double-row taper roller bearings. The photos and models of these faulty bearings are shown in Fig. 5. These faulty bearings are naturally produced during the operation of high-speed train. There are various fault occurring to wheelset bearing during the real operation. Therefore, 12 different kinds of typical fault conditions combined with health conditions are set. The faults are distributed in the inner race, outer race, rolling element and cage of the wheelset bearing, and the severity of the faults is different. The information of the testing wheelset bearing is shown in    As shown in Fig. 6, the raw vibration signals of the 12 health conditions of the wheelset bearing dataset are displayed. In addition, in order to explain the influence of noise on vibration signal, we show the vibration signal after adding different degrees of noise. As shown in Fig. 7, we added 6 dB, 0 dB and -6 dB Gaussian white noise to the vibration signals of the two fault categories. It can be seen that when a small amount of noise is added, the noise has little effect on the vibration signal. However, when a large amount of noise is added, the original waveform of the vibration signal is completely destroyed by the noise, so that it is difficult to distinguish. In actual situations, noise is inevitable. Therefore, in the following experiments, we will also discuss the influence of noise on the deep learning model and the anti-noise performance of our proposed method.

Influence of the position of 1D-NLB
The proposed 1D-NLB can be embedded in any layer of the network to capture long-range dependencies of the feature signal. However, because the length and semantic level of the feature signal in different layers are not consistent, the features learned by 1D-NLB on these layer are also different. Therefore, embedding 1D-NLB in different locations on the network brings different diagnostic performance.
In order to explore the impact on performance when embedding 1D-NLB in different layers of the network, in this experiment, we set up a total of seven different network structures, which are the 1DCNN (the same structure as the NL-1DCNN but does not include 1D-NLB), NL-1DCNN-1, NL-1DCNN-2, ..., NL-1DCNN-6, in which, the number after their name indicates the layer after which the 1D-NLB is embedded. With SNR = −6dB, we performed experiments on these seven methods. TABLE III and Fig. 8 show the accuracy, recall and precision of these methods on the wheelset bearing dataset. The experimental results show that the 1DCNN only obtains 76.80% accuracy, 74.30% recall, and 75.56% precision. After adding 1D-NLB after the first convolutional layer, the NL-1DCNN-1 achieves 81.64% accuracy, 80.13% recall, and 82.90% precision. Which means they are improved by 4.84%, 5.83%, and 5.75%, respectively. This is a huge improvement, which illustrates the effectiveness of the proposed 1D-NLB. The NL-1DCNN-2 has further achieved better performance, and its accuracy, recall and precision have improved by 7.53%, 8.60%, and 8.26% over 1D-NLB, respectively. This shows that the 1D-NLB can encode enough long-distance dependencies from shallow feature signals, so that the network can achieve better performance.
In addition, we also observed that starting from NL-1DCNN-3, the diagnostic performance of the network decreased compared to NL-1DCNN-2. Furthermore, the performance of NL-1DCNN-6 is even worse than the 1DCNN. This shows that the 1D-NLB is very sensitive to its location in the network, and its performance changes with its location in the network. In summary, we can conclude that as the location of 1D-NLB in the network deepens, its performance increases first and then  decreases. This phenomenon is well understood.
The main role of 1D-NLB is to capture the long-range dependencies of the feature signal, and whether sufficient temporal dependencies can be captured is closely related to the input of the 1D-NLB. When the 1D-NLB is located in the shallow layer, the input feature signal has sufficient length, but the semantic level is low, so increasing the semantic level of the input signal can improve the performance of 1D-NLB. When the 1D-NLB is located in the deep layer, the length of the feature signal becomes a greater restrictive factor. In particular, the length of the feature signal outputed by the sixth convolution layer is only 16. In this case, the 1D-NLB has been unable to learn any temporal-related features from such a short feature signal. As a result, the performance of the network has declined since NL-1DCNN-3. Therefore, when designing a 1D-NLB-based fault diagnosis method, it is necessary to balance the two key factors which are the semantic level and feature signal length.
In order to understand the improvement of network performance brought by 1D-NLB more clearly, we use T-SNE technology [37] to visualize the distribution of the features of NL-1DCNN-2 and the 1DCNN on a 2D space, respectively. It is worth noting that the only difference between NL-1DCNN-2 and the 1DCNN is that NL-1DCNN-2 has 1D-NLB and the 1DCNN does not. The visualization results are shown in Fig. 9. Different colored dots represent different health conditions. According to the subfigures A1 and B1, the shallow features of these two networks are not distinguishable. Subsequently, the 1D-NLB makes the NL-1DCNN-2's features more distinguishable. Thus, the discrimination of the features of NL-1DCNN-2 is better than the 1DCNN. For example, the features in the subfigures B2 and B3 are always clustered together. The degree of dispersion of A2 is greater than that of B2, and the degree of dispersion of A3 is greater than that of B3. This shows that the features in A2 and A3 are more discriminative. Therefore, the discrimination of the features of subfigures A2 and A3 are significantly better than that of subfigures B2 and B3. This phenomenon shows that the long-distance dependency captured by 1D-NLB is helpful for the network to distinguish and diagnose different fault categories. This not only proves the validity of 1D-NLB, but also proves that the long-distance dependence of the signal helps the network fully understand the hidden features of the signal. It is precisely because 1D-NLB learns these features that the ordinary CNN networks cannot learn, so that the network can obtain better diagnostic results.

Influence of the number of 1D-NLBs
In order to further explore the impact of the number of 1D-NLB on diagnostic performance, we add one and two 1D-NLBs to the network on the basis of NL-1DCNN-2, which are named NL-1DCNN-2-1 and NL-1DCNN-2-2 respectively. With SNR = −6dB, we performed experiments on these three methods. The accuracy, recall and precision of these three methods are shown in TABLE IV.
We find that the number of 1D-NLB has little effect on network performance. The NL-1DCNN-2, NL-1DCNN-2-1 and NL-1DCNN-2-2 achieved similar fault diagnosis performance. This shows that using only one 1D-NLB can capture adequate long-distance dependencies and greatly improve the performance of the network. Although the NL-1DCNN-2-1 is slightly better than NL-1DCNN-2, adding more modules also increases the computational burden to a certain extent. Therefore, in the subsequent experiments, the network structure of our proposed method is consistent with NL-1DCNN-2.

Effectiveness of 1D-NLB in existing methods
In order to verify the wide applicability of 1D-NLB in the CNN-based fault diagnosis methods, this experiment continues to explore the performance of 1D-NLB in the existing CNN methods. We use the WDCNN as the baseline, and then embed 1D-NLB into different layers of the WDCNN. A total of five different network structures are designed, which are named WDCNN-1, WDCNN-2, ..., WDCNN-5. The number after their name indicates the layer after which the 1D-NLB is embedded. With SNR = −6dB, we performed experiments on these six methods. TABLE V and Fig. 10 show the accuracy, recall and precision of these methods.
Obviously, we find that the proposed 1D-NLB can also effectively improve the fault diagnosis performance of the WDCNN. For example, the accuracy of the WDCNN-2 is improved by 4.09% compared with the WDCNN. Consistent with the phenomenon of previous experiments, as the position of 1D-NLB in the WDCNN gets deeper, the diagnostic performance of the network increases first and then decreases. This also shows that the length of the feature signal and the semantic level have a great impact on the performance of 1D-NLB. In addition, we find that the improvement of the WDCNN-2 compared with the WDCNN is smaller than that of NL-1DCNN-2 compared with 1DCNN. This is because the WDCNN used a very large down-sampling rate in the first convolution layer, which caused the length of the feature signal too small, resulting in 1D-NLB unable to achieve better performance. This also shows that in order to maximize the performance of 1D-NLB, we need to design a relatively reasonable network structure. Even though the WDCNN is not optimized for 1D-NLB, this module still considerably improves the fault diagnosis performance of the WDCNN. This strongly proves the wide applicability of 1D-NLB. This experimental phenomenon proves that the proposed 1D-NLB can be simply embedded in other existing CNN architectures to improve their performance, even if these CNNs are not specifically optimized for 1D-NLB. Therefore, 1D-NLB has a very wide application potential, and it could be used as a general module to improve the performance of most CNN networks.

Compared with state-of-the-arts methods
In order to verify the superiority of the proposed NL-1DCNN and explore its performance under different noise conditions. We compare the NL-1DCNN with six state-of-the-arts deep learning-based fault diagnosis methods under three different noises (SNR = −6dB, 0dB, and 6dB). This effectively proves the fault diagnosis ability of the proposed method in weak noise environment. In addition, when SNR=−6dB, which means the noise intensity is 3.98 times the raw signal, the NL-1DCNN can still obtain 84.33% fault diagnosis accuracy, which is 11.95% higher than Wen-CNN. This is a good proof that the NL-1DCNN has good anti-noise performance even without any de-noising preprocessing. In addition, we find that the LSTM with long-distance dependency learning ability has a good performance in this dataset. At SNR = −6dB, it can obtain a diagnostic accuracy of 81.06%. This also confirms that networks with long-distance dependency learning capabilities can effectively capture more essential signal features and thus obtain better fault diagnosis results when dealing with time-series signal. By contrast, the DTS-CNN only obtained 64.15% accuracy at SNR = −6dB. This shows that the applicability of the DTS-CNN is not satisfactory, and it is difficult to adapt to the fault diagnosis task of wheelset bearing dataset.
TABLE VI also shows the parameter quantities of our method and the comparison method.
Since we only added one 1D-NLB module, the number of parameters in our model is still relatively small. Therefore, the proposed method achieves a large performance boost with little parameter increase.

Compared with state-of-the-arts methods
In order to explore the applicability of the proposed method on the CWRU bearing dataset, we compare the NL-1DCNN with six state-of-the-art deep learning methods under three noise situations (SNR = −6dB, 0dB, and 6dB). TABLE VII shows the accuracy, recall and precision of these methods.
We find that the NL-1DCNN has better fault diagnosis performance than six comparison methods under three noise conditions. Under SNR = 6dB, the NL-1DCNN achieved 99.89% accuracy of fault diagnosis. At SNR = 0dB, the noise power is equal to the raw signal power, and the NL-1DCNN achieved a 99.17% accuracy for fault diagnosis. This shows the excellent fault diagnosis performance of the NL-1DCNN. Moreover, the NL-1DCNN performs better on the motor bearing dataset than the wheelset bearing dataset, and it can obtain 91.23% accuracy at SNR = −6dB. In addition, we find that the LSTM obtained only 65.27% diagnostic accuracy on this dataset. However, the DTS-CNN exhibited relatively good results, which achieved an accuracy of 88.69% at SNR = −6dB. Although the performance of the DTS-CNN is still far from that of the NL-1DCNN, this proves once again the importance of long-distance dependencies for fault diagnosis tasks.
From the performance of these methods on two datasets, the performance of the DTS-CNN and the LSTM is greatly affected by the dataset, and they can only exert their good performance on some datasets. The NL-1DCNN can achieve excellent performance on both datasets, which shows its good adaptability. This reflects the application potential of the NL-1DCNN in other fault diagnosis tasks of rotating machinery to a certain extent. In order to show the performance of these methods more clearly, we use the T-SNE technology to visualize the final output distribution of the NL-1DCNN, the LSTM, the DTS-CNN, the WDCNN, the Wen-CNN, the ResCNN and the ADCNN in a two dimensional space. The visualization results are shown in Fig. 11, where different colors represent different health conditions of motor bearings. Obviously, the output distribution of the NL-1DCNN has the best discrimination, followed by the DTS-CNN and the Wen-CNN. This is consistent with the results of TABLE VIII, which shows that the proposed NL-1DCNN has better performance on the motor bearing dataset.
In order to better understand the diagnostic performance of the proposed method for each health category. The confusion matrix of the proposed NL-1DCNN at SNR = 6dB is displayed in Fig. 12. Obviously, our method can distinguish normal samples and fault samples with 100% accuracy. In addition, in the identification of fault types, the NL-1DCNN can also identify inner race fault and outer race fault with 100% accuracy, and it can accurately identify the degree of bearing failure. The NL-1DCNN only made a few misjudgments in the diagnosis of ball fault. And, these misjudgments are just judging a certain fault degree of ball fault as other fault degree. This shows that our method can accurately distinguish different fault categories, and there may be few misjudgments when determining the degree of fault.

Conclusions
In this paper, we propose the NL-1DCNN for rolling bearing fault diagnosis. This method aims to improve the long-range dependencies learning ability of the network, so as to fully understand the hidden features of the signals. To this end, we introduced the non-local mean method to the CNN and built a 1D-NLB for capturing long-range dependencies. The basic idea of 1D-NLB is to calculate the long-range correlation between the current position and other positions, so that the network can quickly capture the local and global information of the input signal. We validate the effectiveness of the method on two bearing datasets. Experimental results show that the diagnostic performance of the NL-1DCNN is considerably better than the six outstanding methods. The conclusions are summarized as follows: 1) the long-distance dependence can help the network to fully understand the hidden information of the signal, and this information is also very important for fault diagnosis tasks. 2) The proposed 1D-NLB absorbs the advantages of the non-local mean de-noising algorithm and has excellent learning ability for long-distance dependencies. It can be easily embedded in most CNN architectures to improve its fault diagnosis performance.
3) The NL-1DCNN has good fault diagnosis performance, and it has consistent performance on two datasets, which shows its application potential in other fault diagnosis tasks.
In addition, the performance of the proposed method is still relatively low in the case of strong noise, which cannot meet the needs of practical applications. Moreover, in practical situations, it is often impossible to obtain enough fault samples, and the proposed method cannot cope with this situation well. Therefore, in future work, we will focus on improving the model's performance in strong noise environments and introduce the idea of few-shot learning to improve the performance of the diagnostic model in the case of limited labeled samples.