U-shaped Vision Transformer and Its Application in Gear Pitting Measurement

Although convolutional neural networks (CNNs) have become the mainstream segmentation model, the locality of convolution makes them cannot well learn global and long-range semantic information. To further improve the performance of segmentation models, we propose u-shaped vision Transformer (UsViT), a model based on Transformer and convolution. Specifically, residual Transformer blocks are designed in the encoder of UsViT, which take advantages of residual network and Transformer backbone at the same time. What’s more, transpositions in each Transformer layer achieve the information interaction between spatial locations and feature channels, enhancing the capability of feature learning. In the decoder, for enhancing receptive filed, different dilation rates are introduced to each convolutional layer. In addition, residual connections are applied to make the information propagation smoother when training the model. We first verify the superiority of UsViT on Automatic Portrait Matting public dataset, which achieves 90.43% Acc 、 95.56% DSC and 94.66% IoU with relatively fewer parameters. Finally, UsViT is applied to gear pitting measurement in gear contact fatigue test, and the comparative results indicate that UsViT can improve the accuracy of pitting detection.


I. INTRODUCTION
Gear is a widely-used motion and power transmission component in mechanical equipment, and it is prone to failure due to the poor working condition.Moreover, pitting is the main failure mode of gear, which has been detected by vibration-based methods in the past years [1][2][3].However, it is difficult for vibration-based methods to quantitatively detect gear pitting.Gear pitting area ratio is a key metric for evaluating the degree of failure, especially in the gear contact fatigue test [4].
In order to calculate gear pitting area ratio, machine vision may be a feasible tool [5].Based on machine vision methods, the key to measuring gear pitting area ratio is the precise segmentation of the effective tooth surface and the pitting from the acquired gear image.However, the gear pitting is generally irregular, which bring great challenges to the traditional computer vision techniques including threshold segmentation and edge segmentation [6][7][8].
Convolutional neural networks (CNNs) have shown excellent feature extraction ability and good semantic segmentation performance in recent years, so they were successfully applied to different fields of automatic detection [9][10][11].Zhang et al. [12] proposed a simple but efficient segmentation model for road area extraction by applying residual connection into U-Net.Ding et al. [13] proposed an improved algorithm based on the encoder-decoder framework of U-Net to accurately segment the common defects such as untilled corner, scratch and dirty in the process of magnetic disc quality detection.
Du et al [14] proposed a seismic crack recognition method based on ResU-Net and dense CRF model, improving the efficiency and accuracy in the detection of seismic image dataset.Li et al [15] proposed a ship detection method based on U-Net++ and multiple side-output fusion algorithm, solving the problems of complicated background and various ship sizes in satellite remote sensing images.Transformer [16] has made a great achievement in natural language processing (NLP) over the last few years because of its powerful ability of learning long-range feature information.ViT [17]

II. METHOD A. ARCHITECTURE OVERVIEW
As depicted in Fig.
where −1 , and +1 are the input or output feature map in a Transformer layer., +1 are the output feature map of MSA module. is the output feature map of a Transformer block.
In the Transformer layer, the multi-head self-attention [22][23] is formulated as: Attention( , , ) ( ) where Q, K, V are calculated by the product of three learnable parameters with the input feature maps; denotes the channel dimension of K.

III. EXPERIMENTS A. DATASET
To verify the superior performance of UsViT, we first conduct experiments on a public dataset.
Automatic Portrait Matting [24] is a portrait segmentation dataset collected from Flickr, which contains 2000 images with high-quality portraits.
With the resolution of 800  600, these images are randomly split into 1500, 200 and 300 for training, validating and testing respectively.In addition, the labeling process is finished by closed-form [25] and KNN [26] matting to make sure the high quality of the dataset.

B. IMPLEMENTATION DETAILS
UsViT was trained by a computing platform

C. EXPERIMENTAL RESULTS
The

IV. APPLICATION A. ACQUISITION OF GEAR PITTING IMAGES
As shown in Fig. 3, the first step in gear pitting measurement is to acquire gear pitting images.The experimental device is presented in Fig. 4. The left of

C. EXPERIMENT RESULTS
Test results of Gear Pitting dataset are listed in Table III.From the table, we can note that the segmentation performance of U respectively, which indicates that our approach can achieve better segmentation performance and save calculation cost.What's more, Re (average Re) of UsViT is only 6.78%, which is smaller than those of other segmentation models.Therefore, the proposed UsViT is more suitable for calculating the gear pitting area ratio.

16 × 16 ×
After several Transformer blocks, we can obtain fine-grained feature maps, which are subsequently reshaped into .The reshaped feature maps are then feed into a progressive upsampling decoder including four same stages with 2 to reach the full resolution of H×W.The details of resolution and dimension of different stages are shown in Fig. 1.Each stage consists of a Decoder layer, which includes a transposed convolutional layer and 4 convolutional layers with residual connections.Furthermore, instead of using the usual convolutions, different dilation rates (1, 3, 5, 7) are introduced to each convolutional layer for enhancing receptive field.In addition, with the help of residual connections, multiscale feature information from different dilated convolutions are fused, thus the performance of semantic segmentation will be boosted.Finally, the output feature map from the decoder is processed by a 1  1 convolutional layer with softmax activation function for predicting the pixel-level segmentation mask.

with a NIVIDIA GTX 2080Ti based on Python 3. 6 and Tensorflow 2 . 1 .
Due to the limitation of computing resources, the input images were resized to 256 × 256.During the training period, Adam was used as the optimizer and the total epochs was set to 120.The initial value of learning rate was set to 0.0001 and the batch size is 8.In the experiment, we used CrossEntropy as the loss function.Accuracy (Acc), Dice-Similarity coefficient (DSC[27]) and Intersection over Union (IoU[28]) are used as evaluation metrics for the test set.

Fig. 2
Fig. 2 demonstrates some segmentation results of portrait images obtained by various segmentation models.As depicted in Fig. 2, the segmentation result of UsViT are closer to the ground truth compared to other models.Especially at the edge of the segmentation result, UsViT shows smooth and similar details as the ground truth.It then indicates that UsViT has stronger learning ability of feature representation and better segmentation performance.In a word, UsViT takes advantages of convolution and Transformer meanwhile, which can realize the interaction of local and global semantic information, so it can obtain better segmentation results.

Fig. 2 .
Fig. 2. Segmentation results of Automatic Portrait Matting dataset.D. ABLATION STUDY In this experiment, we conducted ablation study on Automatic Portrait Matting with reducing one factor at a time to investigate the contributions of different factors There are four situations about base model: A1-without residual connections in the encoder; A2-without transpositions in the Transformer layer; A3-without residual connections in the Decoder layer; A4-without dilation rates in the Decoder layer.Table II shows the results of ablation study.It is obvious that residual connections in the encoder and dilation rates in the Decoder layer make more contribution to the improvement of segmentation performance.Residual connections in the encoder make the feature information propagation smoother.Different dilation rates in the Decoder layer enhance receptive field and aggregate characteristic information.UsViT combines the advantages of both factors, thereby achieving the best segmentation performance.

Fig. 4
Fig. 4 shows general view of test rig, while the right illustrates the test gearbox and the vision measuring device.For online acquiring the image of gear teeth, a vision measuring system was designed.First, we used a transparent plexiglass plate as upper cover of test gearbox for clearly taking photographs of gear teeth.Then, to facilitate the adjustment of the shooting angle, the CCD industrial camera was fixed in a flexibly adjustable bracket, and LED light source was used.Via gear contact fatigue experiments and the vision measuring system, 800 gear pitting images were collected.To make complete gear pitting dataset, we made corresponding labels by Labelme image annotation tool.The resolution of pitting images is 256×256, and the ratio of the training images' quantity, the validating images' quantity and the testing images' quantity is 7:1:2.Except that the iteration is set to 160 on Gear Pitting dataset, other implementation details is the same as that of Automatic Portrait Matting dataset.

Fig. 3 .Fig. 4 .
Fig. 3. Flow chart of application in gear pitting measurement.B.RELATIVE ERRORIn addition to the three metrics of Acc, DSC and IoU, the relative error of gear pitting area ratio (Re) is also employed to evaluate segmentation

Fig. 5
Fig. 5 demonstrates some segmentation results of gear pitting images obtained by various segmentation models.As depicted i takes advantages of convolution and Transformer

Fig. 5 .
Fig. 5. Segmentation results of Gear Pitting dataset.V. CONCLUSION In this paper, we propose UsViT, an efficient and powerful framework based on Transformer and convolution.Specifically, residual Transformer blocks are designed in the encoder of UsViT, which take advantages of residual network and Transformer backbone maenwhile.What's more, transpositions in each Transformer layer achieve the information interaction between spatial locations and feature channels, enhancing the capability of feature learning.In the decoder, for enhancing receptive filed, different dilation rates are introduced to each convolutional layer.In addition, residual connections are applied to make the information propagation smoother when training the model.The experiments on Automatic Portrait Matting public dataset verify the advantages of the proposed UsViT, which achieves 90.43% Acc、95.56%DSC and 94.66% IoU with relatively fewer parameters.Finally, UsViT is applied to gear pitting measurement in gear contact fatigue test, and the comparative results indicate that UsViT can improve the accuracy of pitting detection.
comparison of the proposed UsViT model with U-Net and its variants on the Automatic Portrait Matting are shown in Table I.It is apparent from the

Table I .
Segmentation performance of Automatic

Table II .
Ablation study on the impact of different factors

Table III .
Segmentation performance of Gear Pitting dataset