LRe Trans Model of Interface Visual Interaction Suitable for Preschooler Robots

: Traditional contact and non-contact methods for estimating visual interaction forces and recognizing behavior have significant drawbacks with regards to biocompatibility, sensor size, the fragility of materials, and balancing algorithm accuracy and speed. To address these limitations, the study suggests a lightweight, regularized transformer-based visual interaction behavior recognition method. The method contains three important parts: image input and slice preprocessing, global semantic representation based on deep lightweight vision Transformer, and regularized interaction behavior recognition. At the same time, the new model is able to collect and analyze preschool children's image data through a dynamic window, and then realize the visual interaction process for preschool children through machine interaction. Experiments shows that the new method achieves 97.6% accuracy and 97.5% F1 score for interaction behavior recognition on a large-scale robot interaction dataset, with a single average inference time of only 0.18 seconds. The experiment yields significant results indicating that the LRe Trans-based method for recognizing visual interaction behavior holds advantages for the specific problem of robots interacting with preschoolers. The method not only provides valuable insights into the theoretical basis of this field but also offers potential applications for future research.


INTRODUCTION
Under current technological conditions, intelligent robots are increasingly prominent in healthcare, education, children's companionship, disaster relief, emergency deliveries, and intelligent services.In the healthcare industry, robots can reduce the burden on medical staff and increase the efficiency of treatment by performing precise assistive maneuvers during surgery or performing routine tasks in patient care.In education, robots can be used as teaching aids or child companions to facilitate children's learning and social skill development through dynamic interactions.In disaster relief, robots are able to enter hazardous areas that are inaccessible or difficult for humans to reach to perform rescue tasks.These robots' ability to perceive behavioral information is crucial for achieving precise operations [1][2].The machine-human interaction interface has to organize multiple functions within limited space.The interface design commonly employs three different glyphs in the form of "T," "口," and "三," as well as upper and lower symmetrical layouts.Additionally, the interface includes a combination of text and illustrations to reduce the user's reaction time, simplify the operation steps, and lower the difficulty of differentiation.For instance, a robotic device designed for pre-kindergarten kids can provide accurate voice responses and behavioral interactions, utilizing visual data from the interface's acquisition system, to address their needs for educational and companion support in diverse situations.Behavioral data from robots is obtained through both contact and non-contact approaches, with contact being the predominant method.The robot's foremost end utilizes miniature force/torque sensors to gather interaction force and behavior with the object being manipulated [3][4].However, this approach faces numerous limitations, including a large sensor size causing operational failure and high cost [5][6].Based on these limitations, the study presents Lightweight Regularized Transformer (LRe Trans), a Visual Interactive Behavioral Recognition (VIBR) model utilizing a regularized Transformer.LRe Trans overcomes the aforementioned limitations and achieves precise and rapid non-contact recognition of Interaction Behavior (IB).
The rest of the paper is organized as follows.Section 2 presents domestic and international research findings related to this study.Section 3 provides an exhaustive description of the LRe Trans model developed in this research.Section 4 outlines experiments conducted to substantiate the model's validity and attainable results.Section 5 furnishes an overall summary of the study, alongside its limitations and prospects for follow-up research.

Related Works
The domains of artificial intelligence, electronic information, and human-computer interaction have developed quickly, leading to rich research findings in this area.For the safety of human-computer collaboration in future factory environments, Mazhar O et al. created a real-time safe framework based on static gestures and 3D skeleton extraction.The researchers utilized Kinect V2 depth map to eliminate the background of hand images and used random patterns and architectural templates to replace the background for data enhancement.The final experimental results proved that the framework did improve the security of human-computer collaboration [7]  comprehensive understanding of the interaction situation and is suitable for preschooler robots to better accomplish the interaction.The interface of the preschooler robot needs to set the feedback window of key data in the focus position of people's vision, so that the user can learn the most critical data in the shortest time.As depicted on the right-hand side of Fig. 1, the robot's system interface is active during operation.The robot's primary function is to capture images and issue commands through various system frames.The interface style needs to be unified, and at the same time requires font size adjustment and text icon personalization for easy understanding.The preschooler robot is composed of a sensing module and a main control module, with a cute appearance, smooth lines and rounded corners.The sensing module has three intelligent cubes in blue, green and yellow, and the surface of this module is coated with a frosted coating, and a 1.44-inch true color screen is set on the front.The main control module has a cuter design with a prominent color.The panel has more functional parts, on the left side are the speakers, which are evenly distributed by way of 5 slash bars and built-in metal mesh.On the right side are the knobs, with the edge section containing the volume scale lines, which are well damped.A brand label is printed in the center as well as LED indicators to show the different operating states.
It is required to divide the raw video into separate frames and give the appropriate label to the IB of each frame in order to accomplish the validity verification of the approach.The input raw data is defined as shown in Equation ( 1 In Equation (1), t is the timestamp and N is the time step.In order to capture the dynamic changes between consecutive frames, as well as to reduce the impact of noise and other disturbing factors on the accuracy of the model, the study uses the time windowing approach to window the data in addition to standard deviation normalization of the data.The windowing of the data is done as in Equation (2).
In Equation (2), t  denotes the length size of the window,   n t is the standard deviation normalized data. is the moving step of the sliding window, and r  denotes the th r th time window data.The addition of window processing enables the model to find inter-correlations between consecutive frames for better understanding and prediction of IB.Since the baseline architecture used for the LRe Trans model constructed in the study is the Swin Transformer framework, the raw data also needs to be sliced using the Patch Partition tool after the time-window processing.The Fig. 3 Structure diagram for semantic feature extraction based on deep lightweight visual transformer Figure 3 shows the process of dividing the input RGB three-channel image into non-overlapping block feature maps of the same size.These block feature maps are then combined into a window, which is mapped to the dimensions of the feature extraction by Linear Embedding.Within each window, feature extraction is performed using the self-attention mechanism.It is crucial to remember that the new block feature maps ensure that the image's dimension is divisible.For example, in the first stage, 8 block feature maps are combined and the dimension of the mapping is set to 128.In the second and third stages, they are set to 16, 32, and 256, 512, respectively.as the depth of the network increases, the total number of feature maps decreases by half each time, while the embedding dimensions are doubled.The size of the feature graph decreases step by step and the dimension increases gradually, showing a hierarchical structure similar to a pyramid.The structure of semantic feature extraction module is shown in Fig. 2. Swin Transformer Block is the main structure to perform global feature extraction, which mainly realizes the semantic representation of image features with the help of self-attention mechanism.In Fig. 4, the Block structure is displayed.
(3) In Equation (3), C is the depth obtained by Linear Embedding.After the feature maps are fused into windows, the In Equation (4), M is the size of the fusion window, and the amount of computation can be controlled by controlling its size, which not only reduces the amount of computation of the model, but also effectively improves the scalability of the model.Specifically, Swin Transformer uses the Shifted Window technique to address the issue of information between adjacent windows at the same stage utilizing W-MSA.Conventional window slicing is uniformly sliced, while shifted window is sliced from the center of the h and w dimensions of the feature map.The cut window will be offset, so that the window information can be exchanged, which solves the problem that different windows cannot communicate with each other.Swin Transformer uses a kind of relative position encoding, specifically when calculating the similarity between the query and the key to add a relative position bias.Through this design, the window obtained is no longer regular, there is interaction between windows, and the expressive ability of the model will be enhanced.The relative position offset is shown in Equation (5) [21].

 
, , In Equation (5), Q denotes the Query matrix and K is the Key matrix, both of which have the same dimension d .V denotes the Value matrix.B denotes the value in the offset matrix B .The number of pathch in the window is 2  m .The offset matrix B is defined as shown in Equation (6).

Fig.5 Feature Map Schematic of Multiple
Image Samples  is defined as the mean of the sample and  is defined as the standard deviation of the sample, and the mean and standard deviation is used to normalize the pixels in the feature map with scaling and displacement.The process of passing the sample image pixels through the regular normalization layer is shown in Equation (7) [22].
In Equation (7), ncij p and ˆncij p are the pixels before and after normalization, respectively. is the normalization scale parameter and  is the shift parameter.c  is a constant to control the range of the numerical shift.The only differences between batch, layer, and instance normalization and conventional normalizing are in the pixels utilized to determine the mean and standard deviation.All three methods follow the same procedure.Therefore, the mean k  and standard  are defined as shown in Equation (8) [23].
, , , 1 In Equation (8), k refers to different normalization methods.k I is the set of pixels and k I n is the number of pixels in the set.Based on this, the process of normalization of image samples by SN is shown in Equation (9) [24].
In Equation ( 9),  is the statistics estimated by different methods such as instance normalization, layer normalization and batch normalization.k  and k  are both scaling ratios, which are used to control the degree of variation in the weighted mean and variance.k  is defined as shown in Equation (10) [25].In Equation (10), in , ln , and bn represent instance normalization, layer normalization, and batch normalization methods, respectively. is a control parameter, which is input into the softmax function for calculation to obtain k  .In the online inference phase, the SN layer employed for forward inference process, the statistics for instance normalization and layer normalization are computed individually based on each sample, however, batch normalization chooses to use the average of the batch in each iteration instead of computing the moving average [26].To compute the batch average, the parameters of the network and all SN layers are first frozen, and then a small batch consisting of a certain number of randomly selected data from the training set is input to this network.The average inference elapsed time of the model is utilized to gauge its computational efficiency, and the most popular overall accuracy and F1 score are used in the study as assessment metrics for IB recognition models [27].

Behavior Recognition Effect of Visual Interaction Based on LRe Trans
The study uses the JIGSAWS large-scale dataset for model validation, and the validation method is the Training-validation-test model.Apart from conducting trials to compare the study's proposed method with various advanced techniques, the model's resilience to various noise disturbances is also confirmed.

Experimental dataset and parameter configuration
The dataset used in the study is JIGSAWS, which is an open source large-scale robot hand interaction dataset.After video frame-splitting and data cleaning, 133,168 consecutive hand interaction images are obtained.To analyze the IB of the robot with preschool children, the study utilizes a windowing approach.This method facilitated the creation of an interaction dataset with continuous windowed data.Each window provides a continuous sequence of robot actions, similar to the dynamic sequence during robot-child interaction.The original windowed interaction dataset is randomly sampled and divided into 10 groups for training, validation, and testing.The model is trained and tested using these sets.The ratio used is 7:1:1.The data collection process involves using a camera to capture images and various contact sensors, including force feedback sensors, to record the interaction forces between the robot and the objects.To analyze children's IBs, algorithms and processing flows are developed to convert the multimodal data.The robot's hand movements can be matched and simulated with those of the children during their interactions.

Fig.6 Training loss and validation loss curve
To determine a better value of t  , the study conducted parameter validation experiments and the results obtained are shown in Fig. 7.In Fig. 7 The size of the Patch greatly affects the model's performance in Transformer and its related designs.Therefore, the study selects different sizes such 2×2, 4×4, and 8×8 for performance comparison in order to confirm the robustness of the model and to establish the ideal patch size.Fig. 8 illustrates how the size of the image slice affects the LRe Trans model's performance.As seen in Fig. 8, the accuracy, F1 score, and execution time of the visual IB approach are significantly impacted by varying patch sizes.The lowest accuracy of 96.19% is achieved when the patch size is 2 × 2. And when the patch size is 8×8, the accuracy is the highest, 97.63%.In terms of execution time, as the patch size increases, the execution time shows a trend of decreasing and then increasing, and when the patch size is 16×16, the execution time is the shortest, which is only 0.03 seconds.This is due to the fact that the smaller the patch size is, the larger the number of slices will be, which in turn will lead to an increase in computation.However, the LRe Trans model maintains more than 96% recognition accuracy and F1 value for multiple slice sizes, which indicates that the model has good robustness.Taken together, choosing the 8 × 8 slicing scheme can better balance the runtime and model performance, therefore, in the subsequent experiments, the study determines the PATCH SIZE to be 8 × 8.The accuracy of TL-CNN is 86.52%, which is significantly lower than that of the LRe Trans model, while the closest performer, GMM+KF+GRNN, has an accuracy of only 96.30%.In terms of F1 scores, the LRe Trans model is also dominant at 97.55%.Among the remaining models, only DL-GA-DNN has an F1 score of 93.56% over 90%.In terms of execution time, the LRe Trans model also outperforms all other methods, taking only 0.18 seconds.The JGRHI method, which takes the longest time, takes 4.26 seconds, which is more than 23 times longer than the LRe Trans model.

Fig. 1
Fig. 1 Structure of LRe Trans model and robot system interface The image input module, the deep vision Transformer semantic feature extraction module, and the regularized IB prediction module comprise the three primary components of the LRe Trans model shown in Fig. 1.The image input module is like the eyes of the robot, responsible for receiving the continuous robotic arm sensing images and preprocessing their data.The Patch Partition tool in the image input module can cut the image into many small pieces for research.Next, these cut image chunks are sent to the Deep Vision Transformer semantic feature extraction module.This module is the core module of the LRe Trans model, which is responsible for receiving and processing the image information to extract the global semantic features of the image data.Finally, the IB prediction module introduced into the LRe Trans layer accurately predicts and recognizes IBs by fusing the global semantic features transmitted by the previous module.The LRe Trans IB recognition model provides a sliced data can then be individually feature coded for each part while preserving the original features of the image[19].The processing flow of Patch Partition is shown in Fig.2.

Fig. 2
Fig.2 Process flow chart of patch partition 3.2 Semantic Feature Extraction Based on Deep Lightweight Visual Transformer The deep lightweight visual Transformer module proposed in the study draws on and optimizes the architectural ideas of Swin Transformer to meet the practical needs of robotic hand interaction scenarios that require timeliness.The semantic feature extraction module uses a hierarchical stacking architecture.It is composed of three stage modules that decrease the resolution of the input feature map.Each stage is built using patch merging and block linking.With respect to the CNN structure's local feature extraction capabilities, this design seeks to decrease the computational complexity of the model while expanding the sensory field's range and enhancing the effectiveness of deep feature extraction.Fig. 3 illustrates the variance in image dimensions for the semantic feature extraction module.

Fig. 4
Fig.4 Structural diagram of Swin Transformer Block As illustrated in Fig. 4, Block uses the Window-Multi-head Self Attention (W-MSA) and Shifted Window-Multi-head Self Attention (SW-MSA) alternating execution strategies to maintain efficient computational power and prevent information loss at the connection when slicing images and linking feature maps.Calculating self-attention in every non-overlapping window can significantly lower the module's calculation.The computational amount   MSA  of an image for MSA is shown in Equation (3) [20].
Fig. 6 displays the loss curves of the LRe Trans model on both the training and validation sets for robotic IB recognition.As Fig. 6 illustrates, the LRe Trans model's loss converges to below 0.2 after iterating for approximately 20 rounds, and it can achieve a minimum of below 0.1.The experimental results validate the applicability of the model to the experimental dataset.
Fig.8 The Effect of Image Slice Size on the Performance of LRe Trans Model The paper compares the LRe Trans recognition model with the most recent VIBR approaches and spatio-temporal deep learning algorithms, as well as conducts validation experiments on a large-scale visual interaction dataset to verify the Fig.9 Performance comparison between LRe Trans method and the latest visual interaction behavior recognition method Preschool robots have real-world applications, however there are interferences brought about by the setting, acquisition tools, and other elements.To assess the LRe Trans model's robust performance under various noise interferences, the study incorporates noise interference into the experiment.In this particular experimental setup, each sample in the original test picture dataset is subjected to random noise, pretzel noise, and Gaussian noise interference.This results in test image samples with varying levels of noise interference.The IB recognition experiments are then performed using the LRe Trans model under test conditions doped with different noises.The results of model robustness validation under noise interference conditions are shown in Fig. 10.In Fig. 10, the training loss and validation loss of the LRe Trans model increase under pretzel noise, random noise, and Gaussian noise interference conditions, and the pretzel noise has the greatest impact on the performance of the model.

Fig. 11
Fig.11 Relevant evaluation results of 10 evaluation indicators 5. Conclusion To address the challenge of conventional contact behavior recognition techniques struggling to reconcile accuracy and speed in robots designed for preschoolers, the VIBR method was presented as a solution.The VIBR method relies on the LRe Trans model that incorporates image input and slice preprocessing, global semantic representation from a deep lightweight Visual Transformer, and regularized IB recognition in three stages.The model could competently identify long memory span interaction actions and efficiently capture global features and long memory span actions.The experimental outcomes revealed that the LRe Trans model executed proficiently on the robot interaction dataset.Indeed, the LRe Trans model achieved exceptional outcomes for both accuracy and F1 score for IB recognition, with figures of 97.6% and 97.5%, respectively, and also boasted an average inference time of solely 0.18 seconds per sheet.These findings indicated that LRe Trans can provide faster inference speed while maintaining or exceeding the accuracy of current state-of-the-art spatio-temporal deep learning algorithms.In addition, the LRe Trans model demonstrated significant resilience for different image slice sizes and noise interference conditions.The experiment's outcomes demonstrated that the LRe Trans model could surpass the weaknesses of traditional machine learning techniques and deep learning algorithms.It further offered an interactive visual layout with improved precision, swifter pace, and better strength for preschooler robots, which holds great value in practical applications.However, the study solely relied on visual image data, and incorporating multimodal heterogeneous data will be considered to enhance the accuracy and robustness of the behavior recognition model in the future work.
This article has been accepted for publication in a future issue of this journal, but it is not yet the definitive version.Content may undergo additional copyediting, typesetting and review before the final publication.Citation information: Xiaoqing Yang, Jonathan Chung Ee Yong, Bo Li, LRe Trans Model of Interface Visual Interaction Suitable for Preschooler Robots, Journal of Artificial Intelligence and Technology (2024), DOI: https://doi.org/10.37965/jait.2024.0488 H W N C is defined to denote the data samples with N channel heights and widths of H and W , respectively, of any normalization layer, and the number of channels of the samples is C . the feature map representation of the image samples is shown in Fig.5.
This article has been accepted for publication in a future issue of this journal, but it is not yet the definitive version.Content may undergo additional copyediting, typesetting and review before the final publication.Citation information: Xiaoqing Yang, Jonathan Chung Ee Yong, Bo Li, LRe Trans Model of Interface Visual Interaction Suitable for Preschooler Robots, Journal of Artificial Intelligence and Technology (2024), DOI: https://doi.org/10.37965/jait.2024.0488

Table 1
Detailed information of the datasetThis article has been accepted for publication in a future issue of this journal, but it is not yet the definitive version.Content may undergo additional copyediting, typesetting and review before the final publication.Citation information: Xiaoqing Yang, Jonathan Chung Ee Yong, Bo Li, LRe Trans Model of Interface Visual Interaction Suitable for Preschooler Robots, Journal of Artificial Intelligence and Technology (2024), DOI: https://doi.org/10.37965/jait.2024.0488 article has been accepted for publication in a future issue of this journal, but it is not yet the definitive version.Content may undergo additional copyediting, typesetting and review before the final publication.Citation information: Xiaoqing Yang, Jonathan Chung Ee Yong, Bo Li, LRe Trans Model of Interface Visual Interaction Suitable for Preschooler Robots, Journal of Artificial Intelligence and Technology (2024), DOI: https://doi.org/10.37965/jait.2024.0488 This This article has been accepted for publication in a future issue of this journal, but it is not yet the definitive version.Content may undergo additional copyediting, typesetting and review before the final publication.Citation information: Xiaoqing Yang, Jonathan Chung Ee Yong, Bo Li, LRe Trans Model of Interface Visual Interaction Suitable for Preschooler Robots, Journal of Artificial Intelligence and Technology (2024), DOI: https://doi.org/10.37965/jait.2024.0488article has been accepted for publication in a future issue of this journal, but it is not yet the definitive version.Content may undergo additional copyediting, typesetting and review before the final publication.Citation information: Xiaoqing Yang, Jonathan Chung Ee Yong, Bo Li, LRe Trans Model of Interface Visual Interaction Suitable for Preschooler Robots, Journal of Artificial Intelligence and Technology (2024), DOI: https://doi.org/10.37965/jait.2024.0488 This