Online Professional-Creative Fusion Music Major Students' Classroom State Recognition Based on the Integration of DSC and LeNet-5 Models

: Online professional-creative fusion education in music majors is becoming increasingly prevalent, but accurately identifying students' classroom states remains a challenge. This research aims to propose a fusion approach based on depth separable convolution and convolutional neural network models for the recognition of online music major students' classroom states. Firstly, facial expressions of students during class are collected through sensor data. Subsequently, convolutional neural network models process these feature data and perform classification, with an enhancement using depth separable convolution. Simultaneously, behavioral data and assessment information of students during classes are fused as multimodal data, yielding the integrated results of students' classroom states. Experimental validation demonstrates that the proposed fusion method exhibits excellent performance in recognizing students' classroom states, with an average F1 score of 0.96, recall rate of 0.92, recognition accuracy of 94.12%, and recognition time of 2.10 seconds. This method accurately distinguishes whether students are focused, distracted, or not in a class state, providing an effective tool for music educators to better understand students' learning states and facilitate personalized teaching management and guidance.


Introduction
The integration and development of professional education and innovation and entrepreneurship education is called professional creative integration [1].Since the concept of innovation and entrepreneurship education was introduced into higher education, Chinese universities have gradually tried to deeply integrate entrepreneurship education with professional education, and put forward corresponding reform measures [2].With the advancement of science and technology, online education has emerged as a new form of education.However, teachers cannot ensure that students maintain a 100% learning state during the teaching process.Currently, online creative fusion education in music majors is becoming increasingly common, but accurately identifying students' classroom states remains a challenge [3].Teachers easily overlook students' learning experiences in online teaching.The intelligent recognition of students' learning states online is a focal point of current research.Through this approach, teachers can gain real-time insights into students' classroom states, adjust course content promptly based on students' states, and ensure maximum course efficiency [4].Traditional machine learning facial recognition methods heavily rely on extensive prior knowledge and experience, lacking strong generalization performance and suffering from issues of low efficiency and poor model robustness [5].In addition, the collection of facial expressions requires high-precision sensors and devices, which can increase the financial burden on educational institutions.The acquired data may be affected by factors such as illumination, Angle, occlusion, etc., resulting in a decline in data quality and thus affecting the training effect of the model.In practical application, students' classroom status may be affected by various factors, such as learning environment, course content, teacher teaching style, etc.These factors may lead to a decline in the generalization ability of the model and an inability to accurately identify the class status of students.To this end, this study proposes a method for online class status recognition of creative fusion music majors, which combines deep separable Convolution (DSC) and convolutional neural network (LeNet-5) models.Music educators can use this method to better understand students' classroom states and provide personalized teaching management and guidance.The innovation of this study lies in the fusion of DSC and LeNet-5 models to achieve accurate recognition of online music major students' classroom states.
The rest of the paper is presented in four parts: the literature review, method introduction, experimental analysis, and conclusion.Each section respectively explores the current research status, introduces the technical aspects and processes of the model design, analyzes the performance of the designed model through comparative experiments, and summarizes and analyzes the research content while providing prospects for the future.

Related Works
Emotions, as one of the most direct labels for expressing human feelings, play a crucial role in demonstrating emotions and conveying sentiments in real life.In recent years, researchers have delved into a new direction by using facial expression recognition to assess the emotional states of individuals.Liu W and colleagues explored a multimodal emotion recognition algorithm for optimal performance, comparing the recognition capabilities and robustness of Deep Canonical Correlation Analysis (DCCA) and Bimodal Deep Autoencoder (BDAE) algorithms.They introduced noise into multimodal features and replaced electroencephalogram (EEG) features with noise.Experimental results indicated that DCCA exhibited stronger robustness with a recognition rate of 90.7% [6].Khare S K and others aimed to achieve accurate automatic emotion classification by constructing an emotion recognition model using electroencephalogram (EEG) signals and CNNs.They transformed EEG signals into images and fed these images into a CNN model for training and recognition.Experimental analysis revealed an accuracy of 93.01%for this model [7].Ashok Kumar P M and team proposed a novel feature method to realize an intelligent facial emotion recognition model.They initially extracted faces from input images using the Viola-Jones method, employed Affine Scale-Invariant Feature Transform (ASIFT) to extract facial components as features, and reduced the number of descriptors using optimal descriptor selection methods.Finally, the extracted features were input into a neural network for recognition [8].Addressing the issue of inaccurate facial emotion recognition due to mask usage during the COVID-19 pandemic, Castellano G and colleagues presented an automatic facial expression recognition system capable of identifying emotions from masked faces.The system focused solely on the eye region for emotion recognition and was tested for its effectiveness, achieving a detection accuracy of 90.12% [9].Kumari N and team identified problems in existing facial emotion recognition methods, such as poor visibility and excessive noise.Experimental analysis demonstrated a Dice index of 90.88% for synapses on the dataset [15].In summary, the literature indicates that facial emotion recognition technology has matured, but its application in online classroom teaching is still limited.At present, the collected data may be affected by factors such as illumination, Angle, occlusion, etc., which will lead to the deterioration of data quality and thus affect the training effect of the model.In addition, students' classroom status may be affected by factors such as learning environment, course content and teacher's teaching style, which will lead to the decline of the generalization ability of the model.The DSC network has shown maturity in facial recognition.To address the limitations of traditional online teaching, which lacks real-time observation and analysis of students' learning states, the study constructed a model for recognizing students' classroom behavior using DSC and LeNet-5 networks.

Integration of DSC and LeNet-5 Model
This research aims to investigate the recognition of online classroom behavior for music major students based on deep separable convolution and the LeNet-5 model.Analyzing sensor data generated by students during class, the study proposes an approach that combines deep separable convolution and the LeNet-5 model to accurately identify students' classroom behavior.

Data Collection and Preprocessing
To bring online education closer to traditional offline classroom education, there is a growing awareness of understanding user states and thoughts during online education to achieve maximum teaching efficiency [16][17][18].The study uses timed captures of students' classroom images from online learning videos, preprocesses them, and then employs the recognition model to identify students' learning states.Teachers, based on the obtained emotional states of students, integrate this information with the learning behaviors recorded by teachers in the classroom, as well as subsequent exams, answering questions, etc.This comprehensive approach provides a holistic understanding of students' learning conditions, as depicted in Figure 1 based on the evaluation of students' on-class learning conditions using multiple sources of information.In the process of model training, the cross-entropy loss function is taken as the optimization objective, and the random gradient descent (SGD) algorithm is used to update the parameters.In order to accelerate the training process and prevent overfitting, the study also used learning rate decay and early stop methods in the training process.After many experiments, the hyperparameter configuration of the model was finally determined: the batch size was 32, the learning rate was 0.001, the number of iterations was 50, and the data enhancement methods were random clipping, rotation and flipping.In order to evaluate the performance of the model, the data set was divided into a training set, a validation set, and a test set in a 7:1:2 ratio.The training set is used to train the model, the validation set is used to adjust hyperparameters and early stop methods, and the test set is used to evaluate the final performance of the model.A total of 1584 images were obtained after filtering out inappropriate angles, excessively dark images, and other undesirable data.To ensure training accuracy, preprocessing of the model was performed before using the image data for training.In the application of neural network models, data preprocessing refers to the process of cleaning, transforming, and normalizing data before feeding it into the neural network.The goal of this stage is to make the raw data more suitable for the training of neural network models, thereby improving the model's performance and generalization capability.Data preprocessing typically includes the following steps.Data preprocessing is a crucial step in the application of neural network models, directly impacting the performance and training effectiveness of the model.Reasonable and effective data preprocessing can provide more valuable input for neural networks, enhancing the practicality and applicability of the model.The dataset often presented issues of inconsistent brightness and contrast in images due to variations in the collecting device models, camera parameter settings, and shooting environments.To address this, color correction was applied to the images to eliminate interference from other factors before inputting them into the model.The In Equation ( 1 In Equation ( 2  DSC is a special convolutional operation that divides the convolution into two steps: first, performing depthwise convolution, and then pointwise convolution.This convolutional approach is more effective in extracting features from facial images with rich textures and color variations while reducing computational complexity and model size [19][20].The study applied DSC to the convolutional layers of the LeNet-5 network, replacing the original convolutional layers.The design rationale of DSC involves using a depthwise convolutional kernel to convolve the input image to extract local features, followed by using a pointwise convolutional kernel to convolve the feature map obtained from the depthwise convolution to achieve global feature interaction.This design ensures high-quality feature extraction while reducing the computational and parameter complexity of the model.The comparison between DSC and traditional convolution, as well as its structure, is illustrated in Figure 4.

 
In Equation ( 4),  and  are learnable parameters.Furthermore, the research adopted a lightweight Attention Mechanism to further improve the model's recognition performance.The attention module of the selected attention mechanism in the convolutional pooling process solves the problem of ignored feature map information due to different proportions of information by reasonably allocating weights between different channels.This not only involves fewer parameters, but also improves the performance of the convolutional neural network.The original input feature map is globally average pooled to obtain all features, and then features with channel attention are obtained through fast one-dimensional convolution.Then, the sigmold function is used to obtain the weights of different channels, and the initial input features are multiplied by the channel weights to obtain features with channel attention.Therefore, this module has stronger performance in extracting effective data [21][22][23].This mechanism introduces an attention module after DSC, allowing the model to adaptively focus on key regions of the input image, thereby more effectively extracting features from facial images.In summary, the study integrated DSC, the LeNet-5 network, and a lightweight attention mechanism to propose a novel network architecture aimed at enhancing the recognition performance of facial images.

Multidimensional Data Fusion in the Student Classroom Process
In practical applications, the study utilizes real-time monitoring through the use of camera devices on online learning platforms to monitor and identify the learning status of each student during class.The research develops a student in-class status recognition model based on an improved LeNet-5 network and data fusion, as shown in Figure 5. to understand their participation, interaction, and attentiveness in class, assigning scores for classroom performance.Data collected during the learning process, such as online learning time, study duration, and learning efficiency, is used to assess students' effort and engagement.Finally, exam scores and assignment results are collected to evaluate students' learning outcomes and mastery.
Due to the disparate scales of different data, the study normalizes the data using the Non-dimensionalization method, which refers to a process in data handling where, through a mathematical transformation method, the numerical values of the data are constrained within a specific range or transformed into dimensionless pure numerical values using certain rules.This process ensures comparability among different variables by eliminating dimensional influences in the data, ensuring relatively consistent weights among different features when inputting into a model.Non-dimensionalization methods assist in enhancing the convergence speed of models, preventing certain features from exerting excessive influence on the model, and ensuring a more balanced distribution of weights among different features during optimization.Each data source has its own unique perspective and information, and simply blending them together can result in missing some important details and features.To this end, the study uses data preprocessing and cleaning techniques to eliminate biases and inconsistencies between different data sources.The stability and reliability of data fusion results were evaluated through cross-validation and sensitivity analysis, and gradual adjustments were made to avoid homogenization effects [24][25][26].In machine learning algorithms such as neural network models, non-dimensionalization is commonly employed during the data preprocessing stage to optimize the model's performance.The calculation of the method is displayed in Equation (5).
, 1, 2,..., , 1, 2,..., In Equation ( 5 Fig. 6 Principles of the hierarchical analysis method and the entropy method As shown in Figure 6, AHP is a subjective evaluation method to build a hierarchical model based on the experience and expertise of decision makers.In this method, the complex problem is decomposed into several levels and factors, and the relative importance of each factor is determined by pairwise comparison.The core of AHP is to establish a judgment matrix and get the weight of each factor by calculating the eigenvector and the maximum eigenvalue of the matrix.The entropy method is based on the characteristics of the data itself, and it determines the weight of each factor by calculating the entropy and information utility value of the data.In order to synthesize the advantages of the two methods, the subjective and objective comprehensive weighting method is used to determine the weight of each data type.By integrating the above data sources, the study gains a more comprehensive understanding of students' learning status and performance, providing more accurate teaching feedback and guidance for teachers and students.Based on the predictions of the LeNet-5 model, the study classifies and scores students' in-class statuses.The LeNet-5 model is a relatively simple CNN with relatively low computational complexity [27][28].However, when the model is improved to accommodate new tasks, more layers may be added or certain parameters may be adjusted, which may increase the computational complexity.In this study, the data to be processed includes students' facial expression, behavior performance and academic performance, and the amount of data is large.Therefore, the training and reasoning process of the model can be relatively complex.In order to improve the performance of the model, a series of data preprocessing steps are carried out, which themselves will increase the computational complexity.In view of this, the study can reduce the computational complexity while ensuring the performance of the model by adjusting the parameters such as the number of layers, the size of the convolution kernel and the step size.In addition, data compression technology is used to reduce the size and dimension of input data, so as to reduce the computational complexity of the model.

Performance Analysis of Recognition Model
With the emergence of online integrated education, recognizing students' learning statuses becomes a focal point of research.The study constructs an in-class status recognition model.To validate the effectiveness and reliability of the proposed music major student in-class status recognition model, the research conducts the following experiments and analyses.

Performance Analysis of the Model
To validate the effectiveness and  iterations, while the traditional LeNet-5 stabilized after 39 iterations.This suggests that the improved LeNet-5 network has better convergence.To evaluate the emotion recognition performance of the designed algorithm, the study utilized both the original and improved versions of the LeNet-5 network to recognize emotions in test data, as depicted in Figure 8.As seen in Figure 8(a), the original network achieved a prediction accuracy of 72.86%, whereas the improved network achieved an accuracy of 84.96%, showing higher accuracy across most emotion labels.Both models exhibited good prediction accuracy for the "happy" label, suggesting that features of happy faces are more distinct than other emotions.For the remaining emotions, the improved network consistently outperformed the original model.To further validate the performance of the designed emotion recognition model  9(a), it can be observed that as the testing data increases, the testing accuracy of several models gradually decreases.Model 1 exhibits the smallest decline in accuracy.Specifically, the average recognition accuracy of Model 1 is 92.84%, while Model 2 has an average recognition accuracy of 89.47%, Model 3 with 85.34%, and Model 4 with 88.43%.
Figure 9(b) indicates that the error range of Model 1 is significantly lower than the other three models, primarily fluctuating between 0.02 and 0.06.To further compare the performance of the models, an experiment records the F1 value, Recall value, recognition accuracy, and recognition time indicators for the four models.The specific results are presented in Table 1.other three models show a decrease of 0.08, 0.10, 4.85%, and an increase of 1.86 seconds, respectively.Model 1 also outperforms the other two models significantly.Therefore, Model 1 demonstrates superior emotion recognition capabilities, enabling efficient and accurate identification of students' class participation.

Analysis of Model Application Effect
The study evaluates the application of the proposed emotion recognition method for assessing students' class participation.Emotion states during class are recorded on a scale of 0 to 1.To assess the practical application of the designed models, the study uses a camera to capture images of students in two sessions of professional-creative fusion classes, conducting emotion recognition analysis.The results are shown in Figure 10.Teachers can analyze students' learning situations post-class and adjust their teaching accordingly.After fusing data from multiple sources and applying the model, a comprehensive evaluation of students' learning is obtained.To test the application effect of the model, the study applies it to a professional-creative fusion class in University A's music department.The students' learning conditions are evaluated over a week, and teachers adjust their courses based on the evaluation results.The study continues to assess learning in the second week, and the results are compared, as shown in Table 2.  2, it can be observed that, in the first week, the learning states of students 1, 4, 7, and 9 were better compared to the other six students.Among them, student 4 had the highest learning state score.Students 2 and 3 had a moderate learning state.In contrast, students 5, 6, 8, and 9 had poorer learning states.After targeted instructional corrections based on teaching evaluations by the teacher, in the second week, students with initially good learning states maintained high performance.Meanwhile, students with initially moderate or poor learning states showed some improvement.
To further test the performance of the student classroom state recognition method (Method 1) based on the fusion of emotion recognition data and classroom data designed for research, this study compares it in detail with existing methods for identifying student classroom states.The comparison methods include the single emotion-based recognition method in reference [29] (method 2), the behavioral analysis method in reference [30] (method 3), the physiological signal-based method in reference [31] (method 4), and the speech recognition method in reference [32] (method 5).grade of the school.The comparison results of the test are shown in Table 3, where the real-time performance index is obtained by corresponding class student and teacher ratings, with a maximum score of 5 points.The higher the score, the better the real-time performance.To further evaluate the generality and reliability of the proposed fusion method in different Settings, experiments were conducted through external validation from independent sources.In external validation, the research team selected schools with different educational backgrounds and environments for testing and collected classroom data in different classroom Settings.Subsequently, these data are analyzed and processed by applying the method of research design 1 to identify the classroom status of students.In order to ensure the fairness and accuracy of the experiment, the research team invited educational experts and external evaluation institutions to independently evaluate the experimental results.The evaluation focuses on the accuracy, timeliness and applicability of identification methods in different educational contexts.The real-time performance scoring standard is consistent with Table 3, and the suitability index adopts the percentage system.The higher the score, the higher the applicability.The experimental results are shown in Table 4. LeNet-5's concise structure makes it show better robustness in processing these complex factors.In order to verify the rationality of LeNet-5, LetNet-5, VGG16 and ResNet50 were trained and tested in the same data set.The test results are shown in Table 5.  5, the difference in recognition accuracy of the three algorithms is not large, and VGG16 and ResNet50 have only a slight advantage over LetNet-5.However, in terms of real-time performance and computing time, LeNet-5 shows obvious advantages.The longer the calculation time, the higher the complexity of the model.Therefore, considering the performance of the model and the demand for computing resources, LeNet-5 is the most appropriate choice.

Conclusions
Despite the advantages of online education, such as being unrestricted by time and location, it gradually becomes a new form of education.However, it lacks the personalized teaching found in traditional classrooms.To address this, the study proposed a new method for student emotion recognition using LeNet-5 and DSC, along with the construction of a multi-source data fusion model for assessing and analyzing students' in-class states.According to the experimental results, the following conclusions can be drawn: (1)Experimental results showed that the model achieved an average F1 value of 0.96, a recall rate of 0.92, an accuracy rate of 94.12%, and a recognition time of 2.10 seconds.The proposed method efficiently and accurately identifies students' in-class states.
(2)The LeNet-5 network with DSC convergence significantly improved with only 35 training iterations compared to the non-fused DSC LeNet-5 network.
(3)Additionally, by fusing information from multiple sources, the study obtained a comprehensive evaluation of students' learning, which was applied to a professional-creative fusion class in the A university.
Evaluating the model's application effect by comparing the two weeks of learning conditions revealed that, in the second week, students improved by over 4%, while poorly performing students improved by over 30%.This method effectively enhances students' learning states, providing real-time feedback for teachers

Fig. 1
Fig.1 Evaluation of students' class learning status based on multi-source information During the training of the model, data from the PubFig Dataset of Columbia University's Biometric Facial Database were used as both the training and test sets.In the process of model training, the cross-entropy loss function is taken as the optimization objective, and the random gradient descent (SGD) algorithm is used to update the parameters.In order to accelerate the training process and prevent overfitting, the study also used learning rate decay and early stop methods in the training process.After many experiments, the hyperparameter configuration of the model was finally determined: the batch size was 32, the learning rate was 0.001, the number of iterations was 50, and the data enhancement methods were random clipping, rotation and flipping.In order to evaluate the performance of the model, the data set was divided into a training set, a validation set, and a test set in a 7:1:2 ratio.The training set is used to train the model, the validation set is used to adjust hyperparameters and early stop methods, and the test set is used to evaluate the final performance of the model.A total of 1584 images were obtained after filtering out inappropriate angles, excessively dark images, and other Fig.2 Principles of the tertiary interpolation method The arrow in Figure 2 refer to the use of the middle equation to calculate the pixel at the position indicated by the frame line of the image, and obtain the new pixel value after calculation.Additionally, to address issues of uneven lighting and significant differences in image contrast, all input images underwent grayscale normalization.The research employed grayscale histogram processing, as described by Equation (2).
Fig.3 Network structure of LeNet-5 The traditional LeNet-5 network faces challenges in accurately recognizing and robustly handling facial images due to the complexity and diversity of such images.In order to address this issue, a study combined DSC with the LeNet-5 network, proposing a new network architecture to enhance the recognition performance of facial images.LeNet-5 network architecture shows excellent performance in image recognition tasks.However, for more complex and diverse facial images, LeNet-5's performance may be limited.The traditional convolution operation is decomposed into two steps, deep convolution and point-by-point convolution, which significantly reduces the number of parameters and computational complexity of Fig.4 DSC versus traditional convolution and its structureThe proposed model builds upon the traditional LeNet-5 model with three convolutional layers and pooling layers, as well as a fully connected layer and an output layer.To address the gradient explosion problem and enhance the model's resistance to overfitting, the study added Batch Normalization (BN) layers after each convolutional structure.During training, the Adam optimizer was employed for optimization.The batch normalization algorithm is defined by Equation (3).
Fig.5 A class status identification model based on the improved LeNet-5 network and data fusion Firstly, facial expression data of students are analyzed using deep learning networks to calculate emotional scores, evaluating the students' learning status based on captured facial images.Additionally, the study analyzes students' behavioral performance

Fig. 7
Fig.7 Comparison the training situation of the three network structures As shown in Figure 7(a), the CNN model required 89 training iterations to achieve satisfactory recognition performance, while the traditional LeNet-5 achieved high recognition accuracy after 38 iterations.The LeNet-5 network with fused DSC achieved desirable metrics after only 35 iterations.Figure 7(b) indicates that the LeNet-5 network with fused DSC stabilized after 31

Fig. 8
Fig.8 Comparison of emotion recognition effects of LeNet-5 networks before and after improvement

(Fig. 9
Fig.9Identification accuracy and error of the model under different data quantities According to Figure9(a), it can be observed that as the testing data increases, the testing accuracy of several models gradually decreases.Model 1 exhibits the smallest decline in accuracy.Specifically, the average recognition accuracy of Model 1 is 92.84%, while Model 2 has an average recognition accuracy of 89.47%, Model 3 with 85.34%, and Model 4 with 88.43%.
Fig.10 Learning status results in the student class based on emotion recognition From Figure 10(a) and Figure 10(b), it is evident that at the beginning of the class, over 80% of the students are highly focused.However, after approximately 20 minutes into the class, most students exhibit varying degrees of changes in their learning states, indicating a decline.Nevertheless, overall, many students maintain a good learning state throughout the class duration.As shown in Figure 10(c) and Figure 10(d), the learning states of four individual students vary, providing an intuitive display.Teachers can analyze students' learning situations post-class and adjust their This article has been accepted for publication in a future issue of this journal, but it is not yet the definitive version.Content may undergo additional copyediting, typesetting and review before the final publication.Citation information: Haobo Lin, Online Professional-Creative Fusion Music Major Students' Classroom State Recognition Based on the Integration of DSC and LeNet-5 Models, Journal of Artificial Intelligence and Technology (2024), DOI: https://doi.org/10.37965/jait.2024.0504 This article has been accepted for publication in a future issue of this journal, but it is not yet the definitive version.Content may undergo additional copyediting, typesetting and review before the final publication.Citation information: Haobo Lin, Online Professional-Creative Fusion Music Major Students' Classroom State Recognition Based on the Integration of DSC and LeNet-5 Models, Journal of Artificial Intelligence and Technology (2024), DOI: https://doi.org/10.37965/jait.2024.0504avoiding the problem of gradient explosion.
article has been accepted for publication in a future issue of this journal, but it is not yet the definitive version.Content may undergo additional copyediting, typesetting and review before the final publication.Citation information: Haobo Lin, Online Professional-Creative Fusion Music Major Students' Classroom State Recognition Based on the Integration of DSC and LeNet-5 Models, Journal of Artificial Intelligence and Technology (2024), DOI: https://doi.org/10.37965/jait.2024.0504 This This article has been accepted for publication in a future issue of this journal, but it is not yet the definitive version.Content may undergo additional copyediting, typesetting and review before the final publication.Citation information: Haobo Lin, Online Professional-Creative Fusion Music Major Students' Classroom State Recognition Based on the Integration of DSC and LeNet-5 Models, Journal of Artificial Intelligence and Technology (2024), DOI: https://doi.org/10.37965/jait.2024.0504

Table 1
Comparison results of performance indicators for several models

Table 1 ,
it is evident that the average values for the four metrics of Model 1 are 0.96, 0.92, 94.12%, and 2.10 seconds, respectively.Compared to Model 1, the This article has been accepted for publication in a future issue of this journal, but it is not yet the definitive version.Content may undergo additional copyediting, typesetting and review before the final publication.Citation information: Haobo Lin, Online Professional-Creative Fusion Music Major Students' Classroom State Recognition Based on the Integration of DSC and LeNet-5 Models, Journal of Artificial Intelligence and Technology (2024), DOI: https://doi.org/10.37965/jait.2024.0504

Table 2
Results of students' learning status evaluation within two weeks

Table 3
Comparison of student classroom state recognition methods This article has been accepted for publication in a future issue of this journal, but it is not yet the definitive version.Content may undergo additional copyediting, typesetting and review before the final publication.Citation information: Haobo Lin, Online Professional-Creative Fusion Music Major Students' Classroom State Recognition Based on the Integration of DSC and LeNet-5 Models, Journal of Artificial Intelligence and Technology (2024), DOI: https://doi.org/10.37965/jait.2024.0504

Table 4
External validation of the independent sourcesThis article has been accepted for publication in a future issue of this journal, but it is not yet the definitive version.Content may undergo additional copyediting, typesetting and review before the final publication.Citation information: Haobo Lin, Online Professional-Creative Fusion Music Major Students' Classroom State Recognition Based on the Integration of DSC and LeNet-5 Models, Journal of Artificial Intelligence and Technology (2024), DOI: https://doi.org/10.37965/jait.2024.0504advanced CNN models, such as VGG, ResNet, etc., and despite the relatively simple structure compared to modern CNN models, LeNet-5 still performs well in many tasks, especially when dealing with image recognition problems.The design of LeNet-5 takes into account the limitations of computing resources, enabling it to achieve efficient training and reasoning with limited computing resources.VGG and ResNet may have stronger feature extraction capabilities,

Table 5
Comparison of different CNN models for student classroom state recognition This article has been accepted for publication in a future issue of this journal, but it is not yet the definitive version.Content may undergo additional copyediting, typesetting and review before the final publication.Citation information: Haobo Lin, Online Professional-Creative Fusion Music Major Students' Classroom State Recognition Based on the Integration of DSC and LeNet-5 Models, Journal of Artificial Intelligence and Technology (2024), DOI: https://doi.org/10.37965/jait.2024.0504and helping them adjust teaching strategies to improve overall teaching quality.Future research could integrate more practical teaching scenarios, explore more effective data fusion methods, and refine model application strategies to better promote students' learning development.To further improve and expand the model, deeper network structures such as ResNet or VGG can be considered in future research to improve the accuracy of emotion recognition.In addition, it is possible to consider using GPUs or dedicated hardware accelerators to accelerate the inference process of the model.