UAV maneuvering decision -making algorithm based on Twin Delayed Deep Deterministic Policy Gradient Algorithm

Aiming at intelligent decision-making of UAV based on situation information in air combat, a novel maneuvering decision method based on deep reinforcement learning is proposed in this paper. The autonomous maneuvering model of UAV is established by Markov Decision Process. The Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm and the Deep Deterministic Policy Gradient (DDPG) algorithm in deep reinforcement learning are used to train the model, and the experimental results of the two algorithms are analyzed and compared. The simulation experiment results show that compared with the DDPG algorithm, the TD3 algorithm has stronger decision-making performance and faster convergence speed, and is more suitable for solving combat problems. The algorithm proposed in this paper enables UAVs to autonomously make maneuvering decisions based on situation information such as position, speed, and relative azimuth, adjust their actions to approach and successfully strike the enemy, providing a new method for UAVs to make intelligent maneuvering decisions during air combat.


Ⅰ. Introduction
At present, unmanned aerial vehicles (UAVs) are widely used in military applications such as reconnaissance, attack, and jamming. Due to the complexity and variability of a battlefield, future UAVs need to be capable of undertaking autonomous operations. Therefore, maneuvering decision-making algorithm of UAV in a combat process of modern air war has become popular research subject [1] . The traditional rule evolution UAV maneuvering decision-making method based on genetic algorithm or genetic fuzzy system relies on human prior knowledge, and significantly lacks self-learning ability.
As an important paradigm in artificial E a r l y A c c e s s intelligence, deep reinforcement learning has shown great advantages in solving various problems, and has emerged in many applications in the field of UAV air combat maneuvering decision as well. Part of the research [4,5] combines deep reinforcement learning with traditional methods to make UAV maneuvering decisions, such as Game Theory [4] , Particle Swarm Optimization [5] and so on. However, traditional methods such as Game Theory need to establish a clear and complete problem model. In another part of the research [6][7][8][9][10][11][12][13] , the UAV maneuvering decision-making is realized by deep reinforcement learning, the autonomous maneuvering model of UAV is established by Markov Decision Process, and the decision function is fitted by neural network. Through training, UAV can master the optimal behavior strategy by the interaction with the environment. However, the existing research of UAV intelligent maneuvering decision based on deep reinforcement learning still has the following shortcomings: (1) The simulation environment is mainly two-dimensional space, so it lacks high-level exploration and analysis; and (2) it doesn't consider the impact of radar and weapon on air combat, and therefore, it is difficult to apply to a complex battlefield environment.
In response to the above problems, this paper establishes a three-dimensional (3D) UAV air combat model, and a UAV maneuvering decision algorithm based on deep reinforcement learning is proposed. The remainder of this paper is organised as follows. In Section Ⅱ, a UAV air combat model based on the characteristics of the 3D environment is defined. In Section Ⅲ, intelligent UAV maneuvering decisionmaking algorithm based on deep reinforcement learning is proposed. In section Ⅳ, the simulation results demonstrate the effectiveness of the proposed algorithm in the field of air combat

Ⅱ. UAV air combat model
The following assumptions are made for the establishment of the UAV motion and dynamics model:  Assume that the UAV is a rigid body;  Ignore the influence of the earth's rotation and revolution, and ignore the earth's curvature; and  Due to the large range of maneuverability and short combat time in close air combat, the impact of fuel consumption on quality and the effect of wind are ignored In a 3D space, UAV has physical descriptions such as position, speed, and attitude. The 3D space coordinate system where UAV is located is defined as OXYZ, the positive direction of the X axis is north, the positive direction of the Z axis is east, and the positive direction of Y axis is vertical up. cos cos sin cos sin where [] X,Y,Z represents the position of the UAV in the coordinate system; v represents the speed of UAV;  represents the pitch E a r l y A c c e s s angle of UAV, is ranged from oo [-90 ,90 ] ;  represents the heading angle of UAV, is ranged from oo [-180 ,180 ] ; dt represents integration step; a represents UAV acceleration; '  represents pitch angle variation; and '  represents UAV heading angle variation.
The UAV is regarded as mass point when observing the movement of it. According to the principle of integration, the motion equation of the UAV with three degrees of freedom is shown in Eq. (1). Limited by UAV's throttle and overload performance, the maneuvering process of UAV in 3D space can be realized by setting suitable v , a , '  and '  .
The two sides in the battle are modeled in the OXYZ coordinate system, as shown in The relative position vector between our UAV and enemy UAV is D , the direction of relative position vector is from our side to the enemy. The distance is between our UAV and enemy UAV is d . The angle of V and D is relative azimuth q . Therefore, the combat situation of enemy and mine can be described by D , d and q .The mathematical description of D , d and q is shown in Eq.

DV DV
(4) Ⅲ . Intelligent UAV maneuvering decision-making algorithm based on deep reinforcement learning

A. Task specification
In air combat, the maneuvering decisionmaking of UAV plays a significant role in the combat result. After initializing the positions of the UAVs on both sides of the battle, the UAV can automatically generate maneuvering decision based on the battlefield situation information based on the deep reinforcement learning algorithm, so that it can occupy a favorable position in the air combat. In consequence, the lock-in and preemptive attack on the enemy has been realized.
The combat environment in this paper includes UAVs on both sides of the battle. The entire combat process is divided into three parts: the situation information acquisition module of both sides, the maneuvering decision module based on the E a r l y A c c e s s deep reinforcement learning algorithm, and the motion module. Among them, the situation information acquisition module of both sides calculates situation information and provides it to the decision module for decision making; the maneuvering decision module generates maneuvering control quantity based on deep reinforcement learning algorithm and provides it for the motion module used to our UAV maneuvering; the motion module updates our position information through the motion equations of UAV, realizes maneuvering, and provides information to the situation information acquisition module of both sides for calculating the corresponding situation.

B. Related theory
TD3 [16] algorithm is an actor-critic algorithm that can operate over continuous action spaces. DDPG [15] algorithm is the theoretical basis of TD3 algorithm. DDPG is an actorcritic, model-free algorithm based on deterministic policy gradient that can operate over continuous action spaces. However, there is a shortcoming in DDPG algorithm that the estimated value function is larger than the true value function. To solve this problem, TD3 algorithm improves policy network and value network on the basis of DDPG algorithm, which makes TD3 algorithm perform better than DDPG in many continuous control tasks. The structure of TD3 is shown in Fig. 2.
Two sets of Critic networks And the minimum Q of the two networks is selected to calculate the target Q , thereby suppressing continuous overestimation. As same as DDPG, Actor network   is used to output action. Therefore, there are three sets of six neural networks in TD3 algorithm, including two Actor networks and four Critic networks. After the Critic network has been updated many times, the Actor network is updated. The delay of parameter update is to allow the Actor network to make action decisions after the Critic network is not overrated.
Select action with exploration noise shown as follows:

 
The calculation formula of target Q is as follows: The Critic network is updated by calculating the loss function of the Critic network. The loss function is shown as follows: The Actor network is updated by policy gradient as follows: The target networks are updated as follows:  Fig. 2. The structure of TD3.

1) The design of action space and state space
The UAV makes maneuvering decisions based on the situation data, and executes maneuvers after obtaining the maneuvering decisions, so that the enemy enters its own missile attack envelop to complete the combat mission. Therefore, the state space includes the UAV's own information and the enemy information that can be obtained, and the action space is the control quantity of the UAV's actions.
According to Eq. (1) to Eq. (4), the state space of this paper is described by a tuple including eight elements and expressed in a vector as follows: [ , , , , , , , ] X Y Z v d q  (10) where ,, X Y Z respectively represent the position of our UAV on the three coordinate axes, v represents the speed of our UAV， represents the pitch angle of our UAV，  represents the heading angle of our UAV, d represents the distance between us and the enemy, q represents the relative azimuth of enemy.
According to Eq. (2) to Eq.(4), it can be seen that the motion of UAV is controlled by acceleration a , pitch angle variation '  , and heading angle variation '  . Therefore, the action space of UAV can be designed as a tuple including three elements and expressed by a vector as follows: 2

) The reward function of the air combat situation
We need to get enemy UAV into our missile attack envelop to accomplish our mission The range of missile attack envelop is determined by the air-to-air missile's maximum attack distance The time that the enemy has been continuously in our missile attack envelop is in t . When Eq. (12) is satisfied, it can be considered that our missiles were successfully launched, and the enemy was destroyed by our missiles, and our combat was successful. The reward function in this paper is composed of continuous rewards and sparse rewards. Among them, the continuous reward function is negatively correlated with the relative azimuth and the relative distance. According to Eq. (12), the angle reward and distance reward have been considered in this paper.
The distance reward d r is shown as follows:

3) Algorithm procedure
According to the above definition, the training process of the UAV maneuvering decision-making algorithm based on TD3 is shown in Table 1.

Ⅳ. Experiment and analysis
A. Experimental parameter settings

1) Parameter settings of TD3 algorithm
The parameters of TD3 algorithm are shown in Table 2, where, training round represents the number of training rounds for the network in the algorithm in a certain initial state; the maximum simulation step size indicates the maximum number of actions performed by the agent in a training round, when this number of times is reached, the training of this round is over; the time step represents the time interval for the agent to perform actions; batch_size represents the number of samples taken from the replay buffer each time during training.

2) Parameter settings of missile and UAV
The parameters of missile and UAV are shown in Table 3. It is assumed that the target locking time of the missile is 2 seconds. When the time step is 0.1s, the missile needs to lock the target with 20 simulation steps to launch. 3) The structure of policy network and value network Fig. 3. The structure of network.
As shown in Fig. 3, the policy network Actor outputs the maneuvering action based on current state. According to the state and action space of UAV maneuvering decisionmaking, the number of Actor network input nodes is 8, and the number of Actor network output nodes is 3. And since the activation function of output layer is tanh function, the output is limited to [-1,1]. The value network Critic is used to evaluate the value of decision that performs the action in current state. The number of neurons in the input layer and output layer are 11 and 1 respectively. Both Actor and Critic networks are fully connected neural networks with two hidden layers. The number of neurons in E a r l y A c c e s s the hidden layer is 256, and the activation function is the Relu function.

B. Simulation experiment and analysis
In this section, the application of TD3 algorithm and DDPG algorithm in air combat maneuvering decision task is realized by designing relevant experiments, and the efficiency of the algorithm is compared. In the experiment, the red side is an intelligent body that uses deep reinforcement learning algorithms, and the blue side is a non-intelligent body that performs fixed maneuvers. The initial distance between UAVs is 15km, and the initial relative azimuth is 40°. The parameters and the structure of network of DDPG algorithm are same as TD3 algorithm in Section A.

1) Convergence speed comparison
In order to better evaluate the convergence speed of the algorithm, the total reward obtained by us in each round was recorded during the experiment to determine whether the reward function converges. The change curves of total reward of DDPG algorithm and TD3 algorithm in 4000 training rounds under the same initial conditions are shown in Fig. 4 Fig . 4. The change curves of total reward.
As shown in the Fig. 4, the TD3 algorithm converged locally within 250 to 2800 rounds, and jumps out of local convergence at 2800 rounds to achieve global convergence. The DDPG algorithm reached local convergence in multiple stages, and finally did not jump out of local convergence. In the end, the TD3 algorithm converges 200 rounds earlier than the DDPG algorithm, and the maximum reward value that it converges to is greater. At the same time, the DDPG algorithm has a lot of jitter during the convergence process, and the TD3 algorithm is more stable. Therefore, TD3 algorithm has faster training speed and better training results compared with DDPG algorithm. Fig. 5 and Fig. 6 respectively show the combat process of UAV approaching the enemy and meeting launch conditions in different planes.   5 shows the combat trajectory of the UAV on the horizontal plane. It can be seen from the Fig. 5 that after the start of the battle, the blue side with no attack ability moves randomly, and the initial relative azimuth angle and distance of the blue side's UAV relative to the red side's UAV are relatively large. In order to make the blue side enter its missile launch area, the red side first quickly changed the heading, reduced the relative azimuth angle, and made a tail-back attack on the blue side. Fig. 6 shows the altitude change of the UAV during combat. As can be seen from the Fig.  6, in the initial state, when the enemy and our side had a height difference and the enemy was lower than us, the red side of TD3 algorithm gradually reduced the height difference during the movement, while the red side of DDPG algorithm always had a large height difference and was always above the enemy's height. The decision-making process of the two algorithms is to first change the direction, reduce the relative azimuth angle, and then shorten the distance, and finally reached an attack situation that satisfies the launch conditions. However, by comparing Fig. 5 and Fig. 6, it can be seen that the red side turning range in the early stage of the TD3 algorithm is smaller, and the relative azimuth angle is reduced faster. When the launch conditions are finally met, the red side of the TD3 algorithm is closer to the enemy than the DDPG algorithm, and the relative azimuth angle is smaller.

2) Test results comparison
Comprehensive comparison of combat trajectories, compared with the DDPG algorithm, the maneuver strategy generated by the TD3 algorithm can enable the red side to meet the launch conditions more quickly, strike the enemy successfully, which is more suitable for actual combat.

Ⅴ. Conclusion
In this paper, a UAV combat maneuvering decision-making algorithm based on deep reinforcement learning is established. UAV maneuvering model and UAV combat model are established by mathematical algorithm. At the same time, in order to make the battlefield environment more real, the concept of missile attack envelop is introduced in the process of confrontation. Then, this paper realizes the UAV air combat maneuvering decision-making based on DDPG and TD3 algorithm. Experimental results show that compared with DDPG algorithm, TD3 algorithm has better convergence speed and optimization ability, and is more suitable for solving UAV maneuvering decision problem.