基于融合离散动作的双延迟深度确定性策略梯度算法的自动驾驶端到端行为决策方法

杨璐; 王一权; 刘佳琦; 段玉林; 张荣辉

doi:10.3963/j.jssn.1674-4861.2022.01.017

基于融合离散动作的双延迟深度确定性策略梯度算法的自动驾驶端到端行为决策方法

doi: 10.3963/j.jssn.1674-4861.2022.01.017

杨璐^{1, 2,},
王一权^{1, 2},
刘佳琦^{1, 2},
段玉林³,
张荣辉^4, ,

1.
天津理工大学天津市先进机电系统设计与智能控制重点实验室天津 300384
2.
天津理工大学机电工程国家级实验教学示范中心天津 300384
3.
中国农业科学院农业资源与农业区划研究所北京 100081
4.
中山大学广东省智能交通系统重点实验室广州 510275

基金项目:

中国农业科学院国际农业科学计划项目 CAAS-ZDRW202107

国家自然科学基金项目 52172350

国家自然科学基金项目 51775565

天津市研究生科研创新项目 2020YJSZXS05

详细信息

作者简介:
杨璐（1982—），博士，副教授. 研究方向：智能车辆. E-mail: yanglu8206@163.com

通讯作者:
张荣辉（1981—），博士，副教授. 研究方向：智能车辆与辅助驾驶. E-mail: zhangrh25@mail.sysu.edu.cn

中图分类号: U463.6; TP181
计量
- 文章访问数: 1253
- HTML全文浏览量: 628
- PDF下载量: 66
- 被引次数: 0
出版历程
- 收稿日期: 2021-08-13
- 网络出版日期: 2022-03-31

An End-to-end Decision-making Method for Autonomous Driving Based on Twin Delayed Deep Deterministic Policy Gradient with Discrete

YANG Lu^{1, 2
,},
WANG Yiquan^{1, 2},
LIU Jiaqi^{1, 2},
DUAN Yulin³,
ZHANG Ronghui^{4
, ,}

1.
Tianjin Key Laboratory for Advanced Mechatronic System Design and Intelligent Control, School of Mechanical Engineering, Tianjin 300384, China
2.
National Demonstration Center for Experimental Mechanical and Electrical Engineering Education, Tianjin University of Technology, Tianjin 300384, China
3.
Institute of Agricultural Resources and Regional Planning, Chinese Academy of Agricultural Sciences, Beijing 100081, China
4.
Guangdong Provincial Key Laboratory of Intelligent Transport System, Sun Yat-sen University, Guangzhou 510275, China

摘要

摘要:
针对基于强化学习的车辆驾驶行为决策方法存在的学习效率低、动作变化不平滑等问题，研究了1种融合不同动作空间网络的端到端自动驾驶决策方法，即融合离散动作的双延迟深度确定性策略梯度算法（TD3WD）。在基础双延迟深度确定性策略梯度算法（TD3）的网络模型中加入1个输出离散动作的附加Q网络辅助进行网络探索训练，将TD3网络与附加Q网络的输出动作进行加权融合，利用融合后动作与环境进行交互，对环境进行充分探索，以提高对环境的探索效率；更新Critic网络时，将附加网络输出作为噪声融合到目标动作中，鼓励智能体探索环境，使动作值预估更加准确；利用预训练的网络获取图像特征信息代替图像作为状态输入，降低训练过程中的计算成本。利用Carla仿真平台模拟自动驾驶场景对所提方法进行验证，结果表明：在训练场景中，所提方法的学习效率更高, 比TD3和深度确定性策略梯度算法（DDPG）等基础算法收敛速度提升约30%；在测试场景中，所提出的算法的收敛后性能更好，平均压线率和转向盘转角变化分别降低74.4%和56.4%。
- 自动驾驶 /
- 端到端决策 /
- 深度强化学习 /
- 动作空间
Abstract:
There are issues for the decision support method for automated driving based on reinforcement learning, such as low learning efficiency and non-continuous actions. Therefore, an end-to-end decision-making method for autonomous driving is developed based on the Twin Delayed Deep Deterministic Policy Gradient with Discrete (TD3WD)algorithm, which can be used to fuse the information from different action spaces over a network. In the network of traditional Twin Delayed Deep Deterministic Policy Gradient(TD3)algorithm, an additional Q network that outputs discrete actions is added to assist exploration training. Weighted fusion of the output actions of TD3 network and additional Q network is performed. The fused actions interact with the environment, in order to fully explore the environment and enhance the efficiency of the environment exploration. When the Critic network is updated, the output of the attached network is merged into the target actions as noise to encourage the agent to explore the environment and obtain better action estimates. Instead of the original images, image feature obtained from the pre-trained network is used as the state input to reduce the computational cost in the training process. The proposed model is tested under a set of simulated autonomous driving scenarios generated by Carla simulation platform. The results show that the convergence speed of the proposed method is about 30% higher than that of traditional reinforcement learning algorithms like TD3 and Deep Deterministic Policy Gradient(DDPG)under the training scenarios. Under the testing scenarios, the proposed method shows better convergent performances and the average rate of lane-crossing and the change rate of steering angle are reduced by 74.4% and 56.4% respectively.
- autonomous driving /
- end-to-end decision-making /
- deep reinforcement learning /
- action space

HTML全文

图 1 TD3WD系统模型

Figure 1. TD3WD system model

下载: 全尺寸图片幻灯片

图 2 辅助探索

Figure 2. Uxiliary exploration

下载: 全尺寸图片幻灯片

图 3 状态信息处理

Figure 3. State information processing

下载: 全尺寸图片幻灯片

图 4 训练地图Town1

Figure 4. Training map Town1

下载: 全尺寸图片幻灯片

图 5 随机初始环境

Figure 5. Random initial environment

下载: 全尺寸图片幻灯片

图 6 轮次平均奖励

Figure 6. Episode average reward

下载: 全尺寸图片幻灯片

图 7 轮次平均行驶距离

Figure 7. Episode average driving distance

下载: 全尺寸图片幻灯片

图 8 单步平均奖励

Figure 8. Single step average reward

下载: 全尺寸图片幻灯片

表 1 TD3WD网络结构

Table 1. TD3WD network structure

网络	层	维度	激活函数
Actor Target Actor	全连接层1	256	relu
	全连接层2	256	relu
	全连接层3	128	relu
	全连接层4	64	relu
	全连接层5	3	relu
Critic Target Critic	全连接层1	256	relu
	全连接层2	256	relu
	全连接层3	128	relu
	全连接层4	64	relu
	全连接层5	1	/
Eval Q Target Q	全连接层1	256	relu
	全连接层2	256	relu
	全连接层3	1	/

下载: 导出CSV

表 2 超参数设置

Table 2. Hyper parameter setting

名称	数值
训练轮次数E	3 000
折扣系数γ₁	0.99
折扣系数γ₂	0.9
Actor学习率lr_A	0.000 1
Critic学习率lr_C	0.001
Q网络学习率lr_Q	0.001
初始动作比重α	0.6
训练最大轮次N_e	3 00
训练最大步数N_s	1 000
探索次数N_t	3 000
经验回放池容量M	500 000
经验采集样本数N	256
软更新系数τ	0.001

下载: 导出CSV

表 3 Town 1测试结果

Table 3. Town 1 test results

算法	直线行驶			弯道转弯			穿过交叉路口			丁字路口转弯
算法	压线率/ %	转角变化	碰撞次数	压线率/ %	转角变化	碰撞次数	压线率/ %	转角变化	碰撞次数	压线率/ %	转角变化	碰撞次数
TD3WD	0	0.09	0	0.8	0.13	0	2.1	0.19	0	0.5	0.25	0
DDPGWD	0	0.21	0	0.7	0.16	0	3.7	0.22	0	3.4	0.21	1
TD3	0	0.39	0	0.3	0.36	0	5.9	0.41	0	5.1	0.37	1
DDPG	0.6	0.41	0	0.9	0.39	1	16.2	0.42	0	1.1	0.35	0

下载: 导出CSV

表 4 Town 1测试结果（新天气）

Table 4. Town 1 test results (New weather)

算法	直线行驶			弯道转弯			穿过交叉路口			丁字路口转弯
算法	压线率/ %	转角变化	碰撞次数	压线率/ %	转角变化	碰撞次数	压线率/ %	转角变化	碰撞次数	压线率/ %	转角变化	碰撞次数
TD3WD	0	0.08	0	3.5	0.21	0	0	0.12	0	0.8	0.18	0
DDPGWD	0	0.12	0	3.9	0.23	0	0	0.15	0	0.9	0.21	1
TD3	0.2	0.42	1	4.9	0.40	0	0	0.42	0	0.8	0.40	2
DDPG	6.1	0.48	2	4.2	0.39	6	1.5	0.48	0	8.6	0.41	6

下载: 导出CSV

表 5 Town 2测试结果

Table 5. Town 2 test results

算法	直线行驶			弯道转弯			穿过交叉路口			丁字路口转弯
算法	压线率/ %	转角变化	碰撞次数	压线率/ %	转角变化	碰撞次数	压线率/ %	转角变化	碰撞次数	压线率/ %	转角变化	碰撞次数
TD3WD	0	0.08	0	0.1	0.23	0	0	0.09	0	1.3	0.33	1
DDPGWD	0.1	0.18	0	16.7	0.19	0	0.9	0.21	0	3.5	0.21	1
TD3	0	0.42	0	2.9	0.41	0	0	0.42	0	4.6	0.39	2
DDPG	0.3	0.41	0	7.3	0.36	5	1.4	0.36	0	4.9	0.38	1

下载: 导出CSV

参考文献(24)

[1]	熊璐, 康宇宸, 张培志, 等. 无人驾驶车辆行为决策系统研究[J]. 汽车技术, 2018, 515(8): 1-9. https://www.cnki.com.cn/Article/CJFDTOTAL-QCJS201808001.htm XIONG L, KANG Y C, ZHANG P Z, et al. Research on behavior decision-making system for unmanned vehicle[J]. Automobile Technology, 2018, 515(8): 1-9. (in Chinese) https://www.cnki.com.cn/Article/CJFDTOTAL-QCJS201808001.htm
[2]	黄玲, 郭亨聪, 张荣辉, 等. 人机混驾环境下基于LSTM的无人驾驶车辆换道行为模型[J]. 中国公路学报, 2020, 33 (7): 156-166. doi: 10.3969/j.issn.1001-7372.2020.07.016 HUANG L, GUO H E, ZHANG R H, et al. LSTM-based lane-changing behavior model for unmanned vehicle under environment of heterogeneous human-driven and autonomous vehicles[J]. China Journal of Highway and Transport, 2020, 33(7): 156-166. (in Chinese) doi: 10.3969/j.issn.1001-7372.2020.07.016
[3]	王鑫鹏, 陈志军, 吴超仲, 等. 考虑驾驶风格的智能车自主驾驶决策方法[J]. 交通信息与安全, 2020, 38(2): 37-46. doi: 10.3963/j.jssn.1674-4861.2020.02.005 WANG X P, CHEN Z J, WU C Z, et al. A method of automatic driving decision for smart car considering driving style[J]. Journal of Transport Information and Safety, 2020, 38(2): 37-46. (in Chinese) doi: 10.3963/j.jssn.1674-4861.2020.02.005
[4]	POMERLEAU D A. Alvinn: An autonomous land vehicle in a neural network[R]. Pittsburgh: Carnegie Mellon University, 1989.
[5]	巴明月. 基于条件模仿学习的端到端车道保持方法研究[D]. 重庆: 重庆理工大学, 2021. BA M Y. Research on end-to-end lane keeping method based on conditional imitation learning[D]. Chongqing: Chongqing University of Technology, 2021. (in Chinese)
[6]	TOROMANOFF M, WIRBEL E, WILHELM F, et al. End to end vehicle lateral control using a single fisheye camera[C]. 2018 IEEE International Conference on Intelligent Robots and Systems(IROS), Madrid: IEEE, 2018.
[7]	CHEN J, YUAN B, TOMIZUKA M. Deep imitation learning for autonomous driving in generic urban scenarios with enhanced safety[C]. 2019 IEEE International Conference on Intelligent Robots and Systems(IROS), Macau: IEEE, 2019.
[8]	FUJIMOTO S, HOOF H, MEGER D. Addressing function approximation error in actor-critic methods[C]. International Conference on Machine Learning(ICML), Stockholm: PMLR, 2018.
[9]	PEROT E, JARITZ M, TOROMANOFF M, et al. End-to-end driving in a realistic racing game with deep reinforcement learning[C]. The IEEE Conference on Computer Vision and Pattern Recognition(CVPR), Hawaii: IEEE, 2017.
[10]	MNIH V, BADIA A P, MIRZA M, et al. Asynchronous methods for deep reinforcement learning[C]. International conference on machine learning(ICML), New York City: JMLR, 2016.
[11]	KENDALL A, HAWKE J, JANZ D, et al. Learning to drive in a day[C]. 2019 International Conference on Robotics and Automation(ICRA), Montreal: IEEE, 2019.
[12]	QIU C R, HU Y, CHEN Y, et al. Deep deterministic policy gradient(DDPG)-based energy harvesting wireless communications[J]. IEEE Internet of Things Journal, 2019, 6(5): 8577-8588. doi: 10.1109/JIOT.2019.2921159
[13]	闫浩, 刘小珠, 石英. 基于REINFORCE算法和神经网络的无人驾驶车辆变道控制[J]. 交通信息与安全, 2021, 39 (1): 164-172. doi: 10.3963/j.jssn.1674-4861.2021.01.0019 YAN H, LIU X Z, SHI Y. Lane-change control for unmanned vehicle based on REINFORCE algorithm and neural network[J]. Journal of Transport Information and Safety, 2021, 39(1): 164-172. (in Chinese) doi: 10.3963/j.jssn.1674-4861.2021.01.0019
[14]	罗鹏, 黄珍, 秦易晋, 等. 基于DQN的车辆驾驶行为决策方法[J]. 交通信息与安全, 2020, 38(5): 67-77. doi: 10.3963/j.jssn.1674-4861.2020.05.008 LUO P, HUANGE Z, QIN Y J, et al. A method of vehicle driving behavior decision based on DQN algorithm[J]. Journal of Transport Information and Safety, 2020, 38(5): 67-77. (in Chinese) doi: 10.3963/j.jssn.1674-4861.2020.05.008
[15]	CHEN J Y, YUAN B D, TOMIZUKA M. Model-free deep reinforcement learning for urban autonomous driving[C]. 2019 IEEE intelligent transportation systems conference(ITSC), Auckland: IEEE, 2019.
[16]	ZHU M X, WANG Y H, PU Z Y, et al. Safe, efficient, and comfortable velocity control based on reinforcement learning for autonomous driving[J]. Transportation Research Part C: Emerging Technologies, 2020(117): 102662.
[17]	宋晓琳, 盛鑫, 曹昊天, 等. 基于模仿学习和强化学习的智能车辆换道行为决策[J]. 汽车工程, 2021, 43(1): 59-67. https://www.cnki.com.cn/Article/CJFDTOTAL-QCGC202101008.htm SONG X L, SHENG X, CAO H T, et al. Lane-change behavior decision-making of intelligent vehicle based on imitation learning and reinforcement learning[J]. Automotive Engineering, 2021, 43(1): 59-67. (in Chinese) https://www.cnki.com.cn/Article/CJFDTOTAL-QCGC202101008.htm
[18]	DOSOVITSKIY A, ROS G, CODEVILLA F, et al. CARLA: An open urban driving simulator[C]. Conference on Robot Learning(CORL), California: PMLR, 2017.
[19]	TOROMANOFF M, WIRBEL E, MOUTARDE F. End-to-end model-free reinforcement learning for urban driving using implicit affordances[C]. The IEEE Conference on Computer Vision and Pattern Recognition(CVPR), Seattle: IEEE, 2020.
[20]	YI M L, XU X, ZENG Y J, et al. Deep imitation reinforcement learning with expert demonstration data[J]. The Journal of Engineering, 2018(16): 1567-1573.
[21]	CUI Y, ISELE D, NIEKUM S, et al. Uncertainty-aware data aggregation for deep imitation learning[C]. 2019 International Conference on Robotics and Automation(ICRA), Montreal: IEEE, 2019.
[22]	ZOU Q J, XIONG K, HOU Y L. An end-to-end learning of driving strategies based on DDPG and imitation learning[C]. 2020 Chinese Control and Decision Conference(CCDC), Hefei: IEEE, 2020.
[23]	MNIH V, KAVUKCUOGLU K, SILVER D, et al. Human-level control through deep reinforcement learning[J]. nature, 2015, 518(7540): 529-533. doi: 10.1038/nature14236
[24]	CODEVILLA F, MULLER M, LOPEZ A, et al. End-to-end driving via conditional imitation learning[C]. 2018 IEEE International Conference on Robotics and Automation(ICRA), Vancouver: IEEE, 2018.