Study on CUDA-based Heterogeneous Parallel for Advanced Assembly Neutronics Program
-
摘要: 为了提升先进组件中子学程序KYLIN-II处理复杂边界条件问题的计算性能,基于可编程显卡异构并行技术对KYLIN-II程序开展了异构并行化研究,实现了共振、输运等模块的海量线程并行计算,并通过优化迭代策略减少了异构并行程序的原子操作次数。为验证异构并行程序的计算精度和加速效果,针对AFA3G超级组件、六角形板型燃料组件、多层套管型燃料栅元等测试例题开展计算,计算结果表明,异构并行程序不会影响计算结果精度,单张显卡异构并行后的KYLIN-II程序可以达到10倍以上的加速比,优化迭代流程可以有效减少计算耗时。相对于传统的基于中央处理器(CPU)的多核并行机制,显卡异构并行显著降低了KYLIN-II程序大规模并行需要的经济成本,可以作为KYLIN-II程序开展进一步并行优化的方向。Abstract: To improve the calculation performance of the advanced assembly neutronics program KYLIN-II when handling the complicated boundary condition, the current paper studied the heterogeneous parallel in the KYLIN-II program based on heterogeneous parallel technology of programmable graphics card, implemented massive thread parallel computing of resonance and transport modules, and reduced the number of atomic operations of heterogeneous parallel program by optimizing iterative strategies. In order to verify the calculation accuracy and acceleration effect of heterogeneous parallel programs, calculations were carried out for test examples such as AFA3G super assembly, hexagonal plate fuel assembly and multilayer sleeve fuel cell. The results indicate that heterogeneous parallel programs will not affect the accuracy of calculation results. The KYLIN-II program after heterogeneous parallel of a single graphics card can achieve an acceleration ratio of more than 10 times. Optimizing the iterative process can effectively reduce the calculation time. Compared with the traditional multi-core parallel mechanism based on central processing unit (CPU), heterogeneous parallel of graphics card significantly reduces the economic cost of large-scale parallel of KYLIN-II program, which can be used as the direction of further parallel optimization of KYLIN-II program.
-
Key words:
- KYLIN-II /
- Parallel technology /
- Heterogeneous parallel /
- Atomic operation
-
表 1 异构平台硬件参数规格
Table 1. Hardware Parameter Specifications for Heterogeneous Platform
硬件 参数规格 CPU Intel Haswell E5 CPU,主频2.6 GHz Quadro K6000 计算能力为3.5,主频为902 MHz,显存频率为3004 MHz,2880个CUDA核心 Tesla K20c 计算能力为3.5,主频为706 MHz,显存频率为2600 MHz,2496个CUDA核心 表 2 AFA3G超级组件数值计算结果比较
Table 2. Comparison of Numerical Results of AFA3G Super Assembly
硬件资源 是否优化 浮点数精度 keff 共振模块耗时/s 输运模块耗时/s 每群射线扫描耗时/ms CPU串行 否 双精度 1.047382 1459.4(1454.2) 2721.0(2706.2)① 4622.7 Quadro K6000 否 单精度 1.047382 131.3(124.8) 251.0(234.8) 399.6 双精度 1.047382 184.7(177.9) 355.1(336.8) 571.9 是 单精度 1.047382 111.0(107.6) 212.2(199.8) 341.6 双精度 1.047382 132.5(129.0) 254.6(242.1) 412.3 Tesla K20c 否 单精度 1.047382 209.6(204.5) 393.8(379.0) 648.2 双精度 1.047382 253.4(248.0) 478.1(463.7) 790.3 是 单精度 1.047382 176.7(173.9) 332.8(322.7) 551.6 双精度 1.047382 195.4(192.7) 367.2(357.2) 610.4 注:①括号内时间表示该模块的射线扫描时间和CPU/GPU数据拷贝时间之和,不含粗网加速等过程的耗时 表 3 六角形板型组件数值计算结果比较
Table 3. Comparison of Numerical Results of Hexagonal Plate Fuel Assembly
硬件资源 是否优化 浮点数精度 keff 共振模块耗时/s 输运模块耗时/s 每群射线扫描耗时/ms CPU串行 否 双精度 1.630950 69.9(69.1) 24442.4(24323.9)② 3890.4 Quadro K6000 否 单精度 1.630950 7.6(6.7) 2259.7(2093.3) 334.9 双精度 1.630950 8.5(7.6) 2544.3(2375.3) 380.1 是 单精度 1.630950 6.8(6.1) 2017.4(1894.7) 303.2 双精度 1.630950 7.5(6.7) 2165.1(2044.2) 327.1 Tesla K20c 否 单精度 1.630950 13.2(12.3) 3531.8(3406.7) 545.3 双精度 1.630950 13.7(12.8) 3638.3(3513.4) 562.4 是 单精度 1.630950 11.9(11.3) 3150.9(3060.4) 489.9 双精度 1.630950 12.5(11.9) 3276.0(3185.4) 510.0 注:②括号内时间表示该模块的射线扫描时间和CPU/GPU数据拷贝时间之和,不含粗网加速等过程的耗时 表 4 多层套管型燃料栅元数值计算结果比较
Table 4. Comparison of Numerical Results of Multilayer Sleeve Fuel Cell
硬件资源 是否优化 keff 共振模块耗时/s 输运模块耗时/s 每群射线扫描耗时/ms CPU串行 否 1.518689 13.4(13.3) 3178.2(3177.9)③ 699.8 Quadro K6000 否 1.518689 1.6(1.6) 274.4(274.1) 60.5 是 1.518689 1.6(1.5) 262.1(261.8) 57.7 Tesla K20c 否 1.518689 4.1(4.0) 433.7(433.5) 95.9 是 1.518689 4.3(4.0) 417.6(417.5) 92.4 注:③括号内时间表示该模块的射线扫描时间和CPU/GPU数据拷贝时间之和,不含粗网加速等过程的耗时 -
[1] 涂晓兰,柴晓明,刘东,等. 先进中子学栅格程序KYLIN-II输运模块并行优化开发[J]. 原子能科学技术,2020, 54(5): 930-936. doi: 10.7538/yzk.2019.youxian.0276 [2] BOYD W, SHANER S, LI L L, et al. The OpenMOC method of characteristics neutral particle transport code[J]. Annals of Nuclear Energy, 2014, 68: 43-52. doi: 10.1016/j.anucene.2013.12.012 [3] 宋佩涛,张志俭,张乾,等. CPU-GPU协同计算在MOC中子输运异构并行计算中的应用研究[J]. 核动力工程,2020, 41(4): 17-21. [4] 宋佩涛,张志俭,梁亮,等. GPU加速MOC输运计算性能分析研究[J]. 原子能科学技术,2020, 54(1): 103-111. doi: 10.7538/yzk.2019.youxian.0094 [5] 郑勇. 矩阵特征线方法加速技术及三维中子输运计算方法研究[D]. 哈尔滨: 哈尔滨工程大学, 2017. [6] IAEA. Research reactor core conversion from the use of highly enriched uranium fuels: guidebook[R]. IAEA-TECDOC-233, Vienna: IAEA, 1980 [7] 黄世恩. 处理多重复杂度中子共振问题的子群方法研究[D]. 北京: 清华大学, 2011.