摘 要
目的:本设计旨在深入研究深度学习、集成学习等机器学习理论,并应用相关算法开展中医藏象辨证量化诊断的创新研究和应用,在设计和实现AdaBoost、随机森林、卷积神经网络和谱聚类等应用广泛、表现优良的机器学习算法的基础上,搭建以算法为核心、集成数据采集、数据清洗、量化诊断、算法评估等功能模块的中医藏象辨证量化诊断的一体化平台。
方法:首先利用学校的中医院校资源优势,收集到来自中医院的7964条原始医案数据;其次编写脚本对医案进行基本清洗,再组织中医专业学生对症状、舌象、脉象和证型等数据项进行规范和标准化,再利用整理出来的数据项字典进行批量替换得到最终标准化医案样本;针对AdaBoost、随机森林、卷积神经网络和谱聚类等算法建立相应的模型,将样本处理成规范的特征向量输入到模型参与计算,调整模型参数,应用表现最优的参数在测试样本上评估;利用Flask开发框架、MySQL数据库和Echarts.js等组件,集成最优算法模型,基于MVC的开发模式以及面向对象的开发方法,实现中医藏象辨证量化诊断系统。
结果:基本清洗得到7518条有效医案,已规范700条医案,提取得到1871个症状特征、32个舌象特征、16个脉象特征以及50个证型标签;基于以上清洗结果,AdaBoost、随机森林、卷积神经网络和谱聚类等四种模型的以准确率为指标的最优表现分别为44.62%、47.59%、52.47%和39.28%。
结论:在医案数据未得到完全标准化、样本分布不均匀、模型大小受硬件条件约束的情况下,卷积神经网络在测试样本集上表现最佳,并且随着网络模型复杂程度提高而有明显提高,集成学习算法随着参数调整,表现在40%~50%间波动,谱聚类在无监督学习上的分类表现一般。在样本质量提高的基础上,各模型表现有望得到明显提高。
关键词:量化诊断;AdaBoost;随机森林;卷积神经网络;谱聚类;一体化平台
Abstract
Purpose: This design aims to deeply study machine learning theories such as deep learning and integrated learning, and apply relevant algorithms to carry out innovative research and application of TCM syndrome differentiation and diagnosis, design and implement AdaBoost, random forest, convolutional neural network and spectrum. Based on clustering and other well-performing machine learning algorithms, an integrated platform for syndrome differentiation and diagnosis of TCM Tibetan elephants with algorithmic core, integrated data acquisition, data cleaning, quantitative diagnosis, algorithm evaluation and other functional modules is built.
Methods: Firstly, 7946 original medical records from Chinese medicine hospitals were collected by using the resources of the Chinese medicine hospitals in the school. Secondly, scripts were used to basically clean the medical cases, and then the students of traditional Chinese medicine were organized to symptom, tongue, pulse and syndrome. The data items are standardized and standardized, and then the sorted data item dictionary is used for batch replacement to obtain the final standardized medical sample; the corresponding model is established for AdaBoost, random forest, convolutional neural network and spectral clustering algorithms, and the sample is processed. The normalized eigenvectors are input into the model to participate in the calculation, the model parameters are adjusted, and the parameters with the best performance are evaluated on the test samples. The components of the Flask development framework, MySQL database and Echarts.js are integrated to integrate the optimal algorithm model based on MVC. Development model and object-oriented development method to realize TCM syndrome differentiation and diagnosis system.
Results: Basic cleansing resulted in 7518 effective medical records, 700 medical records have been standardized, and 1871 symptom features, 32 tongue features, 16 pulse features and 50 syndrome tags were extracted. Based on the above cleaning results, AdaBoost, random The optimal performance of the four models of forest, convolutional neural network and spectral clustering were 44.62%, 47.59%, 52.47% and 39.28%, respectively.
Conclusion: The convolutional neural network performs best on the test sample set when the medical record data is not fully standardized, the sample distribution is not uniform, and the model size is constrained by hardware conditions, and it is obvious as the complexity of the network model increases. Improve, the integrated learning algorithm fluctuates between 40% and 50% with parameter adjustment, and the spectral clustering performance in unsupervised learning is general. On the basis of the improvement of sample quality, the performance of each model is expected to be significantly improved.
Key words: Quantitative diagnosis; AdaBoost; random forest; convolutional neural network; spectral clustering; integrated platform
目 录
1.绪论
1.1项目背景
1.2目的与意义
1.3任务概述
1.3.1设计目标
1.3.2算法要求
1.3.3平台特点
2.相关技术简介
2.1相关算法
2.1.1 AdaBoost
2.1.2随机森林
2.1.3卷积神经网络
2.1.4谱聚类
2.2开发技术
2.2.1前端技术——HTML+CSS+JavaScript
2.2.2后端技术——MySQL+Flask
2.2.3算法框架——TensorFlow+Scikit-learn
3.中医藏象辨证量化诊断方法研究
3.1数据准备
3.1.1医案清洗
3.1.2医案标准化
3.2 AdaBoost算法建模
3.2.1构造特征向量
3.2.2构建决策树
3.2.3 Boosting集成学习
3.2.4算法评估
3.3随机森林算法建模
3.3.1构造特征向量
3.3.2构建决策树
3.3.3 Bagging集成学习
3.3.4算法评估
3.4卷积神经网络建模
3.4.1构造特征向量
3.4.2设计网络模型
3.4.3 训练网络
3.4.4模型评估与调参
3.5谱聚类算法建模
3.5.1构造特征向量
3.5.2构建拉普拉斯矩阵
3.5.3图切分与聚类
3.5.4算法评估
3.6算法总结与比较
4.中医藏象智能诊断平台的设计与实现
4.1系统总体架构
4.2系统功能设计
4.2.1医案录入
4.2.2医案清洗
4.2.3字典维护
4.2.4量化诊断
4.2.5结果展示
4.2.6算法训练
4.2.7算法评估
4.3数据库设计
4.3.1字典表
4.3.2医案表
4.3.3算法表
4.3.4诊断表
4.4系统功能展示
4.4.1医案录入
4.4.2医案清洗
4.4.3字典维护
4.4.4量化诊断
4.4.5结果展示
4.4.6算法训练
4.4.7算法评估
5.总结与展望
5.1系统优点与特色
5.2系统缺点与不足
5.3系统完善思路与展望
结束语
致谢
参考文献