摘 要
Information asymmetry is a popular problem in Talent Resource Market. Therefore, matching CV (Curriculum Vita) to its fitted job description is a important task. In this paper, I develop an algorithm to do this task. This will largely reduce the pains-taking task for people to look through CV and choose the fitted one.
The Algorithm is largely based on classical methods in Natural Language Procession (NLP). I first using Regular Expression to extract information from the structural text. This information will be used to filter the CV in the future. Then, I focus on extracting keywords from nonstructural test. Based on supervised Machine Learning algorithms, I trained an keywords extraction model. I compared two models here: Conditional Random Field (CRF) and bi-LSTM-CRF in this task. I further analyzed the keywords extracted and discovered the “long tail phenomenon”. In response to this challenge, I build an auto-encoder model to compact and extract further information from the keyword vectors of CV. Then I found that the encoded keyword vectors have rich semantics meaning: by visualizing it using Principal Component Analysis (PCA) and finding the CV’s keyword vectors clustering according to the matched job description. Finally, I build a Muti-Layer Perception (MLP) model to classify and matching CV to corresponding job description.
Then analyzed the model’s performance on test set by drawing ROC curve and calculated AUC. The model achieves 0.95 in AUC, which indicated it’s capability to matching CV to job descriptions.
Key words: NLP, Long tail, Resume matching, Machine Learning
摘 要 I
Abstract II
1 绪论 2
1.1 课题背景及目的 2
1.1.1 现实意义 2
1.1.2 理论意义 2
1.2 国内外研究状况 3
1.3 研究方法与研究内容 3
1.4 论文构成 4
2 相关技术 5
2.1 命名实体识别 5
2.2 利用条件随机场的命名实体识别 6
2.3 基于bi-LSTM-CRF的命名实体识别 7
2.4 自编码器的原理和应用 9
2.5 多层感知机的原理 11
2.6 梯度下降算法 12
3 算法设计与实现 14
3.1 数据准备 14
3.1.1 数据说明 14
3.1.2 数据的结构和字段分析 15
3.1.3 数据的预处理 18
3.2 特征提取 19
3.2.1 基于正则表达式的格式化信息提取 19
3.2.2 技术关键词的提取 19
3.2.3 自编码器提取one-hot编码的关键词特征 23
3.3 预测模型的设计 24
3.3.1 模型的输入输出 24
3.3.2 模型架构 25
3.3.3 模型训练 25
4 算法评估 26
4.1 技术关键词提取结果分析 26
4.1.1 关键词提取评价指标 26
4.1.2 技术关键词提取结果对比 27
4.2 自编码器的实现与结果分析 28
4.3 推荐算法在测试数据集上的结果 29
4.3.1 评价指标 29
4.3.2 结果与分析 29
结 论 31
致 谢 33
参考文献 34