对linkedin数据集使用贪心聚类

请见我的github:
https://github.com/HMY626/linkedin_data_cluster/blob/master/Clustering_job.py

一些数据分析均可归结于计数,比较文法相似度是计数,雅尔卡系数也是计数等等

我使用的是在kaggle上的linkedin数据集,由于领英上的职位名称差异性不太规范,所以我需要将职位进行标准化

1
2
3
4
5
6
7
8
9
10
11
transforms = [
('Sr.', 'Senior'),
('Sr', 'Senior'),
('Jr.', 'Junior'),
('Jr', 'Junior'),
('CEO', 'Chief Executive Officer'),
('COO', 'Chief Operating Officer'),
('CTO', 'Chief Technology Officer'),
('CFO', 'Chief Finance Officer'),
('VP', 'Vice President'),
]

将以上职位的缩写建立一个替换列表,用于尽可能地将职位信息进行标准化。

数据集样例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
[
{
"name": {
"family_name": "BOLUKBAS, PMP",
"given_name": "Mustafa"
},
"locality": "Turkey",
"skills": [
"Project Management",
"Cross-functional Team Leadership",
"Business Process Re-engineering",
"Software Development Life Cycle",
"Business",
"Software Project Management",

"name": {
"family_name": "Hoffmann",
"given_name": "Lynette"
},
"locality": "South Africa",
"skills": [
"Predictive Analytics",
"Business Analysis",
"Competitive Analysis",
"Training Skills",
"Market Research",
"Market Analysis",
"Marketing Strategy",
"Strategy Development",
"Pricing Analysis",
"Pricing Strategy",
"Pricing"
],
"locality_code": [
"South Africa",
[
-29.002309799194336,
25.0803165435791
]
],
"events": [
{
"from": "Eskom",
"to": "ESKOM Distribution",
"title1": "Graduate in Marketing",
"start": 24085,
"title2": "Market and Customer Services Analyst",
"end": 24098
},
{
"from": "ESKOM Distribution",
"to": "Eskom Distribution",
"title1": "Market and Customer Services Analyst",
"start": 24098,
"title2": "Pricing Advisor",
"end": 24123
},
{
"from": "Eskom Distribution",
"to": "Ericsson Regional Sub-Saharan Africa",
"title1": "Pricing Advisor",
"start": 24123,
"title2": "Price Manager",
"end": 24135
}
]

},
…………
]

其他不重要的key

统计地理位置信息

我采用geopy地bing地理位置包统计Linkedin联系人的地理位置编码,以便后期用D3进行可视化

聚类效果