今天偷得半日闲,做了一个文本相似度分析的试验。
对wiki的名人语料库调用k-nearest neighbors算法,进行了文本相似度分析。
knn_model = graphlab.nearest_neighbors.create(people, features=['tfidf'],label='name')
结果发现,这个算法还是挺准的。
比如,查找与Obama最相似的名人,结果是Clinton等总统副总统政客等:
obama = people[people['name'] == 'Barack Obama']
knn_model.query(obama)
query_label | reference_label | distance | rank |
---|---|---|---|
0 | Barack Obama | 0.0 | 1 |
0 | Joe Biden | 0.794117647059 | 2 |
0 | Joe Lieberman | 0.794685990338 | 3 |
0 | Kelly Ayotte | 0.811989100817 | 4 |
0 | Bill Clinton | 0.813852813853 | 5 |
查找与Jiang Zemin最相似的名人,结果是Li Peng等等:
jiang_zemin = people[people['name']=='Jiang Zemin']
knn_model.query(jiang_zemin)
query_label | reference_label | distance | rank |
---|---|---|---|
0 | Jiang Zemin | 0.0 | 1 |
0 | Li Peng | 0.808219178082 | 2 |
0 | Hu Jintao | 0.825938566553 | 3 |
0 | Wang Chen (politician) | 0.834196891192 | 4 |
0 | Peng Qinghua | 0.834905660377 | 5 |