求教两组间的标签重合度计算，这个应该学习什么算法？

推荐学习书目

› Learn Python the Hard Way

Python Sites

› PyPI - Python Package Index

› http://diveintopython.org/toc/index.html

› Pocoo

值得关注的项目

› PyPy

› Celery

› Jinja2

› Read the Docs

› gevent

› pyenv

› virtualenv

› Stackless Python

› Beautiful Soup

› 结巴中文分词

› Green Unicorn

› Sentry

› Shovel

› Pyflakes

› pytest

Python 编程

› pep8 Checker

Styles

› PEP 8

› Google Python Style Guide

› Code Style from The Hitchhiker's Guide

This topic created in 2393 days ago, the information mentioned may be changed or developed.

向大家请教一下，本人最近刚刚接触编程，学习的 Python，目前有一个想法想要学习：
就是现在有很多组各不相同的标签，然后想计算两组之间的相似程度，找到重合度最高的。这种算法要学习什么算法呢？有没有 Python 的解决方案？

Python

算法

重合度

学习

18 replies

lithiumii

Dec 3, 2019

不懂算法，盲猜一个 pca （ Principal component analysis?

TaihongZhang

Dec 3, 2019

@lithiumii 好的我去看看

how2code

Dec 3, 2019

说 PCA 的拉出去 251...

最简单的应该是关键词 TF IDF + cosine similarity

ZRS

Dec 3, 2019

直接每个 label 单独一维算 cosine 相似度吧

how2code

Dec 3, 2019

@how2code Google 第一个 https://stackoverflow.com/questions/12118720/python-tf-idf-cosine-to-find-document-similarity

wangyzj

Dec 3, 2019

@lithiumii 哈哈，我也是这么想的

klesh

Dec 3, 2019

看看 Jaccard Similarity 或 Overlap Coefficient 够不够用？

a = {'foo', 'bar', 'hello', 'world'}
b = {'foo', 'bar', 'hello', 'world', 'test'}
c = a.intersection(b)
d = a.union(b)
print('js(a, b)=', float(len(c))/float(len(d)))
print('oc(a, b)=', float(len(c))/float(min(len(a), len(b))))