e commerce - Algorithms for matching based on keywords intersection -


suppose have buyers , sellers trying find each other in market. buyers can tag needs keywords; sellers can same selling. i'm interested in finding algorithm(s) rank-order sellers in terms of relevance particular buyer on basis of 2 keyword sets.

here example:

buyer_keywords = {"furry", "four legs", "likes catnip", "has claws"}  

and have 2 potential sellers need rank order in terms of relevance:

seller_keywords[1] = {"furry", "four legs", "arctic circle", "white"} seller_keywords[2] = {"likes catnip", "furry",                        "hates mice", "yarn-lover", "whiskers"} 

if use intersection of keywords, not discrimination: both intersect on 2 keywords. if divide intersection count size of set union, seller 2 worse because of greater number of keywords. seem introduce automatic penalty method not correcting keyword set size (and don't want penalize adding keywords).

to put bit more structure on problem, suppose have truthful measure of intensity of keyword attributes (which have sum 1 each seller), e.g.,:

seller_keywords[1] = {"furry":.05,                        "four legs":.05,                        "arctic circle":.8,                        "white":.1}  seller_keywords[2] = {"likes catnip":.5,                        "furry":.4,                        "hates mice":.02,                        "yarn-lover":.02,                        "whiskers":.06} 

now sum value of hits: seller 1 gets score of .1, while seller 2 gets score of .9. far, good, might third seller limited, non-descriptive keyword set:

seller_keywords[3] = {"furry":1} 

this catapults them top hit on sole keyword, isn't good.

anyway, guess (and hope) general problem , there exist different algorithmic solutions known strengths , limitations. covered in cs101, think answer question might link relevant references.

i think you're looking use cosine similarity; it's basic technique gets pretty far first hack. intuitively, create vector every tag know has particular index:

terms[0] --> aardvark terms[1] --> anteater ... terms[n] --> zuckerberg 

then create vectors in space each person:

person1[0] = 0     # person doesn't care aardvarks person1[1] = 0.05  # person cares bit anteaters ... person1[n] = 0 

each person vector in n-dimensional space. can use cosine similarity calculate similarity between pairs of them. calculationally, same of asking angle between 2 vectors. want cosine close 1, means vectors collinear -- have similar values of dimensions.

to improve metric, may want use tf-idf weighting on elements in vector. tf-idf downplay importance of popular terms (e.g, 'iphone') , promote importance of unpopular terms person seems particularly associated with.

combining tf-idf weighting , cosine similarity applications this.


Comments

Popular posts from this blog

c# - How to set Z index when using WPF DrawingContext? -

razor - Is this a bug in WebMatrix PageData? -

android - layout with fragment and framelayout replaced by another fragment and framelayout -