java - Encoding record samples for expectation maximization algorithm -

first, i'm programmer without data science background, working knowledge of statistics quite limited.

i'm creating entity matching tool match records across internal datasets. want use probabilistic matching technique described in these documents*. have understanding of how technique works , how apply it, except derivation of agreement/disagreement weights using expectation maximization (em).

specifically, i'm unclear on how encode record pairs double[][] format required

the em implementation have available apache common math multivariatenormalmixtureexpectationmaximization.

here concrete example: matching company records.

a company has 2 fields: name (string) , country (enum), , want generate m , u probabilistic weights using em. how create double[][] dataset each field feed em?

in case of name, string there approximate agreement / disagreement, using string similarity method (edit distance, phonetic index, etc., details aren't relevant here)

in case of country, data normalized agreement occur on exact match. countries on , under represented. record under-represented country should have higher weight 1 over-represented country.

what values in inner double[] mean/represent?
how many entries/columns should there be?
how encode records double[]?

* documents describing probabilistic matching technique using em

Search This Blog

WIKI

java - Encoding record samples for expectation maximization algorithm -

Comments

Post a Comment

Popular posts from this blog

qt - QML MouseArea onWheel event not working properly when inside QML Scrollview -

java - is not an enclosing class / new Intent Cannot Resolve Constructor -

python - Error importing VideoFileClip from moviepy : AttributeError: 'PermissionError' object has no attribute 'message' -