Material informatics#
1. Molecular Similarity#
The Jaccard index, also known as the Jaccard similarity coefficient, is a statistic used for gauging the similarity and diversity of sample sets.

The Jaccard coefficient measures similarity between finite sample sets and is defined as the size of the intersection divided by the size of the union of the sample sets:
Note that by design, \(0\le J(A,B)\le 1\). If A intersection B is empty, then \(J(A,B)=0\).
The Jaccard distance, which measures dissimilarity between sample sets, is complementary to the Jaccard coefficient and is obtained by subtracting the Jaccard coefficient from 1 or, equivalently, by dividing the difference of the sizes of the union and the intersection of two sets by the size of the union:
A “similarity ratio” is given over bitmaps, where each bit of a fixed-size array represents the presence or absence of a characteristic in the plant being modeled. The definition of the ratio is the number of common bits, divided by the number of bits set (i.e. nonzero) in either sample.
Presented in mathematical terms, if samples \(X\) and \(Y\) are bitmaps, \(X_i\) is the \(i_{th}\) bit of \(X\), and \(\land ,\lor\) are bitwise and ,or operators respectively, then the similarity ratio \(T_s\) is:
If Jaccard or Tanimoto similarity is expressed over a bit vector, then it can be written as