NGram Module Documentation¶
ngram
– A set class that supports lookup by N-gram string similarity¶
- class ngram.NGram(items=None, threshold=0.0, warp=1.0, key=None, N=3, pad_len=None, pad_char='$', **kwargs)¶
A set that supports searching for members by N-gram string similarity.
In Python 2, items should be unicode string or a plain ASCII str (bytestring) - do not use UTF-8 or other multi-byte encodings, because multi-byte characters will be split up.
- Parameters
threshold (float in 0.0 ... 1.0) – minimum similarity for a string to be considered a match.
warp (float in 1.0 ... 3.0) – use warp greater than 1.0 to increase the similarity of shorter string pairs.
items ([item, ...]) – iteration of items to index for N-gram search.
N (int >= 2) – number of characters per n-gram.
pad_len (int in 0 ... N-1) – how many characters padding to add (defaults to N-1).
pad_char (str or unicode) – character to use for padding. Default is ‘$’, but consider using thenon-breaking space character,
u'\xa0'
(u"\u00A0"
).key (function(item) -> str/unicode) – Function to convert items into string, default is no conversion. Recommended to use str or unicode for non-string items. Using anonymous function prevents NGram class from being pickled.
Instance variables:
- Variables
_grams – For each n-gram, the items containing it and the number of timesthe n-gram occurs in the item as
{str:{item:int, ...}, ...}
.length – maps items to length of the padded string representations as
{item:int, ...}
.
- add(item)¶
Add an item to the N-gram index (if it has not already been added).
>>> from ngram import NGram >>> n = NGram() >>> n.add("ham") >>> list(n) ['ham'] >>> n.add("spam") >>> list(n) ['ham', 'spam']
- clear()¶
Remove all elements from this set.
>>> from ngram import NGram >>> n = NGram(['spam', 'eggs']) >>> list(n) ['eggs', 'spam'] >>> n.clear() >>> list(n) []
- static compare(s1, s2, **kwargs)¶
Compares two strings and returns their similarity.
- Parameters
s1 – first string
s2 – second string
kwargs – additional keyword arguments passed to __init__.
- Returns
similarity between 0.0 and 1.0.
>>> from ngram import NGram >>> NGram.compare('spa', 'spam') 0.375 >>> NGram.compare('ham', 'bam') 0.25 >>> NGram.compare('spam', 'pam') #N=2 0.375 >>> NGram.compare('ham', 'ams', N=1) 0.5
- copy(items=None)¶
Return a new NGram object with the same settings, and referencing the same items. Copy is shallow in that each item is not recursively copied. Optionally specify alternate items to populate the copy.
>>> from ngram import NGram >>> from copy import deepcopy >>> n = NGram(['eggs', 'spam']) >>> m = n.copy() >>> m.add('ham') >>> list(n) ['eggs', 'spam'] >>> list(m) ['eggs', 'ham', 'spam'] >>> p = n.copy(['foo', 'bar']) >>> list(p) ['foo', 'bar']
- difference(*others)¶
Return the difference of two or more sets as a new set.
>>> from ngram import NGram >>> a = NGram(['spam', 'eggs']) >>> b = NGram(['spam', 'ham']) >>> list(a.difference(b)) ['eggs']
- difference_update(other)¶
Remove from this set all elements from other set.
>>> from ngram import NGram >>> n = NGram(['spam', 'eggs']) >>> other = set(['spam']) >>> n.difference_update(other) >>> list(n) ['eggs']
- discard(item)¶
Remove an element from a set if it is a member.
If the element is not a member, do nothing.
>>> from ngram import NGram >>> n = NGram(['spam', 'eggs']) >>> n.discard('spam') >>> n.discard('ham') >>> list(n) ['eggs']
- find(query, threshold=None)¶
Simply return the best match to the query, None on no match.
>>> from ngram import NGram >>> n = NGram(["Spam","Eggs","Ham"], key=lambda x:x.lower(), N=1) >>> n.find('Hom') 'Ham' >>> n.find("Spom") 'Spam' >>> n.find("Spom", 0.8)
- finditem(item, threshold=None)¶
Return most similar item to the provided one, or None if nothing exceeds the threshold.
>>> from ngram import NGram >>> n = NGram([(0, "Spam"), (1, "Ham"), (2, "Eggsy"), (3, "Egggsy")], ... key=lambda x:x[1].lower()) >>> n.finditem((3, 'Hom')) (1, 'Ham') >>> n.finditem((4, "Oggsy")) (2, 'Eggsy') >>> n.finditem((4, "Oggsy"), 0.8)
- intersection(*others)¶
Return the intersection of two or more sets as a new set.
>>> from ngram import NGram >>> a = NGram(['spam', 'eggs']) >>> b = NGram(['spam', 'ham']) >>> list(a.intersection(b)) ['spam']
- intersection_update(*others)¶
Update the set with the intersection of itself and other sets.
>>> from ngram import NGram >>> n = NGram(['spam', 'eggs']) >>> other = set(['spam', 'ham']) >>> n.intersection_update(other) >>> list(n) ['spam']
- items_sharing_ngrams(query)¶
Retrieve the subset of items that share n-grams the query string.
- Parameters
query – look up items that share N-grams with this string.
- Returns
mapping from matched string to the number of shared N-grams.
>>> from ngram import NGram >>> n = NGram(["ham","spam","eggs"]) >>> n.items_sharing_ngrams("mam") {'ham': 2, 'spam': 2}
- key(item)¶
Get the key string for the item.
>>> from ngram import NGram >>> n = NGram(key=lambda x:x[1]) >>> n.key((3,"ham")) 'ham'
- static ngram_similarity(samegrams, allgrams, warp=1.0)¶
Similarity for two sets of n-grams.
- Note
similarity = (a**e - d**e)/a**e
where a is “all n-grams”, d is “different n-grams” and e is the warp.- Parameters
samegrams – number of n-grams shared by the two strings.
allgrams – total of the distinct n-grams across the two strings.
- Returns
similarity in the range 0.0 to 1.0.
>>> from ngram import NGram >>> NGram.ngram_similarity(5, 10) 0.5 >>> NGram.ngram_similarity(5, 10, warp=2) 0.75 >>> NGram.ngram_similarity(5, 10, warp=3) 0.875 >>> NGram.ngram_similarity(2, 4, warp=2) 0.75 >>> NGram.ngram_similarity(3, 4) 0.75
- ngrams(string)¶
Iterates over the ngrams of a string (no padding).
>>> from ngram import NGram >>> n = NGram() >>> list(n._split("hamegg")) ['ham', 'ame', 'meg', 'egg']
- ngrams_pad(string)¶
Pads a string and iterates over its ngrams.
>>> from ngram import NGram >>> n = NGram() >>> list(n.split("ham")) ['$$h', '$ha', 'ham', 'am$', 'm$$']
- pad(string)¶
Pad a string in preparation for splitting into ngrams.
>>> from ngram import NGram >>> n = NGram() >>> n.pad('ham') '$$ham$$'
- pop()¶
Remove and return an arbitrary set element. Raises KeyError if the set is empty.
>>> from ngram import NGram >>> n = NGram(['spam', 'eggs']) >>> n.pop() 'eggs'
- remove(item)¶
Remove an item from the set. Inverts the add operation.
>>> from ngram import NGram >>> n = NGram(['spam', 'eggs']) >>> n.remove('spam') >>> list(n) ['eggs']
- search(query, threshold=None)¶
Search the index for items whose key exceeds threshold similarity to the query string.
- Parameters
query – returned items will have at least threshold similarity to the query string.
- Returns
list of pairs of (item, similarity) by decreasing similarity.
>>> from ngram import NGram >>> n = NGram([(0, "SPAM"), (1, "SPAN"), (2, "EG")], key=lambda x:x[1]) >>> n.search("SPA") [((0, 'SPAM'), 0.375), ((1, 'SPAN'), 0.375)] >>> n.search("M") [((0, 'SPAM'), 0.125)] >>> n.search("EG") [((2, 'EG'), 1.0)]
- searchitem(item, threshold=None)¶
Search the index for items whose key exceeds the threshold similarity to the key of the given item.
- Returns
list of pairs of (item, similarity) by decreasing similarity.
>>> from ngram import NGram >>> n = NGram([(0, "SPAM"), (1, "SPAN"), (2, "EG"), ... (3, "SPANN")], key=lambda x:x[1]) >>> n.searchitem((2, "SPA"), 0.35) [((0, 'SPAM'), 0.375), ((1, 'SPAN'), 0.375)]
- split(string)¶
Pads a string and iterates over its ngrams.
>>> from ngram import NGram >>> n = NGram() >>> list(n.split("ham")) ['$$h', '$ha', 'ham', 'am$', 'm$$']
- splititem(item)¶
Pads the string key of an item and iterates over its ngrams.
>>> from ngram import NGram >>> n = NGram(key=lambda x:x[1]) >>> item = (3,"ham") >>> list(n.splititem(item)) ['$$h', '$ha', 'ham', 'am$', 'm$$']
- symmetric_difference(other)¶
Return the symmetric difference of two sets as a new set.
>>> from ngram import NGram >>> a = NGram(['spam', 'eggs']) >>> b = NGram(['spam', 'ham']) >>> list(a.symmetric_difference(b)) ['eggs', 'ham']
- symmetric_difference_update(other)¶
Update the set with the symmetric difference of itself and other.
>>> from ngram import NGram >>> n = NGram(['spam', 'eggs']) >>> other = set(['spam', 'ham']) >>> n.symmetric_difference_update(other) >>> list(n) ['eggs', 'ham']
- union(*others)¶
Return the union of two or more sets as a new set.
>>> from ngram import NGram >>> a = NGram(['spam', 'eggs']) >>> b = NGram(['spam', 'ham']) >>> list(a.union(b)) ['eggs', 'ham', 'spam']
- update(items)¶
Update the set with new items.
>>> from ngram import NGram >>> n = NGram(["spam"]) >>> n.update(["eggs"]) >>> list(n) ['eggs', 'spam']