Welcome to NGram’s documentation!¶
Contents:
Introduction¶
The NGram class extends the Python ‘set’ class with efficient fuzzy search for members by means of an N-gram similarity measure. It also has static methods to compare a pair of strings.
The N-grams are character based not word-based, and the class does not implement a language model, merely searching for members by string similarity.
The documentation and tutorial are on the PyPI package documentation site.
Installation¶
Install python-ngram from PyPI using pip installer:
pip install ngram
It should run on Python 2.6, Python 2.7 and Python 3.2
How does it work?¶
The set stores arbitrary items, but for non-string items a key function (such as str) must be specified to provide a string represenation. The key function can also be used to normalise string items (e.g. lower-casing) prior to N-gram indexing.
To index a string it pads the string with a specified dummy character, then splits it into overlapping substrings of N (default N=3) characters in length and associates each N-gram to the items that use it.
To find items similar to a query string, it splits the query into N-grams, collects all items sharing at least one N-gram with the query, and ranks the items by score based on the ratio of shared to unshared N-grams between strings.
History¶
In 2007, Michel Albert (exhuma) wrote the python-ngram module based on Perl’s String::Trigram module by Tarek Ahmed, and committed the code for 2.0.0b2 to a now-disused Sourceforge subversion repo.
Since late 2008, Graham Poulter has maintained python-ngram, initially refactoring it to build on the set class, and also adding features, documentation, tests, performance improvements and Python 3 support.
Primary development takes place on GitHub, but changes are also pushed to the earlier repo on Google Code.
Release Notes¶
Version 3.3.0¶
Released 2012-06-29
- NEW FEATURES
Correct support for remaining set methods: pop, clear, union, intersection, difference, symmetric_difference
Can provide alternate items to the copy method
- IMPROVEMENTS
Update license from LGPL to LGPL version 3
Revised readme to work with GitHub, PyPI and generated docs.
Tox to run all doctests, pass under 2.7 and 3.2
- BUG FIXES
Fix unused threshold param in searchitem method
Fix intersection_update to accept multiple other iterables
Version 3.2.1¶
Released 2012-06-28
Fix bug in symmetric_difference_update method
Update release notes / changelog
Update tutorial
Version 3.2.0¶
Released 2012-06-25
- NEW FEATURES
“csvjoin” script performs SQL-like join between CSV tables based on string similarity.
NGram instances can now be pickled/unpickled (added __reduce__)
Add searchitem method to search by item (search method takes a string)
Add find and finditem methods to return 1 result instead of a list.
- BREAKING CHANGES
iconv parameter is now the “key” parameter (matches the sorted() builtin)
qconv parameter no longer exists: use searchitem method to query by item
the ngrams_pad method is deprecated for new split and splititem methods
the ngrams method is deprecated (equivalent _split is for internal use)
- OTHER IMPROVEMENTS
Converted Mercurial repo to Git
Corrected indentation from 3 to 4 spaces
Added tox to run tests on Python 2.7 and 3.2
Version 3.1.0¶
Released 2009-12-07
- NEW FEATURES
Python 3 support via 2to3
Sphinx documentation generation
Tutorial documentation
- BREAKING CHANGES
str_item and str_query params are now iconv and qconv
- BUG FIXES
Integer division bug (e.g. arises when warp is 2 not 2.0)
- MINOR CHANGES
Setuptools replaced by Distribute (for Python 3)
Docstrings now reStructuredText for Sphinx
Version 3.0.0¶
Released 2009-07-03.
This was a major refactoring without back-compatibility.
- NEW FEATURES
Accepts any hashable item - no longer limited to strings.
Re-written as subclass of set, gaining all set operations.
Docstrings added. Using Epydoc API doc generator.
- IMPROVEMENTS
Eliminated innermost level of dictionaries, reducing memory usage.
Revised to use Python 2.6 idioms. Losing Python 2.2 compatibility.
Renamed things to follow PEP 8
Refactored the NGram class (new method decomposition)
License¶
This program is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>