Sunday, October 14, 2012

pattern, a Python web mining and NLP tool

By Vasudev Ram


pattern is a web mining and NLP (Natural Language Processing) library for Python.

It is from CLiPS (Computational Linguistics & Psycholinguistics), "a research center associated with the Linguistics department of the faculty of Arts of the University of Antwerp."

From the site:

[ It bundles tools for data retrieval (Google + Twitter + Wikipedia API, web spider, HTML DOM parser), text analysis (rule-based shallow parser, WordNet interface, syntactical + semantical n-gram search algorithm, tf-idf + cosine similarity + LSA metrics), clustering and classification (k-means, KNN, SVM), and data visualization (graph networks). ]

Example usage and output - from the site:
>>> from pattern.web import Twitter, plaintext
>>> for tweet in Twitter().search('"more important than"', cached=False):
>>>    print plaintext(tweet.description)
 
'HINT: The mobile web is more important than mobile apps.'
'Start slowly, direction is more important than speed.'
'Imagination is more important than knowledge. - Albert Einstein'
...
I installed it (download the zip file, extract it and do "python setup.py install"); then tried it out with the above test program and a few variations on it. It partially works; i.e. it's able to fetch some tweets, but in some cases it gives errors that seem to be related to Unicode.

It also has an NLP module for English and a few other languages, plus some other stuff.

UPDATE:

It is now working. Got it to fetch these recent tweets of mine (from my @vasudevram Twitter profile):
IGNORE THIS (testing a Twitter tool). test===444
IGNORE THIS (testing a Twitter tool). test===333
IGNORE THIS (testing a Twitter tool). test===222
IGNORE THIS (testing a Twitter tool). test===111

- Vasudev Ram - Dancing Bison Enterprises

No comments: