Category Archives: Big data

Stemming different languages in PHP

While working with a Naive Bayes Classifier in PHP, I needed to do some stemming. In particular I needed Porter stemming in Swedish, but most libraries provide only English.

PECL stem to the rescue!

It is easily installed using:

pecl install stem

And adding the following to your php.ini:

extension=stem.so

(Don’t forget to restart Apache!)

The stem package is based on the Snowball API. Currently the PECL package supports the following languages:

  • Danish
  • Dutch
  • English
  • Finnish
  • French
  • German
  • Hungarian
  • Italian
  • Norwegian
  • Portuguese
  • Romanian
  • Russian
  • Russian (UTF8)
  • Spanish
  • Swedish
  • Turkish (UTF8)

Using it is as simple as stem_LANGUAGE($word).

For example, to stem an english word:

echo stem_english('judges'); //Returns the stem, "judg"

Stemming a swedish word is just as easy:

echo stem_swedish('affärscheferna'); //Returns the stem, "affärschef"

Alternatives

If you are looking for a PHP-only solution which does not need an additional Apache module, I can recommend the Porter Stemmer by Cam Spiers.