Stemming different languages in PHP

While working with a Naive Bayes Classifier in PHP, I needed to do some stemming. In particular I needed Porter stemming in Swedish, but most libraries provide only English.

PECL stem to the rescue!

It is easily installed using:

pecl install stem

And adding the following to your php.ini:

extension=stem.so

(Don’t forget to restart Apache!)

The stem package is based on the Snowball API. Currently the PECL package supports the following languages:

  • Danish
  • Dutch
  • English
  • Finnish
  • French
  • German
  • Hungarian
  • Italian
  • Norwegian
  • Portuguese
  • Romanian
  • Russian
  • Russian (UTF8)
  • Spanish
  • Swedish
  • Turkish (UTF8)

Using it is as simple as stem_LANGUAGE($word).

For example, to stem an english word:

echo stem_english('judges'); //Returns the stem, "judg"

Stemming a swedish word is just as easy:

echo stem_swedish('affärscheferna'); //Returns the stem, "affärschef"

Alternatives

If you are looking for a PHP-only solution which does not need an additional Apache module, I can recommend the Porter Stemmer by Cam Spiers.

16 thoughts on “Stemming different languages in PHP

  1. Himanshu

    Hi,

    Thanks for the tutorial. It was extremely helpful as no documentation of pecl stem exists.
    I tried stemming using stem_english($word) but I am getting the following error:


    PHP Fatal error: Call to undefined function stem_english()

    However, when I use stem($word), it is stemming it (but not efficiently).

    Am I doing something wrong?

    Regards,
    Himanshu Joshi

    Reply
    1. Stanislav Khromov Post author

      Hi Himanshu,

      After you run “pecl install stem” you get asked a bunch of questions about which languages you would like to compile into your stemmer. Make sure you select “yes” for the english stemmer. It looks like this:

      ...
      Compile English stemmer? [yes] : yes
      ...
      

      Afterwards, you can use the stem_english function.

      If you only require english stemming, you may also use the porter-stemmer written by Cam Spiers, which requires no additional modules: https://github.com/camspiers/porter-stemmer

      Edit: You can see which languages are available in your stemmer by checking phpinfo(); under “stem support”

      Reply
  2. Benjamin Intal

    Very helpful article! For me I had to install pecl first since my server didn’t have it yet:

    apt-get install php-http
    pecl install pecl_http

    I’ll be using this to build my search index.

    Reply
  3. Xavier

    Hi,

    After installing PEAR, and placing the file “stem-1.5.1.tgz” under the “php5.4.3” folder, I tried “install pecl stem”, but got an error “The DSP stem.dsp does not exist”. See the log below. It is strange because the “stem.dsp ” file was included in “stem-1.5.1.tgz”. Any idea to fix that ? Also how can I uninstall it ? Thanks for your help.

    C:\wamp\bin\php\php5.4.3>pecl install stem
    downloading stem-1.5.1.tgz …
    Starting to download stem-1.5.1.tgz (82,665 bytes)
    ………………..done: 82,665 bytes
    43 source files, building
    WARNING: php_bin C:\wamp\bin\php\php5.4.3\php.exe appears to have a suffix \php5
    .4.3\php.exe, but config variable php_suffix does not match
    ERROR: The DSP stem.dsp does not exist.

    Reply
  4. Xavier

    Hi Stanislav,

    Thanks for your reply. I copied locally the zip file “php_stem-1.5.1-5.5-ts-vc11-x86” and placed the dll “php_stem” Under the ext folder. I also added the line “extension=php_stem.dll” in the php.ini file.
    Probably did I miss some steps since the stem method is still generating an error when running my php method : Fatal error: Call to undefined function stem_english(). Do you have some idea what is wrong here ?

    Reply
    1. Stanislav Khromov Post author

      Hi Xavier,

      Can you check if the extension appears when you do a phpinfo(); ? (Search for stem on the page)

      If it does not appear, the extension was not installed correctly. I know many LAMP stacks on Windows have multiple PHP versions, make sure you added the extension to the correct folder.

      Reply
  5. Alex

    Looks like stem_russian() is not working correct:

    echo stem_russian("букеты");

    Gives the same string, always.

    Reply
    1. Stanislav Khromov Post author

      If I recall correctly, you get to pick which languages should be included during the pecl install. Make sure you did select Y(es) for english.

      Reply
  6. Gokhan

    I instal with PECL. I added English and Turkish Languages. But when I write stem_turkish() php gives error “Fatal error: Call to undefined function stem_turkish() in /var/www/servis/srvc/index.php on line 7”.

    When I try stem_english() works fine. What is the problem?

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Markdown is allowed in comments.