# NAME Search::Fulltext::Tokenizer::MeCab - Provides Japanese fulltext search for [Search::Fulltext](http://search.cpan.org/perldoc?Search::Fulltext) module # SYNOPSIS use Search::Fulltext; use Search::Fulltext::Tokenizer::MeCab; my $query = '猫'; my @docs = ( '我輩ã¯çŒ«ã§ã‚ã‚‹', '犬もæ©ã‘ã°æ£’ã«å½“ã‚‹', '実家ã§ã¦ã‚“ã¡ã‚ƒã‚“ã£ã¦çŒ«ã‚’飼ã£ã¦ã¾ã—ã¦ï¼Œã‚‚ã®ã™ã£ã”ã„å¯æ„›ã„ã‚“ã§ã™ã‚ˆã»ã‚“ã¨', ); my $fts = Search::Fulltext->new({ docs => \@docs, tokenizer => "perl 'Search::Fulltext::Tokenizer::MeCab::tokenizer'", }); my $results = $fts->search($query); is_deeply($results, [0, 2]); # 1st & 3rd include '猫' my $results = $fts->search('猫 AND å¯æ„›ã„'); is_deeply($results, [2]); # DESCRIPTION [Search::Fulltext::Tokenizer::MeCab](http://search.cpan.org/perldoc?Search::Fulltext::Tokenizer::MeCab) is a Japanse tokenizer working with fulltext search module [Search::Fulltext](http://search.cpan.org/perldoc?Search::Fulltext). Only you have to do is specify `perl 'Search::Fulltext::Tokenizer::MeCab::tokenizer'` as a `tokenizer` of [Search::Fulltext](http://search.cpan.org/perldoc?Search::Fulltext). my $fts = Search::Fulltext->new({ docs => \@docs, tokenizer => "perl 'Search::Fulltext::Tokenizer::MeCab::tokenizer'", }); You are supposed to use UTF-8 strings for `docs`. Although various queries are available like ["QUERIES" in Search::Fulltext](http://search.cpan.org/perldoc?Search::Fulltext#QUERIES), _wildcard query_ (e.g. '我\*') and _phrase query_ (e.g. '"我輩ã¯çŒ«ã§ã‚ã‚‹"') are not supported. User dictionary can be used to change the tokenizing behavior of internally-used [Text::MeCab](http://search.cpan.org/perldoc?Text::MeCab). See [/ENVIRONMENTAL VARIABLES](http://search.cpan.org/perldoc?ENVIRONMENTAL\_VARIABLES") section for detailes. # ENVIRONMENTAL VARIABLES Some environmental variables are provided to customize the behavior of [Search::Fulltext::Tokenizer::MeCab](http://search.cpan.org/perldoc?Search::Fulltext::Tokenizer::MeCab). Typical usage: $ ENV1=foobar ENV2=buz perl /path/to/your_script_using_this_module ARGS - `MECABDIC_USERDIC` Specify path(s) to __MeCab's user dictionary__. See MeCab's manual to learn how to create user dictionary. Examples: MECABDIC_USERDIC="/path/to/yourdic1.dic" MECABDIC_USERDIC="/path/to/yourdic1.dic, /path/to/yourdic2.dic" - `MECABDIC_DEBUG` When set to not 0, debug strings appear on STDERR. Especially, outputs below would help check how your `docs` are tokenized. string to be parsed: 我輩ã¯çŒ«ã§ã‚ã‚‹ (7) token: 我輩 (2) token: 㯠(1) token: 猫 (1) token: 㧠(1) token: ã‚ã‚‹ (2) ... string to be parsed: 猫 AND å¯æ„›ã„ (9) token: 猫 (1) string to be parsed: å¯æ„›ã„ (4) token: å¯æ„›ã„ (3) Note that not only `docs` but also queries are also tokenized. # SUPPORTS Bug reports and pull requests are welcome at [https://github.com/laysakura/Search-Fulltext-Tokenizer-MeCab](https://github.com/laysakura/Search-Fulltext-Tokenizer-MeCab) ! To read this manual via `perldoc`, use `-t` option for correctly displaying UTF-8 caracters. $ perldoc -t Search::Fulltext::Tokenizer::MeCab # VERSION Version 1.02 # AUTHOR Sho Nakatani <lay.sakura@gmail.com>, a.k.a. @laysakura