Ngram.pm version 0.08
=====================

=head1 NAME

Text::Ngram - Ngram analysis of text

=head1 SYNOPSIS

  use Text::Ngram qw(ngram_counts add_to_counts);
  my $text   = "abcdefghijklmnop";
  my $hash_r = ngram_counts($text, 3); # Window size = 3
  # $hash_r => { abc => 1, bcd => 1, ... }

  add_to_counts($more_text, 3, $hash_r);

=head1 DESCRIPTION

n-Gram analysis is a field in textual analysis which uses sliding window
character sequences in order to aid topic analysis, language
determination and so on. The n-gram spectrum of a document can be used
to compare and filter documents in multiple languages, prepare word
prediction networks, and perform spelling correction.

The neat thing about n-grams, though, is that they're really easy to
determine. For n=3, for instance, we compute the n-gram counts like so:

    the cat sat on the mat
    ---                     $counts{"the"}++;
     ---                    $counts{"he "}++;
      ---                   $counts{"e c"}++;
       ...

This module provides an efficient XS-based implementation of n-gram
spectrum analysis.

There are two functions which can be imported:

=head2 ngram_counts

This first function returns a hash reference with the n-gram histogram 
of the text for the given window size. The default window size is 5.

    $href = ngram_counts(\%config, $text, $window_size);

The only necessary parameter is $text.

The possible value for \%config are:

=head3 flankbreaks

If set to 1 (default), breaks are flanked by spaces; if set to 0,
they're not. Breaks are punctuation and other non-alfabetic
characters, which, unless you use C<punctuation => 0> in your
configuration, do not make it into the returned hash.

Here's an example, supposing you're using the default value
for punctuation (1):

  my $text = "hello, world";
  my $hash = ngram_counts($text, 5);

That produces the following ngrams:

  {
    'hello' => 1,
    'ello ' => 1,
    ' worl' => 1,
    'world' => 1,
  }

On the other hand, this:

  my $text = "hello, world";
  my $hash = ngram_counts({flankbreaks => 0}, $text, 5);

Produces the following ngrams:

  {
    'hello' => 1,
    ' worl' => 1,
    'world' => 1,
  }

=head3 lowercase

If set to 0, casing is preserved. If set to 1, all letters are
lowercased before counting ngrams. Default is 1.

    # Get all ngrams of size 4 preserving case
    $href_p = ngram_counts( {lowercase => 0}, $text, 4 );

=head3 punctuation

If set to 0 (default), punctuation is removed before calculating the
ngrams.  Set to 1 to preserve it.

    # Get all ngrams of size 2 preserving punctuation
    $href_p = ngram_counts( {punctuation => 1}, $text, 2 );

=head3 spaces

If set to 0 default is 1, no ngrams contaning spaces will be returned.

   # Get all ngrams of size 3 that do not contain spaces
   $href = ngram_counts( {spaces => 0}, $text, 3);

If you're going to request both types of ngrams, than the best way to
avoid calculating the same thing twice is probably this:

    $href_with_spaces = ngram_counts($text[, $window]);
    $href_no_spaces = $href_with_spaces;
    for (keys %$href_no_spaces) { delete $href->{$_} if / / }

=head2 add_to_counts

This incrementally adds to the supplied hash; if C<$window> is zero or
undefined, then the window size is computed from the hash keys.

    add_to_counts($more_text, $window, $href)

=head1 TO DO

=over 6

=item * Look further into the tests. Sort them and add more.

=back

=head1 SEE ALSO

Cavnar, W. B. (1993). N-gram-based text filtering for TREC-2. In D.
Harman (Ed.), I<Proceedings of TREC-2: Text Retrieval Conference 2>.
Washington, DC: National Bureau of Standards.

Shannon, C. E. (1951). Predication and entropy of printed English.
I<The Bell System Technical Journal, 30>. 50-64.

Ullmann, J. R. (1977). Binary n-gram technique for automatic correction
of substitution, deletion, insert and reversal errors in words.
I<Computer Journal, 20>. 141-147.

=head1 AUTHOR

Maintained by Jose Castro, C<cog@cpan.org>.

Originally created by Simon Cozens, C<simon@cpan.org>.

=head1 COPYRIGHT AND LICENSE

Copyright 2004 by Jose Castro

Copyright 2003 by Simon Cozens

This library is free software; you can redistribute it and/or modify
it under the same terms as Perl itself.