Luceneでローマ字変換 - 明日から本気出す

[改訂新版] Apache Solr入門 ~オープンソース全文検索エンジン (Software Design plus)

作者: 大谷純,阿部慎一朗,大須賀稔,北野太郎,鈴木教嗣,平賀一昭,株式会社リクルートテクノロジーズ,株式会社ロンウイット
出版社/メーカー: 技術評論社
発売日: 2013/11/29
メディア: 大型本
この商品を含むブログ (6件) を見る

　[7.1.3 Suggester]で使用しているJapaneseReadingFormFilterを参考に。(というかまんま)

今回はeclipseにプラグインを入れてgradleプロジェクトとして作ってみました。
build.gradleはデフォルトで作成されるものにlucene用のdependenciesを追加。
不要なものも混ざってますがご容赦ください。

apply plugin: 'java'
apply plugin: 'eclipse'

sourceCompatibility = 1.6
targetCompatibility = 1.6
version = '1.0'
jar {
    manifest {
        attributes 'Implementation-Title': 'Gradle Quickstart', 'Implementation-Version': version
    }
}

repositories {
    mavenCentral()
}

dependencies {
    compile 'org.apache.lucene:lucene-core:4.7.2'
    compile 'org.apache.lucene:lucene-analyzers-common:4.7.2'
    compile 'org.apache.lucene:lucene-analyzers-kuromoji:4.7.2'
    testCompile 'junit:junit:4.+'
}

test {
    systemProperties 'property': 'value'
}

uploadArchives {
    repositories {
       flatDir {
           dirs 'repos'
       }
    }
}

そして、コード

package com.example;

import java.io.IOException;
import java.util.Arrays;
import java.util.List;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.ja.JapaneseAnalyzer;
import org.apache.lucene.analysis.ja.tokenattributes.ReadingAttribute;
import org.apache.lucene.analysis.ja.util.ToStringUtil;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.util.Version;

public class Romanize {

    public static void main(String[] args) {
        Analyzer analyzer = new JapaneseAnalyzer(Version.LUCENE_47);
        try {
            List<String> texts = Arrays
                    .asList("すもももももももものうち。", "メガネは顔の一部です。",
                            "日本経済新聞でモバゲーの記事を読んだ。",
                            "Java, Scala, Groovy, Clojure",
                            "ＬＵＣＥＮＥ、ＳＯＬＲ、Lucene, Solr",
                            "ｱｲｳｴｵカキクケコさしすせそABCＸＹＺ123４５６",
                            "Lucene is a full-featured text search engine library written in Java.");
            for (String text : texts) {
                TokenStream tokenStream = analyzer.tokenStream("", text);
                try {
                    romanize(tokenStream);
                } catch (IOException e) {
                    e.printStackTrace();
                } finally {
                    tokenStream.close();
                }
            }
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            analyzer.close();
        }
    }

    private static void romanize(TokenStream tokenStream) throws IOException {
        CharTermAttribute charTermAttr = tokenStream
                .addAttribute(CharTermAttribute.class);
        ReadingAttribute readingAttr = tokenStream
                .addAttribute(ReadingAttribute.class);
        tokenStream.reset();
        while (tokenStream.incrementToken()) {
            String token = charTermAttr.toString();
            String reading = readingAttr.getReading();
            String readingRoma = null;
            if (reading != null) {
                readingRoma = ToStringUtil.getRomanization(reading);
            } else if (isHalfWidthAlphanumeric(token)) {
                readingRoma = token;
            } else {
                readingRoma = ToStringUtil.getRomanization(token);
            }
            System.out.println("token:" + token + " ,reading:" + reading
                    + " ,readingRoma:" + readingRoma);
        }
    }

    private static boolean isHalfWidthAlphanumeric(String text) {
        if (text == null || text.length() == 0
                || text.length() != text.getBytes().length) {
            return false;
        }
        return true;
    }

}

Analyzerの使い方など参考にさせていただいたのはココ

JapaneseAnalyzerでトークン分割後に

String reading = readingAttr.getReading();

読みカナを取得し、

readingRoma = ToStringUtil.getRomanization(reading);

そして読み仮名をToStringUtil.getRomanizationに渡すとローマ字を得られます。

else if (isHalfWidthAlphanumeric(token)) {
    readingRoma = token;
}

半角アルファベットと数字はそのまま使用。

else {
    readingRoma = ToStringUtil.getRomanization(token);
}

それ以外の読み仮名が得られなかった単語はとりあえず元の単語をそのままローマ字変換してみましたf^^;

結果は以下となりました。

token:すもも ,reading:スモモ ,readingRoma:sumomo
token:もも ,reading:モモ ,readingRoma:momo
token:もも ,reading:モモ ,readingRoma:momo
token:メガネ ,reading:メガネ ,readingRoma:megane
token:顔 ,reading:カオ ,readingRoma:kao
token:一部 ,reading:イチブ ,readingRoma:ichibu
token:日本 ,reading:ニッポン ,readingRoma:nippon
token:日本経済新聞 ,reading:ニホンケイザイシンブン ,readingRoma:nihonkeizaishimbun
token:経済 ,reading:ケイザイ ,readingRoma:keizai
token:新聞 ,reading:シンブン ,readingRoma:shimbun
token:モバゲ ,reading:null ,readingRoma:mobage
token:記事 ,reading:キジ ,readingRoma:kiji
token:読む ,reading:ヨン ,readingRoma:yon
token:java ,reading:null ,readingRoma:java
token:scala ,reading:null ,readingRoma:scala
token:groovy ,reading:null ,readingRoma:groovy
token:clojure ,reading:null ,readingRoma:clojure
token:lucene ,reading:null ,readingRoma:lucene
token:solr ,reading:null ,readingRoma:solr
token:lucene ,reading:null ,readingRoma:lucene
token:solr ,reading:null ,readingRoma:solr
token:アイウエオカキクケコ ,reading:null ,readingRoma:aiueokakikukeko
token:しす ,reading:シス ,readingRoma:shisu
token:そ ,reading:ソ ,readingRoma:so
token:abcxyz ,reading:null ,readingRoma:abcxyz
token:123456 ,reading:null ,readingRoma:123456
token:lucene ,reading:null ,readingRoma:lucene
token:is ,reading:null ,readingRoma:is
token:a ,reading:null ,readingRoma:a
token:full ,reading:null ,readingRoma:full
token:featured ,reading:null ,readingRoma:featured
token:text ,reading:null ,readingRoma:text
token:search ,reading:null ,readingRoma:search
token:engine ,reading:null ,readingRoma:engine
token:library ,reading:null ,readingRoma:library
token:written ,reading:null ,readingRoma:written
token:in ,reading:null ,readingRoma:in
token:java ,reading:null ,readingRoma:java

ふーむ