語分割装置および方法

Word segmentation apparatus and method

Abstract

【課題】形態素解析器では分割できないような新語や造語であっても、語分割することが可能となる。 【解決手段】文字列の入力を受け付け入力文字列を得る入力手段101と、入力文字列の全ての文字間で入力文字列を2分割し、前半文字列と後半文字列とからなる分割文字列を得る分割手段102と、入力文字列が出現した度数を示す数である第1頻度と、前半文字列が出現した度数を示す数である第2頻度と、後半文字列が出現した度数を示す数である第3頻度を取得する取得手段103と、第1頻度の値と、第2頻度の値および第3頻度の値のうちの小さい方の値との比により、複数の分割文字列のうちの比が最小となる分割文字列を最適分割文字列として判定する第1判定手段104と、最適分割文字列に含まれる最適前半文字列および最適後半文字列の少なくとも1つが、停止条件を満たす場合は、基本語として判定する第2判定手段105と、を具備する。 【選択図】図1
<P>PROBLEM TO BE SOLVED: To enable word segmentation for even such a new word and a coined word that are not segmented by a morphological analyzer. <P>SOLUTION: A word segmentation apparatus includes: an input means 101 for receiving input of a character string to obtain an input character string; a segmentation means 102 for segmenting the input character string into two parts between each character of the input character string to obtain segmented character strings composed of a first half character string and a latter half character string; an acquisition means 103 for acquiring a first frequency as a number indicating a frequency of appearance of the input character string, a second frequency as a number of a frequency of appearance of the first half character string, and a third frequency as a number of a frequency of appearance of the latter half character string; a first determination means 104 for determining a segmented character string having a smallest ratio among a plurality of segmented character strings depending on a ratio with respect to a smallest value among the value of the first frequency, the value of the second frequency, and the value of the third frequency, as an optimal segmented character string; and a second determination means 105 which, if at least one of an optimal first half character string and an optimal latter half character string included in the optimal character string satisfies a stopping condition, determines the one as a fundamental word. <P>COPYRIGHT: (C)2010,JPO&INPIT

Claims

Description

Topics

Download Full PDF Version (Non-Commercial Use)

Patent Citations (2)

    Publication numberPublication dateAssigneeTitle
    JP-2004348584-ADecember 09, 2004Nippon Telegr & Teleph Corp , 日本電信電話株式会社Method, device, storage medium, and program for word segmentation
    JP-S62191898-AAugust 22, 1987Fujitsu LtdCompound word split processing system

NO-Patent Citations (0)

    Title

Cited By (1)

    Publication numberPublication dateAssigneeTitle
    CN-104750665-AJuly 01, 2015腾讯科技(深圳)有限公司Text message processing method and text message processing device