The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Lingua::JA::Moji - Handle many kinds of Japanese characters

SYNOPSIS

Convert romanised Japanese to and from kana:

    use utf8;
    use Lingua::JA::Moji qw/kana2romaji romaji2kana/;
    my $romaji = kana2romaji ('あいうえお');
    print "$romaji\n";
    my $kana = romaji2kana ($romaji);
    print "$kana\n";
    

produces output

    aiueo
    アイウエオ

(This example is included as synopsis.pl in the distribution.)

Convert between different forms of kana:

    use utf8;
    use Lingua::JA::Moji ':all';
    my $h = 'あいうえおがっぷぴょん';
    print kata2hira ($h), "\n";
    print hira2kata (kata2hira ($h)), "\n";
    print kana2hw ($h), "\n";
    print kata2hira (hw2katakana (kana2hw ($h))), "\n";
    # Silly circled kana
    print kana2circled ($h), "\n";

produces output

    あいうえおがっぷぴょん
    アイウエオガップピョン
    アイウエオガップピョン
    あいうえおがっぷぴょん
    ㋐㋑㋒㋓㋔㋕゛ッ㋫゜㋪゜ョン

(This example is included as syn-kana.pl in the distribution.)

VERSION

This document describes Lingua::JA::Moji version 0.60 corresponding to git commit 9ad3d6b5308d54f0c1eae61dc5bf7119c2670074 made on Wed Feb 14 15:11:13 2024 +0900.

DESCRIPTION

This module provides methods to convert different written forms of Japanese into one another. It enables conversion between romanized Japanese, hiragana, and katakana. It also includes a number of unusual encodings such as Japanese braille and morse code, as well as conversions between Japanese and Cyrillic and Hangul. It also handles conversion between the Chinese characters (kanji) used before and after the character reforms of 1949, as well as the various bracketed and circled forms of kana and kanji.

All the functions in this module assume the use of Unicode encoding. All input and output strings must be encoded using Perl's "UTF-8" format.

The module loads the various data format conversion files on demand, thus the various obscure conversions hopefully do not cause a memory burden.

This module does not handle the conversion of kanji words into kana, or kana into kanji.

ROMANIZATION

These functions convert Japanese letters to and from romanized forms.

is_romaji

    use Lingua::JA::Moji 'is_romaji';
    # The following line returns "undef"
    is_romaji ("abcdefg");
    # The following line returns a defined value
    is_romaji ('loyehye');
    # The following line returns a defined value
    is_romaji ("atarimae");

This detects whether a string of alphabetical characters, which may also include characters with macrons or circumflexes, "looks like" romanized Japanese. If the test is successful, it returns a true value, and if the test is unsuccessful, it returns a false value. If the string is empty, it returns a false value. Hyphens are not allowed as the first character.

This works by converting the string to kana via "romaji2kana" and seeing if it converts cleanly or not.

The "true" value returned is the output of the round-trip conversion, converted into wapuro format.

is_romaji_semistrict

    use Lingua::JA::Moji 'is_romaji_semistrict';
    # The following line returns "undef"
    is_romaji_semistrict ("abcdefg");
    # The following line returns "undef"
    is_romaji_semistrict ('loyehye');
    # The following line returns a defined value
    is_romaji_semistrict ("atarimae");
    # The following line returns a defined value
    is_romaji_semistrict ("pinku no dorufin");

Halfway between "is_romaji" and "is_romaji_strict", this allows some formations like "pinku no dorufin" but not the really unlikely stuff which "is_romaji" allows.

is_romaji_strict

    use Lingua::JA::Moji 'is_romaji_strict';
    # The following line returns "undef"
    is_romaji_strict ("abcdefg");
    # The following line returns "undef"
    is_romaji_strict ('loyehye');
    # The following line returns a defined value
    is_romaji_strict ("atarimae");

This detects whether a string of alphabetical characters, which may also include characters with macrons or circumflexes, "looks like" romanized Japanese. If the test is successful, it returns a true value, and if the test is unsuccessful, it returns a false value. If the string is empty, it returns a false value.

This test is much stricter than "is_romaji". It insists that the word does not contain constructions which may be valid as inputs to an IME, but which do not look like Japanese words.

The "true" value returned is the output of the round-trip conversion, converted into wapuro format.

This was added to the module in version "0.27".

is_voiced

    use Lingua::JA::Moji 'is_voiced';
    if (is_voiced ('が')) {
         print "が is voiced.\n";
    }

Given a kana or romaji input, is_voiced returns a true value if the sound is a voiced sound like a, za, ga, etc. and the undefined value if not.

kana2romaji

Convert kana to romaji.

    use Lingua::JA::Moji 'kana2romaji';
    $romaji = kana2romaji ("うれしいこども");
    # $romaji = 'uresîkodomo'

Convert kana to a romanized form.

An optional second argument, a hash reference, controls the style of conversion.

    use utf8;
    $romaji = kana2romaji ("しんぶん", {style => "hepburn"});
    # $romaji = "shimbun"

The options are

style

The style of romanization. The default style of romanization is "Nippon-shiki". The user can set the conversion style to "hepburn" or "passport" or "kunrei" or "common". If Hepburn is selected, then the following option use_m is set to "true", and the ve_type is set to "macron". The "common" style is the same as the Hepburn style, but it does things like changing "ジェット" to "jetto" rather than ignoring the small vowel.

Possible styles are as follows:

none/empty

Without a style, the Nippon-shiki romanization is used. This is the only romanisation style which allows round trips from kana to romanised and back.

common

This is a modification of the Hepburn system which also changes combinations of large kana + small vowel kana into the commonest romanized form. For example "ジェット" becomes "jetto" and "ウェ" becomes "we".

hepburn

This gives Hepburn romanization. This is strictly defined to be the actual Hepburn system, so you may prefer to use "common" if your kana contains things like ファ which you want to turn into "fa".

kunrei

This gives Kunrei-shiki romanisation, the form of romanisation used in childrens' education. This is similar to Nippon-shiki except for a few consonant-vowel combinations.

passport

This gives "passport romaji" where long "o" vowels get turned into "oh" and other long vowels are deleted. In this system "おおの" turns into "ohno" and "ゆうすけ" turns into "yusuke".

use_m

If this is true, "syllabic n"s (ん) which come before "b" or "p" sounds, such as the first "n" in "shinbun" (しんぶん, newspaper) will be converted into "m" rather than "n".

It is automatically set to a true value if you choose "hepburn" or "passport" styles of romanisation, but you can override that by setting it to a false, but not undefined, value, something like this:

    my $romaji = kana2romaji ($hiragana,
                          {style => 'hepburn', 
                           ve_type => 'wapuro',
                           use_m => 0,});

I apologise for the convoluted interface. See "HISTORY" for more on the haphazard design of the module.

ve_type

The ve_type option controls how long vowels are written. The default is to use circumflexes to represent long vowels. If style is set to hepburn or common, the default is set to use macrons. If style is set to passport, the value of ve_type is also set to passport. The choices are:

undef

A circumflex is used.

macron

A macron is used.

passport

"Oh" is used to write long "o" vowels, and other long vowels are ignored.

none

Long vowels are not indicated.

wapuro

The "chouon" marks become hyphens, and おう becomes ou.

wo
     kana2romaji ("ちりぬるを", { wo => 1 });

If "wo" is set to a true value, "を" becomes "wo", otherwise it becomes "o".

kana_consonant

    use Lingua::JA::Moji 'kana_consonant';
    $consonant = kana_consonant ('ざる');
    # $consonant = 's'

Given a kana input, return the "dictionary order" consonant of the first kana. If the first kana is any of あいうえお, it returns an empty string. If the kana is an unvoiced kana, it returns the corresponding consonant of the first kana in the Nippon-shiki romanisation. If the kana is a voiced kana, it returns the corresponding consonant of the unvoiced version of the first kana in the Nippon-shiki romanisation.

This enables Japanese words to be sorted into the order used in Japanese dictionaries, where the voiced/unvoiced distinction between, for example, za and sa, or ta and da, is ignored.

normalize_romaji

    use Lingua::JA::Moji 'normalize_romaji';
    $normalized = normalize_romaji ('tsumuji');

normalize_romaji converts romanized Japanese to a canonical form, which is based on the Nippon-shiki romanization, but without representing long vowels using a circumflex. In the canonical form, sokuon (っ) characters are converted into the string "xtu". If there is kana in the input string, this will also be converted to romaji.

normalize_romaji is for comparing two Japanese words which may be represented in different ways, for example in different romanization systems, to see if they refer to the same word despite the difference in writing. It does not provide a standardized or officially-sanctioned form of romanization.

romaji2hiragana

Convert romaji to hiragana.

    use Lingua::JA::Moji 'romaji2hiragana';
    $hiragana = romaji2hiragana ('babubo');
    # $hiragana = 'ばぶぼ'

Convert romanized Japanese into hiragana. This takes the same options as "romaji2kana". It also switches on the "wapuro" option, which uses long vowels with a kana rather than a "chouon".

romaji2kana

Convert romaji to kana.

    use Lingua::JA::Moji 'romaji2kana';
    $kana = romaji2kana ('yamaguti');
    # $kana = 'ヤマグチ'

Convert romanized Japanese to katakana. The romanization is highly liberal and will attempt to convert any romanization it sees into katakana. The rules of romanization are based on the behaviour of the Microsoft IME (input method editor). To convert romanized Japanese into hiragana, use "romaji2hiragana".

An optional second argument to the function contains options in the form of a hash reference,

     $kana = romaji2kana ($romaji, {wapuro => 1});

Use an option wapuro => 1 to convert long vowels into the equivalent kana rather than "chouon".

     $kana = romaji2kana ($romaji, {ime => 1});

Use the ime => 1 option to approximate the behaviour of an IME. For example, input "gumma" becomes グッマ and input "onnna" becomes オンナ. Passport romaji ("Ohshimizu") is disallowed if this option is switched on.

See also "is_romaji", "is_romaji_strict", and "is_romaji_semistrict" for validation of romanised Japanese inputs.

romaji_styles

    use Lingua::JA::Moji 'romaji_styles';
    my @styles = romaji_styles ();
    # Returns a true value
    romaji_styles ("hepburn");
    # Returns the undefined value
    romaji_styles ("frogs");

Given an argument, this returns a true value if it is a known style of romanization.

Without an argument, it returns a list of possible styles, as an array of hash references, with each hash reference containing the short name under the key "abbrev" and the full name under the key "full_name".

romaji_vowel_styles

    use Lingua::JA::Moji 'romaji_vowel_styles';

Returns a list of valid styles of romaji vowels.

KANA

These functions convert one form of kana into another.

cleanup_kana

    use Lingua::JA::Moji 'cleanup_kana';

This function converts any of hiragana, halfwidth katakana, or romaji input into katakana. It also converts various confusable kanji characters into kana. For example the "one" kanji 一 is converted into a "chouon", ー, and the "mouth" kanji 口 is converted into the katakana ロ (ro).

This is used as the "front end" function for this katakana to English web application.

This was added to the module in version "0.46".

hira2kata

Convert hiragana to katakana.

    use Lingua::JA::Moji 'hira2kata';
    $katakana = hira2kata ('ひらがな');
    # $katakana = 'ヒラガナ'

hira2kata converts hiragana into katakana. The input may be a single string or a list of strings. If the input is a list, it converts each element of the list, and in list context it returns a list of the converted inputs. In scalar context it returns a concatenation of the strings.

    my @katakana = hira2kata (@hiragana);

This does not convert "chouon" signs.

hw2katakana

Convert halfwidth katakana to katakana.

    use Lingua::JA::Moji 'hw2katakana';
    $full_width = hw2katakana ('アイウカキギョウ。');
    # $full_width = 'アイウカキギョウ。'

hw2katakana converts "halfwidth katakana" and halfwidth Japanese punctuation to fullwidth katakana and fullwidth punctuation. Its function is similar to the Emacs command japanese-zenkaku-region. For the opposite function, see kana2hw.

InHankakuKatakana

    use Lingua::JA::Moji 'InHankakuKatakana';
    use utf8;
    if ('ア' =~ /\p{InHankakuKatakana}/) {
        print "ア is half-width katakana\n";
    }

InHankakuKatakana is a character class for use in regular expressions with \p which can validate "halfwidth katakana".

InKana

    use Lingua::JA::Moji 'InKana';
    $is_kana = ('アイウエオ' =~ /^\p{InKana}+$/);
    # $is_kana = '1'

A character class for use in regular expressions which matches all kana characters. This class catches meaningful combinations of hiragana, katakana, halfwidth katakana, circled katakana, and katakana combined words. It does not match the hentaigana characters of Unicode.

This is a combination of the existing Perl character classes Katakana, InKatakana, and InHiragana, minus unassigned characters, plus the "halfwidth katakana prolonged sound mark" (U+FF70) <ー> (chouon), the "halfwidth katakana voiced sound mark" (U+FF9E) <゙> ("dakuten") and the "halfwidth katakana semivoiced sound mark" (U+FF9F) <゚> ("handakuten"), minus '・', Unicode 30FB, "KATAKANA MIDDLE DOT". It is somewhat like the following:

    qr/\p{Katakana}|\p{InKatakana}|\p{InHiragana}|ー|゙|゚>/

except that the unassigned points which are matched by \p{Katakana} are not matched and KATAKANA MIDDLE DOT is not matched.

is_hiragana

    use Lingua::JA::Moji 'is_hiragana';
    

This function returns a true value if its argument is a string of hiragana, and an undefined value if not. The entire string from beginning to end must all be kana for this to return true. The kana cannot include punctuation marks or "chouon".

is_kana

    use Lingua::JA::Moji 'is_kana';
    

This function returns a true value if its argument is a string of kana, or an undefined value if not. The input cannot contain punctuation or "chouon".

is_katakana

    use Lingua::JA::Moji 'is_katakana';

Returns a true value if the string is katakana. At the moment this doesn't do the half-width katakana or squared symbol katakana.

is_small

    use Lingua::JA::Moji 'is_small';
    $is_small = ('ぁ');

Returns a true value for small kana, kana which have a bigger version as well, such as ぁ and あ.

join_sound_marks

    use Lingua::JA::Moji 'join_sound_marks';
    $joined = join_sound_marks ('か゛は゜つ゛');
    # $joined = 'がぱづ'

Join "dakuten" and "handakuten" (Unicode U+3099-U+309C) to kana where possible. Where they cannot be joined, strip them out. This only works on full width kana. The return value is the joined text.

This was added to the module in version "0.53".

kana2hw

Convert kana to halfwidth katakana.

    use Lingua::JA::Moji 'kana2hw';
    $half_width = kana2hw ('あいウカキぎょう。');
    # $half_width = 'アイウカキギョウ。'

kana2hw converts hiragana, katakana, and fullwidth Japanese punctuation to "halfwidth katakana" and halfwidth punctuation. Its function is similar to the Emacs command japanese-hankaku-region. For the opposite function, see hw2katakana. See also "katakana2hw" for a function which only converts katakana.

kana2katakana

Convert kana to katakana.

    use Lingua::JA::Moji 'kana2katakana';
    

This converts any of katakana, "halfwidth katakana", circled katakana and hiragana to full width katakana. It also joins "dakuten" and "handakuten" marks to kana where possible, or removes them, using "join_sound_marks".

kana_to_large

    use Lingua::JA::Moji 'kana_to_large';
    $large = kana_to_large ('ぁあぃい');
    # $large = 'ああいい'

Convert small-sized kana such as 「ぁ」 into full-sized kana such as 「あ」.

kata2hira

Convert katakana to hiragana.

    use Lingua::JA::Moji 'kata2hira';
    $hiragana = kata2hira ('カキクケコ');
    # $hiragana = 'かきくけこ'

kata2hira converts full-width katakana into hiragana. If the input is a list, it converts each element of the list, and in list context, returns a list of the converted inputs, otherwise it returns a concatenation of the strings.

    my @hiragana = hira2kata (@katakana);

This function does not convert "chouon" signs into long vowels. It also does not convert half-width katakana into hiragana.

katakana2hw

Convert katakana to halfwidth katakana.

    use Lingua::JA::Moji 'katakana2hw';
    $hw = katakana2hw ("あいうえおアイウエオ");
    # $hw = 'あいうえおアイウエオ'

This converts katakana to "halfwidth katakana", leaving hiragana unchanged. See also "kana2hw".

katakana2square

    use Lingua::JA::Moji 'katakana2square';
    $sq = katakana2square ('カロリーアイウエオウォン');
    # $sq = '㌍アイウエオ㌆'

Convert katakana into a square thing if possible.

katakana2syllable

    use Lingua::JA::Moji 'katakana2syllable';
    $syllables = katakana2syllable ('ソーシャルブックマークサービス');

This breaks the given string into syllables. If the string is broken up character by character, it becomes 'ソ', 'ー', 'シ', 'ャ', 'ル'. However, by themselves, 'ー' and 'ャ' can't be spoken.

This breaks the string up into pronouncable syllables, so that $syllables becomes 'ソー', 'シャ', 'ル'. A "syllabic n" is attached to the preceding sequence, so for example フラナガン is broken up into four syllables, フ, ラ, ナ, ガン.

This routine is used as the basis of this Change your name to kanji web application. The name is converted from English to kana, then this function is used to break the kana name into pieces to which a kanji may be attached. It's also used in this Katakana to English converter for the case that no words can be matched, and suggestions are made for how to split the word into possible components.

This was added to the module in version "0.24".

nigori_first

    use Lingua::JA::Moji 'nigori_first';
    my @list = (qw/カン スウ ハツ オオ/);
    nigori_first (\@list);
    # Now @list = (qw/カン スウ ハツ オオ ガン ズウ バツ パツ/);

Given a list of kana, add all the possible versions of the words with the first kana with either a "dakuten" or a "handakuten" added.

This was intended for a search for a particular kanji in a dictionary. It is not actually in use anywhere at the moment.

This was added to the module in version "0.36".

smallize_kana

    use Lingua::JA::Moji 'smallize_kana';
    $smallize = smallize_kana ('オキヤクサマガカツタ');
    # $smallize = 'オキャクサマガカッタ'

Given katakana input, convert possible "old-style" kana usage with large kanas used for "youon" or "sokuon" into smaller kana. If the conversion succeeds, return the converted value, otherwise return the undefined value. (I found the undefined value works better as a return value on failure than returning the text itself, since it saves the user from having to check whether the text has changed.)

The conversion is not intelligent, it just attempts to do as much as possible, so although it will work to convert "shiyotsuchiyuu" ("シヨツチユウ") into "shotchuu" ("ショッチュウ"), it will also do stupid things like converting "chiyoda" (ちよだ) into "choda" (ちょだ).

This was added to the module in version "0.46".

split_sound_marks

    use Lingua::JA::Moji 'split_sound_marks';
    $split = split_sound_marks ('ガパヅ');
    # $split = 'カ゛ハ゜ツ゛'

Split "dakuten" and "handakuten" from kana where possible. U+309B and U+309C are chosen rather than U+3099 and U+309A. (This choice was somewhat arbitrary. I'm not sure which of the pairs should be used. I chose these because they were the ones already in use internally in the module in "kana2braille" and "kana2morse".) This only works on full width kana. The return value is the split text.

This was added to the module in version "0.53".

square2katakana

    use Lingua::JA::Moji 'square2katakana';
    $kata = square2katakana ('㌆');
    # $kata = 'ウォン'

Convert a square katakana box into its components.

strip_sound_marks

    use Lingua::JA::Moji 'strip_sound_marks';

Strip sound marks from kana, so that for example パン (katakana pan) becomes ハン (katakana han).

This was added to the module in version "0.59".

HENTAIGANA

Variant kana forms. Hentaigana are new in Unicode 10.0 (June 2017).

hentai2kana

    use Lingua::JA::Moji 'hentai2kana';

Convert hentaigana into hiragana. Hentaigana with multiple interpretations are converted into a list of kana separated by a middle dot character.

This was added to the module in version "0.43".

hentai2kanji

    use Lingua::JA::Moji 'hentai2kanji';
    $kanji = hentai2kanji ('𛀢');
    # $kanji = '家'

Convert hentaigana into their equivalent kanji.

This was added to the module in version "0.43".

kana2hentai

    use Lingua::JA::Moji 'kana2hentai';
    $hentai = kana2hentai ('ケンブ');
    # $hentai = '𛀢・𛀲・𛀳・𛀴・𛀵・𛀶・𛀷𛄝・𛄞𛂰・𛂱・𛂲゛'

Convert kana to equivalent hentaigana. If more than one hentaigana exists, they are returned joined with a middle dot. The "dakuten" and "handakuten" are split out of the kana using "split_sound_marks" before the conversion.

This was added to the module in version "0.43".

kanji2hentai

    use Lingua::JA::Moji 'kanji2hentai';
    $kanji = kanji2hentai ('家');
    # $kanji = '𛀢'

Convert kanji to equivalent hentaigana, where they exist.

This was added to the module in version "0.43".

WIDE ASCII FUNCTIONS

Functions for handling "wide ASCII".

ascii2wide

Convert printable ASCII characters to wide ASCII characters.

    use Lingua::JA::Moji 'ascii2wide';
    $wide = ascii2wide ('abCE019');
    # $wide = 'abCE019'

Convert ASCII into "wide ASCII". It also converts the ASCII space, ASCII 0x20 into a fullwidth space, U+3000.

InWideAscii

    use Lingua::JA::Moji 'InWideAscii';
    use utf8;
    if ('A' =~ /\p{InWideAscii}/) {
        print "A is wide ascii\n";
    }

This is a character class for use with \p which matches "wide ASCII". It also matches the fullwidth space, U+3000.

wide2ascii

Convert wide ASCII characters to printable ASCII characters.

    use Lingua::JA::Moji 'wide2ascii';
    $ascii = wide2ascii ('abCE019');
    # $ascii = 'abCE019'

Convert "wide ASCII" into ASCII. It also converts the fullwidth space, U+3000, into an ASCII space, ASCII 0x20.

OTHER TYPES OF LETTERING

braille2kana

Convert Japanese braille to kana.

    use Lingua::JA::Moji 'braille2kana';
    

Converts Japanese braille (tenji) into the equivalent katakana.

circled2kana

Convert circled katakana to kana.

    use Lingua::JA::Moji 'circled2kana';
    $kana = circled2kana ('㋐㋑㋒㋓㋔');
    # $kana = 'アイウエオ'

This function converts the "circled katakana" of Unicode into full-width katakana. See also "kana2circled".

kana2braille

Convert kana to Japanese braille.

    use Lingua::JA::Moji 'kana2braille';
    

This converts kana into the equivalent Japanese braille (tenji) forms.

Bugs

This is not an adequate Japanese braille converter. Creating Japanese braille requires breaking Japanese sentences up into individual words, but this does not attempt to do that. People who are interested in building a Perl braille converter could start here.

kana2circled

Convert kana to circled katakana.

    use Lingua::JA::Moji 'kana2circled';
    $circled = kana2circled ('アイウエオガン');
    # $circled = '㋐㋑㋒㋓㋔㋕゛ン'

This function converts kana into the "circled katakana" of Unicode, which have code points from 32D0 to 32FE. See also "circled2kana".

There is no circled form of the ン kana, "syllabic n", so this is left untouched. The "dakuten" and "handakuten" are split from the kana using "split_sound_marks".

Circled katakana appear as Unicode code points U+32D0 to U+32FE.

kana2morse

Convert kana to Japanese morse code (wabun code).

    use Lingua::JA::Moji 'kana2morse';
    $morse = kana2morse ('ショッチュウ');
    # $morse = '--.-. -- .--. ..-. -..-- ..-'

Convert Japanese kana into Morse code. Japanese morse code does not have any way of representing small kana characters, so converting to and then from morse code will result in ショッチュウ becoming シヨツチユウ. The function "smallize_kana" may work to fix these outputs in some cases.

morse2kana

Convert Japanese morse code (wabun code) to kana.

    use Lingua::JA::Moji 'morse2kana';
    $kana = morse2kana ('--.-. -- .--. ..-. -..-- ..-');
    # $kana = 'シヨツチユウ'

Convert Japanese Morse code into kana. Each Morse code element must be separated by whitespace from the next one.

KANJI

bad_kanji

    use Lingua::JA::Moji 'bad_kanji';
    my @bad_kanji = bad_kanji ();

Returns a list of kanji with negative meanings. See also https://www.lemoda.net/japanese/offensive-kanji/index.html.

This was added to the module in version "0.47".

bracketed2kanji

    use Lingua::JA::Moji 'bracketed2kanji';
    $kanji = bracketed2kanji ('㈱');
    # $kanji = '株'

Convert bracketed form of kanji into unbracketed form.

circled2kanji

    use Lingua::JA::Moji 'circled2kanji';
    $kanji = circled2kanji ('㊯');
    # $kanji = '協'

Convert the circled forms of kanji into their uncircled equivalents.

kanji2bracketed

    use Lingua::JA::Moji 'kanji2bracketed';
    $kanji = kanji2bracketed ('株');
    # $kanji = '㈱'

Convert an unbracketed form of kanji into bracketed form, if it exists, otherwise do nothing with it.

kanji2circled

    use Lingua::JA::Moji 'kanji2circled';
    $kanji = kanji2circled ('協嬉');
    # $kanji = '㊯嬉'

Convert the usual forms of kanji into circled equivalents, if they exist. Note that only a limited number of kanji have circled forms.

new2old_kanji

Convert Modern kanji to Pre-1949 kanji.

    use Lingua::JA::Moji 'new2old_kanji';
    $old = new2old_kanji ('三国 連太郎');
    # $old = '三國 連太郎'

Convert new-style (post-1949) kanji (Chinese characters) into old-style (pre-1949) kanji.

Bugs

The list of characters in this converter may not contain every pair of old/new kanji.

It will not correctly convert 弁 since this has three different equivalents in the old system.

old2new_kanji

Convert Pre-1949 kanji to Modern kanji.

    use Lingua::JA::Moji 'old2new_kanji';
    $new = old2new_kanji ('櫻井');
    # $new = '桜井'

Convert old-style (pre-1949) kanji (Chinese characters) into new-style (post-1949) kanji.

yurei_moji

    use Lingua::JA::Moji 'yurei_moji';
    my @yurei = yurei_moji ();

Returns a list of the yurei moji (幽霊文字), kanji which don't actually exist but were mistakenly included in a computer standard. See https://www.sljfaq.org/afaq/yuureimoji.html for more information.

This was added to the module in version "0.47".

CYRILLIZATION

This is an experimental cyrillization of kana based on the information in a Wikipedia article, http://en.wikipedia.org/wiki/Cyrillization_of_Japanese. The module author does not know anything about cyrillization of kana, so any assistance in correcting this is very welcome.

cyrillic2katakana

Convert the Cyrillic (Russian) alphabet to katakana.

    use Lingua::JA::Moji 'cyrillic2katakana';
    $kana = cyrillic2katakana ('симбун');
    # $kana = 'シンブン'

kana2cyrillic

Convert kana to the Cyrillic (Russian) alphabet.

    use Lingua::JA::Moji 'kana2cyrillic';
    $cyril = kana2cyrillic ('シンブン');
    # $cyril = 'симбун'

HANGUL (KOREAN LETTERS)

kana2hangul

    use Lingua::JA::Moji 'kana2hangul';
    $hangul = kana2hangul ('すごわざ');
    # $hangul = '스고와자'

Bugs

May be incorrect

This is based on lists found on the internet at http://kajiritate-no-hangul.com/kana.html and http://lesson-hangeul.com/50itiranhyo.html. There is currently no proof of correctness.

No reverse conversion

There is currently no hangul to kana conversion.

SEE ALSO

Other Perl modules on CPAN include

Japanese kana/romanization

Data::Validate::Japanese

This contains four validators for kanji and kana, is_hiragana, corresponding to "is_hiragana" in this module, and three more, is_kanji, is_katakana, and is_h_katakana, for half-width katakana.

Lingua::JA::Fold

Full/half width conversion, collation of Japanese text, including handling of line breaks.

Lingua::JA::Hepburn::Passport

Passport romanization, which means converting long vowels into "OH". This corresponds to "kana2romaji" in the current module using the passport => 1 option, for example

    $romaji = kana2romaji ("かとう", {style => 'hepburn', passport => 1});
Lingua::JA::Jtruncate

Handle character boundaries over bytes in the old Japanese encodings EUC, JIS, and Shift-JIS, for people who don't like converting to Unicode.

Until about 2008, I used to use CP932 (Microsoft variant of Shift-JIS) in Perl programs, until I had the bad experience of tracking down a very strange bug caused by the "kanji space", U+3000, containing an @ mark when written in CP932, and being interpreted by Perl as an array.

Lingua::JA::Kana

This contains convertors for hiragana, half width and full width katakana, and romaji. As of version 0.07 [Aug 06, 2012], the romaji conversion is less complete than this module.

Lingua::JA::NormalizeText

A huge collection of normalization functions for Japanese text. If Lingua::JA::Moji does not have it, Lingua::JA::NormalizeText may do.

Lingua::JA::Onbiki

Convert a Japanese tilde character into the appropriate vowel. To achieve this with Lingua::JA::Moji, see the following example:

    use utf8;
    use Lingua::JA::Moji ':all';
    for (qw/あったか〜い つめた〜い ん〜 アッタカ〜イ/) {
        my $word = $_;
        while ($word =~ /(\p{InKana})〜/ && $1 ne 'ん') {
            my $kana = $1;
            my $romaji = kana2romaji ($kana);
            $romaji =~ s/[^aiueo]//g;
            my $vowel = romaji2kana ($romaji);
            if ($kana =~ /\p{InHiragana}/) {
                $vowel = kata2hira ($vowel);
            }
            $word =~ s/$kana〜/$kana$vowel/g;
        }
        print "$_ -> $word\n";
    }
    
    
    
    

produces output

    あったか〜い -> あったかあい
    つめた〜い -> つめたあい
    ん〜 -> ん〜
    アッタカ〜イ -> アッタカアイ

(This example is included as onbiki.pl in the distribution.)

Lingua::JA::Regular::Unicode

This includes hiragana to katakana, full width / half width, and wide ascii conversion. The strange name is due to its being an extension of Lingua::JA::Regular using Unicode-encoded strings.

Lingua::JA::Romaji

Romaji to kana/kana to romaji conversion.

Lingua::JA::Romaji::Valid

Validate romanized Japanese. This module does something similar to "is_romaji", "is_romaji_strict", and "is_romaji_semistrict" in Lingua::JA::Moji, but it has some extra options as well.

Lingua::JA::Romanize::Japanese

Romanization of Japanese. The module also includes romanization of kanji via the kakasi kanji to romaji convertor, and other functions.

Kana/kanji conversion

Lingua::JA::Romanize::Japanese

Romanization of Japanese language via kakasi.

Lingua::JA::Romanize::MeCab

Romanization of Japanese language with MeCab

Text::MeCab
Data::HanConvert

🐉 "The data for converting between traditional and simplified Chinese languages"

Encode::CNMap

🐉 "enhanced Chinese encodings with Simplified-Traditional auto-mapping"

Encode::HanConvert

🐉 "Traditional and Simplified Chinese mappings"

Lingua::KO::Munja

This is similar to the present module for Korean.

Lingua::ZH::HanConvert

🐉 "Convert between Traditional and Simplified Chinese characters"

Regexp::Chinese::TradSimp

🐉 "Take a string containing Chinese text, and turn it into a traditional-simplified-insensitive regexp."

Books

Parts of this module are covered in the book "Perl CPAN Module Guide" by Naoki Tomita (in Japanese), ISBN 978-4862671080, published by WEB+DB PRESS plus, April 2011.

NOTES

This section explains some of the Japanese-language-specific terminology used elsewhere in the documentation. The headers in this section are in lower case for the benefit of internal documentation links. The explanatory links here go to the "sci.lang.japan Frequently Asked Questions", a Usenet FAQ about Japanese language.

chouon

The long vowel marker, "ー", or chōon, which is used in Japanese katakana to indicate a lengthened vowel. See What is the long line symbol used in katakana?

dakuten
handakuten

Dakuten, 濁点, literally "voicing mark", and handakuten, 半濁点, literally "half voicing mark", are diacritic marks which appear on some kana to convert them to a voiced consonant. In modern Japanese encodings, these are usually displayed as part of the kana, but in "halfwidth katakana" they are displayed separately from the kana to reduce the number of characters which need to be encoded.

This module offers "split_sound_marks" and "join_sound_marks" to associate or dissociate the marks from kana, which may be used, for example, for the case of Morse code, Braille, or halfwidth kana conversion, as well as "strip_sound_marks", which removes all dakuten and handakuten from text.

halfwidth katakana

Halfwidth katakana, hankaku katakana (半角かたかな) is a legacy encoding of katakana based on an eight-bit encoding. See What is half-width katakana? for full details.

sokuon

Sokuon, 促音, is the use of a small kana tsu to indicate a doubled consonant. This smaller letter was not used in some kinds of older encoding such as Morse codes.

syllabic n

In this document, "syllabic n" means the kana ん or ン. See What is syllabic n? for full details.

wide ASCII

Wide ASCII, fullwidth ASCII, or zenkaku eisūji (全角英数字) are a legacy of bitmapped fonts which has survived into the present day. "Wide ascii" characters were originally special bitmapped font characters created to be the same size as one kanji or kana character. The name for normal ASCII characters in Japanese is hankaku eisūji (半角英数字), literally "half width English letters and numerals". See What is "wide ASCII"? for full details.

youon

Youon (拗音) means the use of kana ending in "i" with a small ya, yu, or yo kana, such as しゃ (sha) or きょ (kyo). These are called "glides" by linguists.

EXPORT

This module exports its functions only on request. To export all the functions in the module,

    use Lingua::JA::Moji ':all';

DEPENDENCIES

Carp

Carp is used to report errors.

Convert::Moji

This is used for most of the work of the module.

JSON::Parse

This is used to read in some of the data.

ACKNOWLEDGEMENTS

Thanks to Naoki Tomita, David Steinbrunner, and Neil Bowers for fixes.

HISTORY

"Moji" (文字) means "letters" in Japanese. I started Lingua::JA::Moji out of a need for more comprehensive handling of Japanese text than was offered by any of the existing modules on CPAN. There were a lot of modules offering piecemeal romaji/kana conversions or hiragana/katakana conversions, but nothing comprehensive or robust. Lingua::JA::Moji was originally a private module. Most of the functions in the module are things I needed for my own projects.

The design using Convert::Moji was part of an abandoned plan to make a cross-language module which could produce, say, a JavaScript converter doing the same things as this Perl one, using the same text sources.

I wasn't really sure whether to release it, but eventually I released it to CPAN as a result of requests for the source code of an online romaji/kana converter by website users. The module interface, in particular the hash reference options to "kana2romaji" and "romaji2kana", is rather messy, and some of the defaults are rather strange, but since it was described in Naoki Tomita's book, and some people may be using it as is, I'm not very keen to change it in incompatible ways.

0.24

This version added "katakana2syllable".

0.27

This version added "is_romaji_strict".

0.36

This version added the "nigori_first" function.

0.37

This version added "is_romaji_semistrict".

0.43

This version added support for hentaigana. This is based on copy and paste of the Unicode 10.0 standard draft documents. See the directory data in the github repository for the files used to generate this data.

0.46

This version disallowed hyphens as the first character of a romaji string and added "smallize_kana" and "cleanup_kana".

0.47

This version added a list of the "Yūrei moji" (幽霊文字), false kanji, and changed romanisation somewhat.

0.48

This version changed "kana2romaji" to be consistent with the documentation for the long vowel options wapuro and none.

0.53

This version added "join_sound_marks" and "split_sound_marks" to the module.

0.54

This version removed a function kana_order from the module. It improved the behaviour of "is_romaji_strict" after comparing its negatives and positives with a large number of English and nonsense words. It improved the behaviour of "smallize_kana" with regard to the "tsu" kana. "cleanup_kana" was improved to deal with stray dakuten and handakuten and some other odd kanji/kana confusions.

0.58

This added "kana_consonant".

0.59

This added "strip_sound_marks".

AUTHOR

Ben Bullock, <bkb@cpan.org>

COPYRIGHT & LICENCE

This package and associated files are copyright (C) 2008-2024 Ben Bullock.

You can use, copy, modify and redistribute this package and associated files under the Perl Artistic Licence or the GNU General Public Licence.