Unicode::Property::XS - Unicode properties implemented by lookup table in C code.
use Unicode::Property::XS qw(:all); # 'ucs_' is the default prefix my @property_letters; foreach my $ord (0x0000..0x37FF) { push @property_letters, ucs_L($ord); # /\p{L}/ }; my @property_list = ucs_EaFullwidth1(0x0000..0x37FF); foreach my $ord (0x0000..0x3FFFF) { next if !ucs_Legal($ord); die "Internal error!" if ucs_M($ord) != ((chr($ord) =~ /\p{M}/) ? 1 : 0); } my @myChars = q( a b c d e f g 1 2 3 ); my @property_list2 = ucs_L( ord(@myChars) ); __END__ ################################# BEGIN { Unicode::Property::XS::Prefix = 'Is'; } use Unicode::Property::XS; my @property_letters; foreach my $ord (0x0000..0x37FF) { push @property_letters, IsL($ord); # /\p{L}/ }; __END__ ################################# use Unicode::Property::XS qw( Legal :EastAsianWidth ); use Unicode::EastAsianWidth; BEGIN { $Unicode::EastAsianWidth::EastAsian = 0; }; foreach my $ord (0x0000..0xEFFFF) { next if !ucs_Legal($ord) ; my $lookup_value = ucs_EaFullwidth0($ord); # /\p{InFullwidth} my $re_value = chr($ord)=~/\p{InFullwidth}/ ; die "Error in Unicode::Property::XS!\n" if !($lookup_value == $re_value) ; }; __END__
Unicode properties for regular expression in perl is handy. But it's somehow slow when the times of repetition is sparse for a given word. So, I made a table lookup XS module for property lookup. The "Unicoae Character Properties" section of perlunicode and properties in Unicode::EastAsianWidth is implemented.
The bundle costs 1.2MB for run time dynamic library, and include all the property class listed below. please tell me if you module-spliting or space-saving solutions.
All the functions except ucs_Legal() work the same way. Return 1 if the input character (in numeric value) is in that property class. Return 0 if not. Return 0 if the encoding value is illegal (should not happen if the input value is converted by ord($ucs_char)). Return 15 if in plane 15, a user-defined plane. Return 16 if in plane 16, a user-defined plane.
ucs_Legal()
ord($ucs_char)
And ucs_Legal() returns 1 if perl will not complain chr($ucs_ord), and 0, otherwise.
chr($ucs_ord)
The following functions can be exported to the caller's scope. ucs_Legal().
Functions for general properties: ucs_L(), ucs_LC(), ucs_Lu(), ucs_Ll(), ucs_Lt(), ucs_Lm(), ucs_Lo(), ucs_M(), ucs_Mn(), ucs_Mc(), ucs_Me(), ucs_N(), ucs_Nd(), ucs_Nl(), ucs_No(), ucs_P(), ucs_Pc(), ucs_Pd(), ucs_Ps(), ucs_Pe(), ucs_Pi(), ucs_Pf() ucs_Po(), ucs_S(), ucs_Sm(), ucs_Sc(), ucs_Sk(), ucs_So(), ucs_Z(), ucs_Zs(), ucs_Zl(), ucs_Zp(), ucs_C(), ucs_Cc(), ucs_Cf(), ucs_Cs(), ucs_Co(), ucs_Cn(),
ucs_L()
ucs_LC()
ucs_Lu()
ucs_Ll()
ucs_Lt()
ucs_Lm()
ucs_Lo()
ucs_M()
ucs_Mn()
ucs_Mc()
ucs_Me()
ucs_N()
ucs_Nd()
ucs_Nl()
ucs_No()
ucs_P()
ucs_Pc()
ucs_Pd()
ucs_Ps()
ucs_Pe()
ucs_Pi()
ucs_Pf()
ucs_Po()
ucs_S()
ucs_Sm()
ucs_Sc()
ucs_Sk()
ucs_So()
ucs_Z()
ucs_Zs()
ucs_Zl()
ucs_Zp()
ucs_C()
ucs_Cc()
ucs_Cf()
ucs_Cs()
ucs_Co()
ucs_Cn()
Functions for bidirectional properties: ucs_BidiL(), ucs_BidiLRE(), ucs_BidiLRO(), ucs_BidiR(), ucs_BidiAL(), ucs_BidiRLE(), ucs_BidiRLO(), ucs_BidiPDF(), ucs_BidiEN(), ucs_BidiES(), ucs_BidiET(), ucs_BidiAN(), ucs_BidiCS(), ucs_BidiNSM(), ucs_BidiBN(), ucs_BidiB(), ucs_BidiS(), ucs_BidiWS(), ucs_BidiON().
ucs_BidiL()
ucs_BidiLRE()
ucs_BidiLRO()
ucs_BidiR()
ucs_BidiAL()
ucs_BidiRLE()
ucs_BidiRLO()
ucs_BidiPDF()
ucs_BidiEN()
ucs_BidiES()
ucs_BidiET()
ucs_BidiAN()
ucs_BidiCS()
ucs_BidiNSM()
ucs_BidiBN()
ucs_BidiB()
ucs_BidiS()
ucs_BidiWS()
ucs_BidiON()
Functions for scripts ( properties PhagsPa, Phoenician, are not included since they are not implemented in /\p{ }/ form. ): ucs_Arabic(), ucs_Armenian(), ucs_Balinese(), ucs_Bengali(), ucs_Bopomofo(), ucs_Braille(), ucs_Buginese(), ucs_Buhid(), ucs_CanadianAboriginal(), ucs_Cherokee(), ucs_Coptic(), ucs_Cuneiform(), ucs_Cypriot(), ucs_Cyrillic(), ucs_Deseret(), ucs_Devanagari(), ucs_Ethiopic(), ucs_Georgian(), ucs_Glagolitic(), ucs_Gothic(), ucs_Greek(), ucs_Gujarati(), ucs_Gurmukhi(), ucs_Han(), ucs_Hangul(), ucs_Hanunoo(), ucs_Hebrew(), ucs_Hiragana(), ucs_Inherited(), ucs_Kannada(), ucs_Katakana(), ucs_Kharoshthi(), ucs_Khmer(), ucs_Lao(), ucs_Latin(), ucs_Limbu(), ucs_LinearB(), ucs_Malayalam(), ucs_Mongolian(), ucs_Myanmar(), ucs_NewTaiLue(), ucs_Nko(), ucs_Ogham(), ucs_OldItalic(), ucs_OldPersian(), ucs_Oriya(), ucs_Osmanya(), ucs_PhagsPa(), ucs_Phoenician(), ucs_Runic(), ucs_Shavian(), ucs_Sinhala(), ucs_SylotiNagri(), ucs_Syriac(), ucs_Tagalog(), ucs_Tagbanwa(), ucs_TaiLe(), ucs_Tamil(), ucs_Telugu(), ucs_Thaana(), ucs_Thai(), ucs_Tibetan(), ucs_Tifinagh(), ucs_Ugaritic(), ucs_Yi().
ucs_Arabic()
ucs_Armenian()
ucs_Balinese()
ucs_Bengali()
ucs_Bopomofo()
ucs_Braille()
ucs_Buginese()
ucs_Buhid()
ucs_CanadianAboriginal()
ucs_Cherokee()
ucs_Coptic()
ucs_Cuneiform()
ucs_Cypriot()
ucs_Cyrillic()
ucs_Deseret()
ucs_Devanagari()
ucs_Ethiopic()
ucs_Georgian()
ucs_Glagolitic()
ucs_Gothic()
ucs_Greek()
ucs_Gujarati()
ucs_Gurmukhi()
ucs_Han()
ucs_Hangul()
ucs_Hanunoo()
ucs_Hebrew()
ucs_Hiragana()
ucs_Inherited()
ucs_Kannada()
ucs_Katakana()
ucs_Kharoshthi()
ucs_Khmer()
ucs_Lao()
ucs_Latin()
ucs_Limbu()
ucs_LinearB()
ucs_Malayalam()
ucs_Mongolian()
ucs_Myanmar()
ucs_NewTaiLue()
ucs_Nko()
ucs_Ogham()
ucs_OldItalic()
ucs_OldPersian()
ucs_Oriya()
ucs_Osmanya()
ucs_PhagsPa()
ucs_Phoenician()
ucs_Runic()
ucs_Shavian()
ucs_Sinhala()
ucs_SylotiNagri()
ucs_Syriac()
ucs_Tagalog()
ucs_Tagbanwa()
ucs_TaiLe()
ucs_Tamil()
ucs_Telugu()
ucs_Thaana()
ucs_Thai()
ucs_Tibetan()
ucs_Tifinagh()
ucs_Ugaritic()
ucs_Yi()
Functions for extended properties: ucs_ASCIIHexDigit(), ucs_BidiControl(), ucs_Dash(), ucs_Deprecated(), ucs_Diacritic(), ucs_Extender(), ucs_HexDigit(), ucs_Hyphen(), ucs_Ideographic(), ucs_IDSBinaryOperator(), ucs_IDSTrinaryOperator(), ucs_JoinControl(), ucs_LogicalOrderException(), ucs_NoncharacterCodePoint(), ucs_OtherAlphabetic(), ucs_OtherDefaultIgnorableCodePoint(), ucs_OtherGraphemeExtend(), ucs_OtherIDStart(), ucs_OtherIDContinue(), ucs_OtherLowercase(), ucs_OtherMath(), ucs_OtherUppercase(), ucs_PatternSyntax(), ucs_PatternWhiteSpace(), ucs_QuotationMark(), ucs_Radical(), ucs_SoftDotted(), ucs_STerm(), ucs_TerminalPunctuation(), ucs_UnifiedIdeograph(), ucs_VariationSelector(), ucs_WhiteSpace().
ucs_ASCIIHexDigit()
ucs_BidiControl()
ucs_Dash()
ucs_Deprecated()
ucs_Diacritic()
ucs_Extender()
ucs_HexDigit()
ucs_Hyphen()
ucs_Ideographic()
ucs_IDSBinaryOperator()
ucs_IDSTrinaryOperator()
ucs_JoinControl()
ucs_LogicalOrderException()
ucs_NoncharacterCodePoint()
ucs_OtherAlphabetic()
ucs_OtherDefaultIgnorableCodePoint()
ucs_OtherGraphemeExtend()
ucs_OtherIDStart()
ucs_OtherIDContinue()
ucs_OtherLowercase()
ucs_OtherMath()
ucs_OtherUppercase()
ucs_PatternSyntax()
ucs_PatternWhiteSpace()
ucs_QuotationMark()
ucs_Radical()
ucs_SoftDotted()
ucs_STerm()
ucs_TerminalPunctuation()
ucs_UnifiedIdeograph()
ucs_VariationSelector()
ucs_WhiteSpace()
Functions for derived properties: ucs_Alphabetic(), ucs_Lowercase(), ucs_Uppercase(), ucs_Math(), ucs_IDStart(), ucs_IDContinue(), ucs_Any(), ucs_Assigned(), ucs_Unassigned(), ucs_ASCII(), ucs_Common().
ucs_Alphabetic()
ucs_Lowercase()
ucs_Uppercase()
ucs_Math()
ucs_IDStart()
ucs_IDContinue()
ucs_Any()
ucs_Assigned()
ucs_Unassigned()
ucs_ASCII()
ucs_Common()
Functions for EastAsianWidth: ucs_EaF(), ucs_EaH(), ucs_EaA(), ucs_EaNa(), ucs_EaW(), ucs_EaN(), ucs_EaFullwidth0(), ucs_EaHalfwidth0(), ucs_EaFullwidth1(), ucs_EaHalfwidth1().
ucs_EaF()
ucs_EaH()
ucs_EaA()
ucs_EaNa()
ucs_EaW()
ucs_EaN()
ucs_EaFullwidth0()
ucs_EaHalfwidth0()
ucs_EaFullwidth1()
ucs_EaHalfwidth1()
While considering about classification of InEastAsianAmbiguous category in InFullwidth and InHalfwidth, ucs_EaFullwidth0() and ucs_EaHalfwidth0() represent the InFullwidth class and InHalfwidth class with $Unicode::EastAsianWidth::EastAsian = 0. On the contrary, ucs_EaFullwidth1() and ucs_EaHalfwidth1() with $Unicode::EastAsianWidth::EastAsian = 1. The actual value of $Unicode::EastAsianWidth::EastAsian is irrelevant to them since the lookup table is premade.
InEastAsianAmbiguous
InFullwidth
InHalfwidth
$Unicode::EastAsianWidth::EastAsian = 0
$Unicode::EastAsianWidth::EastAsian = 1
$Unicode::EastAsianWidth::EastAsian
In my line-warping program, the total running time is cut half by using this module, comparing to original regex version, i.e. /\p{ }/ family. At the same time, caching the regex result doesn't help much. But it shows only 20%-50% performance difference in benchmark module.
/\p{ }/
# Mention other useful documentation such as the documentation of # related modules or operating system documentation (such as man pages # in UNIX), or any relevant external documentation such as RFCs or # standards.
# If you have a mailing list set up for your module, mention it here.
# If you have a web site set up for your module, mention it here.
perlunicode, Unicode::EastAsianWidth, http://www.unicode.org/unicode/reports/tr11/, http://unicode.org/Public/UNIDATA/EastAsianWidth.txt
Mindos Cheng, <mindos@gmail.com>
Copyright (C) 2008-2009 by Mindos Cheng
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.9 or, at your option, any later version of Perl 5 you may have available.
To install Unicode::Property::XS, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Unicode::Property::XS
CPAN shell
perl -MCPAN -e shell install Unicode::Property::XS
For more information on module installation, please visit the detailed CPAN module installation guide.