Message-ID:

Linux: because a PC is a terrible thing to waste -- ksh@cis.ufl.edu put this on Tshirts in '93

devel / comp.lang.postscript / Re: Composite fonts for Unicode strings (was: Final request for feedback)

Hi All,

I'm about to publish my UTF-8 code. Before I do I'm asking for feedback
and opinions for what should be the last time.

What's different about what I'm finally intending to publish:

1. I'm using a dictionary for the UNICODE encoding map, instead of
sparse array. This isn't because it's faster -- 3ns slower seems quite
acceptable -- and a dictionary is bigger -- over double the size for
GNU's UnifontMedium. I'm doing this because it's two less files to
publish -- I don't need to publish sparseget and I don't need to publish
an AWK script to convert Fontforge .g2n files into a sparse array.

2. I've replaced utf8show with utf8decode (which generates an array of
UNICODE values) and unicodeshow.

3. I'm not storing the map in the font, but passing it as a parameter to
unicodeshow because I think it's simpler. Storing it in the font means
defining a new font (definefont).

These are the alternative programs for printing a UTF-8 string.

This is what I think I'll publish:

%!PS
%%IncludeResource: procset unicodeshow
%%IncludeResource: procset utf8decode
/Helvetica 20 selectfont
100 300 moveto
(Welcome to \342\200\234UTF-8\342\200\235 \342\230\272) utf8decode
ReverseAdobeGlyphList exch unicodeshow
showpage

This is what I was previously intending, using a dictionary:

%!PS
%%IncludeResource: procset unicodefont
%%IncludeResource: procset unicodeshow
%%IncludeResource: procset utf8decode
/Helvetica findfont 20 scalefont ReverseAdobeGlyphList unicodefont
/MyFont exch definefont setfont
100 300 moveto
(Welcome to \342\200\234UTF-8\342\200\235 \342\230\272) utf8decode
unicodeshow
showpage

There's one extra line if using a sparse array instead of a dictionary:

%%IncludeResource: procset sparseget

I think the first is better but am open to opposing opinions.

Thanks,

David

Re: Final request for feedback

<014e6696-760d-4aa0-a79c-f2c684db6f31n@googlegroups.com>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=148&group=comp.lang.postscript#148

copy link Newsgroups: comp.lang.postscript

X-Received: by 2002:a05:600c:3b85:b0:37b:baf8:f542 with SMTP id n5-20020a05600c3b8500b0037bbaf8f542mr3870396wms.26.1645544182820;
Tue, 22 Feb 2022 07:36:22 -0800 (PST)
X-Received: by 2002:a05:6808:190d:b0:2d5:bd9:dbf6 with SMTP id
bf13-20020a056808190d00b002d50bd9dbf6mr2163983oib.48.1645544182206; Tue, 22
Feb 2022 07:36:22 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.128.87.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.postscript
Date: Tue, 22 Feb 2022 07:36:22 -0800 (PST)
In-Reply-To: <6211b169$1@news.ausics.net>
Injection-Info: google-groups.googlegroups.com; posting-host=97.87.183.68; posting-account=G1KGwgkAAAAyw4z0LxHH0fja6wAbo7Cz
NNTP-Posting-Host: 97.87.183.68
References: <6211b169$1@news.ausics.net>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <014e6696-760d-4aa0-a79c-f2c684db6f31n@googlegroups.com>
Subject: Re: Final request for feedback
From: luser.droog@gmail.com (luser droog)
Injection-Date: Tue, 22 Feb 2022 15:36:22 +0000
Content-Type: text/plain; charset="UTF-8"

by: luser droog - Tue, 22 Feb 2022 15:36 UTC

On Saturday, February 19, 2022 at 9:11:52 PM UTC-6, David Newall wrote:
> Hi All,
>
> I'm about to publish my UTF-8 code. Before I do I'm asking for feedback
> and opinions for what should be the last time.
>
That looks really good to me. I'm a little sad that definefont is out,
but it really doesn't appear to offer very much. It seems like PostScript
*almost* has the pieces available to put this together seamlessly.
But the conversion probably can't use a filtered file because of the need
to convert from a string to an array. And packing the glyph selection
into a composite font would be a ton of work if it's even possible.

On Tue, 22 Feb 2022 07:36:22 -0800 (PST)
luser droog <luser.droog@gmail.com> wrote:
> [...] And packing
> the glyph selection into a composite font would be a ton of work if
> it's even possible.

It is possible to create a tree of composite fonts, where each byte in
a UTF-8 sequence dispatches to the next font, and the last one picks
the glyph. The problems with this approach are 1. the complexity
creating and populating the font tree, and 2. the fact that
the base fonts at the leaves can only encode 64 glyphs each (since
that's how many values the last byte in a multibyte UTF-8 sequence can
hold), and not even at the beginning of the /Encoding array, which is a
waste.

A simpler approach is to reencode the UTF-8 string to a made-up UTF-24
encoding (3 bytes per codepoint), and then use a simple chain of 8x8
(FMapType 2) composite fonts. Here the first byte selects the Unicode
plane (sections of 65536 codepoints; only 4 or 5 are assigned), the
second byte the segment of 256 codepoints in that plane, and the third
one the glyph inside that segment.

While in theory this needs 1 comp. font to choose the plane + 256 comp.
fonts (1 for each plane) + 265x256 base fonts = 65793 fonts, the
majority of them are just the same empty font.

Below is an example of this approach. You get a unicode font by calling
"unicodize" on a font with CharStrings, and you reencode UTF-8 strings
with the "u" operator:

/Courier-Unicode /Courier findfont unicodize 12 scalefont setfont
(oh là là)u show

It uses the AdobeGlyphList for now -- maybe David will come up with
something better.

The code has probably some bugs. I only tested it with Emacs' "Hello"
demo:

%!PS

/f /Arial findfont def
/uf /UFont f unicodize def

uf 14 scalefont setfont
700
[
( Europe: ¡Hola!, Grüß Gott, Hyvää päivää, Tere õhtust, Bonġu)
( Cześć!, Dobrý den, Здравствуйте!, Γειά σας, გამარჯობა)
( Africa: ሠላም)
( Middle/Near East: שָׁלוֹם, السّلام عليكم)
( South Asia: નમસ્તે, नमस्ते, ನಮಸ್ಕಾರ, നമസ്കാരം, ଶୁଣିବେ,)
( ආයුබෝවන්, வணக்கம், నమస్కారం, བཀྲ་ཤིས་བདེ་ལེགས༎)
( South East Asia: ជំរាបសួរ, ສະບາຍດີ, မင်္ဂလာပါ, สวัสดีครับ,
Chào bạn) ( East Asia: 你好, 早晨, こんにちは, 안녕하세요)
( Misc: Eĥoŝanĝo ĉiuĵaŭde, ⠓⠑⠇⠇⠕, ∀ p ∈ world • hello p □)
( CJK variety: GB(元气,开发), BIG5(元氣,開發), JIS(元気,開発),
KSC(元氣,開發)) ( Unicode charset: Eĥoŝanĝo ĉiuĵaŭde, Γειά σας,
שלום, Здравствуйте!) ] {
1 index 20 exch moveto
u show
30 sub
} forall

pop
showpage

Here's the code. Our old friend the iterator makes an appearance :)

%!PS

%% create a composite font suitable for strings with UTF-24 encoding
%: key originalfont -- newfont
/unicodize {
40 dict begin
/ofont exch def
/key exch def
/fname key dup length string cvs def
/basefonts 10 dict def
/planefonts 10 dict def
%: string string -- name
/newname {
/s2 exch def /s1 exch def
/s s1 length s2 length add 1 add string def
s 0 s1 putinterval
s s1 length (-) putinterval
s s1 length 1 add s2 putinterval
s cvn
} def
%: int -- string
/tohex { 16 10 string cvrs } def
%: array element -- newarray
/append { /e exch def [ exch aload pop e ] } def
%: suffix -- font
/newbasefont {
/suffix exch def
/name fname suffix newname def
ofont dup length dict copy
dup /Encoding [ 256 { /.notdef } repeat ] put
dup /FontName name put
dup basefonts exch name exch put
} def
/emptybasefont (Base-E) newbasefont def
%: suffix -- font
/newplanefont {
/suffix exch def
/name fname suffix newname def
<< /FontType 0
/FontMatrix [ 1 0 0 1 0 0 ]
/FontName name
/FMapType 2
/Encoding [ 256 { 0 } repeat ]
/FDepVector [ emptybasefont ]
>>
dup planefonts exch name exch put
} def
/emptyplanefont (Plane-E) newplanefont def
/mainfont << /FontType 0
/FontMatrix [ 1 0 0 1 0 0 ]
/FontName fname
/FMapType 2
/Encoding [ 256 { 0 } repeat ]
/FDepVector [ emptyplanefont ]
>> def
%: font subfont code --
/addsubfont {
/c exch def /sf exch def /f exch def
f /FDepVector 2 copy get sf append put
f /Encoding get c f /FDepVector get length 1 sub put
} def
%: glyphname code --
/putglyph {
dup /plane exch 65536 idiv def
dup /range exch 65536 mod 256 idiv def
/code exch 256 mod def
/glyph exch def
/idx mainfont /Encoding get plane get def
idx 0 eq {
plane tohex newplanefont
dup mainfont exch plane addsubfont
} {
mainfont /FDepVector get idx get
} ifelse
/planefont exch def
/idx planefont /Encoding get range get def
idx 0 eq {
plane 256 mul range add tohex newbasefont
dup planefont exch range addsubfont
} {
planefont /FDepVector get idx get
} ifelse
/basefont exch def
basefont /Encoding get code glyph put
} def
%: glyphname -- code true | false
/getcode {
/g exch def
AdobeGlyphList g known {
AdobeGlyphList g get true
} {
/s g g length string cvs def
s length 7 eq {
s 0 3 getinterval (uni) eq {
s 7 string copy dup 0 (16#) putinterval
{ cvi } stopped { pop false } { true } ifelse
} {
s 0 1 getinterval (u) eq {
9 string dup 3 s 1 6 getinterval putinterval
dup 0 (16#) putinterval
{ cvi } stopped { pop false } { true } ifelse
} { false } ifelse
} ifelse
} { false } ifelse
} ifelse
} def
% fill the fonts...
ofont /CharStrings get { pop dup getcode { putglyph } { pop } ifelse } forall
% register them...
basefonts { definefont pop } forall
planefonts { definefont pop } forall
% register & return main font
key mainfont definefont
end
} bind def

%: string|array -- iterator ( -- nextchar true | false )
/sequenceiterator {
2 dict begin
/s exch def
/counter [ 0 ] def
[ counter 0 /get cvx s length /lt cvx [
s counter 0 /get cvx /get cvx true
counter 0 2 /copy cvx /get cvx 1 /add cvx /put cvx
] cvx [
false
] cvx /ifelse cvx
] cvx
end
} bind def

%% reencode UTF-8 to UTF-24
%: string -- string
/u {
3 dict begin
/src exch def
/nextch src sequenceiterator def
% count UTF-8 sequence starts
0 src { dup 128 lt exch 2#11000000 and 2#11000000 eq or
{ 1 } { 0 } ifelse add } forall
3 mul string /dest exch def
0 {
% decode sequence
nextch not { exit } if
dup 128 lt {
0 % 0xxxxxxx - 0 following bytes
} {
dup dup 2#11000000 ge exch 2#11011111 le and {
2#00011111 and 1 % 110xxxxx - 1 following byte
} {
dup dup 2#11100000 ge exch 2#11101111 le and {
2#00001111 and 2 % 1110xxxx - 2 following bytes
} {
dup dup 2#11110000 ge exch 2#11110111 le and {
2#00000111 and 3 % 11110xxx - 3 following bytes
} {
pop 0 0 % invalid sequence
} ifelse
} ifelse
} ifelse
} ifelse
{ 6 bitshift nextch pop 2#00111111 and add } repeat
% stack: index-to-dest, codepoint
2 copy 65536 idiv dest 3 1 roll put
exch 1 add exch 2 copy 65536 mod 256 idiv dest 3 1 roll put
exch 1 add exch 2 copy 256 mod dest 3 1 roll put pop
1 add
} loop
pop
dest
end
} bind def
--

Click here to read the complete article

Subject	Author
Final request for feedback	David Newall
Re: Final request for feedback	luser droog
Composite fonts for Unicode strings (was: Final request for	Carlos
Re: Composite fonts for Unicode strings (was: Final request for	David Newall
Re: Composite fonts for Unicode strings (was: Final request for	David Newall
Re: Final request for feedback	Carlos