|
|
|
date: Wed, 6 Aug 2008 23:52:40 -0700,
group: microsoft.public.win32.programmer.ui
back
Re: Code pages v. Arial Unicode MS
> I'm trying to understand the concept of code pages.
Some basic lingo here:
http://www.mihai-nita.net/article.php?artID=20060806a
> I have a font called "Arial Unicode MS" and it seems to have a lot of
> glyphs, include Chinese and Russian. Is there not a unique integer for each
> glyph in this font? If so, why do we need the concept of a code page?
Yes, there is a glyph ID for each glyph.
But in different fonts the same glyphs with identical id might represent
a different character.
You need a code page because that is a standard thing, designed for data
interchange. Glyphs Ids are not standardised, they are controlled by
font creators.
Also, OpenType/TrueType fonts are limited to 65K glyphs, while Unicode
encodes more than 100K characters
Mapping between code points ("characters") and glyphs is not necesarily
trivial. You can have one to one, one to many, many to one, and many to
many mappings. The mappings can depend on context, language, font, etc.
If you want to digg deeper into this you can go the the OTF spec.
> Is not the concept a font a superset of a code page?
No. A font is a collection of glyphs (very dumb down definition)
A code page is a coded character set.
> So I am confused. When I look at the documentation for MultiByteToWideChar
> http://msdn.microsoft.com/en-us/library/ms776413(VS.85).aspx I see 7 code
> pages.
> But when I call EnumSystemCodePages, I see well over a dozen. This looks
> more like http://www.microsoft.com/globaldev/reference/WinCP.mspx.
> Is Microsoft inconsistant in their use of the term "code page" here?
The 7 are predefined symbolic values, the others are numeric.
Similar (if you want) to
#define MAX_USHORT 0xFFFF
#define MAX_PATH 512
It does not mean that MAX_USHORT and MAX_PATH are the only numeric
values possible. CP_UTF8 is just a nice way to say 65001.
Even the page MultiByteToWideChar says "for a list of code pages,
see Code Page Identifiers."
--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email
date: Fri, 08 Aug 2008 04:35:40 -0700
author: Mihai N.
Re: Code pages v. Arial Unicode MS
> Do UTF-7 and UTF-8 supersede all the other code pages? If all web documents
> used UTF-7 or UTF-8 would their be any need for the other code pages?
UTF-7 is rarely used otday.
You will most likely see UTF-8 on the web and Unix world, and for
data exchange and storage. Lots of UTF-16 when text processing is needed
(Win32 API, Mac OS X API, .NET, Java, JavaScript, ICU, Qt, etc.)
and again some UTF-32 (relatively little) on Unix apps using wchar_t.
> What functions does IE call when it sees the encoding in the header of a
> response from a web server? As Joel on software explains, the header of a
> response contains the encoding.
MLang: http://msdn.microsoft.com/en-us/library/aa767865(VS.85).aspx
Determining the code page of a web page is a bit more complicated.
You should start withg RFC 2616, Accept-Encoding.
That comes from the server.
You can also specify it in a meta tag.
And you can also guess the encoding, if everything else fails
(MLang IMultiLanguage2, DetectInputCodepage & DetectCodepageInIStream)
> I believe all web pages are 8 bit characters by definition, correct?
Technically utf-8 might need up to 4 bytes to represent some code points.
(what the user understands by character might need even more :-)
But trying to guess what you are really asking: UTF-16 is valid,
and IE understands it.
> So I think what happends is that IE looks at the encoding type, finds the
> code page and specifies this as an argument to MultiByteToWideChar and
> converts everything to Wide characters and then calls the GDI function
> DrawTextW which then uses each character code to index into an array of
> glyphs (aka a font such as Arial Unicode MS). Is this correct?
There might be some implementation details that are different, but the
principle you got it right.
(IE probably calls ConvertStringToUnicode or ConvertStringToUnicodeEx,
or IMLangConvertCharset or ConvertINetMultiByteToUnicode (don't ask
me why so many :-), can use IEnumScript and IMLangFontLink and
IMLangFontLink2 to split the text into runs using the same script,
DrawTextW might be in fact DrawTextEx, ExtTextOut (or other GDI stuff)
or might even call uniscribe directly (I doubt).)
> What does MultiByteToWide convert to? UTS-16?
Yes, always utf-16
> What happens if MultiByteToWide finds an exotic Chinese character that
> cannot fit into 16 bits?
All Chinese characters in the code pages supported by Windows
have utf-16 mappings.
> Does DrawTextW use IsDBCSLeadByte to compute the
> proper index to the glyph array? How does it do that?
IsDBCSLeadByte is not used/needed for Unicode text.
GDI calls Uniscribe, if needed (for complex script).
Determining the glyphs from a code point is not trivial,
requires the understanding of how each script works, and OpenType internals.
See http://www.microsoft.com/typography/Glyph%20Processing/overview.mspx
and "Script-specific Development" here:
http://www.microsoft.com/typography/SpecificationsOverview.mspx
to get some idea.
--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email
date: Tue, 12 Aug 2008 21:08:24 -0700
author: Mihai N.
|
|