Ureader.com  
Microsoft software help and Community
   home   |   control panel login   |   archive   |  
 
Windos
win32.3rdparty
win32.directx.audio
win32.directx.ddk
win32.directx.graphics
win32.directx.input
win32.directx.managed
win32.directx.misc
win32.directx.networking
win32.directx.sdk
win32.directx.video
win32.dirx.grap.shaders
win32.gdi
win32.international
win32.kernel
win32.messaging
win32.mmedia
win32.networks
win32.ole
win32.rtc
win32.tapi
win32.tapi.beta
win32.tools
win32.ui
win32.wince
win32.wmi
windows.mediacenter
winfx.aero
winfx.announcements
winfx.avalon
winfx.collaboration
winfx.fundamentals
winfx.general
winfx.indigo
winfx.sdk
winfx.winfs
  
 
date: Wed, 6 Aug 2008 23:52:40 -0700,    group: microsoft.public.win32.programmer.ui        back       


Code pages v. Arial Unicode MS   
I'm trying to understand the concept of code pages.

I have a font called "Arial Unicode MS" and it seems to have a lot of 
glyphs, include Chinese and Russian. Is there not a unique integer for each 
glyph in this font? If so, why do we need the concept of a code page?

Is not the concept a font a superset of a code page?

So I am confused. When I look at the documentation for MultiByteToWideChar 
http://msdn.microsoft.com/en-us/library/ms776413(VS.85).aspx I see 7 code 
pages.

But when I call EnumSystemCodePages, I see well over a dozen. This looks 
more like http://www.microsoft.com/globaldev/reference/WinCP.mspx.
Is Microsoft inconsistant in their use of the term "code page" here?

Thanks,
Siegfried
date: Wed, 6 Aug 2008 23:52:40 -0700   author:   Siegfried Heintze

RE: Code pages v. Arial Unicode MS   
"why do we need the concept of codepage"
codepages are used for correct interpretation of data between different 
languages. Meaning making data files fixed on specific font  is not a good 
idea. (besides that codepage gives you known sorting, upper / lower case 
conversions etc. functions, something font can't do)

codepages you see at MultiByteToWideChar documentation are system specific 
contants (at least some of them).  CP_ACP can mean anything from win1250 to 
Chinese_CNS. You can use any installed codepage as argument of MBTWC (those 
you got from EnymSystemCodePages) or those predefined.



"Siegfried Heintze" wrote:

> I'm trying to understand the concept of code pages.
> 
> I have a font called "Arial Unicode MS" and it seems to have a lot of 
> glyphs, include Chinese and Russian. Is there not a unique integer for each 
> glyph in this font? If so, why do we need the concept of a code page?
> 
> Is not the concept a font a superset of a code page?
> 
> So I am confused. When I look at the documentation for MultiByteToWideChar 
> http://msdn.microsoft.com/en-us/library/ms776413(VS.85).aspx I see 7 code 
> pages.
> 
> But when I call EnumSystemCodePages, I see well over a dozen. This looks 
> more like http://www.microsoft.com/globaldev/reference/WinCP.mspx.
> Is Microsoft inconsistant in their use of the term "code page" here?
> 
> Thanks,
> Siegfried
> 
> 
> 
>
date: Thu, 7 Aug 2008 00:26:09 -0700   author:   Mirek Suk

Re: Code pages v. Arial Unicode MS   
"Siegfried Heintze" wrote:
> Is not the concept a font a superset of a code page?

You should read this article. It wil answer all your questions.

"The Absolute Minimum Every Software Developer Absolutely, 
Positively Must Know About Unicode and Character Sets"
http://www.joelonsoftware.com/articles/Unicode.html


Alex
date: Thu, 7 Aug 2008 11:24:29 +0300   author:   Alex Blekhman

Re: Code pages v. Arial Unicode MS   
> I'm trying to understand the concept of code pages.
Some basic lingo here:
  http://www.mihai-nita.net/article.php?artID=20060806a


> I have a font called "Arial Unicode MS" and it seems to have a lot of 
> glyphs, include Chinese and Russian. Is there not a unique integer for each 
> glyph in this font? If so, why do we need the concept of a code page?
Yes, there is a glyph ID for each glyph.
But in different fonts the same glyphs with identical id might represent
a different character.
You need a code page because that is a standard thing, designed for data
interchange. Glyphs Ids are not standardised, they are controlled by
font creators.
Also, OpenType/TrueType fonts are limited to 65K glyphs, while Unicode
encodes more than 100K characters
Mapping between code points ("characters") and glyphs is not necesarily
trivial. You can have one to one, one to many, many to one, and many to
many mappings. The mappings can depend on context, language, font, etc.
If you want to digg deeper into this you can go the the OTF spec.


> Is not the concept a font a superset of a code page?
No. A font is a collection of glyphs (very dumb down definition)
A code page is a coded character set.


> So I am confused. When I look at the documentation for MultiByteToWideChar 
> http://msdn.microsoft.com/en-us/library/ms776413(VS.85).aspx I see 7 code 
> pages.
> But when I call EnumSystemCodePages, I see well over a dozen. This looks 
> more like http://www.microsoft.com/globaldev/reference/WinCP.mspx.
> Is Microsoft inconsistant in their use of the term "code page" here?

The 7 are predefined symbolic values, the others are numeric.
Similar (if you want) to 
   #define MAX_USHORT  0xFFFF
   #define MAX_PATH    512
It does not mean that MAX_USHORT and MAX_PATH are the only numeric
values possible. CP_UTF8 is just a nice way to say 65001.

Even the page MultiByteToWideChar says "for a list of code pages,
see Code Page Identifiers."


-- 
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email
date: Fri, 08 Aug 2008 04:35:40 -0700   author:   Mihai N.

Re: Code pages v. Arial Unicode MS   
Thanks folks!

Do UTF-7 and UTF-8 supersede all the other code pages? If all web documents 
used UTF-7 or UTF-8 would their be any need for the other code pages?

What functions does IE call when it sees the encoding in the header of a 
response from a web server? As Joel on software explains, the header of a 
response contains the encoding. I believe all web pages are 8 bit characters 
by definition, correct?

So I think what happends is that IE looks at the encoding type, finds the 
code page and specifies this as an argument to MultiByteToWideChar and 
converts everything to Wide characters and then calls the GDI function 
DrawTextW which then uses each character code to index into an array of 
glyphs (aka a font such as Arial Unicode MS). Is this correct?

What does MultiByteToWide convert to? UTS-16?

What happens if MultiByteToWide finds an exotic Chinese character that 
cannot fit into 16 bits? Does DrawTextW use IsDBCSLeadByte to compute the 
proper index to the glyph array? How does it do that?

Thanks!
Siegfried
date: Tue, 12 Aug 2008 19:55:18 -0700   author:   Siegfried Heintze

Re: Code pages v. Arial Unicode MS   
> Do UTF-7 and UTF-8 supersede all the other code pages? If all web documents 
> used UTF-7 or UTF-8 would their be any need for the other code pages?

UTF-7 is rarely used otday.
You will most likely see UTF-8 on the web and Unix world, and for
data exchange and storage. Lots of UTF-16 when text processing is needed
(Win32 API, Mac OS X API, .NET, Java, JavaScript, ICU, Qt, etc.)
and again some UTF-32 (relatively little) on Unix apps using wchar_t.



> What functions does IE call when it sees the encoding in the header of a 
> response from a web server? As Joel on software explains, the header of a 
> response contains the encoding.
MLang: http://msdn.microsoft.com/en-us/library/aa767865(VS.85).aspx
Determining the code page of a web page is a bit more complicated.
You should start withg RFC 2616, Accept-Encoding.
That comes from the server.
You can also specify it in a meta tag.
And you can also guess the encoding, if everything else fails
(MLang IMultiLanguage2, DetectInputCodepage & DetectCodepageInIStream)


> I believe all web pages are 8 bit characters by definition, correct?
Technically utf-8 might need up to 4 bytes to represent some code points.
(what the user understands by character might need even more :-)
But trying to guess what you are really asking: UTF-16 is valid,
and IE understands it.

> So I think what happends is that IE looks at the encoding type, finds the 
> code page and specifies this as an argument to MultiByteToWideChar and 
> converts everything to Wide characters and then calls the GDI function 
> DrawTextW which then uses each character code to index into an array of 
> glyphs (aka a font such as Arial Unicode MS). Is this correct?

There might be some implementation details that are different, but the
principle you got it right.

(IE probably calls ConvertStringToUnicode or ConvertStringToUnicodeEx,
or IMLangConvertCharset or ConvertINetMultiByteToUnicode (don't ask
me why so many :-), can use IEnumScript and IMLangFontLink and
IMLangFontLink2 to split the text into runs using the same script,
DrawTextW might be in fact DrawTextEx, ExtTextOut (or other GDI stuff)
or might even call uniscribe directly (I doubt).)

> What does MultiByteToWide convert to? UTS-16?
Yes, always utf-16

> What happens if MultiByteToWide finds an exotic Chinese character that 
> cannot fit into 16 bits?
All Chinese characters in the code pages supported by Windows
have utf-16 mappings.

> Does DrawTextW use IsDBCSLeadByte to compute the 
> proper index to the glyph array? How does it do that?
IsDBCSLeadByte is not used/needed for Unicode text.
GDI calls Uniscribe, if needed (for complex script).
Determining the glyphs from a code point is not trivial,
requires the understanding of how each script works, and OpenType internals.
See http://www.microsoft.com/typography/Glyph%20Processing/overview.mspx
and "Script-specific Development" here:
http://www.microsoft.com/typography/SpecificationsOverview.mspx
to get some idea.

-- 
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email
date: Tue, 12 Aug 2008 21:08:24 -0700   author:   Mihai N.

Re: Code pages v. Arial Unicode MS   
> (IE probably calls ConvertStringToUnicode or ConvertStringToUnicodeEx,
...
> or might even call uniscribe directly (I doubt).)

It looks like it calls Uniscribe directly:
"The Uniscribe DLL (USP10.DLL) currently ships with Windows 2000
and Internet Explorer 5.0+."


-- 
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email
date: Tue, 12 Aug 2008 21:22:30 -0700   author:   Mihai N.

Google
 
Web ureader.com


    COPYRIGHT 2007, YARDI TECHNOLOGY LIMITED, ALL RIGHT RESERVE  |   contact us