Ureader.com  
Microsoft software help and Community
   home   |   control panel login   |   archive   |  
 
Windos
win32.3rdparty
win32.directx.audio
win32.directx.ddk
win32.directx.graphics
win32.directx.input
win32.directx.managed
win32.directx.misc
win32.directx.networking
win32.directx.sdk
win32.directx.video
win32.dirx.grap.shaders
win32.gdi
win32.international
win32.kernel
win32.messaging
win32.mmedia
win32.networks
win32.ole
win32.rtc
win32.tapi
win32.tapi.beta
win32.tools
win32.ui
win32.wince
win32.wmi
windows.mediacenter
winfx.aero
winfx.announcements
winfx.avalon
winfx.collaboration
winfx.fundamentals
winfx.general
winfx.indigo
winfx.sdk
winfx.winfs
  
 
date: Tue, 27 May 2008 06:59:03 -0700,    group: microsoft.public.win32.programmer.international        back       


char buffer to unicode buffer   
Hi all,
 
    I need to convert ansii buffer to unicode buffer by using 
multibytetowidechar. 
Envinronment: vc++ 6.0.

  In the result buffer, there is no unicode character after encoded with 
CP_UTF8. How to resolve this issue?

Please find the below code snippet..


int _tmain(int argc, _TCHAR* argv[])
{
	CHAR szData[256] = {0};
	strcpy(szData, "AëBëC DöEöF");
	INT nDataLen = strlen(szData);
	INT nDesBufferLen = ::MultiByteToWideChar(	CP_UTF8, 
									0, 
									szData,
									-1, 
									0,
									0);
	WCHAR* pUnicodeData = new WCHAR[nDesBufferLen + 1];
	ZeroMemory(pUnicodeData, (nDesBufferLen + 1) * sizeof(WCHAR));
	UINT uiWideCharLen = ::MultiByteToWideChar( CP_UTF8, 
												0,
												szData,nDataLen, 
												pUnicodeData,
												nDesBufferLen);
	pUnicodeData[uiWideCharLen] = 0;
	//Here pUnicodeData has ABC DEF only... It din't return "AëBëC DöEöF" in 
pUnicodeData buffer. How?
	if (pUnicodeData) delete[] pUnicodeData;

	return 0;
}

-- 
Thanks & Regards,
Bill.
date: Tue, 27 May 2008 06:59:03 -0700   author:   Bill

Re: char buffer to unicode buffer   
Your literal ANSI string is encoded using the default codepage, not UTF8.

Replace CP_UTF8 by CP_ACP and it should work better... until you run the 
code on a computer with a different default codepage!

HTH,

Serge.
http://www.apptranslator.com - Localization tool for your applications



"Bill"  wrote in message 
news:E1E88BB6-9934-4F8F-BB50-153A1490680B@microsoft.com...
> Hi all,
>
>    I need to convert ansii buffer to unicode buffer by using
> multibytetowidechar.
> Envinronment: vc++ 6.0.
>
>  In the result buffer, there is no unicode character after encoded with
> CP_UTF8. How to resolve this issue?
>
> Please find the below code snippet..
>
>
> int _tmain(int argc, _TCHAR* argv[])
> {
> CHAR szData[256] = {0};
> strcpy(szData, "AëBëC DöEöF");
> INT nDataLen = strlen(szData);
> INT nDesBufferLen = ::MultiByteToWideChar( CP_UTF8,
> 0,
> szData,
> -1,
> 0,
> 0);
> WCHAR* pUnicodeData = new WCHAR[nDesBufferLen + 1];
> ZeroMemory(pUnicodeData, (nDesBufferLen + 1) * sizeof(WCHAR));
> UINT uiWideCharLen = ::MultiByteToWideChar( CP_UTF8,
> 0,
> szData,nDataLen,
> pUnicodeData,
> nDesBufferLen);
> pUnicodeData[uiWideCharLen] = 0;
> //Here pUnicodeData has ABC DEF only... It din't return "AëBëC DöEöF" in
> pUnicodeData buffer. How?
> if (pUnicodeData) delete[] pUnicodeData;
>
> return 0;
> }
>
> -- 
> Thanks & Regards,
> Bill.
date: Tue, 27 May 2008 16:59:41 +0200   author:   Serge Wautier

Re: char buffer to unicode buffer   
Thank you. It is working fine now with CP_UTF7 and CP_ACP.

Requirement: Our application recieves "AëBëC DöEöF" data from server through 
sockets. It will be there in void buffer. How do I know the codepage of this 
data?

whether I need to apply CP_ACP or CP_UTF7 or CP_UTF8.

How can I achieve it in generic way for all?

Thanks in advance.

-- 
Thanks & Regards,
Bill.


"Serge Wautier" wrote:

> Your literal ANSI string is encoded using the default codepage, not UTF8.
> 
> Replace CP_UTF8 by CP_ACP and it should work better... until you run the 
> code on a computer with a different default codepage!
> 
> HTH,
> 
> Serge.
> http://www.apptranslator.com - Localization tool for your applications
> 
> 
> 
> "Bill"  wrote in message 
> news:E1E88BB6-9934-4F8F-BB50-153A1490680B@microsoft.com...
> > Hi all,
> >
> >    I need to convert ansii buffer to unicode buffer by using
> > multibytetowidechar.
> > Envinronment: vc++ 6.0.
> >
> >  In the result buffer, there is no unicode character after encoded with
> > CP_UTF8. How to resolve this issue?
> >
> > Please find the below code snippet..
> >
> >
> > int _tmain(int argc, _TCHAR* argv[])
> > {
> > CHAR szData[256] = {0};
> > strcpy(szData, "AëBëC DöEöF");
> > INT nDataLen = strlen(szData);
> > INT nDesBufferLen = ::MultiByteToWideChar( CP_UTF8,
> > 0,
> > szData,
> > -1,
> > 0,
> > 0);
> > WCHAR* pUnicodeData = new WCHAR[nDesBufferLen + 1];
> > ZeroMemory(pUnicodeData, (nDesBufferLen + 1) * sizeof(WCHAR));
> > UINT uiWideCharLen = ::MultiByteToWideChar( CP_UTF8,
> > 0,
> > szData,nDataLen,
> > pUnicodeData,
> > nDesBufferLen);
> > pUnicodeData[uiWideCharLen] = 0;
> > //Here pUnicodeData has ABC DEF only... It din't return "AëBëC DöEöF" in
> > pUnicodeData buffer. How?
> > if (pUnicodeData) delete[] pUnicodeData;
> >
> > return 0;
> > }
> >
> > -- 
> > Thanks & Regards,
> > Bill. 
> 
>
date: Tue, 27 May 2008 22:02:00 -0700   author:   Bill

Re: char buffer to unicode buffer   
> Our application recieves "AëBëC DöEöF" data from server through 
> sockets. It will be there in void buffer. How do I know the codepage
> of this data?

You don't know.
The server should tag the data with the codepage, or the protocol should
specify that all comunication is done using code page X.

If you have control on the server side, I would recomend establishing
utf-8 as code page for the whole comunication.
Less code page conversions (better performance) and less complexity,
and less risk of corrupted data.


-- 
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email
date: Tue, 27 May 2008 22:47:27 -0700   author:   Mihai N.

Re: char buffer to unicode buffer   
Thanks alot Mihai for your prompt reponse and guiding me.

I dint change anything in my server side. But, still it is supporting 
chinese, japanese, arabic and some other languages also. 

How MultibyteToWidechar (CP-UTF8...) dint support for these characters..

And How MultibyteToWidechar (CP-UTF7...) is able to convert?

Can you please clear my doubt?

Can you please suggest any good article or book to understand these 
encoding/decoding standards?

Thanks in advance.

-- 
Thanks & Regards,
Bill.


"Mihai N." wrote:

> 
> > Our application recieves "AëBëC DöEöF" data from server through 
> > sockets. It will be there in void buffer. How do I know the codepage
> > of this data?
> 
> You don't know.
> The server should tag the data with the codepage, or the protocol should
> specify that all comunication is done using code page X.
> 
> If you have control on the server side, I would recomend establishing
> utf-8 as code page for the whole comunication.
> Less code page conversions (better performance) and less complexity,
> and less risk of corrupted data.
> 
> 
> -- 
> Mihai Nita [Microsoft MVP, Visual C++]
> http://www.mihai-nita.net
> ------------------------------------------
> Replace _year_ with _ to get the real email
>
date: Tue, 27 May 2008 23:00:00 -0700   author:   Bill

Re: char buffer to unicode buffer   
> Can you please clear my doubt?

Unfortunately, I cannot :-)
You have to figure out what encoding is the server using when sending data.
And if you have a way to configure the server to always send UTF-8
(which supports all the languages you mention)


> Can you please suggest any good article or book to understand these 
> encoding/decoding standards?

Not really. Because I am still not sure what you have ther, I don't
know what can help you most.
I have tried to explain the "basic lingo" here:
   http://www.mihai-nita.net/article.php?artID=20060806a

Main point: different code pages assign different numbers to the same 
character. So if I get 85h and I don't know in what code page that is,
I have no way to understand what the character is. Like a crypted message :-)

If I send you a bunch of bytes (85h 92h A2h) and I don't tell you in what
code page they are, it is hopeless (see "Coded character set" in my post).
So you either sent the bytes, and tell me what the code page is for
each message you are sending, or we establish that we always use the
same code page (recomended UTF-8) and you don't have to specify it
for every single message.


Same with your server: you have to figure out what is the code page
it uses to send the data, and then you will figure out what code page
you need to use on the client side (the same one, obviously :-)


-- 
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email
date: Wed, 28 May 2008 22:59:55 -0700   author:   Mihai N.

Re: char buffer to unicode buffer   
"Serge Wautier"  wrote in message news:OtB$PpAwIHA.3500@TK2MSFTNGP02.phx.gbl...
> Your literal ANSI string is encoded using the default codepage, not UTF8.
> 
> Replace CP_UTF8 by CP_ACP and it should work better... until you run the 
> code on a computer with a different default codepage!

I am having a dog of a time understanding Visual Studio's and/or the C specifications way for dealing with string literals.

I mean, as a counterpoint, I have a code sample that suprised me by storing several string literals in a c source file and treating them as utf-8 encoded. The source file itself was unicode with a BOM.

Does the source file encoding directly influence the literal encoding? Even if conversions are required (Conversions because the source file was utf-16, and the string literal, lacking a L"" was strictly some kind of 8 bit encoding).
date: Mon, 18 Aug 2008 10:11:07 +0200   author:   Chris Becke

Re: char buffer to unicode buffer   
> I am having a dog of a time understanding Visual Studio's and/or the C++
> specifications way for dealing with string literals.
The C++ specs are really fuzzy (or missing) in most areas regarding
international support.

> I mean, as a counterpoint, I have a code sample that suprised me by storing
> several string literals in a c++ source file and treating them as utf-8
> encoded. The source file itself was unicode with a BOM.
This is the "a string of char is a bunch of bytes" principle at work.
It is loved by Unix/Linux developers, since it seems everything it's easy.
If you have a BOM, a text editor will know the file is UTF-8.
When compiled nothing happens.
And at runime "you should know" that the source was utf-8.

In general, C strings are "a bunch of bytes" and they stay "as is" when
compiled. If at runtime the environment does not match the compile time
enbvironment, you might have some nasty surprises.
(and also nasty surprises if you have an international team, with
various developers using machines with various international settings).

> Does the source file encoding directly influence the literal encoding?
> Even if conversions are required (Conversions because the source file was
> utf-16, and the string literal, lacking a L"" was strictly some kind of
> 8 bit encoding).
Fow wchar_t string using L"..." the encoding of the source matters,
because the bytes in the source file must be converted to utf-16 or utf-32
(depending on OS).

With VS the code will convert L"..." with the following code pages:
if( BOM )
    the conversion uses the Unicode form matching the BOM
else if ( there is a #pragma setlocale in the source file )
    that locale determines the code page
else
    the system code page (ANSI cp) is used


Of course, the best option is to take all strings out of sources and to move 
them in resources :-)


-- 
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email
date: Mon, 18 Aug 2008 02:07:48 -0700   author:   Mihai N.

Re: char buffer to unicode buffer   
"Mihai N."  wrote in message news:Xns9AFE15AA3F59MihaiN@207.46.248.16...
>> I am having a dog of a time understanding Visual Studio's and/or the C
>> specifications way for dealing with string literals.
> The C specs are really fuzzy (or missing) in most areas regarding
> international support.
> 
>> I mean, as a counterpoint, I have a code sample that suprised me by storing
>> several string literals in a c source file and treating them as utf-8
>> encoded. The source file itself was unicode with a BOM.
> This is the "a string of char is a bunch of bytes" principle at work.
> It is loved by Unix/Linux developers, since it seems everything it's easy.
> If you have a BOM, a text editor will know the file is UTF-8.
> When compiled nothing happens.
> And at runime "you should know" that the source was utf-8.
> 
> In general, C strings are "a bunch of bytes" and they stay "as is" when
> compiled. If at runtime the environment does not match the compile time
> enbvironment, you might have some nasty surprises.
> (and also nasty surprises if you have an international team, with
> various developers using machines with various international settings).
> 
>> Does the source file encoding directly influence the literal encoding?
>> Even if conversions are required (Conversions because the source file was
>> utf-16, and the string literal, lacking a L"" was strictly some kind of
>> 8 bit encoding).
> Fow wchar_t string using L"..." the encoding of the source matters,
> because the bytes in the source file must be converted to utf-16 or utf-32
> (depending on OS).
> 
> With VS the code will convert L"..." with the following code pages:
> if( BOM )
>    the conversion uses the Unicode form matching the BOM
> else if ( there is a #pragma setlocale in the source file )
>    that locale determines the code page
> else
>    the system code page (ANSI cp) is used
> 
> 
> Of course, the best option is to take all strings out of sources and to move 
> them in resources :-)

The more interesting thing (to me) is the case where the "String of bytes" principal does not apply. Specifically,when the source file is unicode (UTF-16), and L is not present - implying the literal is a char[], then the compiler has to pick a char sized encoding to convert the literal to.
date: Mon, 18 Aug 2008 11:43:24 +0200   author:   Chris Becke

Re: char buffer to unicode buffer   
> The more interesting thing (to me) is the case where the "String of bytes"
> principal does not apply. Specifically,when the source file is unicode
> (UTF-16), and L is not present - implying the literal is a char[],
> then the compiler has to pick a char sized encoding to convert the
> literal to.

You are right. I guess the reverse of the rules apply :-)
And you probabably end up with the stuff converted using ANSI code page.


-- 
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email
date: Mon, 18 Aug 2008 23:35:39 -0700   author:   Mihai N.

Re: char buffer to unicode buffer   
Well, VS8 picked utf-8. I just whish i knew *why*.

"Mihai N."  wrote in message news:Xns9AFEF000240F4MihaiN@207.46.248.16...
>> The more interesting thing (to me) is the case where the "String of bytes"
>> principal does not apply. Specifically,when the source file is unicode
>> (UTF-16), and L is not present - implying the literal is a char[],
>> then the compiler has to pick a char sized encoding to convert the
>> literal to.
> 
> You are right. I guess the reverse of the rules apply :-)
> And you probabably end up with the stuff converted using ANSI code page.
> 
> 
> -- 
> Mihai Nita [Microsoft MVP, Visual C]
> http://www.mihai-nita.net
> ------------------------------------------
> Replace _year_ with _ to get the real email
date: Tue, 19 Aug 2008 08:46:32 +0200   author:   Chris Becke

Re: char buffer to unicode buffer   
> Well, VS8 picked utf-8. I just whish i knew *why*.

Not what see.

I have tried this:
==============
#include <stdio.h>
#include <tchar.h>

int main( void ) {
//	char *str = "JapaneseText";
	char *str = "Tést";
	for( int i = 0; str[i]; ++i )
		_tprintf( _T("str[%d] = '%c' (0x%x)\n"), i, str[i], (unsigned 
char)str[i] );
	return 0;
}
==============
With the accented e the result is:
str[0] = 'T' (0x54)
str[1] = 'é' (0xe9)
str[2] = 's' (0x73)
str[3] = 't' (0x74)

With UTF-8 this would have been c3 a9.

With the Japanese text I get compile time errors:
c.cpp(5) : warning C4566: character represented by universal-character-name 
'\u65E5' cannot be represented in the current code page (1252)
c.cpp(5) : warning C4566: character represented by universal-character-name 
'\u672C' cannot be represented in the current code page (1252)
c.cpp(5) : warning C4566: character represented by universal-character-name 
'\u8A9E' cannot be represented in the current code page (1252)

So it realy looks like it goes to current code page, not utf-8.

Maybe there is something else going on.
Any chance he's doing some fancy conversions at runtime?


-- 
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email
date: Tue, 19 Aug 2008 11:14:07 -0700   author:   Mihai N.

Google
 
Web ureader.com


    COPYRIGHT 2007, YARDI TECHNOLOGY LIMITED, ALL RIGHT RESERVE  |   contact us