|
|
|
date: Tue, 27 May 2008 06:59:03 -0700,
group: microsoft.public.win32.programmer.international
back
char buffer to unicode buffer
Hi all,
I need to convert ansii buffer to unicode buffer by using
multibytetowidechar.
Envinronment: vc++ 6.0.
In the result buffer, there is no unicode character after encoded with
CP_UTF8. How to resolve this issue?
Please find the below code snippet..
int _tmain(int argc, _TCHAR* argv[])
{
CHAR szData[256] = {0};
strcpy(szData, "AëBëC DöEöF");
INT nDataLen = strlen(szData);
INT nDesBufferLen = ::MultiByteToWideChar( CP_UTF8,
0,
szData,
-1,
0,
0);
WCHAR* pUnicodeData = new WCHAR[nDesBufferLen + 1];
ZeroMemory(pUnicodeData, (nDesBufferLen + 1) * sizeof(WCHAR));
UINT uiWideCharLen = ::MultiByteToWideChar( CP_UTF8,
0,
szData,nDataLen,
pUnicodeData,
nDesBufferLen);
pUnicodeData[uiWideCharLen] = 0;
//Here pUnicodeData has ABC DEF only... It din't return "AëBëC DöEöF" in
pUnicodeData buffer. How?
if (pUnicodeData) delete[] pUnicodeData;
return 0;
}
--
Thanks & Regards,
Bill.
date: Tue, 27 May 2008 06:59:03 -0700
author: Bill
Re: char buffer to unicode buffer
Your literal ANSI string is encoded using the default codepage, not UTF8.
Replace CP_UTF8 by CP_ACP and it should work better... until you run the
code on a computer with a different default codepage!
HTH,
Serge.
http://www.apptranslator.com - Localization tool for your applications
"Bill" wrote in message
news:E1E88BB6-9934-4F8F-BB50-153A1490680B@microsoft.com...
> Hi all,
>
> I need to convert ansii buffer to unicode buffer by using
> multibytetowidechar.
> Envinronment: vc++ 6.0.
>
> In the result buffer, there is no unicode character after encoded with
> CP_UTF8. How to resolve this issue?
>
> Please find the below code snippet..
>
>
> int _tmain(int argc, _TCHAR* argv[])
> {
> CHAR szData[256] = {0};
> strcpy(szData, "AëBëC DöEöF");
> INT nDataLen = strlen(szData);
> INT nDesBufferLen = ::MultiByteToWideChar( CP_UTF8,
> 0,
> szData,
> -1,
> 0,
> 0);
> WCHAR* pUnicodeData = new WCHAR[nDesBufferLen + 1];
> ZeroMemory(pUnicodeData, (nDesBufferLen + 1) * sizeof(WCHAR));
> UINT uiWideCharLen = ::MultiByteToWideChar( CP_UTF8,
> 0,
> szData,nDataLen,
> pUnicodeData,
> nDesBufferLen);
> pUnicodeData[uiWideCharLen] = 0;
> //Here pUnicodeData has ABC DEF only... It din't return "AëBëC DöEöF" in
> pUnicodeData buffer. How?
> if (pUnicodeData) delete[] pUnicodeData;
>
> return 0;
> }
>
> --
> Thanks & Regards,
> Bill.
date: Tue, 27 May 2008 16:59:41 +0200
author: Serge Wautier
Re: char buffer to unicode buffer
Thank you. It is working fine now with CP_UTF7 and CP_ACP.
Requirement: Our application recieves "AëBëC DöEöF" data from server through
sockets. It will be there in void buffer. How do I know the codepage of this
data?
whether I need to apply CP_ACP or CP_UTF7 or CP_UTF8.
How can I achieve it in generic way for all?
Thanks in advance.
--
Thanks & Regards,
Bill.
"Serge Wautier" wrote:
> Your literal ANSI string is encoded using the default codepage, not UTF8.
>
> Replace CP_UTF8 by CP_ACP and it should work better... until you run the
> code on a computer with a different default codepage!
>
> HTH,
>
> Serge.
> http://www.apptranslator.com - Localization tool for your applications
>
>
>
> "Bill" wrote in message
> news:E1E88BB6-9934-4F8F-BB50-153A1490680B@microsoft.com...
> > Hi all,
> >
> > I need to convert ansii buffer to unicode buffer by using
> > multibytetowidechar.
> > Envinronment: vc++ 6.0.
> >
> > In the result buffer, there is no unicode character after encoded with
> > CP_UTF8. How to resolve this issue?
> >
> > Please find the below code snippet..
> >
> >
> > int _tmain(int argc, _TCHAR* argv[])
> > {
> > CHAR szData[256] = {0};
> > strcpy(szData, "AëBëC DöEöF");
> > INT nDataLen = strlen(szData);
> > INT nDesBufferLen = ::MultiByteToWideChar( CP_UTF8,
> > 0,
> > szData,
> > -1,
> > 0,
> > 0);
> > WCHAR* pUnicodeData = new WCHAR[nDesBufferLen + 1];
> > ZeroMemory(pUnicodeData, (nDesBufferLen + 1) * sizeof(WCHAR));
> > UINT uiWideCharLen = ::MultiByteToWideChar( CP_UTF8,
> > 0,
> > szData,nDataLen,
> > pUnicodeData,
> > nDesBufferLen);
> > pUnicodeData[uiWideCharLen] = 0;
> > //Here pUnicodeData has ABC DEF only... It din't return "AëBëC DöEöF" in
> > pUnicodeData buffer. How?
> > if (pUnicodeData) delete[] pUnicodeData;
> >
> > return 0;
> > }
> >
> > --
> > Thanks & Regards,
> > Bill.
>
>
date: Tue, 27 May 2008 22:02:00 -0700
author: Bill
Re: char buffer to unicode buffer
> Can you please clear my doubt?
Unfortunately, I cannot :-)
You have to figure out what encoding is the server using when sending data.
And if you have a way to configure the server to always send UTF-8
(which supports all the languages you mention)
> Can you please suggest any good article or book to understand these
> encoding/decoding standards?
Not really. Because I am still not sure what you have ther, I don't
know what can help you most.
I have tried to explain the "basic lingo" here:
http://www.mihai-nita.net/article.php?artID=20060806a
Main point: different code pages assign different numbers to the same
character. So if I get 85h and I don't know in what code page that is,
I have no way to understand what the character is. Like a crypted message :-)
If I send you a bunch of bytes (85h 92h A2h) and I don't tell you in what
code page they are, it is hopeless (see "Coded character set" in my post).
So you either sent the bytes, and tell me what the code page is for
each message you are sending, or we establish that we always use the
same code page (recomended UTF-8) and you don't have to specify it
for every single message.
Same with your server: you have to figure out what is the code page
it uses to send the data, and then you will figure out what code page
you need to use on the client side (the same one, obviously :-)
--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email
date: Wed, 28 May 2008 22:59:55 -0700
author: Mihai N.
Re: char buffer to unicode buffer
> I am having a dog of a time understanding Visual Studio's and/or the C++
> specifications way for dealing with string literals.
The C++ specs are really fuzzy (or missing) in most areas regarding
international support.
> I mean, as a counterpoint, I have a code sample that suprised me by storing
> several string literals in a c++ source file and treating them as utf-8
> encoded. The source file itself was unicode with a BOM.
This is the "a string of char is a bunch of bytes" principle at work.
It is loved by Unix/Linux developers, since it seems everything it's easy.
If you have a BOM, a text editor will know the file is UTF-8.
When compiled nothing happens.
And at runime "you should know" that the source was utf-8.
In general, C strings are "a bunch of bytes" and they stay "as is" when
compiled. If at runtime the environment does not match the compile time
enbvironment, you might have some nasty surprises.
(and also nasty surprises if you have an international team, with
various developers using machines with various international settings).
> Does the source file encoding directly influence the literal encoding?
> Even if conversions are required (Conversions because the source file was
> utf-16, and the string literal, lacking a L"" was strictly some kind of
> 8 bit encoding).
Fow wchar_t string using L"..." the encoding of the source matters,
because the bytes in the source file must be converted to utf-16 or utf-32
(depending on OS).
With VS the code will convert L"..." with the following code pages:
if( BOM )
the conversion uses the Unicode form matching the BOM
else if ( there is a #pragma setlocale in the source file )
that locale determines the code page
else
the system code page (ANSI cp) is used
Of course, the best option is to take all strings out of sources and to move
them in resources :-)
--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email
date: Mon, 18 Aug 2008 02:07:48 -0700
author: Mihai N.
Re: char buffer to unicode buffer
"Mihai N." wrote in message news:Xns9AFE15AA3F59MihaiN@207.46.248.16...
>> I am having a dog of a time understanding Visual Studio's and/or the C
>> specifications way for dealing with string literals.
> The C specs are really fuzzy (or missing) in most areas regarding
> international support.
>
>> I mean, as a counterpoint, I have a code sample that suprised me by storing
>> several string literals in a c source file and treating them as utf-8
>> encoded. The source file itself was unicode with a BOM.
> This is the "a string of char is a bunch of bytes" principle at work.
> It is loved by Unix/Linux developers, since it seems everything it's easy.
> If you have a BOM, a text editor will know the file is UTF-8.
> When compiled nothing happens.
> And at runime "you should know" that the source was utf-8.
>
> In general, C strings are "a bunch of bytes" and they stay "as is" when
> compiled. If at runtime the environment does not match the compile time
> enbvironment, you might have some nasty surprises.
> (and also nasty surprises if you have an international team, with
> various developers using machines with various international settings).
>
>> Does the source file encoding directly influence the literal encoding?
>> Even if conversions are required (Conversions because the source file was
>> utf-16, and the string literal, lacking a L"" was strictly some kind of
>> 8 bit encoding).
> Fow wchar_t string using L"..." the encoding of the source matters,
> because the bytes in the source file must be converted to utf-16 or utf-32
> (depending on OS).
>
> With VS the code will convert L"..." with the following code pages:
> if( BOM )
> the conversion uses the Unicode form matching the BOM
> else if ( there is a #pragma setlocale in the source file )
> that locale determines the code page
> else
> the system code page (ANSI cp) is used
>
>
> Of course, the best option is to take all strings out of sources and to move
> them in resources :-)
The more interesting thing (to me) is the case where the "String of bytes" principal does not apply. Specifically,when the source file is unicode (UTF-16), and L is not present - implying the literal is a char[], then the compiler has to pick a char sized encoding to convert the literal to.
date: Mon, 18 Aug 2008 11:43:24 +0200
author: Chris Becke
|
|