Ureader.com  
Microsoft software help and Community
   home   |   control panel login   |   archive   |  
 
Word
application.errors
conversions
docmanagement
drawing.graphics
formatting.longdocs
international
internet.assistant
mail
mailmerge.fields
menustoolbars
newusers
numbering
oleinterop
pagelayout
printingfonts
setup.networking
spelling.grammar
tables
vba.addins
vba.beginners
vba.customization
vba.general
vba.userforms
web.authoring
word6-7macros
word97vba
  
 
date: Wed, 7 Mar 2007 08:51:23 +0530,    group: microsoft.public.word.international.features        back       


Non-Unicode text in Word/txt files   
Is it possible to detect a legacy (non-Unicode) encodings inside Word and/or 
text documents?

In Word, I think the answer must be no: once some text is in Word--even if 
it uses a legacy font and has glyphs in the upper ascii region 
(128-255)--Word will have turned those into what it thinks is UTF-16 (even 
though it isn't legitimate UTF-16). I'm saying this based on my experience 
of sending such text thru COM and even the legacy data comes thru as wide 
BSTR data that is indistinguishable from UTF-16 (other than the semantics of 
the characters).

Since I can assume that the text files I have to deal with are either UTF-8 
(if Unicode-encoded) or narrow bytes of legacy data, I'm thinking that I 
could open the file in narrow mode (i.e. 8-bit bytes) and then try to do the 
UTF-8 to UTF-16 conversion. If that fails somehow, because of invalid UTF-8 
sequences, then I think I can detect a non-Unicode encoding.

Can anyone think of a (better) way for either scenario?

Thanks,
Bob
date: Wed, 7 Mar 2007 08:51:23 +0530   author:   Bob Eaton

Re: Non-Unicode text in Word/txt files   
Hi Bob,

Not sure which one of two possible issues you have:

a) Text files using some DOS or Windows code page, or
b) Word files using some old font with non-standard "upper ascii" characters 
that you don't have installed.

If it's a), you should be able to import (open) them as "encoded text", and 
choose the proper code page.
The dialog, in Word 2003/2007 at least, shows you a preview, so you can make 
sure to figure out the correct CP.

If the texts are already in Word, there is an add-in, eefonts.dot, that adds 
an entry "Fix broken text" to the Tools menu.
You select some bungled text (= text imported or pasted using the system 
code page instead of the proper one), choose the command, and select the 
country (code page) of the original text, then the add-in fixes the text.

For b), there are a few converters for some old custom Greek, Cyrillic, ... 
fonts, which maybe you can find googling for the font name(s).

For a), but also for b) if the font used the code assignments from some 
existing code page, you could also try to save as Windows text (*don't* 
allow character substitution in the Save dialog), and then re-import using 
the proper code page.

If it's a), you'd probably need some heuristic way to search for bungled 
text (...search for characters that appear but should not), if it's b), you 
should be able to search for the font.

Regards,
Klaus


"Bob Eaton"  wrote:
> Is it possible to detect a legacy (non-Unicode) encodings inside Word 
> and/or text documents?
>
> In Word, I think the answer must be no: once some text is in Word--even if 
> it uses a legacy font and has glyphs in the upper ascii region 
> (128-255)--Word will have turned those into what it thinks is UTF-16 (even 
> though it isn't legitimate UTF-16). I'm saying this based on my experience 
> of sending such text thru COM and even the legacy data comes thru as wide 
> BSTR data that is indistinguishable from UTF-16 (other than the semantics 
> of the characters).
>
> Since I can assume that the text files I have to deal with are either 
> UTF-8 (if Unicode-encoded) or narrow bytes of legacy data, I'm thinking 
> that I could open the file in narrow mode (i.e. 8-bit bytes) and then try 
> to do the UTF-8 to UTF-16 conversion. If that fails somehow, because of 
> invalid UTF-8 sequences, then I think I can detect a non-Unicode encoding.
>
> Can anyone think of a (better) way for either scenario?
>
> Thanks,
> Bob
>
>
date: Sun, 11 Mar 2007 07:31:30 +0100   author:   Klaus Linke

Re: Non-Unicode text in Word/txt files   
Thanks Klaus,

I'm dealing with situation (b): non-standard fonts using the upper ascii 
region for (typically) non-roman glyphs. In fact, they may be using the 
lower region (0-127) for non-ascii glyphs as well... But those characters 
I'm sure there's no way to detect as distinct from ascii without visually 
inspecting the glyph shapes.

The more I think about this, the more I doubt there's a way except for text 
files. That is, text files that have a narrow stream of bytes can only be 
Unicode-encoded if it's UTF-8 and the code points between 128-255 are not 
legitimate UTF-8 code points by themselves. A legitimate Unicode character 
in the range of 128-255, would be represented in (at least) 2 bytes, whereas 
if I'm dealing with a legacy-encoded file, I might see such upper ascii 
characters by themselves. So, I think there's a way to detect this 
non-standard encoding by inspecting narrow byte stream text files... But in 
Word, I think everything is internally 'wide' (as a result of the code page 
conversion you mentioned) and therefore there's no way to tell whether it's 
legitimate Unicode or not without actually inspecting the glyphs.

Anyway, thanks again,
Bob



"Klaus Linke"  wrote in message 
news:eyCVpb6YHHA.4264@TK2MSFTNGP05.phx.gbl...
> Hi Bob,
>
> Not sure which one of two possible issues you have:
>
> a) Text files using some DOS or Windows code page, or
> b) Word files using some old font with non-standard "upper ascii" 
> characters that you don't have installed.
>
> If it's a), you should be able to import (open) them as "encoded text", 
> and choose the proper code page.
> The dialog, in Word 2003/2007 at least, shows you a preview, so you can 
> make sure to figure out the correct CP.
>
> If the texts are already in Word, there is an add-in, eefonts.dot, that 
> adds an entry "Fix broken text" to the Tools menu.
> You select some bungled text (= text imported or pasted using the system 
> code page instead of the proper one), choose the command, and select the 
> country (code page) of the original text, then the add-in fixes the text.
>
> For b), there are a few converters for some old custom Greek, Cyrillic, 
> ... fonts, which maybe you can find googling for the font name(s).
>
> For a), but also for b) if the font used the code assignments from some 
> existing code page, you could also try to save as Windows text (*don't* 
> allow character substitution in the Save dialog), and then re-import using 
> the proper code page.
>
> If it's a), you'd probably need some heuristic way to search for bungled 
> text (...search for characters that appear but should not), if it's b), 
> you should be able to search for the font.
>
> Regards,
> Klaus
>
>
> "Bob Eaton"  wrote:
>> Is it possible to detect a legacy (non-Unicode) encodings inside Word 
>> and/or text documents?
>>
>> In Word, I think the answer must be no: once some text is in Word--even 
>> if it uses a legacy font and has glyphs in the upper ascii region 
>> (128-255)--Word will have turned those into what it thinks is UTF-16 
>> (even though it isn't legitimate UTF-16). I'm saying this based on my 
>> experience of sending such text thru COM and even the legacy data comes 
>> thru as wide BSTR data that is indistinguishable from UTF-16 (other than 
>> the semantics of the characters).
>>
>> Since I can assume that the text files I have to deal with are either 
>> UTF-8 (if Unicode-encoded) or narrow bytes of legacy data, I'm thinking 
>> that I could open the file in narrow mode (i.e. 8-bit bytes) and then try 
>> to do the UTF-8 to UTF-16 conversion. If that fails somehow, because of 
>> invalid UTF-8 sequences, then I think I can detect a non-Unicode 
>> encoding.
>>
>> Can anyone think of a (better) way for either scenario?
>>
>> Thanks,
>> Bob
>>
>>
>
>
date: Sun, 11 Mar 2007 15:23:31 +0530   author:   Bob Eaton

Re: Non-Unicode text in Word/txt files   
Which fonts are that?

Klaus



"Bob Eaton"  schrieb im Newsbeitrag 
news:OZeHeM$YHHA.4940@TK2MSFTNGP05.phx.gbl...
> Thanks Klaus,
>
> I'm dealing with situation (b): non-standard fonts using the upper ascii 
> region for (typically) non-roman glyphs. In fact, they may be using the 
> lower region (0-127) for non-ascii glyphs as well... But those characters 
> I'm sure there's no way to detect as distinct from ascii without visually 
> inspecting the glyph shapes.
>
> The more I think about this, the more I doubt there's a way except for 
> text files. That is, text files that have a narrow stream of bytes can 
> only be Unicode-encoded if it's UTF-8 and the code points between 128-255 
> are not legitimate UTF-8 code points by themselves. A legitimate Unicode 
> character in the range of 128-255, would be represented in (at least) 2 
> bytes, whereas if I'm dealing with a legacy-encoded file, I might see such 
> upper ascii characters by themselves. So, I think there's a way to detect 
> this non-standard encoding by inspecting narrow byte stream text files... 
> But in Word, I think everything is internally 'wide' (as a result of the 
> code page conversion you mentioned) and therefore there's no way to tell 
> whether it's legitimate Unicode or not without actually inspecting the 
> glyphs.
>
> Anyway, thanks again,
> Bob
>
>
>
> "Klaus Linke"  wrote in message 
> news:eyCVpb6YHHA.4264@TK2MSFTNGP05.phx.gbl...
>> Hi Bob,
>>
>> Not sure which one of two possible issues you have:
>>
>> a) Text files using some DOS or Windows code page, or
>> b) Word files using some old font with non-standard "upper ascii" 
>> characters that you don't have installed.
>>
>> If it's a), you should be able to import (open) them as "encoded text", 
>> and choose the proper code page.
>> The dialog, in Word 2003/2007 at least, shows you a preview, so you can 
>> make sure to figure out the correct CP.
>>
>> If the texts are already in Word, there is an add-in, eefonts.dot, that 
>> adds an entry "Fix broken text" to the Tools menu.
>> You select some bungled text (= text imported or pasted using the system 
>> code page instead of the proper one), choose the command, and select the 
>> country (code page) of the original text, then the add-in fixes the text.
>>
>> For b), there are a few converters for some old custom Greek, Cyrillic, 
>> ... fonts, which maybe you can find googling for the font name(s).
>>
>> For a), but also for b) if the font used the code assignments from some 
>> existing code page, you could also try to save as Windows text (*don't* 
>> allow character substitution in the Save dialog), and then re-import 
>> using the proper code page.
>>
>> If it's a), you'd probably need some heuristic way to search for bungled 
>> text (...search for characters that appear but should not), if it's b), 
>> you should be able to search for the font.
>>
>> Regards,
>> Klaus
>>
>>
>> "Bob Eaton"  wrote:
>>> Is it possible to detect a legacy (non-Unicode) encodings inside Word 
>>> and/or text documents?
>>>
>>> In Word, I think the answer must be no: once some text is in Word--even 
>>> if it uses a legacy font and has glyphs in the upper ascii region 
>>> (128-255)--Word will have turned those into what it thinks is UTF-16 
>>> (even though it isn't legitimate UTF-16). I'm saying this based on my 
>>> experience of sending such text thru COM and even the legacy data comes 
>>> thru as wide BSTR data that is indistinguishable from UTF-16 (other than 
>>> the semantics of the characters).
>>>
>>> Since I can assume that the text files I have to deal with are either 
>>> UTF-8 (if Unicode-encoded) or narrow bytes of legacy data, I'm thinking 
>>> that I could open the file in narrow mode (i.e. 8-bit bytes) and then 
>>> try to do the UTF-8 to UTF-16 conversion. If that fails somehow, because 
>>> of invalid UTF-8 sequences, then I think I can detect a non-Unicode 
>>> encoding.
>>>
>>> Can anyone think of a (better) way for either scenario?
>>>
>>> Thanks,
>>> Bob
>>>
>>>
>>
>>
>
>
date: Sun, 11 Mar 2007 17:32:22 +0100   author:   Klaus Linke

Re: Non-Unicode text in Word/txt files   
That's the problem. We're hoping to be able to detect this generically for 
any legacy (non-Unicode) encoded fonts, so it isn't any particular one.

Bob


"Klaus Linke"  wrote in message 
news:udIIar$YHHA.3628@TK2MSFTNGP02.phx.gbl...
> Which fonts are that?
>
> Klaus
date: Mon, 12 Mar 2007 16:00:45 +0530   author:   Bob Eaton

Re: Non-Unicode text in Word/txt files   
I don't think much can be done without knowing which fonts you have a 
problem with.
And BTW, you haven't really stated *what* problem(s) you have.
I assume you have garbage in the files because you no longer have the old 
fonts installed? Or you want characters or symbols converted to Unicode? But 
that's just guessing.

Regards,
Klaus


"Bob Eaton"  wrote:
> That's the problem. We're hoping to be able to detect this generically for 
> any legacy (non-Unicode) encoded fonts, so it isn't any particular one.
>
> Bob
>
>
> "Klaus Linke"  wrote in message 
> news:udIIar$YHHA.3628@TK2MSFTNGP02.phx.gbl...
>> Which fonts are that?
>>
>> Klaus
>
>
date: Mon, 12 Mar 2007 19:15:30 +0100   author:   Klaus Linke

Re: Non-Unicode text in Word/txt files   
Yes, I'm developing tools for doing encoding conversion
(http://scripts.sil.org/EncCnvtrs) that is being embedded in different
client applications.

Someone recently asked if there was a way to detect the presence of
non-Unicode data in Word and text documents so that they could know it
needed to be converted.

I suppose for our internal use, the number of legacy fonts we have to deal
with is finite and I could just search for those explicitly... but I just
prefer generic solutions...

By the way, do you know if there's a way to enumerate the fonts that are
actually used in a document? (e.g. in the Word PIA interface or VBA)?
Though, I suppose this is a question for another group.

Thanks,
Bob


"Klaus Linke"  wrote in message
news:O7eotJNZHHA.5044@TK2MSFTNGP05.phx.gbl...
>I don't think much can be done without knowing which fonts you have a
>problem with.
> And BTW, you haven't really stated *what* problem(s) you have.
> I assume you have garbage in the files because you no longer have the old
> fonts installed? Or you want characters or symbols converted to Unicode?
> But that's just guessing.
>
> Regards,
> Klaus
>
>
> "Bob Eaton"  wrote:
>> That's the problem. We're hoping to be able to detect this generically
>> for any legacy (non-Unicode) encoded fonts, so it isn't any particular
>> one.
>>
>> Bob
>>
>>
>> "Klaus Linke"  wrote in message
>> news:udIIar$YHHA.3628@TK2MSFTNGP02.phx.gbl...
>>> Which fonts are that?
>>>
>>> Klaus
>>
>>
>
>
date: Tue, 13 Mar 2007 08:03:28 +0530   author:   Bob Eaton

Re: Non-Unicode text in Word/txt files   
Hi Bob,

Sorry if I sounded a bit exasperated, but it would have been easier for me 
to make suggestions if I knew which concrete font(s) you try to deal with.
Symbol fonts are easy to find, you can do a wildcard search:
        .Text = "[" & ChrW(&HF000) & "-" & ChrW(&HF0FF) & "]"
        .MatchWildcards = True
because Word uses that (corporate private use Unicode) code block for them.
If the old font isn't recognizable as a symbol font to Word, there isn't 
much you can do though I'm afraid.

As to your question about a list of fonts that's used in a document, I don't 
have a good answer.
Probably the best and easiest way might be to make a copy of the document.
Then look at the font of the first character, store the name, and delete all 
characters in that font with one Find/Replace.
Wash, rinse, and repeat until nothing's left.

Regards,
Klaus



"Bob Eaton"  wrote:
> Yes, I'm developing tools for doing encoding conversion
> (http://scripts.sil.org/EncCnvtrs) that is being embedded in different
> client applications.
>
> Someone recently asked if there was a way to detect the presence of
> non-Unicode data in Word and text documents so that they could know it
> needed to be converted.
>
> I suppose for our internal use, the number of legacy fonts we have to deal
> with is finite and I could just search for those explicitly... but I just
> prefer generic solutions...
>
> By the way, do you know if there's a way to enumerate the fonts that are
> actually used in a document? (e.g. in the Word PIA interface or VBA)?
> Though, I suppose this is a question for another group.
>
> Thanks,
> Bob
>
>
> "Klaus Linke"  wrote in message
> news:O7eotJNZHHA.5044@TK2MSFTNGP05.phx.gbl...
>>I don't think much can be done without knowing which fonts you have a
>>problem with.
>> And BTW, you haven't really stated *what* problem(s) you have.
>> I assume you have garbage in the files because you no longer have the old
>> fonts installed? Or you want characters or symbols converted to Unicode?
>> But that's just guessing.
>>
>> Regards,
>> Klaus
>>
>>
>> "Bob Eaton"  wrote:
>>> That's the problem. We're hoping to be able to detect this generically
>>> for any legacy (non-Unicode) encoded fonts, so it isn't any particular
>>> one.
>>>
>>> Bob
>>>
>>>
>>> "Klaus Linke"  wrote in message
>>> news:udIIar$YHHA.3628@TK2MSFTNGP02.phx.gbl...
>>>> Which fonts are that?
>>>>
>>>> Klaus
>>>
>>>
>>
>>
>
>
>
date: Tue, 13 Mar 2007 05:02:46 +0100   author:   Klaus Linke

Re: Non-Unicode text in Word/txt files   
BTW, I love SIL and the work you do, and I'll check out your converter, but 
don't have much time this week.

:-)  Klaus
date: Tue, 13 Mar 2007 05:13:54 +0100   author:   Klaus Linke

Google
 
Web ureader.com


    COPYRIGHT 2007, YARDI TECHNOLOGY LIMITED, ALL RIGHT RESERVE  |   contact us