|
|
|
date: Wed, 7 Mar 2007 08:51:23 +0530,
group: microsoft.public.word.international.features
back
Re: Non-Unicode text in Word/txt files
Hi Bob,
Not sure which one of two possible issues you have:
a) Text files using some DOS or Windows code page, or
b) Word files using some old font with non-standard "upper ascii" characters
that you don't have installed.
If it's a), you should be able to import (open) them as "encoded text", and
choose the proper code page.
The dialog, in Word 2003/2007 at least, shows you a preview, so you can make
sure to figure out the correct CP.
If the texts are already in Word, there is an add-in, eefonts.dot, that adds
an entry "Fix broken text" to the Tools menu.
You select some bungled text (= text imported or pasted using the system
code page instead of the proper one), choose the command, and select the
country (code page) of the original text, then the add-in fixes the text.
For b), there are a few converters for some old custom Greek, Cyrillic, ...
fonts, which maybe you can find googling for the font name(s).
For a), but also for b) if the font used the code assignments from some
existing code page, you could also try to save as Windows text (*don't*
allow character substitution in the Save dialog), and then re-import using
the proper code page.
If it's a), you'd probably need some heuristic way to search for bungled
text (...search for characters that appear but should not), if it's b), you
should be able to search for the font.
Regards,
Klaus
"Bob Eaton" wrote:
> Is it possible to detect a legacy (non-Unicode) encodings inside Word
> and/or text documents?
>
> In Word, I think the answer must be no: once some text is in Word--even if
> it uses a legacy font and has glyphs in the upper ascii region
> (128-255)--Word will have turned those into what it thinks is UTF-16 (even
> though it isn't legitimate UTF-16). I'm saying this based on my experience
> of sending such text thru COM and even the legacy data comes thru as wide
> BSTR data that is indistinguishable from UTF-16 (other than the semantics
> of the characters).
>
> Since I can assume that the text files I have to deal with are either
> UTF-8 (if Unicode-encoded) or narrow bytes of legacy data, I'm thinking
> that I could open the file in narrow mode (i.e. 8-bit bytes) and then try
> to do the UTF-8 to UTF-16 conversion. If that fails somehow, because of
> invalid UTF-8 sequences, then I think I can detect a non-Unicode encoding.
>
> Can anyone think of a (better) way for either scenario?
>
> Thanks,
> Bob
>
>
date: Sun, 11 Mar 2007 07:31:30 +0100
author: Klaus Linke
Re: Non-Unicode text in Word/txt files
Thanks Klaus,
I'm dealing with situation (b): non-standard fonts using the upper ascii
region for (typically) non-roman glyphs. In fact, they may be using the
lower region (0-127) for non-ascii glyphs as well... But those characters
I'm sure there's no way to detect as distinct from ascii without visually
inspecting the glyph shapes.
The more I think about this, the more I doubt there's a way except for text
files. That is, text files that have a narrow stream of bytes can only be
Unicode-encoded if it's UTF-8 and the code points between 128-255 are not
legitimate UTF-8 code points by themselves. A legitimate Unicode character
in the range of 128-255, would be represented in (at least) 2 bytes, whereas
if I'm dealing with a legacy-encoded file, I might see such upper ascii
characters by themselves. So, I think there's a way to detect this
non-standard encoding by inspecting narrow byte stream text files... But in
Word, I think everything is internally 'wide' (as a result of the code page
conversion you mentioned) and therefore there's no way to tell whether it's
legitimate Unicode or not without actually inspecting the glyphs.
Anyway, thanks again,
Bob
"Klaus Linke" wrote in message
news:eyCVpb6YHHA.4264@TK2MSFTNGP05.phx.gbl...
> Hi Bob,
>
> Not sure which one of two possible issues you have:
>
> a) Text files using some DOS or Windows code page, or
> b) Word files using some old font with non-standard "upper ascii"
> characters that you don't have installed.
>
> If it's a), you should be able to import (open) them as "encoded text",
> and choose the proper code page.
> The dialog, in Word 2003/2007 at least, shows you a preview, so you can
> make sure to figure out the correct CP.
>
> If the texts are already in Word, there is an add-in, eefonts.dot, that
> adds an entry "Fix broken text" to the Tools menu.
> You select some bungled text (= text imported or pasted using the system
> code page instead of the proper one), choose the command, and select the
> country (code page) of the original text, then the add-in fixes the text.
>
> For b), there are a few converters for some old custom Greek, Cyrillic,
> ... fonts, which maybe you can find googling for the font name(s).
>
> For a), but also for b) if the font used the code assignments from some
> existing code page, you could also try to save as Windows text (*don't*
> allow character substitution in the Save dialog), and then re-import using
> the proper code page.
>
> If it's a), you'd probably need some heuristic way to search for bungled
> text (...search for characters that appear but should not), if it's b),
> you should be able to search for the font.
>
> Regards,
> Klaus
>
>
> "Bob Eaton" wrote:
>> Is it possible to detect a legacy (non-Unicode) encodings inside Word
>> and/or text documents?
>>
>> In Word, I think the answer must be no: once some text is in Word--even
>> if it uses a legacy font and has glyphs in the upper ascii region
>> (128-255)--Word will have turned those into what it thinks is UTF-16
>> (even though it isn't legitimate UTF-16). I'm saying this based on my
>> experience of sending such text thru COM and even the legacy data comes
>> thru as wide BSTR data that is indistinguishable from UTF-16 (other than
>> the semantics of the characters).
>>
>> Since I can assume that the text files I have to deal with are either
>> UTF-8 (if Unicode-encoded) or narrow bytes of legacy data, I'm thinking
>> that I could open the file in narrow mode (i.e. 8-bit bytes) and then try
>> to do the UTF-8 to UTF-16 conversion. If that fails somehow, because of
>> invalid UTF-8 sequences, then I think I can detect a non-Unicode
>> encoding.
>>
>> Can anyone think of a (better) way for either scenario?
>>
>> Thanks,
>> Bob
>>
>>
>
>
date: Sun, 11 Mar 2007 15:23:31 +0530
author: Bob Eaton
Re: Non-Unicode text in Word/txt files
Which fonts are that?
Klaus
"Bob Eaton" schrieb im Newsbeitrag
news:OZeHeM$YHHA.4940@TK2MSFTNGP05.phx.gbl...
> Thanks Klaus,
>
> I'm dealing with situation (b): non-standard fonts using the upper ascii
> region for (typically) non-roman glyphs. In fact, they may be using the
> lower region (0-127) for non-ascii glyphs as well... But those characters
> I'm sure there's no way to detect as distinct from ascii without visually
> inspecting the glyph shapes.
>
> The more I think about this, the more I doubt there's a way except for
> text files. That is, text files that have a narrow stream of bytes can
> only be Unicode-encoded if it's UTF-8 and the code points between 128-255
> are not legitimate UTF-8 code points by themselves. A legitimate Unicode
> character in the range of 128-255, would be represented in (at least) 2
> bytes, whereas if I'm dealing with a legacy-encoded file, I might see such
> upper ascii characters by themselves. So, I think there's a way to detect
> this non-standard encoding by inspecting narrow byte stream text files...
> But in Word, I think everything is internally 'wide' (as a result of the
> code page conversion you mentioned) and therefore there's no way to tell
> whether it's legitimate Unicode or not without actually inspecting the
> glyphs.
>
> Anyway, thanks again,
> Bob
>
>
>
> "Klaus Linke" wrote in message
> news:eyCVpb6YHHA.4264@TK2MSFTNGP05.phx.gbl...
>> Hi Bob,
>>
>> Not sure which one of two possible issues you have:
>>
>> a) Text files using some DOS or Windows code page, or
>> b) Word files using some old font with non-standard "upper ascii"
>> characters that you don't have installed.
>>
>> If it's a), you should be able to import (open) them as "encoded text",
>> and choose the proper code page.
>> The dialog, in Word 2003/2007 at least, shows you a preview, so you can
>> make sure to figure out the correct CP.
>>
>> If the texts are already in Word, there is an add-in, eefonts.dot, that
>> adds an entry "Fix broken text" to the Tools menu.
>> You select some bungled text (= text imported or pasted using the system
>> code page instead of the proper one), choose the command, and select the
>> country (code page) of the original text, then the add-in fixes the text.
>>
>> For b), there are a few converters for some old custom Greek, Cyrillic,
>> ... fonts, which maybe you can find googling for the font name(s).
>>
>> For a), but also for b) if the font used the code assignments from some
>> existing code page, you could also try to save as Windows text (*don't*
>> allow character substitution in the Save dialog), and then re-import
>> using the proper code page.
>>
>> If it's a), you'd probably need some heuristic way to search for bungled
>> text (...search for characters that appear but should not), if it's b),
>> you should be able to search for the font.
>>
>> Regards,
>> Klaus
>>
>>
>> "Bob Eaton" wrote:
>>> Is it possible to detect a legacy (non-Unicode) encodings inside Word
>>> and/or text documents?
>>>
>>> In Word, I think the answer must be no: once some text is in Word--even
>>> if it uses a legacy font and has glyphs in the upper ascii region
>>> (128-255)--Word will have turned those into what it thinks is UTF-16
>>> (even though it isn't legitimate UTF-16). I'm saying this based on my
>>> experience of sending such text thru COM and even the legacy data comes
>>> thru as wide BSTR data that is indistinguishable from UTF-16 (other than
>>> the semantics of the characters).
>>>
>>> Since I can assume that the text files I have to deal with are either
>>> UTF-8 (if Unicode-encoded) or narrow bytes of legacy data, I'm thinking
>>> that I could open the file in narrow mode (i.e. 8-bit bytes) and then
>>> try to do the UTF-8 to UTF-16 conversion. If that fails somehow, because
>>> of invalid UTF-8 sequences, then I think I can detect a non-Unicode
>>> encoding.
>>>
>>> Can anyone think of a (better) way for either scenario?
>>>
>>> Thanks,
>>> Bob
>>>
>>>
>>
>>
>
>
date: Sun, 11 Mar 2007 17:32:22 +0100
author: Klaus Linke
Re: Non-Unicode text in Word/txt files
Hi Bob,
Sorry if I sounded a bit exasperated, but it would have been easier for me
to make suggestions if I knew which concrete font(s) you try to deal with.
Symbol fonts are easy to find, you can do a wildcard search:
.Text = "[" & ChrW(&HF000) & "-" & ChrW(&HF0FF) & "]"
.MatchWildcards = True
because Word uses that (corporate private use Unicode) code block for them.
If the old font isn't recognizable as a symbol font to Word, there isn't
much you can do though I'm afraid.
As to your question about a list of fonts that's used in a document, I don't
have a good answer.
Probably the best and easiest way might be to make a copy of the document.
Then look at the font of the first character, store the name, and delete all
characters in that font with one Find/Replace.
Wash, rinse, and repeat until nothing's left.
Regards,
Klaus
"Bob Eaton" wrote:
> Yes, I'm developing tools for doing encoding conversion
> (http://scripts.sil.org/EncCnvtrs) that is being embedded in different
> client applications.
>
> Someone recently asked if there was a way to detect the presence of
> non-Unicode data in Word and text documents so that they could know it
> needed to be converted.
>
> I suppose for our internal use, the number of legacy fonts we have to deal
> with is finite and I could just search for those explicitly... but I just
> prefer generic solutions...
>
> By the way, do you know if there's a way to enumerate the fonts that are
> actually used in a document? (e.g. in the Word PIA interface or VBA)?
> Though, I suppose this is a question for another group.
>
> Thanks,
> Bob
>
>
> "Klaus Linke" wrote in message
> news:O7eotJNZHHA.5044@TK2MSFTNGP05.phx.gbl...
>>I don't think much can be done without knowing which fonts you have a
>>problem with.
>> And BTW, you haven't really stated *what* problem(s) you have.
>> I assume you have garbage in the files because you no longer have the old
>> fonts installed? Or you want characters or symbols converted to Unicode?
>> But that's just guessing.
>>
>> Regards,
>> Klaus
>>
>>
>> "Bob Eaton" wrote:
>>> That's the problem. We're hoping to be able to detect this generically
>>> for any legacy (non-Unicode) encoded fonts, so it isn't any particular
>>> one.
>>>
>>> Bob
>>>
>>>
>>> "Klaus Linke" wrote in message
>>> news:udIIar$YHHA.3628@TK2MSFTNGP02.phx.gbl...
>>>> Which fonts are that?
>>>>
>>>> Klaus
>>>
>>>
>>
>>
>
>
>
date: Tue, 13 Mar 2007 05:02:46 +0100
author: Klaus Linke
|
|