Discussion:
Problem displaying Unicode characters in CMD
Add Reply
JJ
2017-08-03 18:21:56 UTC
Reply
Permalink
Raw Message
See below screenshot.

<Loading Image...>

My OS is Windows 7 with English language, BTW.

FYI, that CMD session was already started using the /U switch. And it's
already using a TrueType font (Lucida Console). The other Consolas font have
the same problem too. My system already have the required font for
displaying most Unicode characters (especially CJK) - as shown by the
Windows Explorer in the screenshot.

There are claims that I have to set the active code page for that CMD
session to UTF-8 (65001) via the CHCP command, but even that didn't help. I
also tried UTF-16 (1200) code page since it's the closest thing to the OS
native UCS-2, but CMD says it's an invalid code page. My system code page is
set to English, BTW. The system code page must not be changed for my system.

With CMD application, I have no problem working with Unicode characters as
data. I only have problem displaying them.

Anyone can help?

PS)
- This is a CMD application problem. Not the console window itself.
- Using an application other than CMD is not applicable, unless CMD can't
display Unicode characters.
Paul
2017-08-03 20:04:25 UTC
Reply
Permalink
Raw Message
Post by JJ
See below screenshot.
<http://i.imgur.com/aY3JAqX.jpg>
My OS is Windows 7 with English language, BTW.
FYI, that CMD session was already started using the /U switch. And it's
already using a TrueType font (Lucida Console). The other Consolas font have
the same problem too. My system already have the required font for
displaying most Unicode characters (especially CJK) - as shown by the
Windows Explorer in the screenshot.
There are claims that I have to set the active code page for that CMD
session to UTF-8 (65001) via the CHCP command, but even that didn't help. I
also tried UTF-16 (1200) code page since it's the closest thing to the OS
native UCS-2, but CMD says it's an invalid code page. My system code page is
set to English, BTW. The system code page must not be changed for my system.
With CMD application, I have no problem working with Unicode characters as
data. I only have problem displaying them.
Anyone can help?
PS)
- This is a CMD application problem. Not the console window itself.
- Using an application other than CMD is not applicable, unless CMD can't
display Unicode characters.
One of the answers here, adds an additional entry to the Registry,
so you can have another font choice. Maybe the characters you need
would be in there ?

https://stackoverflow.com/questions/388490/unicode-characters-in-windows-command-line-how

CMD.exe seems to be able to pass the characters (from a shell perspective),
but there are no real guarantees on what shows in the display itself.
Which is a disaster. What good is an interactive shell,
which is not interactive ?

Paul
JJ
2017-08-04 12:44:26 UTC
Reply
Permalink
Raw Message
Post by Paul
One of the answers here, adds an additional entry to the Registry,
so you can have another font choice. Maybe the characters you need
would be in there ?
https://stackoverflow.com/questions/388490/unicode-characters-in-windows-command-line-how
Yes, I've just tried that. It seems that the console's setting dialog only
accepts monospace fonts including some other unknown criteria(s).

Not all monospace fonts are accepted. e.g. "Bitstream Vera Sans Mono",
"DejaVu Vera Sans Mono", "saxMono". Some are displayed in the list but the
console won't use it; and some aren't even displayed in the list. I did
succeded on adding and using some monospace fonts but none of them have any
CJK Unicode block. e.g. "Andale Mono".

AFAIK, the "MS Gothic" font is a monospace font designed for Japanese
language and it does have CJK Unicode block (IIRC, it's the default GUI font
in CJK version of Windows 95), but the console's setting dialog won't accept
that font (it won't display it in the list). So far, I haven't found any
monospace CJK-compatible font which is accepted by the console's setting
dialog.
Post by Paul
CMD.exe seems to be able to pass the characters (from a shell perspective),
but there are no real guarantees on what shows in the display itself.
Which is a disaster. What good is an interactive shell,
which is not interactive ?
I've read in a discussion on the net that CMD doesn't respect the code page
setting when displaying file names onto the screen. It only works properly
the the output is redirected into file. As if it only use the system code
page, which is a global setting.
Mayayana
2017-08-03 23:13:05 UTC
Reply
Permalink
Raw Message
"JJ" <***@vfemail.net> wrote

| - This is a CMD application problem. Not the console window itself.

I don't generally use console windows, but I assume
you can only choose one font. In that case, Lucida
is showing you what it's got, which doesn't include
Chinese characters.
JJ
2017-08-04 12:44:25 UTC
Reply
Permalink
Raw Message
Post by Mayayana
I don't generally use console windows, but I assume
you can only choose one font. In that case, Lucida
is showing you what it's got, which doesn't include
Chinese characters.
There are 3 fonts to choose from in my system: "Consolas", "Lucida Console",
and "Raster Fonts". The first two are TrueType fonts.

You're right. The "Lucida Console" font does not have a Unicode block for
CJK characters. However, I use the "Microsoft Sans Serif" font for the
default Windows GUI via Windows Classic theme. "Microsoft Sans Serif" font
does not have a Unicode block for CJK characters either. Yet, Windows
Explorer can display the CJK characters correctly.

It's similar like using "Lucida Console" font (or any other
TrueType/OpenType font) in Notepad. If you copy any CJK character from e.g.
Character Map, Notepad can display the characters correctly. This is
possible because the system borrows character glyphs from other font which
have them. CMD however, behave differently.
Mayayana
2017-08-04 13:40:03 UTC
Reply
Permalink
Raw Message
"JJ" <***@vfemail.net> wrote

| You're right. The "Lucida Console" font does not have a Unicode block for
| CJK characters. However, I use the "Microsoft Sans Serif" font for the
| default Windows GUI via Windows Classic theme. "Microsoft Sans Serif" font
| does not have a Unicode block for CJK characters either. Yet, Windows
| Explorer can display the CJK characters correctly.
|
| It's similar like using "Lucida Console" font (or any other
| TrueType/OpenType font) in Notepad. If you copy any CJK character from
e.g.
| Character Map, Notepad can display the characters correctly. This is
| possible because the system borrows character glyphs from other font which
| have them. CMD however, behave differently.

I just tested Lucida in my console window on XP.
I get a rectangle for a Chinese character. Ditto with
Notepad, which I keep set to Verdana. Windows
Explorer is probably more sophisticated. Likewise with
browsers. For instance, I keep a webpage for reference
that I created with the full unicode set, showing
each as:

decimal value character UTF-8 byte values

I set the font as verdana in CSS, but foreign characters
still show up. Presumably the browser knows to pick a
font that suits. I know that Firefox has settings in
about:config for that. So if I use something like
&#x6074; to show the unicode Chinese character 24692
(6074 is the hexadecimal version) then the browser knows
to deal with that. I suspect those fonts may be built in.

But browsers are designed to show anything graphical.
Plain text windows are usually designed to show only
one font. I'm surprised your Notepad shows the characters.
Maybe MS made it more sophisticated in Vista/7 and it's
no longer a plain Win32 text window.

Also note with respect to Mike S's post: Local codepage
has nothing to do with unicode characters. It started out
as ASCII, using one byte. In 7-bit ASCII, 0-127 are basic
English characters. With the need to support foreign
languages, ANSI was developed. Still one byte per character.
0-127 are still the same. 128-255 are displayed depending
on the local codepage. In English, #149 is a bullet. In
Russian it's probably a Russian character. In Turkish,
Turkish. Etc. The codepage setting decides that. You
can set your system to function as Russian, Turkish, etc.

That solved the problem except for Korean, Chinese,
Japanese, which use a multibyte character set to deal with
the limitations of ANSI. It's still one byte per character
but some byte values are signifiers for the next byte.
So 65 is "A", for instance, but 120 65 might be the character
for "tree" using the Japanese codepage. (Just an example.
I don't know the signifier numbers offhand. Nor do I know
Japanese. :)

That's all in the world of one-byte encoding (which
confusingly includes multi-byte Asian characters).

Unicode is two byte encoding. All characters needed
have a number of their own. So Russian characters
might be, say, 340-420. Chinese characters seem to
be up in the mid-20,000s to 30,000s. It's an entirely
different approach. 0-127 are still the same as ASCII,
but the bytes for "ab" in ASCII or ANSI are 97-98.
In unicode they're 0-97-0-98. Always 2 bytes.

That created a problem. The computing world was
based on 1 byte = 1 character. Even multibyte encoding
reads one byte at a time. It's made up of numbers
from 0-255. Unicode is made up of numbers from 0
to 65535, using 2 bytes for each number. Completely
different encoding.
Unicode has been around for many years, but it
requires different treatment. Different programming
APIs. Webpages are written in ANSI. JPG EXIF tags
are in ANSI. Etc. Unicode is also superfluous to those
of us in N. America and Europe. So it's been slow to
be adopted.
To make the transition smoother, UTF-8 was
created. UTF-8 is similar to the multibyte Asian
encoding. It renders the unicode characters using
prepended flag bytes. So text can still be parsed
one byte at a time. Webpages can be ANSI or UTF-8
without changing the basic file structure. There
are no pesky null characters to screw things up.
All that's needed is for the browser to know which
way to parse. And of course, it still doesn't matter
much in the West. So everyone's happy. Since UTF-8
does actually function as unicode, copepages are
not used.

Your console window probably deals in unicode.
But fonts deal in characters. So if the window can
only render one font at a time then it won't be
able to render anything not drawn in Lucida.

That may be more that anyone cares to know. :)
But I figure it's worth explaining because the whole
thing can get very confusing and there's a lot of
misinformation about what's what when it comes to
character encoding.
JJ
2017-08-05 16:59:55 UTC
Reply
Permalink
Raw Message
Post by Mayayana
I just tested Lucida in my console window on XP.
I get a rectangle for a Chinese character. Ditto with
Notepad, which I keep set to Verdana. Windows
Explorer is probably more sophisticated. Likewise with
browsers. For instance, I keep a webpage for reference
that I created with the full unicode set, showing
decimal value character UTF-8 byte values
I set the font as verdana in CSS, but foreign characters
still show up. Presumably the browser knows to pick a
font that suits. I know that Firefox has settings in
about:config for that. So if I use something like
&#x6074; to show the unicode Chinese character 24692
(6074 is the hexadecimal version) then the browser knows
to deal with that. I suspect those fonts may be built in.
But browsers are designed to show anything graphical.
Plain text windows are usually designed to show only
one font. I'm surprised your Notepad shows the characters.
Maybe MS made it more sophisticated in Vista/7 and it's
no longer a plain Win32 text window.
Also note with respect to Mike S's post: Local codepage
has nothing to do with unicode characters. It started out
as ASCII, using one byte. In 7-bit ASCII, 0-127 are basic
English characters. With the need to support foreign
languages, ANSI was developed. Still one byte per character.
0-127 are still the same. 128-255 are displayed depending
on the local codepage. In English, #149 is a bullet. In
Russian it's probably a Russian character. In Turkish,
Turkish. Etc. The codepage setting decides that. You
can set your system to function as Russian, Turkish, etc.
That solved the problem except for Korean, Chinese,
Japanese, which use a multibyte character set to deal with
the limitations of ANSI. It's still one byte per character
but some byte values are signifiers for the next byte.
So 65 is "A", for instance, but 120 65 might be the character
for "tree" using the Japanese codepage. (Just an example.
I don't know the signifier numbers offhand. Nor do I know
Japanese. :)
That's all in the world of one-byte encoding (which
confusingly includes multi-byte Asian characters).
Unicode is two byte encoding. All characters needed
have a number of their own. So Russian characters
might be, say, 340-420. Chinese characters seem to
be up in the mid-20,000s to 30,000s. It's an entirely
different approach. 0-127 are still the same as ASCII,
but the bytes for "ab" in ASCII or ANSI are 97-98.
In unicode they're 0-97-0-98. Always 2 bytes.
That created a problem. The computing world was
based on 1 byte = 1 character. Even multibyte encoding
reads one byte at a time. It's made up of numbers
from 0-255. Unicode is made up of numbers from 0
to 65535, using 2 bytes for each number. Completely
different encoding.
Unicode has been around for many years, but it
requires different treatment. Different programming
APIs. Webpages are written in ANSI. JPG EXIF tags
are in ANSI. Etc. Unicode is also superfluous to those
of us in N. America and Europe. So it's been slow to
be adopted.
To make the transition smoother, UTF-8 was
created. UTF-8 is similar to the multibyte Asian
encoding. It renders the unicode characters using
prepended flag bytes. So text can still be parsed
one byte at a time. Webpages can be ANSI or UTF-8
without changing the basic file structure. There
are no pesky null characters to screw things up.
All that's needed is for the browser to know which
way to parse. And of course, it still doesn't matter
much in the West. So everyone's happy. Since UTF-8
does actually function as unicode, copepages are
not used.
Your console window probably deals in unicode.
But fonts deal in characters. So if the window can
only render one font at a time then it won't be
able to render anything not drawn in Lucida.
That may be more that anyone cares to know. :)
But I figure it's worth explaining because the whole
thing can get very confusing and there's a lot of
misinformation about what's what when it comes to
character encoding.
Well, the code page should be irrelevant assuming that the font actually has
the required Unicode block, but apparently it isn't.

To add more confusion, here what happened then the system code page is set
to Japanese.

<Loading Image...>

And strangely, you'll notice that the Font Preview window shows that the "MS
Gothic" font name is not "MS Gothic" but "MS ゴシック" when the system code page
is set to ther than Japanese (or probably other than CJK).
JJ
2017-08-05 17:03:19 UTC
Reply
Permalink
Raw Message
Post by JJ
Well, the code page should be irrelevant assuming that the font actually has
the required Unicode block, but apparently it isn't.
To add more confusion, here what happened then the system code page is set
to Japanese.
<http://i.imgur.com/mHfuaSW.jpg>
And strangely, you'll notice that the Font Preview window shows that the "MS
Gothic" font name is not "MS Gothic" but "MS ゴシック" when the system code page
is set to ther than Japanese (or probably other than CJK).
You probably already know that the Japanese code page uses the Yen currency
character as the backslash. This is the main reason I don't want to change
my system locale to Japanese. Otherwise, I would use that already.
Mayayana
2017-08-05 20:16:24 UTC
Reply
Permalink
Raw Message
"JJ" <***@vfemail.net> wrote

| Well, the code page should be irrelevant assuming that the font actually
has
| the required Unicode block, but apparently it isn't.
|
No, it's two different things. The codepage is used to
parse ANSI/DBCS. Unicode is 2-byte encoding and includes
unique numeric values for all characters. That's what I was
trying to clarify. Codepage is used only for ANSI/DBCS. It's
not relevant with unicode because all used characters are
assigned a unique byte value, while the purpose of a
codepage is to squeeze all languages into a possible 256
values in a byte. It does that by reusing bytes 128-255
depending on the language.

A font does not have a "unicode block". It only has characters.
Fonts and encoding are different things.

It gets complicated because DBCS languages (Chinese,
Japanese, Korean), have to use multiple bytes for single
characters in their non-unicode encoding, while all other
languages use one byte. If you just look at Western
languages it's easier to see. A text file with a single byte
128 (H80) is a Euro sign when using the English codepage.
In the Russian codepage it looks like a capital A. That's
how you'd see it in Notepad on an English or Russian
computer. The unicode value for a Euro sign is 8364,
or hex 20AC. H20AC would show in a hex editor as AC 20.
The English ANSI codepage would render that as an angled
dash followed by a space. The Russian codepage would
render it as something like a capital M followed by a space.
But if Notepad knows it's unicode then both computers
would render a Euro sign. Thus, no codepages for unicode.

| And strangely, you'll notice that the Font Preview window shows that the
"MS
| Gothic" font name is not "MS Gothic" but "MS ????" when the system code
page
| is set to ther than Japanese (or probably other than CJK).

Interesting. Maybe that's coming across in dropdown text
window as unicode but being interpreted as DBCS.

So what you need seems to be a monospaced, unicode
font, that includes Japanese characters, then use Paul's
trick to get at it in the console window. *If* your console
window can really display unicode. There's a list here:

https://en.wikipedia.org/wiki/Unicode_font

A few are monospaced, but the selection seems to
be very limited. Arial Unicode MS has almost 40,000
characters, but many of the fonts only have 6,000 or
so. What you need is monospace unicode with
Japanese characters. Do any include Japanese? I don't
know. Maybe some Japanese company has specifically
made such a thing.

If you change the codepage you run into all sorts
of complications, as you've seen. Any byte above
127 will render corrupt, and other oddities like the funky
font dropdown selector can happen. With Japanese it will
probably be worse because it's a DBCS language rather
than just ANSI. With DBCS a byte above 127 will
be a flag indicating how to interpret the following byte.
JJ
2017-08-06 15:33:07 UTC
Reply
Permalink
Raw Message
Post by Mayayana
No, it's two different things. The codepage is used to
parse ANSI/DBCS. Unicode is 2-byte encoding and includes
unique numeric values for all characters. That's what I was
trying to clarify. Codepage is used only for ANSI/DBCS. It's
not relevant with unicode because all used characters are
assigned a unique byte value, while the purpose of a
codepage is to squeeze all languages into a possible 256
values in a byte. It does that by reusing bytes 128-255
depending on the language.
A font does not have a "unicode block". It only has characters.
Fonts and encoding are different things.
It gets complicated because DBCS languages (Chinese,
Japanese, Korean), have to use multiple bytes for single
characters in their non-unicode encoding, while all other
languages use one byte. If you just look at Western
languages it's easier to see. A text file with a single byte
128 (H80) is a Euro sign when using the English codepage.
In the Russian codepage it looks like a capital A. That's
how you'd see it in Notepad on an English or Russian
computer. The unicode value for a Euro sign is 8364,
or hex 20AC. H20AC would show in a hex editor as AC 20.
The English ANSI codepage would render that as an angled
dash followed by a space. The Russian codepage would
render it as something like a capital M followed by a space.
But if Notepad knows it's unicode then both computers
would render a Euro sign. Thus, no codepages for unicode.
Maybe I should have mentioned the "Unicode block" as "Unicode subrange".
Sorry, for the confusion.
Post by Mayayana
Interesting. Maybe that's coming across in dropdown text
window as unicode but being interpreted as DBCS.
That's impossible. The "Gothic" text can't possibly be "ゴシック" regardless of
what it was originally encoded with.

Did you actually see the katakana characters in the news message from your
news client? That (and this) message was encoded using Big5, BTW.
Post by Mayayana
So what you need seems to be a monospaced, unicode
font, that includes Japanese characters, then use Paul's
trick to get at it in the console window. *If* your console
https://en.wikipedia.org/wiki/Unicode_font
A few are monospaced, but the selection seems to
be very limited. Arial Unicode MS has almost 40,000
characters, but many of the fonts only have 6,000 or
so. What you need is monospace unicode with
Japanese characters. Do any include Japanese? I don't
know. Maybe some Japanese company has specifically
made such a thing.
None of the mentioned fonts is accepted by the console, unfortunately.
Paul
2017-08-06 15:45:29 UTC
Reply
Permalink
Raw Message
Post by JJ
None of the mentioned fonts is accepted by the console, unfortunately.
There is Courier New, but the file isn't big enough.

How many versions of the Courier New font are there ?

There is Droid Sans Mono, but it's a smaller font file
than Courier New.

And the guys here provide some numbers, for just how
dire the situation is.

https://graphicdesign.stackexchange.com/questions/5697/courier-new-like-font-with-unicode-support

Paul
JJ
2017-08-07 18:25:25 UTC
Reply
Permalink
Raw Message
Post by Paul
There is Courier New, but the file isn't big enough.
How many versions of the Courier New font are there ?
There is Droid Sans Mono, but it's a smaller font file
than Courier New.
And the guys here provide some numbers, for just how
dire the situation is.
https://graphicdesign.stackexchange.com/questions/5697/courier-new-like-font-with-unicode-support
In my collection, there are:
- Courier (Raster, TrueType, PostScript)
- Courier New KOI-8 (PostScript; KOI-8 character set)
- Courier Std (OpenType)
- Courier10 BT (TrueType, PostScript)
- CourierMCY (TrueType, PostScript)

AFAIK, all Courier fonts are monospaced, but I haven't seen any that have
adquate Unicode subrange (which include CJK).

FYI, Windows' built in PostScript fonts support can only handle ANSI/OEM
character set.

I have a font information tool I wrote years ago. Here are the list of the
Unicode subrange some of the mentioned fonts have.

Courrier New:
<https://pastebin.com/6GqRtHK7>

Droid Sand Mono:
<https://pastebin.com/aP52cu6x>

FreeMono:
<https://pastebin.com/4prhSNsZ>

GNU UniFont: (mentioned by Mayayana)
<https://pastebin.com/3V2XMiyQ>

I have most of the fonts that has CJK Unicode subrange from many sources. I
even have the excellect "Osaka" TrueType font from Mac OS X which is
converted to Windows version (Mac TTF files are not binary compatible with
Windows because they use big endian format). Yet, none of the CJK fonts in
my collection is accepted by the console's settings dialog if I don't set
the system locale to CJK.

I don't think this problem has any solution.
So, thanks for your time.
Mayayana
2017-08-06 17:06:00 UTC
Reply
Permalink
Raw Message
"JJ" <***@vfemail.net> wrote

.....

Following Paul's link,I found this:

http://unifoundry.com/unifont.html (a font)

Loading Image...
(a picture of the characters in that font)

I don't know if windows will load it.
There are also interesting notes on console windows
here:

https://stackoverflow.com/questions/388490/unicode-characters-in-windows-command-line-how

With the interesting idea of setting the display to UTF-8.

There's also this, from Michael Kaplan, who is, or
at least was, pretty much the language programming
expert at MS:

http://archives.miloush.net/michkap/archive/2008/03/18/8306597.html

He shows how to programmatically jump through
hoops to show rectangles in the console window that
function as the characters they're supposed to be.
Whoopee. It doesn't sound promising. But maybe you'll
be the first. :)

| > Interesting. Maybe that's coming across in dropdown text
| > window as unicode but being interpreted as DBCS.
|
| That's impossible. The "Gothic" text can't possibly be "????" regardless
of
| what it was originally encoded with.
|

No, I wouldn't think so. But some kind of
fluke in the dropdown window is the only
explanation I can think of.

| Did you actually see the katakana characters in the news message from your
| news client? That (and this) message was encoded using Big5, BTW.
|

I see them in the window. If I look at the message source
it shows with the English code page, as a line of oddball
characters. If I save the post and open it in Notepad I see
rectangles. If I then paste that into an ANSI text window
as part of a webpage I get ??????... But if I replace those
with the rectangles from Notepad and save it as UTF-8,
IE will show the characters.
So... yes and no. :)
JJ
2017-08-07 18:25:23 UTC
Reply
Permalink
Raw Message
Post by Mayayana
http://unifoundry.com/unifont.html (a font)
https://upload.wikimedia.org/wikipedia/commons/e/e3/Unifont-6.3.20131006.png
(a picture of the characters in that font)
I don't know if windows will load it.
It won't, unfortunately.
Post by Mayayana
There are also interesting notes on console windows
https://stackoverflow.com/questions/388490/unicode-characters-in-windows-command-line-how
With the interesting idea of setting the display to UTF-8.
Well, that SO question is about working with Unicode as data, not as
display. I don't have any problem on that too.
Post by Mayayana
There's also this, from Michael Kaplan, who is, or
at least was, pretty much the language programming
http://archives.miloush.net/michkap/archive/2008/03/18/8306597.html
He shows how to programmatically jump through
hoops to show rectangles in the console window that
function as the characters they're supposed to be.
Whoopee. It doesn't sound promising. But maybe you'll
be the first. :)
That actually shows the problem. The Windows' console design in terms of
displaying characters, is not natively UCS2/UTF16. It's more like native
ANSI/OEM.
Post by Mayayana
No, I wouldn't think so. But some kind of
fluke in the dropdown window is the only
explanation I can think of.
FYI, most cross platform applications use their own font rendering engine.
They don't rely on Windows' built in font rendering engine. Moreover,
Thunderbird, Firefox and other Gecko based applications use the Gecko
browser engine for their main application GUI (as GUI framework).
Post by Mayayana
I see them in the window. If I look at the message source
it shows with the English code page, as a line of oddball
characters.
That would be the Big5 encoded text shown using ANSI character set.
Post by Mayayana
If I save the post and open it in Notepad I see rectangles.
That's when the font used for Notepad doesn't have the glyph for that
characters.
Post by Mayayana
If I then paste that into an ANSI text window
as part of a webpage I get ??????...
What application is that? ANSI character set is roughly the same as code
page. If the system code page is not CJK, Windows won't show the correct
character. Assuming that the font used for the display have the glyph for
that characters.
Post by Mayayana
But if I replace those
with the rectangles from Notepad and save it as UTF-8,
IE will show the characters.
So... yes and no. :)
Well, IE has better internationalization support. Much better than the
console, apparently.

And if you take a look a the screenshot again, you'll notice that the
console removes both the "Courier New" and "Lucida Console" fonts from the
list when the system locale is set to CJK. So, it seems that the Windows'
console design (in terms of display) is bound to the system code page. I
think that's the main problem.

OK, I do believe now that there's no solution for this.
Thanks for your support.
Mayayana
2017-08-07 22:23:51 UTC
Reply
Permalink
Raw Message
"JJ" <***@vfemail.net> wrote

| > If I then paste that into an ANSI text window
| > as part of a webpage I get ??????...
|
| What application is that?

That's actually my own code editor. I made it with a
RichEdit window and included a toggle option for
ANSI or UTF-8. When set to ANSI I get ?s. When set
to UTF-8 I get rectangles. There's seems to be some
kind of "sniffing" built in. In ANSI I should get ANSI
characters, but Windows apparently picks up that it's
UTF-8 and just doesn't try to render it. Yet if I load
a UTF-8 webpage I don't get ?s for single UTF-8
characters. I get characters above 128 in English ANSI.

This has been an interesting exploration. The encoding
options are so complicated. But I guess it makes sense
that the console window would be ANSI. Most programming
is English. I imagine CD or DEL don't change. So the only
reason to support other languages would be for local
differences in file/folder names.
JJ
2017-08-09 15:24:00 UTC
Reply
Permalink
Raw Message
Post by Mayayana
That's actually my own code editor. I made it with a
RichEdit window and included a toggle option for
ANSI or UTF-8. When set to ANSI I get ?s. When set
to UTF-8 I get rectangles. There's seems to be some
kind of "sniffing" built in. In ANSI I should get ANSI
characters, but Windows apparently picks up that it's
UTF-8 and just doesn't try to render it. Yet if I load
a UTF-8 webpage I don't get ?s for single UTF-8
characters. I get characters above 128 in English ANSI.
The Windows' RichEdit control is Unicode aware, even if the host application
uses an ANSI GUI. e.g. Wordpad in Windows 9x.

You can test it by using the RichEdit's built in ALT+X shortcut when your
application is set to ANSI mode. Press the shortcut when the input cursor is
placed after a character. Try that using two different characters where both
show as "?" or square characters.
Mayayana
2017-08-09 18:53:27 UTC
Reply
Permalink
Raw Message
"JJ" <***@vfemail.net> wrote

|
| The Windows' RichEdit control is Unicode aware, even if the host
application
| uses an ANSI GUI. e.g. Wordpad in Windows 9x.

Interesting. I just pasted your file name into
Wordpad and it got sniffed out as Japanese,
then rendered in a Japanese font that I didn't
know I had.

But my program is an editor for HTML and script.
I want it to be locked into either ANSI or UTF-8,
so I have a menu toggle, which changes the 3rd
parameter when I send an EM_STREAMIN message
to load a file.

More accurately, I want ANSI, but sometimes there
are UTF-8 webpages that are loaded and I want
to be able to handle those. It's kind of a shame,
really. English webpages don't need to be UTF-8.
ASCII is UTF-8 matching. But companies like Microsoft
often use things like curly quotes in UTF-8 which
then corrupt the text if they're rendered as ANSI.
They're using just enough to create a problem for
ANSI rendering.

Mike S
2017-08-03 23:16:14 UTC
Reply
Permalink
Raw Message
Post by JJ
See below screenshot.
<http://i.imgur.com/aY3JAqX.jpg>
My OS is Windows 7 with English language, BTW.
FYI, that CMD session was already started using the /U switch. And it's
already using a TrueType font (Lucida Console). The other Consolas font have
the same problem too. My system already have the required font for
displaying most Unicode characters (especially CJK) - as shown by the
Windows Explorer in the screenshot.
There are claims that I have to set the active code page for that CMD
session to UTF-8 (65001) via the CHCP command, but even that didn't help. I
also tried UTF-16 (1200) code page since it's the closest thing to the OS
native UCS-2, but CMD says it's an invalid code page. My system code page is
set to English, BTW. The system code page must not be changed for my system.
With CMD application, I have no problem working with Unicode characters as
data. I only have problem displaying them.
Anyone can help?
PS)
- This is a CMD application problem. Not the console window itself.
- Using an application other than CMD is not applicable, unless CMD can't
display Unicode characters.
What happens when you try this?

Yeah,I've just resolved my problem. It was a fault of default font in
cmd.exe which can't manage unicode signs. To fix it(windows 7 x64 pro):

Open/run cmd.exe
Click on the icon at the top-left corner
Select properties
Then "Font" bar
Select "Lucida Console" and OK.
Write Chcp 10000 at the prompt
Finally dir /b

Enjoy your clean UTF-16 output with hearts, Chinese signs, and much more!

https://stackoverflow.com/questions/10764920/utf-16-on-cmd-exe
Mike S
2017-08-03 23:19:15 UTC
Reply
Permalink
Raw Message
Post by Mike S
Post by JJ
See below screenshot.
<http://i.imgur.com/aY3JAqX.jpg>
My OS is Windows 7 with English language, BTW.
FYI, that CMD session was already started using the /U switch. And it's
already using a TrueType font (Lucida Console). The other Consolas font have
the same problem too. My system already have the required font for
displaying most Unicode characters (especially CJK) - as shown by the
Windows Explorer in the screenshot.
There are claims that I have to set the active code page for that CMD
session to UTF-8 (65001) via the CHCP command, but even that didn't help. I
also tried UTF-16 (1200) code page since it's the closest thing to the OS
native UCS-2, but CMD says it's an invalid code page. My system code page is
set to English, BTW. The system code page must not be changed for my system.
With CMD application, I have no problem working with Unicode
characters as
data. I only have problem displaying them.
Anyone can help?
PS)
- This is a CMD application problem. Not the console window itself.
- Using an application other than CMD is not applicable, unless CMD can't
display Unicode characters.
What happens when you try this?
Yeah,I've just resolved my problem. It was a fault of default font in
Open/run cmd.exe
Click on the icon at the top-left corner
Select properties
Then "Font" bar
Select "Lucida Console" and OK.
Write Chcp 10000 at the prompt
Finally dir /b
Enjoy your clean UTF-16 output with hearts, Chinese signs, and much more!
https://stackoverflow.com/questions/10764920/utf-16-on-cmd-exe
Sorry, forgot to add this

Chcp

Displays the number of the active console code page, or changes the
console's active console code page. Used without parameters, chcp
displays the number of the active console code page.
Syntax

chcp [nnn]

Code page _ Country/region or language

437 United States
850 Multilingual (Latin I)
852 Slavic (Latin II)
855 Cyrillic (Russian)
857 Turkish
860 Portuguese
861 Icelandic
863 Canadian-French
865 Nordic
866 Russian
869 Modern Greek

https://technet.microsoft.com/en-us/library/bb490874.aspx
JJ
2017-08-04 12:44:26 UTC
Reply
Permalink
Raw Message
Post by Mike S
What happens when you try this?
Yeah,I've just resolved my problem. It was a fault of default font in
Open/run cmd.exe
Click on the icon at the top-left corner
Select properties
Then "Font" bar
Select "Lucida Console" and OK.
Write Chcp 10000 at the prompt
Finally dir /b
Enjoy your clean UTF-16 output with hearts, Chinese signs, and much more!
https://stackoverflow.com/questions/10764920/utf-16-on-cmd-exe
Unfortunately, it has no effect. The console font is already set to Lucida
Console. Setting the code page to 10000 (which is Mac version of Western
code page) gives no error, but the DIR command still show the same thing.

That SO answer may be a solution, but I think it's missing something else.

Did you test that on your own system with an actual Unicode file name? If
not, try creating a dummy file and rename it to below. It's the exact same
file name as the one in my system.

ソーラン渡り鳥 (島津亜矢 + 田川寿美).aac

Note: the above text is encoded in UTF-8.
Paul
2017-08-04 16:15:24 UTC
Reply
Permalink
Raw Message
Post by JJ
Post by Mike S
What happens when you try this?
Yeah,I've just resolved my problem. It was a fault of default font in
Open/run cmd.exe
Click on the icon at the top-left corner
Select properties
Then "Font" bar
Select "Lucida Console" and OK.
Write Chcp 10000 at the prompt
Finally dir /b
Enjoy your clean UTF-16 output with hearts, Chinese signs, and much more!
https://stackoverflow.com/questions/10764920/utf-16-on-cmd-exe
Unfortunately, it has no effect. The console font is already set to Lucida
Console. Setting the code page to 10000 (which is Mac version of Western
code page) gives no error, but the DIR command still show the same thing.
That SO answer may be a solution, but I think it's missing something else.
Did you test that on your own system with an actual Unicode file name? If
not, try creating a dummy file and rename it to below. It's the exact same
file name as the one in my system.
ソーラン渡り鳥 (島津亜矢 + 田川寿美).aac
Note: the above text is encoded in UTF-8.
I managed to modify my system enough so that Thunderbird
shows characters instead of boxes. But since the font
used (JhengHei Regular) isn't a monospaced font, there's
no way that cmd.exe is going to use a font like that. Even
with the registry hack, it will be excluded from the font menu.

Loading Image...

This is the font i used. There's apparently more than one
font for the job, and the characters are different in them.
So only a native speaker/writer could possibly know whether
that's an appropriate representation.

http://www.microsoft.com/en-us/download/details.aspx?id=12072

msjh.ttf 14,713,760 bytes

I see a distinct lack of mono fonts, lots of "Regular" and "Bold".
And also font extensions, which most programs won't know how to use.
Adding more font standards (other than .ttf) isn't real progress
when nothing uses them.

I'd experiment with Courier New, but based on the size of the
file in my system (303,296 bytes), that's just not big enough
to have enough alternate pages of stuff.

I had a copy of FontForge set up once, and I could see the
pages in some of the fonts with it.

Paul
JJ
2017-08-05 17:00:05 UTC
Reply
Permalink
Raw Message
Post by Paul
I managed to modify my system enough so that Thunderbird
shows characters instead of boxes. But since the font
used (JhengHei Regular) isn't a monospaced font, there's
no way that cmd.exe is going to use a font like that. Even
with the registry hack, it will be excluded from the font menu.
https://s2.postimg.org/hax9prms9/no_squares.gif
This is the font i used. There's apparently more than one
font for the job, and the characters are different in them.
So only a native speaker/writer could possibly know whether
that's an appropriate representation.
http://www.microsoft.com/en-us/download/details.aspx?id=12072
msjh.ttf 14,713,760 bytes
I see a distinct lack of mono fonts, lots of "Regular" and "Bold".
And also font extensions, which most programs won't know how to use.
Adding more font standards (other than .ttf) isn't real progress
when nothing uses them.
I'd experiment with Courier New, but based on the size of the
file in my system (303,296 bytes), that's just not big enough
to have enough alternate pages of stuff.
I had a copy of FontForge set up once, and I could see the
pages in some of the fonts with it.
Well, Thunderbird doesn't use the Windows built in console window. Moreover,
most cross platform applications use their own font rendering engine.

Also see my recent reply to Mayayana.
Loading...