-
-
Notifications
You must be signed in to change notification settings - Fork 31.7k
Windows: WindowsConsoleIO produces mojibake for strings longer than 32 KiB #82052
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
# To reproduce: # The error reproduces on Windows but not Linux. Tested in both powershell and CMD. # There is no error if N = 472 or 474.
N = 473
# There is no error if W = 39 or 41.
# (I tested with console windows of varying sizes, all well over 40 characters.)
W = 40
# There is no error if ch = "e" with no accent.
# There is still an error for other unicode characters like "Ö" or "ü".
ch = "é"
# There is no error without newlines.
s = (ch * W + "\n") * N
# Assert the string itself is correct.
assert all(c in (ch, "\n") for c in s)
print(s) # There is no error if we use N separate print statements # Similar scripts written in Groovy, JS and Ruby have no error. |
To be compatible with Windows 7, _io__WindowsConsoleIO_write_impl in Modules/_io/winconsoleio.c is forced to write to the console in chunks that do not exceed 32 KiB. It does so by repeatedly dividing the length to decode by 2 until the decoded buffer size is small enough. wlen = MultiByteToWideChar(CP_UTF8, 0, b->buf, len, NULL, 0);
while (wlen > 32766 / sizeof(wchar_t)) {
len /= 2;
wlen = MultiByteToWideChar(CP_UTF8, 0, b->buf, len, NULL, 0);
} With >>> 38786 // 2
19393
>>> 19393 // 82
236
>>> 19393 % 82
41 This means line 237 ends up with 20 'é' characters (UTF-8 b'\xc3\xa9') and one partial character sequjence, b'\xc3'. When this buffer is passed to MultiByteToWideChar to decode from UTF-8 to UTF-16, the partial sequence gets decoded as the replacement character U+FFFD. For the next write, the remaining b'\xa9' byte also gets decoded as U+FFFD. To avoid this, _io__WindowsConsoleIO_write_impl could decode the whole buffer in one pass, and slice that up into writes that are less than 32 KiB. Or it could ensure that its UTF-8 slices are always at character boundaries. |
I'd rather keep encoding incrementally, and reduce the length of each attempt until the last UTF-8 character does not have its top bit set (i.e. is the final character in a multi-byte sequence). Otherwise the people who like to print >2GB worth of data to the console will complain about the memory error :) |
Steve's approach makes sense and should be robust. side note: do we need to care about Windows 7 anymore in 3.10 given that microsoft no longer supports it? |
If the fix comes in time for Python 3.8, then it needs to support Windows 7. For Python 3.9+, the 32 KiB limit can be removed. The console documentation still includes the misleading disclaimer about "available heap". This refers to a relatively small block of shared memory (64 KiB IIRC) that's overlayed by a heap, not the default process heap. Shared memory is used by system LPC ports to efficiently pass large messages between a system server (e.g. csrss.exe, conhost.exe) and a client process. The console API used to use an LPC port, but in Windows 8.1+ it uses a driver instead, so none of the "available heap" warnings apply anymore. Microsoft should clarify the docs to stress that the warning is for Windows 7 and earlier. |
The following change to
I can make this into a PR (likely sometime after the holidays), but I'm not entirely clear how I'd create a test for it. And my C skills are rather rusty, so if someone would check I haven't made a stupid error that would be great 🙂 One thing this doesn't handle is if there isn't a UTF-8 final byte in the part of the buffer we are planning on converting, so len goes to 0. I can't imagine how that would happen, but I haven't checked if that's a possibility. And if it is, I don't know what else we can do about it... cc @willmcgugan who flagged this issue on Twitter... |
It's not possible in a valid string, so probably if len hits zero then we can either raise a Unicode error of some kind, or just restore the original length and let the console deal with it. (It's not possible because each character in a multi-byte sequence has progressively more top-bits set. So even theoretically, you can only have 9 bytes in a row representing a single character, but IIRC only 3 are allowed. We're at over 16,000, so we're good.) |
Is there an open pull request for the fix? |
Not yet. I was planning on doing one, but haven’t had the time yet. |
…01103) Don't send partial UTF-8 sequences to the Windows API
…pythonGH-101103) Don't send partial UTF-8 sequences to the Windows API (cherry picked from commit f34176b) Co-authored-by: Paul Moore <p.f.moore@gmail.com>
…pythonGH-101103) Don't send partial UTF-8 sequences to the Windows API (cherry picked from commit f34176b) Co-authored-by: Paul Moore <p.f.moore@gmail.com>
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
Linked PRs
The text was updated successfully, but these errors were encountered: