Windows: WindowsConsoleIO produces mojibake for strings longer than 32 KiB #82052

andy-ms · 2019-08-16T00:36:04Z

BPO	37871
Nosy	@gpshead, @pfmoore, @tjguk, @zware, @eryksun, @zooba, @Andy-MS

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2019-08-16.00:36:04.457>
labels = ['type-bug', '3.8', '3.9', 'expert-IO', '3.10', 'OS-windows']
title = 'Windows: WindowsConsoleIO produces mojibake for strings longer than 32 KiB'
updated_at = <Date 2021-03-20.23:35:16.331>
user = 'https://github.com/andy-ms'

bugs.python.org fields:

activity = <Date 2021-03-20.23:35:16.331>
actor = 'eryksun'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Windows', 'IO']
creation = <Date 2019-08-16.00:36:04.457>
creator = 'anhans'
dependencies = []
files = []
hgrepos = []
issue_num = 37871
keywords = []
message_count = 5.0
messages = ['349837', '349844', '349872', '389191', '389199']
nosy_count = 7.0
nosy_names = ['gregory.p.smith', 'paul.moore', 'tim.golden', 'zach.ware', 'eryksun', 'steve.dower', 'anhans']
pr_nums = []
priority = 'normal'
resolution = None
stage = 'needs patch'
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue37871'
versions = ['Python 3.8', 'Python 3.9', 'Python 3.10']

Linked PRs

andy-ms · 2019-08-16T00:36:04Z

# To reproduce:
# Put this text in a file a.py and run py a.py.
# Or just run: py -c "print(('é' * 40 + '\n') * 473)"
# Scroll up for a while. One of the lines will be:
# éééééééééééééééééééé��ééééééééééééééééééé
# (You can spot this because it's slightly longer than the other lines.)
# The error is consistently on line 237, column 21 (1-indexed).

# The error reproduces on Windows but not Linux. Tested in both powershell and CMD.
# (Failed to reproduce on either a real Linux machine or on Ubuntu with WSL.)
# On Windows, the error reproduces every time consistently.

# There is no error if N = 472 or 474.
N = 473
# There is no error if W = 39 or 41.
# (I tested with console windows of varying sizes, all well over 40 characters.)
W = 40
# There is no error if ch = "e" with no accent.
# There is still an error for other unicode characters like "Ö" or "ü".
ch = "é"
# There is no error without newlines.
s = (ch * W + "\n") * N
# Assert the string itself is correct.
assert all(c in (ch, "\n") for c in s)
print(s)

# There is no error if we use N separate print statements
# instead of printing a single string with N newlines.

# Similar scripts written in Groovy, JS and Ruby have no error.
# Groovy: System.out.println(("é" * 40 + "\n") * 473)
# JS: console.log(("é".repeat(40) + "\n").repeat(473))
# Ruby: puts(("é" * 40 + "\n") * 473)

eryksun · 2019-08-16T04:05:43Z

To be compatible with Windows 7, _io__WindowsConsoleIO_write_impl in Modules/_io/winconsoleio.c is forced to write to the console in chunks that do not exceed 32 KiB. It does so by repeatedly dividing the length to decode by 2 until the decoded buffer size is small enough.

    wlen = MultiByteToWideChar(CP_UTF8, 0, b->buf, len, NULL, 0);
    while (wlen > 32766 / sizeof(wchar_t)) {
        len /= 2;
        wlen = MultiByteToWideChar(CP_UTF8, 0, b->buf, len, NULL, 0);
    }

With ('é' * 40 + '\n') * 473, encoded as UTF-8, we have 473 82-byte lines (note that "\n" has been translated to "\r\n"). This is 38,786 bytes, which is too much for a single write, so it splits it in two.

    >>> 38786 // 2
    19393
    >>> 19393 // 82
    236
    >>> 19393 % 82
    41

This means line 237 ends up with 20 'é' characters (UTF-8 b'\xc3\xa9') and one partial character sequjence, b'\xc3'. When this buffer is passed to MultiByteToWideChar to decode from UTF-8 to UTF-16, the partial sequence gets decoded as the replacement character U+FFFD. For the next write, the remaining b'\xa9' byte also gets decoded as U+FFFD.

To avoid this, _io__WindowsConsoleIO_write_impl could decode the whole buffer in one pass, and slice that up into writes that are less than 32 KiB. Or it could ensure that its UTF-8 slices are always at character boundaries.

zooba · 2019-08-16T15:39:55Z

I'd rather keep encoding incrementally, and reduce the length of each attempt until the last UTF-8 character does not have its top bit set (i.e. is the final character in a multi-byte sequence).

Otherwise the people who like to print >2GB worth of data to the console will complain about the memory error :)

gpshead · 2021-03-20T22:26:36Z

Steve's approach makes sense and should be robust.

side note: do we need to care about Windows 7 anymore in 3.10 given that microsoft no longer supports it?

eryksun · 2021-03-20T23:35:16Z

side note: do we need to care about Windows 7 anymore in
3.10 given that microsoft no longer supports it?

If the fix comes in time for Python 3.8, then it needs to support Windows 7. For Python 3.9+, the 32 KiB limit can be removed.

The console documentation still includes the misleading disclaimer about "available heap". This refers to a relatively small block of shared memory (64 KiB IIRC) that's overlayed by a heap, not the default process heap. Shared memory is used by system LPC ports to efficiently pass large messages between a system server (e.g. csrss.exe, conhost.exe) and a client process. The console API used to use an LPC port, but in Windows 8.1+ it uses a driver instead, so none of the "available heap" warnings apply anymore. Microsoft should clarify the docs to stress that the warning is for Windows 7 and earlier.

pfmoore · 2022-12-23T23:03:48Z

The following change to _io__WindowsConsoleIO_write_impl appears to fix the issue, by implementing @zooba's suggestion.

    while (wlen > 32766 / sizeof(wchar_t)) {
        len /= 2;
        /* Reduce the length until we hit the final byte of a UTF-8 sequence
         * (top bit is unset). Fix for github issue 82052.
         */
        while (len > 0 && (((char *)b->buf)[len-1] & 0x80) != 0)
            --len;
        wlen = MultiByteToWideChar(CP_UTF8, 0, b->buf, len, NULL, 0);
    }

I can make this into a PR (likely sometime after the holidays), but I'm not entirely clear how I'd create a test for it. And my C skills are rather rusty, so if someone would check I haven't made a stupid error that would be great 🙂

One thing this doesn't handle is if there isn't a UTF-8 final byte in the part of the buffer we are planning on converting, so len goes to 0. I can't imagine how that would happen, but I haven't checked if that's a possibility. And if it is, I don't know what else we can do about it...

cc @willmcgugan who flagged this issue on Twitter...

zooba · 2023-01-09T21:18:21Z

One thing this doesn't handle is if there isn't a UTF-8 final byte in the part of the buffer we are planning on converting, so len goes to 0. I can't imagine how that would happen, but I haven't checked if that's a possibility. And if it is, I don't know what else we can do about it...

It's not possible in a valid string, so probably if len hits zero then we can either raise a Unicode error of some kind, or just restore the original length and let the console deal with it.

(It's not possible because each character in a multi-byte sequence has progressively more top-bits set. So even theoretically, you can only have 9 bytes in a row representing a single character, but IIRC only 3 are allowed. We're at over 16,000, so we're good.)

ofek · 2023-01-10T06:49:20Z

Is there an open pull request for the fix?

pfmoore · 2023-01-10T09:37:06Z

Not yet. I was planning on doing one, but haven’t had the time yet.

…01103) Don't send partial UTF-8 sequences to the Windows API

…pythonGH-101103) Don't send partial UTF-8 sequences to the Windows API (cherry picked from commit f34176b) Co-authored-by: Paul Moore <p.f.moore@gmail.com>

…01103) Don't send partial UTF-8 sequences to the Windows API (cherry picked from commit f34176b) Co-authored-by: Paul Moore <p.f.moore@gmail.com>

andy-ms mannequin added 3.7 (EOL) end of life OS-windows type-bug An unexpected behavior, bug, or error labels Aug 16, 2019

eryksun added the topic-IO label Aug 16, 2019

vstinner changed the title ~~40 * 473 grid of "é" has a single wrong character on Windows~~ Windows: WindowsConsoleIO produces mojibake for strings longer than 32 KiB Aug 21, 2019

eryksun added 3.8 (EOL) end of life 3.9 only security fixes 3.10 only security fixes and removed 3.7 (EOL) end of life labels Mar 20, 2021

ezio-melotti transferred this issue from another repository Apr 10, 2022

pfmoore mentioned this issue Jan 17, 2023

gh-82052: Don't send partial UTF-8 sequences to the Windows API #101103

Merged

pfmoore closed this as completed in #101103 Jan 17, 2023

This was referenced Jan 17, 2023

[3.11] gh-82052: Don't send partial UTF-8 sequences to the Windows API (GH-101103) #101107

Merged

[3.10] gh-82052: Don't send partial UTF-8 sequences to the Windows API (GH-101103) #101108

Merged

pfmoore added a commit that referenced this issue Jan 17, 2023

gh-82052: Don't send partial UTF-8 sequences to the Windows API (GH-1…

f34176b

…01103) Don't send partial UTF-8 sequences to the Windows API

sorgloomer mentioned this issue Oct 16, 2023

Windows: WindowsConsoleIO produces mojibake replacement characters #110913

Closed

davep mentioned this issue Nov 7, 2023

Chunk writes on Windows Textualize/textual#3644

Merged

davep mentioned this issue Nov 7, 2023

Line drawing artefacts in Tree on Windows, in Windows Terminal Textualize/textual#2548

Closed

cmaloney mentioned this issue Jul 17, 2024

Simplify Windows write I/O around consoles on Windows 8+ #121940

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Windows: WindowsConsoleIO produces mojibake for strings longer than 32 KiB #82052

Windows: WindowsConsoleIO produces mojibake for strings longer than 32 KiB #82052

andy-ms mannequin commented Aug 16, 2019 •

edited by bedevere-bot

Loading

andy-ms mannequin commented Aug 16, 2019 •

edited by vstinner

Loading

eryksun commented Aug 16, 2019

zooba commented Aug 16, 2019

gpshead commented Mar 20, 2021

eryksun commented Mar 20, 2021

pfmoore commented Dec 23, 2022

zooba commented Jan 9, 2023

ofek commented Jan 10, 2023

pfmoore commented Jan 10, 2023

Windows: WindowsConsoleIO produces mojibake for strings longer than 32 KiB #82052

Windows: WindowsConsoleIO produces mojibake for strings longer than 32 KiB #82052

Comments

andy-ms mannequin commented Aug 16, 2019 • edited by bedevere-bot Loading

Linked PRs

andy-ms mannequin commented Aug 16, 2019 • edited by vstinner Loading

eryksun commented Aug 16, 2019

zooba commented Aug 16, 2019

gpshead commented Mar 20, 2021

eryksun commented Mar 20, 2021

pfmoore commented Dec 23, 2022

zooba commented Jan 9, 2023

ofek commented Jan 10, 2023

pfmoore commented Jan 10, 2023

andy-ms mannequin commented Aug 16, 2019 •

edited by bedevere-bot

Loading

andy-ms mannequin commented Aug 16, 2019 •

edited by vstinner

Loading