Skip to content

Commit f34176b

Browse files
authored
gh-82052: Don't send partial UTF-8 sequences to the Windows API (GH-101103)
Don't send partial UTF-8 sequences to the Windows API
1 parent c5660ae commit f34176b

File tree

2 files changed

+17
-1
lines changed

2 files changed

+17
-1
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Fixed an issue where writing more than 32K of Unicode output to the console screen in one go can result in mojibake.

Modules/_io/winconsoleio.c

+16-1
Original file line numberDiff line numberDiff line change
@@ -954,7 +954,7 @@ _io__WindowsConsoleIO_write_impl(winconsoleio *self, Py_buffer *b)
954954
{
955955
BOOL res = TRUE;
956956
wchar_t *wbuf;
957-
DWORD len, wlen, n = 0;
957+
DWORD len, wlen, orig_len, n = 0;
958958
HANDLE handle;
959959

960960
if (self->fd == -1)
@@ -984,6 +984,21 @@ _io__WindowsConsoleIO_write_impl(winconsoleio *self, Py_buffer *b)
984984
have to reduce and recalculate. */
985985
while (wlen > 32766 / sizeof(wchar_t)) {
986986
len /= 2;
987+
orig_len = len;
988+
/* Reduce the length until we hit the final byte of a UTF-8 sequence
989+
* (top bit is unset). Fix for github issue 82052.
990+
*/
991+
while (len > 0 && (((char *)b->buf)[len-1] & 0x80) != 0)
992+
--len;
993+
/* If we hit a length of 0, something has gone wrong. This shouldn't
994+
* be possible, as valid UTF-8 can have at most 3 non-final bytes
995+
* before a final one, and our buffer is way longer than that.
996+
* But to be on the safe side, if we hit this issue we just restore
997+
* the original length and let the console API sort it out.
998+
*/
999+
if (len == 0) {
1000+
len = orig_len;
1001+
}
9871002
wlen = MultiByteToWideChar(CP_UTF8, 0, b->buf, len, NULL, 0);
9881003
}
9891004
Py_END_ALLOW_THREADS

0 commit comments

Comments
 (0)