二进制文件 VS 文本文件

最新推荐文章于 2025-04-02 17:22:40 发布

firetoucher

最新推荐文章于 2025-04-02 17:22:40 发布

阅读量3.5k

点赞数

文章标签： file character printing newline variables image

本文链接：https://blog.csdn.net/firetoucher/article/details/593639

版权

二进制文件和文本文件本质上无区别，但文本文件常被视为特定子集。在处理中，二进制文件的末尾EOF字符在文件拼接时可能丢失，导致问题。例如，两个10字节的二进制文件拼接后，第一个EOF被丢弃。为避免此类问题，不应让程序误将二进制文件当作文本文件。此外，二进制文件存储数据更高效，但不具移植性；文本文件虽然消耗更多CPU时间，但更兼容。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Actually, there is no difference. But since text files are often
interpreted as such, it is wise to limit the contents to the proper
subset. For example:

>dir tt1.txt

02/06/2006 07:05p 10 tt1.txt
1 File(s) 10 bytes

The file tt1.txt contains 10 bytes, but we only see 7 if we type it:

>type tt1.txt

abcdefg

because "type" expects the file to be text and not binary and
interprets some of the contents instead of printing them. A dump
reveals why we only see 7:

DUMP.EXE version 8-MAR-91
Block # 0 0
0 61 62 63 64 65 66 67 0D 0A 1A FF FF FF FF FF FF abcdefg...

Bytes 8, 9 & 10 are carraige return (0D), line feed (0A) and EOF (1A).

The EOF character is not strictly required, since the OS knows
there are exactly 10 bytes (the FFs are sector padding bytes not
part of the file).

But watch what happens when I concatenate two copies together:

>copy tt1.txt+tt1.txt tt2.txt

tt1.txt
tt1.txt
1 file(s) copied.

>dir tt2.txt

02/06/2006 07:19p 19 tt2.txt
1 File(s) 19 bytes

10 bytes + 10 bytes = 19 bytes ??

A dump reveals what happened:

DUMP.EXE version 8-MAR-91
Block # 0 0
0 61 62 63 64 65 66 67 0D 0A 61 62 63 64 65 66 67
abcdefg..abcdefg
16 0D 0A 1A FF FF FF FF FF FF FF FF FF FF FF FF FF
...

The terminating EOF of the first copy of tt1.txt was dropped
as part of the concatenation. The OS expects only one (if any)
EOF character per file and it better be the last one.

I could simply insert the original EOF back into the file

>dir tt3.txt

02/06/2006 07:25p 20 TT3.TXT
1 File(s) 20 bytes
DUMP.EXE version 8-MAR-91
Block # 0 0
0 61 62 63 64 65 66 67 0D 0A 1A 61 62 63 64 65 66
abcdefg...abcdef
16 67 0D 0A 1A FF FF FF FF FF FF FF FF FF FF FF FF
g...

But the OS won't like it:

>type tt3.txt

abcdefg

Even though the files is now 20 bytes long, "type" won't go past
the first EOF character.

The copy command has a binary option that will concatenate
without trying to interpret the contents:

>copy /b tt1.txt+tt1.txt tt4.txt

tt1.txt
tt1.txt
1 file(s) copied.

>dir tt4.txt

02/06/2006 07:29p 20 tt4.txt
1 File(s) 20 bytes

But that doesn't help the "type" command.

>type tt4.txt

abcdefg

These kind of problems can also occur if you use FTP to send
a binary file in text mode.

So, generally, assuming the content is ok, it's best to never
let a program think a binary file is a text file.

效率

What people usually mean by this is storing data in the form of either
binary representation or an ASCII string. Say for example you have four
variables, the first two are short integers (assume this to be 16 bits)
and the second two are long integers (assume this to be 32 bits). You
want to store the values in a file. In text format there are many ways
to represent this in a file but the most common is the newline
separated file:

100
4050
234262
400000

Reading the text file requires reading 22 bytes (including the
newlines). And then you also have to convert the ASCII string back to
integers using atoi() etc. which consumes CPU time. Compare this to
reading a pure binary file. Reading the binary file only requires
reading 12 bytes and at most you'd have to handle endianness by
flipping the bytes over (or calling ntohs() and friends) which consumes
a lot less CPU time compared to atoi().

Now, with four values you'll not see much difference but imagine doing
this for very large amounts of data. Compare for example the size of a
256 color image the size of 1024x768 as a gif file (which is binary) to
the same image as an X-pixmap file (which is ASCII text).

兼容

The only problem with binary representation is that it's
not portable. It is not the same for the same type on
different machine. Different languages can implement
them in different ways. In many cases the speed you gain
by using binary files is not enough to make you give up
portability.