二进制文件 VS 文本文件

二进制文件和文本文件本质上无区别,但文本文件常被视为特定子集。在处理中,二进制文件的末尾EOF字符在文件拼接时可能丢失,导致问题。例如,两个10字节的二进制文件拼接后,第一个EOF被丢弃。为避免此类问题,不应让程序误将二进制文件当作文本文件。此外,二进制文件存储数据更高效,但不具移植性;文本文件虽然消耗更多CPU时间,但更兼容。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Actually, there is no difference. But since text files are often
interpreted as such, it is wise to limit the contents to the proper
subset. For example:

 

>dir tt1.txt

02/06/2006  07:05p                  10 tt1.txt
               1 File(s)             10 bytes

The file tt1.txt contains 10 bytes, but we only see 7 if we type it:

 

>type tt1.txt

abcdefg

because "type" expects the file to be text and not binary and
interprets some of the contents instead of printing them. A dump
reveals why we only see 7:

                          DUMP.EXE    version 8-MAR-91
Block #    0        0
  0   61 62 63 64 65 66 67 0D 0A 1A FF FF FF FF FF FF    abcdefg...

Bytes 8, 9 & 10 are carraige return (0D), line feed (0A) and EOF (1A).

The EOF character is not strictly required, since the OS knows
there are exactly 10 bytes (the FFs are sector padding bytes not
part of the file).

But watch what happens when I concatenate two copies together:

 

>copy tt1.txt+tt1.txt tt2.txt

tt1.txt
tt1.txt
        1 file(s) copied.
>dir tt2.txt

02/06/2006  07:19p                  19 tt2.txt
               1 File(s)             19 bytes

10 bytes + 10 bytes = 19 bytes ??

A dump reveals what happened:

                          DUMP.EXE    version 8-MAR-91
Block #    0        0
  0   61 62 63 64 65 66 67 0D 0A 61 62 63 64 65 66 67
abcdefg..abcdefg
 16   0D 0A 1A FF FF FF FF FF FF FF FF FF FF FF FF FF
...            

The terminating EOF of the first copy of tt1.txt was dropped
as part of the concatenation. The OS expects only one (if any)
EOF character per file and it better be the last one.

I could simply insert the original EOF back into the file

 

>dir tt3.txt

02/06/2006  07:25p                  20 TT3.TXT
               1 File(s)             20 bytes
                          DUMP.EXE    version 8-MAR-91
Block #    0        0
  0   61 62 63 64 65 66 67 0D 0A 1A 61 62 63 64 65 66
abcdefg...abcdef
 16   67 0D 0A 1A FF FF FF FF FF FF FF FF FF FF FF FF
g...            

But the OS won't like it:

 

>type tt3.txt

abcdefg

Even though the files is now 20 bytes long, "type" won't go past
the first EOF character.

The copy command has a binary option that will concatenate
without trying to interpret the contents:

 

>copy /b tt1.txt+tt1.txt tt4.txt

tt1.txt
tt1.txt
        1 file(s) copied.
>dir tt4.txt

02/06/2006  07:29p                  20 tt4.txt
               1 File(s)             20 bytes

But that doesn't help the "type" command.

 

>type tt4.txt

abcdefg

These kind of problems can also occur if you use FTP to send
a binary file in text mode.

So, generally, assuming the content is ok, it's best to never
let a program think a binary file is a text file.

 

效率

What people usually mean by this is storing data in the form of either
binary representation or an ASCII string. Say for example you have four
variables, the first two are short integers (assume this to be 16 bits)
and the second two are long integers (assume this to be 32 bits). You
want to store the values in a file. In text format there are many ways
to represent this in a file but the most common is the newline
separated file:

100
4050
234262
400000

Reading the text file requires reading 22 bytes (including the
newlines). And then you also have to convert the ASCII string back to
integers using atoi() etc. which consumes CPU time. Compare this to
reading a pure binary file. Reading the binary file only requires
reading 12 bytes and at most you'd have to handle endianness by
flipping the bytes over (or calling ntohs() and friends) which consumes
a lot less CPU time compared to atoi().

Now, with four values you'll not see much difference but imagine doing
this for very large amounts of data. Compare for example the size of a
256 color image the size of 1024x768 as a gif file (which is binary) to
the same image as an X-pixmap file (which is ASCII text).

兼容

The only problem with binary representation is that it's
not portable. It is not the same for the same type on
different machine. Different languages can implement
them in different ways. In many cases the speed you gain
by using binary files is not enough to make you give up
portability.

评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值