Internal and Tape Sorting Using the Replacement-Selection Technique

A general technique for sequencing unsorted records is presented. The technique is shown to be applicable for the first stage of a generalized sort program (the formation of initial strings) as well as for sorting records within a memory storage (an internal sort). It is shown that given N records in memory storage, records are sequenced using 1+log2 N tests per record, that initial string lengths will average 2N for random input records, and that reading, writing and processing can be accomplished simultaneously if the computer permits such overlap. 提出了一种对未排序记录进行排序的通用技术。该技术被证明适用于广义排序程序的第一阶段(初始字符串的形成)以及对存储器中的记录进行排序(内部排序)。结果表明,给定存储器中的N个记录,每个记录使用1+log2N测试对记录进行排序,随机输入记录的初始字符串长度平均为2N,并且如果计算机允许这种重叠,则可以同时完成读取、写入和处理。

Introduction and Summary

Sorting programs generally require two phases: formation of strings of sequenced records onto external memory, and repeated merging of the sequenced strings using external memory until a single sequenced string is produced. This paper presents a technique for string formations using a binary tree procedure. The technique is called replacement-selection, since records selected as the smallest in a group are replaced by a new unsorted record. 排序程序通常需要两个阶段:将已排序记录的字符串形成到外部存储器上,以及使用外部存储器重复合并已排序字符串,直到生成单个已排序字符串。本文提出了一种使用二叉树程序形成字符串的技术。这种技术被称为替换选择,因为被选为组中最小的记录会被一个新的未排序记录替换。

A mass storage area in memory is initially filled with unsorted records, each record being separated from the adjacent records by one word, subsequently called result words. The number of records stored in the mass storage area must be a multiple of four. Pairs of adjoining records are compared and the results of the comparison are stored in the result words. Then the smaller of each pair is compared and so on until the record with the smallest key is found. The result of all comparisons are stored in the result words. Records are not moved during the comparison procedure. After a record is selected as the smallest and written out or placed in an output area, a new input record replaces the selected record in its identical physical location in the mass storage area. 存储器中的大容量存储区域最初填充有未排序的记录,每个记录与相邻记录由一个字分隔,随后称为结果字。大容量存储区域中存储的记录数必须是四的倍数。相邻记录对被比较,并且比较的结果被存储在结果字中。然后比较每对中较小的一个,依此类推,直到找到具有最小密钥的记录。所有比较的结果都存储在结果字中。在比较过程中不会移动记录。在一个记录被选择为最小的并被写出或放置在输出区域中之后,新的输入记录将替换大容量存储区域中相同物理位置的所选记录。

A major advantage of this method is that reading, writing and computing operations are performed simultaneously if the computer permits overlap of these functions. Reading and writing of records is not dependent on the sequencing of all records in a mass storage area. As one or several records are sequenced they may be written on tape. Conversely, reading of input records occurs after one or several records are to be replaced. Input and output areas are not considered part of the mass storage area. Two input and two output areas are required to eliminate read or write interlock. 这种方法的一个主要优点是,如果计算机允许这些功能重叠,则可以同时执行读取、写入和计算操作。记录的读取和写入不依赖于大容量存储区域中所有记录的顺序。当一条或几条记录被排序时,它们可能会被写在磁带上。相反,输入记录的读取发生在一个或多个记录被替换之后。输入和输出区域不被视为大容量存储区域的一部分。需要两个输入和两个输出区域来消除读或写互锁。

The method requires one additional word per record. The number of comparisons per record I is equal to 1 + [log2N], where N is the number of records stored in memory. For randomly ordered keys approximately twice the number of records that can be stored in a mass storage area will be sequenced at one time. 2 该方法要求每个记录增加一个单词。每个记录I的比较数等于1+[log2N],其中N是存储在存储器中的记录数。对于随机排序的密钥,一次将对可存储在大容量存储区中的记录数量进行排序,大约是记录数量的两倍

The Replacement-Selection Method

The program is divided into two sections: initialization, and record selection. 该程序分为两个部分:初始化和记录选择。

INITIALIZATION LOGIC. Given N records lying in contiguous core locations separated by one word (a mass storage area), N-I tests are made as follows: 初始化逻辑。给定N个记录位于由一个字分隔的连续核心位置(大容量存储区),N-I测试如下:

  1. N/2 comparisons are made comparing successive pairs of records (1 and 2; 3 and 4; 5 and 6; etc.) which yield N/2 selected records, called X1, X2, X3, .-. , XN/2 • N/2比较是对连续的记录对(1和2;3和4;5和6;等等)进行比较,其产生N/2个选定的记录,称为X1、X2、X3、,XN/2
    The addresses of these selected records are stored in core memory in the location following each odd numbered record and starting after the first odd numbered record. 这些选定记录的地址存储在核心存储器中每个奇数记录之后的位置,并从第一个奇数记录开始。
    A pictorial representation of a storage area that can contain twelve records is shown in Figure 1. In the example selected, a core location is composed of 12 numeric characters. An address is 5 characters. The one word beneath each record is used for storing the results of comparisons between records. 图1显示了一个可以包含十二条记录的存储区域的图示。在所选示例中,核心位置由12个数字字符组成。地址是5个字符。每个记录下面的一个单词用于存储记录之间的比较结果。
    Figure 2 shows this area after initialization. The four fields in the result word (below each record) represent indicators to aid the program in selecting the proper record for comparison. 图2显示了初始化后的这个区域。结果字中的四个字段(在每个记录下面)表示指示符,以帮助程序选择合适的记录进行比较。
    The last field after the odd numbered records contains the beginning address of the selected records at the first level. 奇数记录后的最后一个字段包含第一级所选记录的起始地址。
  2. The selected records Xi , X2, X3, .‘. , XN/2 are then compared in pairs as follows: X1 and X2 ; X~ and X~ ; etc. N/4 tests are performed. The comparisons yield N/4 selected records called Yx , Y2 , Y3 , "’" , YN/4 • 所选记录Xi,X2,X3,.’,XN/2然后按如下方式成对比较:X1和X2;X和X;等等。进行N/4测试。比较产生N/4个选定记录,称为Yx、Y2、Y3、“'”、YN/4
    The addresses of these records are stored in core memory in a location following even numbered records and starting after the first even number record. In Figure 2, they are stored in the last field. Note that they are stored after consecutive even records. 这些记录的地址存储在核心存储器中偶数记录之后的位置,并从第一个偶数记录开始。在图2中,它们存储在最后一个字段中。请注意,它们存储在连续的偶数记录之后。
  3. The records Y~ , Y2, Y3, “” , Y~/4 are compared sequentially in overlapping groups of two, and each new selected record address is added to the list of addresses obtained in 2 above, e.g. YN/4+i, YN/4+2, etc. until N/4–1 comparisons are performed. 记录Y、Y2、Y3、“”、Y/4按两个重叠组的顺序进行比较,每个新选择的记录地址被添加到上面2中获得的地址列表中,例如YN/4+i、YN/4+2等,直到执行N/4-1比较。
    The addresses of selected records continue to be stored in core memory in locations following even nmnbered records. 所选记录的地址继续存储在核心存储器中,甚至存储在未命名记录之后的位置。
    During initialization the total number of comparisons is equal to N–1. In Figure 2 the number of comparisons, for example, is 6+3+2 = 11. 在初始化期间,比较的总数等于N-1。例如,在图2中,比较次数为6+3+2=11。

在这里插入图片描述
The last address (from the last test) contains the address of the smallest record. The selected record is transferred to an output area and new records are inputed and processed by record selection as described below until initialization must re-occur. 最后一个地址(来自上次测试)包含最小记录的地址。所选记录被转移到输出区域,并通过如下所述的记录选择来输入和处理新记录,直到必须重新进行初始化。
Figure 2 shows that Record 5 has been selected as the next smallest record and has been transferred to an output area. Record 5 is replaced in memory with the next input record (record 13) and the process of comparing of the key of this record against the other records in memory is called the record selection logic.
图2显示,记录5已被选为下一个最小的记录,并已被转移到输出区域。用下一个输入记录(记录13)替换存储器中的记录5,并且将该记录的关键字与存储器中的其他记录进行比较的过程被称为记录选择逻辑。
During the initialization, the values 22 and 33 are entered in the first two fields of every four locations beginning with the 2nd and 4th respectively. The 3rd field indicates where the result of a comparison should be stored and is calculated only once (norreally at the beginning of the program).
在初始化期间,值22和33被输入到每四个位置的前两个字段中,分别从第二个和第四个开始。第三个字段表示比较结果应该存储在哪里,并且只计算一次(也不是在程序开始时)。
在这里插入图片描述
在这里插入图片描述
RECORD SELECTION LOGIC. Each new incoming record goes through a sequence of tests until the record with next smallest key is selected. A record is selected by a comparison of two records at the 1st level, another comparison of two records at the 2nd level, until the last level is reached. 记录选择逻辑。每个新传入的记录都要经过一系列测试,直到选择了具有下一个最小关键字的记录。通过比较第一级的两个记录,再比较第二级的两条记录,选择一条记录,直到达到最后一级。

The number of levels is equal to the smallest integer greater to log2N. In the example, N is 12; since log2 12 ~ 3.59, the number of levels is 4. Note that all records do not go through four levels (records 9 through 12). The average number of tests per record is approximately log2N. The actual average number of tests are 3.66, rather than 3.59. (See footnote 1.) 级数等于大于log2N的最小整数。在该示例中,N为12;自log212~3.59以来,级别数为4。请注意,并非所有记录都经过四个级别(记录9到12)。每个记录的平均测试次数约为log2N。实际的平均测试次数是3.66次,而不是3.59次。(见脚注1。)

On the first level, a record is compared to the record located below it or above it in memory. 在第一级,将记录与内存中位于其下方或上方的记录进行比较。

An indicator in the word below the incoming record (as shown in Figure 3) is tested to determine the location of the records “paired partner.” Paired partners are two records which will lie in adjoining locations in core memory and which are compared in the initialization phase. In the example shown, a “22” or “33” in characters 1 and 2 of the work area below each record indicates that the records partner has a lower core location. If characters 1 and 2 are less than 22, then a records partner has a higher core location. Since a records partner is either the record above or below it, one test can determine the assigned partners. 测试传入记录下方单词中的指示符(如图3所示),以确定记录“配对伙伴”的位置。配对伙伴是位于核心内存中相邻位置的两个记录,在初始化阶段进行比较。在所示的示例中,每条记录下方工作区域的字符1和2中的“22”或“33”表示记录伙伴的核心位置较低。如果字符1和2小于22,则记录伙伴具有更高的核心位置。由于记录伙伴是它上面或下面的记录,一个测试可以确定指定的伙伴。

The location between the two records is also used to indicate whether a record can become part of the existing sequence. The third field in the result word indicates an address in a lower level. The lower level address contents are modified and a new partner for the selected record is determined by examining an indicator in the result word. The selected records partners address at each subsequent level will be stored in a result word that is in another even location. A “22” indicates the result word has a lower core location; “33” indicates the result word has a higher core location. The process continues until the last level is reached. 两个记录之间的位置也用于指示一个记录是否可以成为现有序列的一部分。结果字中的第三个字段指示较低级别中的地址。修改较低级别的地址内容,并通过检查结果字中的指示符来确定所选记录的新伙伴。每个后续级别的选定记录伙伴地址将存储在另一个偶数位置的结果字中。“22”表示结果字具有较低的核心位置;“33”表示结果词具有较高的核心位置。该过程一直持续到达到最后一个级别。

After each record is thus selected, it is transferred to an output area and the next input record is moved into the area just vacated. Therefore, each record (if blocked input) is moved to a work area and, after selection, to an output area. ~ 在这样选择每个记录之后,它被转移到输出区域,并且下一个输入记录被移动到刚刚空出的区域中。因此,每个记录(如果输入被阻止)都被移动到一个工作区域,并在选择后被移动到输出区域。

New records can become part of the existing string of outputs if its key is higher than the key of the other records already selected. If the key of the incoming record is lower than the last one selected for output, then a “1” is set up in the 1st or 2nd character of the result word for the two paired records. This indicator will indicate to the program that a record is not to be compared with the others in memory but is to be held aside until the current sequence of output is completed. When the number of records that cannot be part of the current sequence is equal to the number of records in the storage area, the program returns to the initialization logic. 如果新记录的关键字高于已选择的其他记录的关键字,则新记录可以成为现有输出字符串的一部分。如果输入记录的键低于最后一个选择输出的键,则在两个配对记录的结果字的第一个或第二个字符中设置“1”。该指示器将向程序指示,一个记录不与存储器中的其他记录进行比较,而是保留在一边,直到当前的输出序列完成。当不能成为当前序列一部分的记录数等于存储区域中的记录数时,程序返回到初始化逻辑。

Internal Sorting Using the Replacement-Selection Technique

The replacement-selection technique described in the previous section can readily be adapted to internal sorting. The main difference between the internal sort logic and the tape sort logic is that when a record is selected as the smallest it is not replaced, but its rank (the smallest record is ranked 1, the next smallest 2, and so on) is placed in the word below the selected record. If an output area is available, it may be directly placed in the output area and the exchange need not occur after the sorting. 上一节中描述的替换选择技术可以很容易地适用于内部排序。内部排序逻辑和磁带排序逻辑之间的主要区别在于,当一条记录被选为最小记录时,它不会被替换,但它的排名(最小记录排名为1,次最小的记录排名为2,依此类推)被放在所选记录下面的字中。如果输出区域可用,则可以将其直接放置在输出区域中,并且在排序之后不需要进行交换。

Except for this, the replacement-selection logic for internal sorting is similar to that described in the previous section. Each of N records are placed in memory separated by one word. A comparison of all N records is performed to produce a tree structure which related the sequence of groups of records to other groups of records. A tree structure for finding the smallest in N = 12 records would be the same as that shown in Figure 2. 除此之外,用于内部排序的替换选择逻辑与上一节中描述的类似。N个记录中的每一个都被放置在由一个字分隔的存储器中。执行所有N个记录的比较以产生将记录组的序列与其他记录组相关联的树结构。在N=12个记录中查找最小记录的树结构与图2所示的结构相同。

After N-1 tests the lowest record is selected. Its sequence (rank) number (which is 1, since this record had the lowest key) is stored with the address of the record, one word below the record. 4 At this point, an average of (log2N)-1 comparisons are required for the selection of each additional record. After all records have been sequenced and ranked, a maximum of N exchanges (based on information in the result word below each record) is required to rearrange the record. Note that the sequenced records will be separated by one word. Since in most cases, the record will be processed internally, this one-word separation will not slow down subsequent processing on the sorted records. If the sequenced records were to be written externally, the records would be moved to an output area rather than be “exchanged.” N-1测试后,选择最低记录。它的序列号(秩)是1,因为这个记录有最低的关键字)与记录的地址一起存储,在记录下面一个字。4在这一点上,选择每个附加记录需要平均(log2N)-1的比较。在对所有记录进行排序和排序后,最多需要N次交换(基于每个记录下面的结果词中的信息)来重新排列记录。请注意,按顺序排列的记录将由一个单词分隔。由于在大多数情况下,记录将在内部进行处理,因此这种一字分隔不会减缓对排序记录的后续处理。如果要从外部写入已排序的记录,则这些记录将被移动到输出区域,而不是“交换”

Bose and Nelson [3, 4] state that the P-Operator Method of Internal Sorting will require significantly fewer comparisons and exchanges of records than are required by the Exchange, Selection and Exchange, and Binary Insertion methods. Their conclusions are quite valid: given a work area that is completely filled with records, the POperator Method in general provides the best solution. However, when a small amount of memory can be sacrificed, the Replacement-Selection technique can provide a much more efficient internal sort than either of the four techniques previously published. The additional space required is one word per record. Therefore, given a memory area of size A and a record of size S, then the number of records (N) that can be sequenced in memory using the Replacement-Selection technique is A/(S+i), as compared to A/S for other techniques. Bose和Nelson[3,4]指出,与交换、选择和交换以及二进制插入方法相比,内部排序的P算子方法需要更少的记录比较和交换。他们的结论非常有效:给定一个完全充满记录的工作区域,操作员方法通常提供最佳解决方案。然而,当可以牺牲少量内存时,替换选择技术可以提供比先前发表的四种技术中的任何一种更有效的内部排序。所需的额外空间是每个记录一个字。因此,给定大小为a的存储器区域和大小为S的记录,则与其他技术的a/S相比,可以使用重新放置选择技术在存储器中排序的记录的数量(N)为a/(S+i)。

If the additional memory is not available, then N would have to be reduced by N/(S-4-1) records in order to confine the working area to A. 如果附加存储器不可用,则N必须减少N/(S-4-1)个记录,以便将工作区域限制在A。

The number of comparisons and the same number of exchanges required by the P-0perator method is given in [3, 4]. The Replacement-Selection method requires N[log2N] comparisons and N exchanges. [3,4]中给出了P操作器方法所需的比较次数和相同次数的交换。替换选择方法需要N个[log2N]比较和N个交换。

Table 1 shows a comparison of the P-Operator method and the Replacement-Selection technique for various values of N. (A memory size of NS for the P-Operator method and N(S-4-1) for the Replacement-Selection method.) 表1显示了针对N的各种值的P算子方法和替换选择技术的比较。(对于P算子方法,NS的存储器大小和对于替换选择方法,N(S-4-1)的存储器大小。)

在这里插入图片描述
REFERENCES

  1. GASSNER, BETTY JANE. Proof of a conjecture concerning sorting by replacement selection. Unpublished paper, 1958.
  2. IVERSON, KENNETH E. A Programming Language. Wiley, 1962, p. 239.
  3. BOSE, R. C., AND NELSON, R.J. A sorting problem. J. ACM 9 (1962), 282-296.
  4. SHELL, D.L. A high sorting procedure. Comm. ACM 2 (1959), 30-32.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值