Reservoir Sampling - Sampling from a stream of elements(蓄水池算法,从流数据中抽样)

本文探讨了Reservoir Sampling算法的原理与实现,重点解释了如何从无限数据流中均匀抽取固定数量的样本,无需预知数据总量,仅通过一次遍历即可完成。详细阐述了算法的工作机制、复杂度分析及与随机数生成的关系。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Problem Statement

Reservoir Sampling is an algorithm for sampling elements from a stream of data. Imagine you are given a really large stream of data elements (queries on google searches in May, products bought at Walmart during the Christmas season, names in a phone book, whatever). Your goal is to  efficiently  return a random sample of 1,000 elements  evenly distributed  from the original stream. How would you do it?

The right answer is generating random integers between 0 and N-1, then retrieving the elements at those indices and you have your answer. (Update: reader Martin astutely points out that this is sampling with replacement. To make this sampling without replacement, one simply needs to note whether or not your sample already pulled that random number and if so, choose a new random number. This can make the algorithm pretty expensive if the sample size is very close to N though).

So, let me make the problem harder. You don't know N (the size of the stream) in advance and you can't index directly into it. You can count it, but that requires making 2 passes of the data. You can do better. There are some heuristics you might try: for example to guess the length and hope to undershoot. It will either not work in one pass or will not be evenly distributed.

Simple Solution

A relatively easy and correct solution is to assign a random number to every element as you see them in the stream, and then always keep the top 1,000 numbered elements at all times. This is similar to how mysql does "ORDER BY RAND()" calls. This strategy works well, and only requires additionally storing the randomly generated number for each element.

Reservoir Sampling

Another, more complex option is reservoir sampling. First, you want to make a reservoir (array) of 1,000 elements and fill it with the first 1,000 elements in your stream. That way if you have exactly 1,000 elements, the algorithm works. This is the base case.

Next, you want to process the i'th element (starting with i = 1,001) such that at the end of processing that step, the 1,000 elements in your reservoir are randomly sampled amongst the i elements you've seen so far. How can you do this? Start with i = 1,001. With what probability after the 1001'th step should element 1,001 (or any element for that matter) be in the set of 1,000 elements? The answer is easy: 1,000/1,001. So, generate a random number between 0 and 1, and if it is less than 1,000/1,001 you should take element 1,001. In other words, choose to add element 1,001 to your reservoir with probability 1,000/1,001. If you choose to add it (which you likely will), then replace any element in the reservoir chosen randomly. I've shown that this produces a 1,000/1,001 chance of selecting the 1,001'th element, but what about the 2nd element in the list? The 2nd element is definitely in the reservoir at step 1,000 and the probability of it getting removed is the probability of element 1,001 getting selected multiplied by the probability of #2 getting randomly chosen as the replacement candidate. That probability is 1,000/1,001 * 1/1,000 = 1/1,001. So, the probability that #2 survives this round is 1 - that or 1,000/1,001.

This can be extended for the i'th round - keep the i'th element with probability  1,000/i  and if you choose to keep it, replace a random element from the reservoir. It is pretty easy to prove that this works for all values of i using induction. It obviously works for the i'th element based on the way the algorithm selects the i'th element with the correct probability outright. The probability any element before this step being in the reservoir is 1,000/(i-1) . The probability that they are removed is  1,000/i  * 1/1,000 = 1/i. The probability that each element sticks around given that they are already in the reservoir is  (i-1)/i  and thus the elements' overall probability of being in the reservoir after i rounds is  1,000/(i-1)  *  (i-1)/i  =  1,000/i .

This ends up a little complex, but works just the same way as the random assigned numbers above.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值