Logarithmic Thinning.
Nov. 19th, 2006 05:44 pmI'm about to geek out here on how to retain logarithmicly distributed samples in a sparse array as part of a linear sampling process. If that sentence means nothing to you, or if you couldn't care less about that subject, don't bother to click.
Several ages ago,
sps mentioned that he wanted to create a Linux window widget that was a logarithmicly plotted CPU use meter. Now, its easy enough to produce a log plot given several billion evenly sampled points, but this tends to waste massive amounts of memory.
sps opined that it should be relatively easy to thin the samples as they were stored so that only those actually needed to produce a log plot would be retained. I never could figure out how that might be accomplished, so I asked him about it last week, and he drew this diagram:
0) #
1) ##
2) ###
3) # ##
4) # ###
5) # # ##
6) # # ###
7) # # ##
8) # # ###
9) # # # ##
10) # # # ###
11) # # # ##
12) # # # ###
13) # # # # ##
14) # # # # ###
15) # # # ##
16) # # # ###
17) # # # # ##
18) # # # # ###
19) # # # # ##
20) # # # # ###
21) # # # # # ##
22) # # # # # ###
23) # # # # ##
24) # # # # ###
25) # # # # # ##
26) # # # # # ###
27) # # # # # ##
28) # # # # # ###
29) # # # # # # ##
30) # # # # # # ###
The above shows the contents of a sparse array after each successive addition of a sample, with logarithmic thinning. The '#' symbols each indicate a retained sample. A space indicates where there are no samples, so entry 5 which looks like # # ## shows that after storing 6 samples, we're only retaining samples 0, 2, 4 and 5.
Stephen wasn't sure exactly what algorithm would generate the above efficiently, but he surmised that it would be something simple based on the bit-patterns of the sample indexes. He was right. It took me all day yesterday, but I managed to figure out how to generate the above iteratively as samples are stored. In pseudo code, it goes something like this:
This algorithm could still use some tweaking, which I am leaving as an exercise to the reader:
Okay. Geek content is finished. You may look again.
Several ages ago,
0) #
1) ##
2) ###
3) # ##
4) # ###
5) # # ##
6) # # ###
7) # # ##
8) # # ###
9) # # # ##
10) # # # ###
11) # # # ##
12) # # # ###
13) # # # # ##
14) # # # # ###
15) # # # ##
16) # # # ###
17) # # # # ##
18) # # # # ###
19) # # # # ##
20) # # # # ###
21) # # # # # ##
22) # # # # # ###
23) # # # # ##
24) # # # # ###
25) # # # # # ##
26) # # # # # ###
27) # # # # # ##
28) # # # # # ###
29) # # # # # # ##
30) # # # # # # ###
The above shows the contents of a sparse array after each successive addition of a sample, with logarithmic thinning. The '#' symbols each indicate a retained sample. A space indicates where there are no samples, so entry 5 which looks like # # ## shows that after storing 6 samples, we're only retaining samples 0, 2, 4 and 5.
Stephen wasn't sure exactly what algorithm would generate the above efficiently, but he surmised that it would be something simple based on the bit-patterns of the sample indexes. He was right. It took me all day yesterday, but I managed to figure out how to generate the above iteratively as samples are stored. In pseudo code, it goes something like this:
- Add the latest sample to the end of the array. Call its index idx. idx's least significant bit is labelled bit 0.
- Initialize an offset value to 2. Call it ofs.
- Initialize a bit number to 0, Call it bit.
- If bit bit of idx is 0, or ofs is greater than idx, goto 8
- Remove sample with index idx-ofs.
- Set ofs equal to 2*ofs+1. This is equivalent to shifting ofs left by one bit, and setting the new lowest bit to 1.
- Set bit equal to bit+1, and goto step 4
- Finished.
# logthin takes a sample reference, and a sparse array (hash) of samples.
# It adds the sample to the end of the array, and then thins the array so
# as to retain only a logarithmic number of samples as one goes further
# back in the array.
#
# Yes, there are more efficient ways to implement this.
sub logthin($\%)
{ my ($sample,$hash) = @_;
my @keys=sort {$a <=> $b} keys %$hash;
my $idx = @keys ? $keys[$#keys]+1 : 0;
$$hash{$idx} = $sample;
my $ofs = 2;
my $bit = 0;
while( $idx & (1<<$bit) && $idx >= $ofs )
{
delete $$hash{$idx-$ofs};
$ofs += $ofs+1;
$bit++;
}
}
This algorithm could still use some tweaking, which I am leaving as an exercise to the reader:
- While its relatively easy to create a varient that retains fewer samples (by only storing every Nth sample), its harder to create one that stores more. This version will only store ~32 samples out of 4 billion. If one wanted to store a larger number, like 500, one would need to have a log base of something like 1.045, rather than log 2. I'm not yet sure how one would do that.
- The number of samples retained is not monotonically increasing. This can make it difficult to deal with in a dynamic display situation where you only show the last N samples, as sudden decreases in sample numbers can cause jitter. This can be solved by changing the while loop to only mark entries for deletion, and then to delete the oldest marked entry (if any) every time a sample is added.
Okay. Geek content is finished. You may look again.
no subject
Date: 2006-11-20 01:41 pm (UTC)The rule was this:
number backups (starting at 1)
For each power of two, retain its most recent odd multiple.
1
1 2
. 2 3
. 2 3 4
. 2 . 4 5
. . . 4 5 6
. . . 4 . 6 7
. . . 4 . 6 7 8
. . . 4 . 6 . 8 9
etc. Not quite as retentive as yours. I suppose you could start numbering at zero and retain backup number zero forever.