Stuff.

Feb. 6th, 2009 05:41 pm
swestrup: (Default)
[personal profile] swestrup
Haven't posted anything of substance this month. I actually haven't had a lot to say. The PVR thing is still coming along, and is taking much, Much, MUCH longer than I had hoped because I keep finding when I go to do step 7 that the docs tell me I should have done something different in step 2. I've now recompiled all of Gentoo at least three times. This all comes from having relied on autoinstallers for Linux in the past and not having actually known what relies on what or how various strange pieces of hardware are detected and drivers located for them. Quite often the answer seems to be to compile 200 drivers, try them in succession to see which ones work, choose whichever seems to have the most features, and delete the rest. Repeat for each unusual device on your system. But, in the end, progress is still being made and more and more of the system is actually starting to work. Lets hear it for the Gentoo documentation which seems heads-and-shoulders above anything else I've encountered in Linux so far.

In other news I went out and bought a set of headphones for skype so that I can join a tele-RPG that a friend is hosting, but I've yet to get either the headphones or Skype to work properly on this system. I gather that the first can be solved by obtaining a thorough understanding of ALSA and then going and tweaking the necessary settings. Do I have to point out, once again, that under Windows one just plugs in headphones and they work, without having to spend days learning the intricacies of the internal mixing software? As for getting skype working, I'm told its possible on a 64-bit system, but not necessarily easy. We'll see.

And then there's Anticipation. Where to start? I've been getting more and more back into helping out on the website, and there is a huge amount to do between now and the WorldCon this fall. Also, there's a steady (and ever increasing) stream of minutia that I have to take care of with regards to people's accounts and email and mailing lists and so on. I've started taking steps to offload this stuff because I find it very hard to keep switching tasks every few minutes. I'll never get anything accomplished that way.

On the intellectual front, I've been thinking about ways to find similar bit strings (in terms of Hamming Distance) in a huge list. I first came across this problem when I was working for Softguard around 1999, and I never did like the quick-and-dirty solution we came up with. Its been bugging me ever since, but I recently had my interest re-ignited when I found a piece of public-domain image classification software that generates 256-bit signatures for images and then performs an O(N^2) set of comparisons to see which are within 25 bit-flips of each other. The author is actively soliciting better ways of finding matches, and I'd like to help.

Anyway, my thoughts on the matter started with the most general solution that I could think of, which involved N-dimensional Space-Filling Curves, but googling on that matter lead me to the conclusion that most of what I was interested in was still in the realm of ongoing research. (My interest was because a space-filling curve changes an n-dimensional space into a 1-dimensional space (distance along the curve) and there are space-filling curves that are pretty good at maintaining locality, such that two points near in space are near on the curve -- this would give me a sorting order that would put nearby items near each other in the sort.)

Next I considered just the space of interest, the Q_n, the N-dimensional unit hypercube which is isomorphic to GF(2)^n, the n-bit Galois Field (of which I know far too little). In trying to imagine a space-filling curve that passed through every point in Q_n, I found that I was imagining a Hamiltonian Cycle which is describable by a Gray Code. I got very excited by this as I was originally of the impression that Gray Codes generally maintained locality. Well, they don't. At least the 'standard' Reflected Gray Codes don't. It turns out that there are Many MANY types of Gray Codes that have been invented and I'm still trying to figure out if one exists that meets my needs, or if anyone has shown that no such code exists, or what. It turns out to be a huge area of ongoing research as well.

In any case, lets assume that some Gray Code does work as a suitable locality metric. Lets call it MGC for "Metric Gray Code". Then one can transform the NxN comparison of thousands of 256-bit signatures as follows:

1) For a given threshold of bit flips B, determine 'T' the difference in MGC indexes that is guaranteed to contain all differences of B flips or less.
2) For each signature, find its positional index in the MGC.
3) Sort the list of signatures by MGC index.
4) Start with an empty list of comparison candidates. Go to the start of the sorted list.
5) Consider 'n' the next item on the sorted list. Discard all comparison candidates that differ by more than T from 'n'.
6) If the list of candidates is non-empty, compare 'n' to each item on the list in turn, outputting their differences.
7) Add the item 'n' to the list of candidates.
8) If we're not at the end of the sorted list, advance one position and go to step 5.

This will output all signatures that are within the threshold limit, accompanied by the exact differences. Nifty no?

Now if I can only find a suitable MGC, I will be happy indeed!


Hmm. For someone with 'nothing much' going on in his life, this turned out to be a pretty big post!

Date: 2009-02-07 12:19 am (UTC)
From: [identity profile] sps.livejournal.com
Why would such a thing exist, though? I don't see it restricting the search space any differently than hashing on substrings (the approach you say you don't like) does, because the input bits are symmetric in this phrasing of the problem.

Date: 2009-02-07 02:36 am (UTC)
From: [identity profile] sps.livejournal.com
The false negative rate is a function of the distance that is acceptable. For short distances, you can bring it to zero. Remember that we were looking at patterns like "anything written in Spanish" that had immense distances. Here, you want distance 25 on 256 bits, which means that 10-bit covering substrings are guaranteed to hit. It doesn't solve the problem, but it reduces it by some orders of magnitude, which is I think as good as a reordering method can do.

January 2017

S M T W T F S
1234567
891011121314
15161718192021
22232425262728
293031    

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Feb. 13th, 2026 09:34 pm
Powered by Dreamwidth Studios