Talk:Pending ideas

From The Okopipi Wiki

Jump to: navigation, search

DHT code

I prupose a DHT base network topology since it is the best way to handle any kind of DDOS attacks, ruthlessly cost efficient, very fast and immensly scalable. The easiest way to implement a DHT would be using code from the BSD-licensed ( meaning that you have the right to share it, change it, and share your changes of the code) Bamboo DHT. The Bamboo DHT is optimized for high node churn, and accessible publicly from OpenDHT.

"This service model of OpenDHT usage greatly simplifies deploying client applications. By using OpenDHT as a highly-available naming and storage service, clients can ignore the complexities of deploying and maintaining a DHT and instead concentrate on developing more sophisticated distributed applications." - OpenDHT

Similarity vs Compression

I read quite a while ago that you can use compression as a very effective tool to find out how similar two documents are. (I suppose you could use it for more than two, but that is not relevant here). Probably much cheaper computationally than messing with hashes, finding and using the correct algo, etc, and probably less accurate than the compression method anyway!

So I just did some tests and it works great! But, it is _very_ dependent on the compression method.

I chose to use winrar (I haven't yet tested zip).
There were two variables: compression method (best, normal), and archive type (solid, non-solid).

I initally used 4 files. They had the form:
$a := $text "word"
$b1 := $text "ward"
$b2 := $text "wabd"
$b3 := $text "fish"

(ie. body is the same, but with a different word attached to the end, but of the same size.)
The file size of each was: 2,668 bytes.

The results:
2,870 a_a_nonsolid_normal.rar
2,871 a_b1_nonsolid_normal.rar
2,871 a_b2_nonsolid_normal.rar
2,870 a_b3_nonsolid_normal.rar

2,492 a_a_nonsolid_best.rar
2,492 a_b1_nonsolid_best.rar
2,492 a_b2_nonsolid_best.rar
2,491 a_b3_nonsolid_best.rar

1,517 a_a_solid_normal.rar
1,518 a_b1_solid_normal.rar
1,519 a_b2_solid_normal.rar
1,518 a_b3_solid_normal.rar

1,327 a_a_solid_best.rar
1,330 a_b1_solid_best.rar
1,331 a_b2_solid_best.rar
1,332 a_b3_solid_best.rar

So you can see, the results suck, but work great for the last case. Solid archive, best compression.

I then did a couple more quick tests using solid, and best.
$b4 := $text "fishfish"
$b5 := $text (with brain -> apple) {brain is somewhere in the middle of $text}

Filesizes:
2,672 b4.txt
2,668 b5.txt

Results:
1,336 a_b4_solid_best.rar
1,340 a_b5_solid_best.rar

So, in conclusion, using winrar, solid archive, with the best compression option, is a very accurate test of similarity between two files. If you have any interesting counter-examples, please post here. Simul 04:55, 28 May 2006 (PDT)

BTW, it is not suprising that the solid archive method should be used.What is a solid archive? I am not so sure why the most aggressive compression method is needed over normal compression. -Simul.


If it were to use a similar hashing method it would surely run into licensing issues. Therefore 7Zip (7Zip Site) or something with an Open License should be used. The compression itself should also be highly optimized as otherwise it would clog down users that receive much spam.

Also, the comparasion of hash-codes should be done as fast a possible, preferably multi-threaded for several reasons;
1. It would not bring a system to a grinding halt
2. If one such item is spoofed in such a way that it compromises the archiving utility, the other threads would still continue.

The last item in itself also brings forth the necessity of running this in a low- if not none userpriviledge memoryspace - so that if it is compromised or spoofed the user / admin's PC / server is not compromised.

I do however like the simplicity of this solution. --Aprazeth 14:09, 28 May 2006 (PDT)



I did further tests using the same test files as above, using various compression algorithms. None of them worked quite as well as rar. Here are the results for 7Zip.

1,574 a_a_solid.7z
1,584 a_b1_solid.7z
1,584 a_b2_solid.7z
1,586 a_b3_solid.7z
1,589 a_b4_solid.7z
1,589 a_b5_solid.7z

So it doesn't distinguish between all the cases like rar, but I'd say it is close enough.

But CPU consumption may indeed be a problem if there is a lot of spam. I don't think there is anyway around that, other than running it in the background when the computer is idle. --Simul 02:23, 29 May 2006 (PDT)

OK. I can't leave this one alone. I did some more quick tests. For compression algorithms that don't support the solid option, one can fudge it just by joining the two files together and then compress them as one file. Here are the results for bzip2 and 7zip.

1,663 a_a.txt.bz2
1,666 a_b1.txt.bz2
1,667 a_b2.txt.bz2
1,667 a_b3.txt.bz2
1,666 a_b4.txt.bz2
1,669 a_b5.txt.bz2

1,531 a_a.7z
1,533 a_b1.7z
1,533 a_b2.7z
1,534 a_b3.7z
1,536 a_b4.7z
1,537 a_b5.7z

Respective command lines are:
bzip2 -k -9 a_b5.txt
7z u a_b5.7z a_b5.txt

OK. I am done. 7Zip should work just fine, and no license problems. --Simul 08:15, 29 May 2006 (PDT)


So you know, there are many ways to create just as efficient archives. My favourite is afio (GPL'd). You can tell it "files smaller than $blah, stuff as is. Files larger than $blah, compress." This works out to be very computationally efficient and produces compressionally optimized archives. fwiw. (tqk)

Timeline

Im not a programmer, but for the common folk like me, can anyone put out a developement timeline??

Personal tools