A forum for reverse engineering, OS internals and malware analysis 

Forum for discussion about user-mode development.
 #16541  by Tigzy
 Sat Nov 10, 2012 3:05 pm
Hi.

I'm looking for some documentation on the bests ways to find a signature in a file (PE, but nevermind).
Currently, I'm doing this:
Code: Select all
open file
parse sections
foreach section in sections
    foearch byte in bytes
       foreach signature in signatures
          memcmp(...)
       end
    end
end
This is quite fast cause I got less than 20 signatures to search, but in a whole database this can be very long I guess.
I'm passing all my signatures to SQLite, and I wonder if there are good ways to query the database for some signatures match.
How do AV product do to handle their huge databases?
 #16545  by Vrtule
 Sat Nov 10, 2012 8:44 pm
Hello,

you can probably improve your searching by using one of more advanced algorithm originally designed to search in text data. For example:

Rabin-Karp (the signatures need to be of equal length)
http://en.wikipedia.org/wiki/Rabin-karp

Aho-Corrasick (the signatures need not to be of equal length, howerver, the implementation is more complicated)
http://en.wikipedia.org/wiki/Aho-Corasick

Both algorithms allow you to match against multiple signatures at one step and they need to have all signatures, you are matching against, in memory.

I admit I know absolutely nothing about searching for signatures. But I think you may need to divide the signatures to certain classes. Signatures of certain class may appear only certain places (sections) which may also reduce the search space. But I have no practical experience with this topic, so be careful when reading these lines.
 #16547  by xdeadcode
 Sat Nov 10, 2012 10:13 pm
Hi Tigzy,

Well for sure this is not a trivial topic. I can suggest to look at clam source code (http://www.clamav.net/lang/pl/download/sources/) they have some interesting solutions out there (also look at what types of signatures they have).

To algorithms that are very useful for this kind of tasks I would add bloom-filter (here is easy description with example http://billmill.org/bloomfilter-tutorial/) and boyer-moore algorithm (http://en.wikipedia.org/wiki/Boyer%E2%8 ... _algorithm). Also take a look at sources of http://aluigi.altervista.org/mytoolz/signsrch.zip.

Also think if you really need to invent very complicated solution for this. Everything depends on how much signatures you will have in your database and how often users will need to use your soft - if your code will be used only from time to time (e.g because this is cleaner) then maybe implementation of such complicated piece of code will be pointless.

Best regards,
 #16548  by wacked2
 Sat Nov 10, 2012 10:48 pm
If you only need the signatures at the entrypoint (or any point for that matter) you could either order them ascending and then do a binary search for the only matching signatures.
Depending on the signature you could maybe even use a hashmap (unlikely - if the signature contains no XX then yes)

Though I think you the main bottleneck is that you try to match all bytes. You should cut down on that
The signature probably is in a fixed place (entrypoint?!) - WAY less bytes to compare

If you really need to check all bytes: Order your signatures so you only select a subset for each check. For that you probably need a special kind of database, SQLLite defiantly will need way to long to search through all entries and find the right one(s).
I don't know if you can really afford to spent that time (probably not :) ) but why not write a small own DB? On start it loads all the sigs and orders them in memory - needs time to 'initialize' but fast checking.
Then a small lookup table 'First byte' -> 'List of signatures' (or maybe even 2bytes? When you're having >50MB of signatures another 256^2*Pointersize bytes arn't that much anymore) and the actual search should be lightning fast

I have no idea whether that or the various more efficiently string comparing algorithms will offer a better result and which will take less time.

Edit: Thanks xdeadcode, that signsearch seems very interesting.
 #16551  by Tigzy
 Sun Nov 11, 2012 5:23 am
Thanks everyone!
I'll have a look to your algos
Yeah I was already on my way to look at the clamav source code.

For the moment I currently load every sig into some objects cause sqlite store strings and I convert it into bytes arrays
I will maybe make a map with index

This is more likely to understand professional algorithms than to make an efficient one for my soft ;)
 #18302  by Tigzy
 Fri Feb 22, 2013 11:49 am
Hi.

Found some good way to deal with SQLite database to keep signatures:

- Signatures are stored in string format in the database, including regex wildcards
- use <regex> lib to find matches in a string converted hex buffer

PseudoCode:
Code: Select all
BYTE* somebuffertoanalyse = Readsomething(...);
string bufferStr = ConvertHexBufferToString(somebuffertoanalyse);

foreach (string sig in database.signatures)
{
      regex_match matches;
      if (regex_search(bufferStr, regex(sig), &matches))
      {
          //we match - do whatever with the matches
      }
}