There are some optimizations in place, and ways to only match certain parts of bytes, but for all intents and purposes, this is how traditional AV works when it’s matching a signature.
With heuristic approaches, the AV matches things that aren’t in the code directly.
One example of how it might work is the following:
• Does the executable import VirtualAlloc?
• Is the executable greater than 30KB and less than 75KB?
• Does the executable have a section whose permissions are read, write and execute?
• If all of these things are true, then it is a virus
With heuristic matches, there may be up to 10 rules in place, but it’s no more complicated than having more rules than the above illustrates in the real world.
With hash-based approaches, the AV does the following:
• Take a hash over a certain area of the executable (MD5, SHA256, CRC32)
• Does the hash match the hash of a known virus?
• If yes, then it is a virus
The only part where that above gets more complicated in the real world is, sometimes, engines will take many different hashes across the binary and see if any of them match. For instance, it may cut up the file into 1024-byte chunks and take the hashes of all of them and see if any of them match a virus.
When other vendors say they use ML, what really happens is the following:
• Use an ML algorithm to scan lots of malicious software
• Have the ML algorithm generate a signature, heuristic, or hash as described above.
• Have humans vet the resulting signature, heuristic, or hash to make sure that nothing good is blocked
This means that other vendors’ detection algorithms at the end of the day result in just more of the same signatures the AV industry has used for several decades.
The problems with signatures are, if a single byte is changed in any of the important values, then the signature no longer matches. A single recompilation with different strings easily evades more signature detection algorithms.
The problem with heuristics are that if a single one of the checks gets bypassed, such as the executable size, the entire signature is bad. Adding data to malware can easily bypass these signatures. The problem with hashes is, if a single bit gets changed in any of the areas used to generate the hashes, the hashes are wildly different
For an attacker, traditional AV signatures require knowledge of a single feature, and changing that single feature in order to break the entire detection. At most, traditional AV uses a set of 5 or 10 features that are easily changed in order to convict a sample.
The reason Cylance’s detection algorithms are better is simple. Instead of a simple, straight forward, step based algorithm our algorithm is more like a maze with a lot of dead ends. Each feature telling you to go left or right inside the maze, and the value of goodness or badness depends on which dead-end you wind up in. If you change the binary so that the algorithm goes left, left could produce a more malicious determination. Backtracking through the maze in order to ensure that your sample is good is as hard as it would be to solve a maze by taking random turns.
In total, we’re measuring 2.7 million features right now, and each one of those features can be viewed as thousands of different turns in that maze.