This Detection System Analyzes Compression Ratios To Spot Phishing Attempts

Phishing is a fraudulent attempt to obtain sensitive information, with hackers disguising themselves a trustworthy entity.

Often, phishing are carried out though email spoofing, instant messaging and text messaging, to direct users to a fake website which matches the look-and-feel of the legitimate site, tricking users into entering their personal information.

Phishing is considered a social engineering trick, and usually the easiest and fastest way to obtain sensitive information from unsuspecting victims. This is why the method is widely-used by hackers of all kinds.

Researchers and tech companies have been trying to create new ways to counter phishing attempts, and here CSIRO's digital arm Data61 has come up with a new approach to automatically identify phishing attempts with a claimed higher success rate compared to existing methods.

Collaborating with the University of New South Wales (UNSW) and Cyber Security Cooperative Research Centre (CSCRC), Data61 developed what they call 'PhishZip'.

This system uses algorithmic techniques that sees file compression to spot phishing attempts.

In a Data61 web post, research scientist Dr Arindam Pal said that:

"Previous phishing detection methods employed machine learning algorithms that used traditional classification techniques like logistic regression, support vector machines, decision trees and artificial neural networks."

"These algorithms can’t cope with the dynamic nature of phishing, which often sees fraudsters constantly change the design and hyperlink of an illicit site every few hours."

Because of this, existing anti-phishing methods that include blacklists, content analysis platforms and web-based filters, can only provide limited protection before scammers develop new and more elaborate attacks.

In other words, existing methods can't scale as quickly.

PhishZip here, applies lossless DEFLATE file compression algorithm to distinguish phishing websites from the legitimate version, a technique that encodes information using fewer bits than the original format to reduce file size.

“Legitimate and phishing websites have different compression ratios," Pal said.

"We then introduce a systematic process of selecting meaningful words which are associated with phishing and non-phishing websites and analyse the likelihood of those word occurrences, therefore calculating the optimal likelihood threshold."

"These words are then used as the pre-defined dictionary for our compression models and used to train the algorithm into identifying instances where a proliferation of these key words indicates a malicious website."

What this means, PhishZip should be an advantage to existing machine-learning based anti-phishing methods, simply because t it doesn’t need model training or HTML parsing, where HTML code extracts information from webpages such as titles and headings.

In its testings, when PhishZip was used on several phishing websites that are clones of PayPal, Facebook, Microsoft, ING Direct and other popular sites, the algorithm was able to correctly identify 83% of phishing sites, which Data61 said is a marked improvement on existing methods.

The researchers were confidence about the result, and have contributed comprehensive phishing datasets to PhishTank, a community run by OpenDNS for people to share, verify and track phishing data. This enables researchers and engineers around the world to leverage the techniques to improve the security of systems.

Since the 'COVID-19' coronavirus pandemic, there has been a significant increase in phishing activities happening throughout the web. With working and studying shifted to home, people have increased their reliance on the internet for communication. Hackers see this as an opportunity.

“The technology could ultimately prevent significant financial losses for individuals and organisations,” Pal added.

As of the announcement, the team has yet to make PhishZip publicly available.

Published:

28/07/2020