Phishing detection on webpages in European non-English languages based on machine learning

Loading...
Thumbnail Image

Advisor

Referee

Mark

Journal Title

Journal ISSN

Volume Title

Publisher

Altmetrics

Abstract

Machine learning-based phishing detection is crucial for preventing zero-day attacks. State-of-the-art phishing detection performs well on English webpages. However, it is not adequately accurate for webpages in minor languages. This work significantly improves phishing detection in European countries where minor languages are predominantly spoken. The proposed language-based phishing detection outperforms state-of-the-art methods for webpages in minor languages, achieving 99% accuracy on local webpages in the following countries: the Czech Republic, Denmark, Estonia, Croatia, Hungary, Lithuania, Latvia, Poland, Romania, Serbia, Slovakia, and Slovenia. The main improvement lies in reducing the false positive rate, where local benign webpages are incorrectly identified as phishing. The proposed method reduces the false positive rate by up to a factor of 10 for webpages in minor languages. The results are statistically robust across different webpage sets, scaled for real-world use. The Shapiro-Wilk test confirms a normal distribution of the results in each tested country, with p-values consistently above 0.05. Paired T-tests yield p-values consistently below 0.05, indicating statistically significant improvements in each country. The low variability in the results further demonstrates the robustness of the proposed method. To foster reproducibility, the source code, datasets, and raw results are made publicly available, including approximately two million local webpages from 16 countries. Reference data from English-speaking and major-language-speaking countries are also included. This work advances phishing detection for webpages in minor languages and contributes to more inclusive global cybersecurity.
Machine learning-based phishing detection is crucial for preventing zero-day attacks. State-of-the-art phishing detection performs well on English webpages. However, it is not adequately accurate for webpages in minor languages. This work significantly improves phishing detection in European countries where minor languages are predominantly spoken. The proposed language-based phishing detection outperforms state-of-the-art methods for webpages in minor languages, achieving 99% accuracy on local webpages in the following countries: the Czech Republic, Denmark, Estonia, Croatia, Hungary, Lithuania, Latvia, Poland, Romania, Serbia, Slovakia, and Slovenia. The main improvement lies in reducing the false positive rate, where local benign webpages are incorrectly identified as phishing. The proposed method reduces the false positive rate by up to a factor of 10 for webpages in minor languages. The results are statistically robust across different webpage sets, scaled for real-world use. The Shapiro-Wilk test confirms a normal distribution of the results in each tested country, with p-values consistently above 0.05. Paired T-tests yield p-values consistently below 0.05, indicating statistically significant improvements in each country. The low variability in the results further demonstrates the robustness of the proposed method. To foster reproducibility, the source code, datasets, and raw results are made publicly available, including approximately two million local webpages from 16 countries. Reference data from English-speaking and major-language-speaking countries are also included. This work advances phishing detection for webpages in minor languages and contributes to more inclusive global cybersecurity.

Description

Citation

Scientific Reports. 2025, vol. 15, issue October, 14 p.
https://www.nature.com/articles/s41598-025-21384-w

Document type

Peer-reviewed

Document version

Published version

Date of access to the full text

Language of document

en

Study field

Comittee

Date of acceptance

Defence

Result of defence

Endorsement

Review

Supplemented By

Referenced By

Creative Commons license

Except where otherwised noted, this item's license is described as Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International
Citace PRO