Phishing detection on webpages in European non-English languages based on machine learning

Komosný, Dan

doi:10.1038/s41598-025-21384-w

Phishing detection on webpages in European non-English languages based on machine learning

Files

s4159802521384w.pdf (2.08 MB)

Date

2025-10-27

Authors

Komosný, Dan

Publisher

Springer Nature

ORCID

0000-0002-6551-7997

Altmetrics

Abstract

Machine learning-based phishing detection is crucial for preventing zero-day attacks. State-of-the-art phishing detection performs well on English webpages. However, it is not adequately accurate for webpages in minor languages. This work significantly improves phishing detection in European countries where minor languages are predominantly spoken. The proposed language-based phishing detection outperforms state-of-the-art methods for webpages in minor languages, achieving 99% accuracy on local webpages in the following countries: the Czech Republic, Denmark, Estonia, Croatia, Hungary, Lithuania, Latvia, Poland, Romania, Serbia, Slovakia, and Slovenia. The main improvement lies in reducing the false positive rate, where local benign webpages are incorrectly identified as phishing. The proposed method reduces the false positive rate by up to a factor of 10 for webpages in minor languages. The results are statistically robust across different webpage sets, scaled for real-world use. The Shapiro-Wilk test confirms a normal distribution of the results in each tested country, with p-values consistently above 0.05. Paired T-tests yield p-values consistently below 0.05, indicating statistically significant improvements in each country. The low variability in the results further demonstrates the robustness of the proposed method. To foster reproducibility, the source code, datasets, and raw results are made publicly available, including approximately two million local webpages from 16 countries. Reference data from English-speaking and major-language-speaking countries are also included. This work advances phishing detection for webpages in minor languages and contributes to more inclusive global cybersecurity.

Keywords

Language , Phishing Detection , Machine Learning , False Positive Rate , Webpage URL , Cybersecurity

Citation

Scientific Reports. 2025, vol. 15, issue October, p. 1-14.
https://www.nature.com/articles/s41598-025-21384-w

Document type

Peer-reviewed

Document version

Published version

Language of document

en

DOI

10.1038/s41598-025-21384-w

URI

http://hdl.handle.net/11012/255604

Collections

Ústav telekomunikací

Creative Commons license

Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International

Except where otherwised noted, this item's license is described as Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International

Citace PRO

Full item page

Phishing detection on webpages in European non-English languages based on machine learning

Files

Date

Authors

Advisor

Referee

Mark

Journal Title

Journal ISSN

Volume Title

Publisher

ORCID

Altmetrics

Abstract

Description

Keywords

Citation

Document type

Document version

Date of access to the full text

Language of document

Study field

Comittee

Date of acceptance

Defence

Result of defence

DOI

URI

Collections

Endorsement

Review

Supplemented By

Referenced By

Creative Commons license

Citace PRO