Phishing detection on webpages in European non-English languages based on machine learning

dc.contributor.authorKomosný, Dancs
dc.coverage.issueOctobercs
dc.coverage.volume15cs
dc.date.issued2025-10-27cs
dc.description.abstractMachine learning-based phishing detection is crucial for preventing zero-day attacks. State-of-the-art phishing detection performs well on English webpages. However, it is not adequately accurate for webpages in minor languages. This work significantly improves phishing detection in European countries where minor languages are predominantly spoken. The proposed language-based phishing detection outperforms state-of-the-art methods for webpages in minor languages, achieving 99% accuracy on local webpages in the following countries: the Czech Republic, Denmark, Estonia, Croatia, Hungary, Lithuania, Latvia, Poland, Romania, Serbia, Slovakia, and Slovenia. The main improvement lies in reducing the false positive rate, where local benign webpages are incorrectly identified as phishing. The proposed method reduces the false positive rate by up to a factor of 10 for webpages in minor languages. The results are statistically robust across different webpage sets, scaled for real-world use. The Shapiro-Wilk test confirms a normal distribution of the results in each tested country, with p-values consistently above 0.05. Paired T-tests yield p-values consistently below 0.05, indicating statistically significant improvements in each country. The low variability in the results further demonstrates the robustness of the proposed method. To foster reproducibility, the source code, datasets, and raw results are made publicly available, including approximately two million local webpages from 16 countries. Reference data from English-speaking and major-language-speaking countries are also included. This work advances phishing detection for webpages in minor languages and contributes to more inclusive global cybersecurity.en
dc.description.abstractMachine learning-based phishing detection is crucial for preventing zero-day attacks. State-of-the-art phishing detection performs well on English webpages. However, it is not adequately accurate for webpages in minor languages. This work significantly improves phishing detection in European countries where minor languages are predominantly spoken. The proposed language-based phishing detection outperforms state-of-the-art methods for webpages in minor languages, achieving 99% accuracy on local webpages in the following countries: the Czech Republic, Denmark, Estonia, Croatia, Hungary, Lithuania, Latvia, Poland, Romania, Serbia, Slovakia, and Slovenia. The main improvement lies in reducing the false positive rate, where local benign webpages are incorrectly identified as phishing. The proposed method reduces the false positive rate by up to a factor of 10 for webpages in minor languages. The results are statistically robust across different webpage sets, scaled for real-world use. The Shapiro-Wilk test confirms a normal distribution of the results in each tested country, with p-values consistently above 0.05. Paired T-tests yield p-values consistently below 0.05, indicating statistically significant improvements in each country. The low variability in the results further demonstrates the robustness of the proposed method. To foster reproducibility, the source code, datasets, and raw results are made publicly available, including approximately two million local webpages from 16 countries. Reference data from English-speaking and major-language-speaking countries are also included. This work advances phishing detection for webpages in minor languages and contributes to more inclusive global cybersecurity.en
dc.formattextcs
dc.format.extent14cs
dc.format.mimetypeapplication/pdfcs
dc.identifier.citationScientific Reports. 2025, vol. 15, issue October, 14 p.en
dc.identifier.doi10.1038/s41598-025-21384-wcs
dc.identifier.issn2045-2322cs
dc.identifier.orcid0000-0002-6551-7997cs
dc.identifier.other198970cs
dc.identifier.urihttp://hdl.handle.net/11012/255604
dc.language.isoencs
dc.relation.ispartofScientific Reportscs
dc.relation.urihttps://www.nature.com/articles/s41598-025-21384-wcs
dc.rightsCreative Commons Attribution-NonCommercial-NoDerivatives 4.0 Internationalcs
dc.rights.accessopenAccesscs
dc.rights.sherpahttp://www.sherpa.ac.uk/romeo/issn/2045-2322/cs
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/4.0/cs
dc.subjectLanguageen
dc.subjectPhishing Detectionen
dc.subjectMachine Learningen
dc.subjectFalse Positive Rateen
dc.subjectWebpage URLen
dc.subjectCybersecurityen
dc.subjectLanguage
dc.subjectPhishing Detection
dc.subjectMachine Learning
dc.subjectFalse Positive Rate
dc.subjectWebpage URL
dc.subjectCybersecurity
dc.titlePhishing detection on webpages in European non-English languages based on machine learningen
dc.title.alternativePhishing detection on webpages in European non-English languages based on machine learningen
dc.type.driverarticleen
dc.type.statusPeer-revieweden
dc.type.versionpublishedVersionen
eprints.grantNumberinfo:eu-repo/grantAgreement/TA0/FW/FW10010014cs
sync.item.dbidVAV-198970en
sync.item.dbtypeVAVen
sync.item.insts2025.12.19 21:53:51en
sync.item.modts2025.12.19 21:32:49en
thesis.grantorVysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií. Ústav telekomunikacícs

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
s4159802521384w.pdf
Size:
2.08 MB
Format:
Adobe Portable Document Format
Description:
file s4159802521384w.pdf