Phishing detection on webpages in European non-English languages based on machine learning
| dc.contributor.author | Komosný, Dan | cs |
| dc.coverage.issue | October | cs |
| dc.coverage.volume | 15 | cs |
| dc.date.issued | 2025-10-27 | cs |
| dc.description.abstract | Machine learning-based phishing detection is crucial for preventing zero-day attacks. State-of-the-art phishing detection performs well on English webpages. However, it is not adequately accurate for webpages in minor languages. This work significantly improves phishing detection in European countries where minor languages are predominantly spoken. The proposed language-based phishing detection outperforms state-of-the-art methods for webpages in minor languages, achieving 99% accuracy on local webpages in the following countries: the Czech Republic, Denmark, Estonia, Croatia, Hungary, Lithuania, Latvia, Poland, Romania, Serbia, Slovakia, and Slovenia. The main improvement lies in reducing the false positive rate, where local benign webpages are incorrectly identified as phishing. The proposed method reduces the false positive rate by up to a factor of 10 for webpages in minor languages. The results are statistically robust across different webpage sets, scaled for real-world use. The Shapiro-Wilk test confirms a normal distribution of the results in each tested country, with p-values consistently above 0.05. Paired T-tests yield p-values consistently below 0.05, indicating statistically significant improvements in each country. The low variability in the results further demonstrates the robustness of the proposed method. To foster reproducibility, the source code, datasets, and raw results are made publicly available, including approximately two million local webpages from 16 countries. Reference data from English-speaking and major-language-speaking countries are also included. This work advances phishing detection for webpages in minor languages and contributes to more inclusive global cybersecurity. | en |
| dc.description.abstract | Machine learning-based phishing detection is crucial for preventing zero-day attacks. State-of-the-art phishing detection performs well on English webpages. However, it is not adequately accurate for webpages in minor languages. This work significantly improves phishing detection in European countries where minor languages are predominantly spoken. The proposed language-based phishing detection outperforms state-of-the-art methods for webpages in minor languages, achieving 99% accuracy on local webpages in the following countries: the Czech Republic, Denmark, Estonia, Croatia, Hungary, Lithuania, Latvia, Poland, Romania, Serbia, Slovakia, and Slovenia. The main improvement lies in reducing the false positive rate, where local benign webpages are incorrectly identified as phishing. The proposed method reduces the false positive rate by up to a factor of 10 for webpages in minor languages. The results are statistically robust across different webpage sets, scaled for real-world use. The Shapiro-Wilk test confirms a normal distribution of the results in each tested country, with p-values consistently above 0.05. Paired T-tests yield p-values consistently below 0.05, indicating statistically significant improvements in each country. The low variability in the results further demonstrates the robustness of the proposed method. To foster reproducibility, the source code, datasets, and raw results are made publicly available, including approximately two million local webpages from 16 countries. Reference data from English-speaking and major-language-speaking countries are also included. This work advances phishing detection for webpages in minor languages and contributes to more inclusive global cybersecurity. | en |
| dc.format | text | cs |
| dc.format.extent | 14 | cs |
| dc.format.mimetype | application/pdf | cs |
| dc.identifier.citation | Scientific Reports. 2025, vol. 15, issue October, 14 p. | en |
| dc.identifier.doi | 10.1038/s41598-025-21384-w | cs |
| dc.identifier.issn | 2045-2322 | cs |
| dc.identifier.orcid | 0000-0002-6551-7997 | cs |
| dc.identifier.other | 198970 | cs |
| dc.identifier.uri | http://hdl.handle.net/11012/255604 | |
| dc.language.iso | en | cs |
| dc.relation.ispartof | Scientific Reports | cs |
| dc.relation.uri | https://www.nature.com/articles/s41598-025-21384-w | cs |
| dc.rights | Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International | cs |
| dc.rights.access | openAccess | cs |
| dc.rights.sherpa | http://www.sherpa.ac.uk/romeo/issn/2045-2322/ | cs |
| dc.rights.uri | http://creativecommons.org/licenses/by-nc-nd/4.0/ | cs |
| dc.subject | Language | en |
| dc.subject | Phishing Detection | en |
| dc.subject | Machine Learning | en |
| dc.subject | False Positive Rate | en |
| dc.subject | Webpage URL | en |
| dc.subject | Cybersecurity | en |
| dc.subject | Language | |
| dc.subject | Phishing Detection | |
| dc.subject | Machine Learning | |
| dc.subject | False Positive Rate | |
| dc.subject | Webpage URL | |
| dc.subject | Cybersecurity | |
| dc.title | Phishing detection on webpages in European non-English languages based on machine learning | en |
| dc.title.alternative | Phishing detection on webpages in European non-English languages based on machine learning | en |
| dc.type.driver | article | en |
| dc.type.status | Peer-reviewed | en |
| dc.type.version | publishedVersion | en |
| eprints.grantNumber | info:eu-repo/grantAgreement/TA0/FW/FW10010014 | cs |
| sync.item.dbid | VAV-198970 | en |
| sync.item.dbtype | VAV | en |
| sync.item.insts | 2025.12.19 21:53:51 | en |
| sync.item.modts | 2025.12.19 21:32:49 | en |
| thesis.grantor | Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií. Ústav telekomunikací | cs |
Files
Original bundle
1 - 1 of 1
Loading...
- Name:
- s4159802521384w.pdf
- Size:
- 2.08 MB
- Format:
- Adobe Portable Document Format
- Description:
- file s4159802521384w.pdf
