A Multi-Dimensional DNS Domain Intelligence Dataset for Cybersecurity Research

Loading...
Thumbnail Image

Authors

Hranický, Radek
Ondryáš, Ondřej
Horák, Adam
Pouč, Petr
Jeřábek, Kamil
Ebert, Tomáš
Polišenský, Jan

Advisor

Referee

Mark

Journal Title

Journal ISSN

Volume Title

Publisher

Altmetrics

Abstract

The escalating sophistication and frequency of cyber threats require advanced solutions in cybersecurity research. Particularly, phishing and malware detection have become increasingly reliant on data-driven approaches. This paper presents a unique dataset precisely curated to bolster research in network security, focusing on the classification and analysis of internet domains. This dataset contains information for over a million internet domains with detailed labels distinguishing between phishing, malware, and benign traffic. Our dataset is distinctive due to its comprehensive compilation of metainformation derived from multiple sources, including DNS records, TLS handshakes and certificates, WHOIS and RDAP services, IP-related data, and geolocation details. Such rich, multi-dimensional data allows for a deeper analysis and understanding of domain characteristics that are critical in identifying and categorizing cyber threats. The integration of information from diverse sources enhances the dataset's utility, providing a holistic view of each domain's footprint and its potential security implications. The data is formatted in JSON, ensuring versatility, accessibility for researchers, and easy integration into various analytical tools and platforms, facilitating ease of use in statistical analysis, machine learning, and other computational analyses. Our dataset's extensive volume and variety surpass any known publicly available resources in this field, making it an invaluable asset for both academic and practical development and testing of cybersecurity solutions. This paper thoroughly describes the value of the data, details the comprehensive methodology employed in the collection process, and provides a clear description of the data structure. Such documentation is crucial for ensuring that the dataset can be effectively utilized and reapplied in a variety of research contexts. Its structured format and the broad range of included features are critical for developing robust cybersecurity solutions and can be adapted for emerging threats.
The escalating sophistication and frequency of cyber threats require advanced solutions in cybersecurity research. Particularly, phishing and malware detection have become increasingly reliant on data-driven approaches. This paper presents a unique dataset precisely curated to bolster research in network security, focusing on the classification and analysis of internet domains. This dataset contains information for over a million internet domains with detailed labels distinguishing between phishing, malware, and benign traffic. Our dataset is distinctive due to its comprehensive compilation of metainformation derived from multiple sources, including DNS records, TLS handshakes and certificates, WHOIS and RDAP services, IP-related data, and geolocation details. Such rich, multi-dimensional data allows for a deeper analysis and understanding of domain characteristics that are critical in identifying and categorizing cyber threats. The integration of information from diverse sources enhances the dataset's utility, providing a holistic view of each domain's footprint and its potential security implications. The data is formatted in JSON, ensuring versatility, accessibility for researchers, and easy integration into various analytical tools and platforms, facilitating ease of use in statistical analysis, machine learning, and other computational analyses. Our dataset's extensive volume and variety surpass any known publicly available resources in this field, making it an invaluable asset for both academic and practical development and testing of cybersecurity solutions. This paper thoroughly describes the value of the data, details the comprehensive methodology employed in the collection process, and provides a clear description of the data structure. Such documentation is crucial for ensuring that the dataset can be effectively utilized and reapplied in a variety of research contexts. Its structured format and the broad range of included features are critical for developing robust cybersecurity solutions and can be adapted for emerging threats.

Description

Citation

Data in Brief. 2026, vol. 62, issue October, p. 1-13.
https://www.sciencedirect.com/science/article/pii/S235234092500784X

Document type

Peer-reviewed

Document version

Published version

Date of access to the full text

Language of document

en

Study field

Comittee

Date of acceptance

Defence

Result of defence

Endorsement

Review

Supplemented By

Referenced By

Creative Commons license

Except where otherwised noted, this item's license is described as Creative Commons Attribution 4.0 International
Citace PRO