Transformer-based Semantic Segmentation for Large-Scale Building Footprint Extraction from Very-High Resolution Satellite Images

Gibril, Mohamed Barakat A.; Al-Ruzouq, Rami; Shanableh, Abdallah; Jena, Ratiranjan; Bolcek, Jan; Zulhaidi Mohd Shafri, Helmi; Ghorbanzadeh, Omid

Transformer-based Semantic Segmentation for Large-Scale Building Footprint Extraction from Very-High Resolution Satellite Images

dc.contributor.author	Gibril, Mohamed Barakat A.	cs
dc.contributor.author	Al-Ruzouq, Rami	cs
dc.contributor.author	Shanableh, Abdallah	cs
dc.contributor.author	Jena, Ratiranjan	cs
dc.contributor.author	Bolcek, Jan	cs
dc.contributor.author	Zulhaidi Mohd Shafri, Helmi	cs
dc.contributor.author	Ghorbanzadeh, Omid	cs
dc.coverage.issue	10	cs
dc.coverage.volume	73	cs
dc.date.issued	2024-03-09	cs
dc.description.abstract	Extracting building footprints from extensive very-high spatial resolution (VHSR) remote sensing data is crucial for diverse applications, including surveying, urban studies, population estimation, identification of informal settlements, and disaster management. Although convolutional neural networks (CNNs) are commonly utilized for this purpose, their effectiveness is constrained by limitations in capturing long-range relationships and contextual details due to the localized nature of convolution operations. This study introduces the masked-attention mask transformer (Mask2Former), based on the Swin Transformer, for building footprint extraction from large-scale satellite imagery. To enhance the capture of large-scale semantic information and extract multiscale features, a hierarchical vision transformer with shifted windows (Swin Transformer) serves as the backbone network. An extensive analysis compares the efficiency and generalizability of Mask2Former with four CNN models (PSPNet, DeepLabV3+, UpperNet-ConvNext, and SegNeXt) and two transformer-based models (UpperNet-Swin and SegFormer) featuring different complexities. Results reveal superior performance of transformer-based models over CNN-based counterparts, showcasing exceptional generalization across diverse testing areas with varying building structures, heights, and sizes. Specifically, Mask2Former with the Swin transformer backbone achieves a mean intersection over union between 88% and 93%, along with a mean F-score (mF-score) ranging from 91% to 96.35% across various urban landscapes.	en
dc.format	text	cs
dc.format.extent	4937 -4954	cs
dc.format.mimetype	application/pdf	cs
dc.identifier.citation	ADVANCES IN SPACE RESEARCH. 2024, vol. 73, issue 10, p. 4937 -4954.	en
dc.identifier.doi	10.1016/j.asr.2024.03.002	cs
dc.identifier.issn	1879-1948	cs
dc.identifier.orcid	0009-0008-0271-6543	cs
dc.identifier.other	188212	cs
dc.identifier.uri	http://hdl.handle.net/11012/245513
dc.language.iso	en	cs
dc.publisher	Elsevier	cs
dc.relation.ispartof	ADVANCES IN SPACE RESEARCH	cs
dc.relation.uri	https://www.sciencedirect.com/science/article/pii/S0273117724002205	cs
dc.rights	Creative Commons Attribution 4.0 International	cs
dc.rights.access	openAccess	cs
dc.rights.sherpa	http://www.sherpa.ac.uk/romeo/issn/1879-1948/	cs
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/	cs
dc.subject	remote sensing	en
dc.subject	satellite imagery	en
dc.subject	Mask2former	en
dc.subject	CNN	en
dc.subject	Swin Transformer	en
dc.subject	vision transformer	en
dc.title	Transformer-based Semantic Segmentation for Large-Scale Building Footprint Extraction from Very-High Resolution Satellite Images	en
dc.type.driver	article	en
dc.type.status	Peer-reviewed	en
dc.type.version	publishedVersion	en
sync.item.dbid	VAV-188212	en
sync.item.dbtype	VAV	en
sync.item.insts	2025.02.03 15:41:51	en
sync.item.modts	2025.01.17 16:43:43	en
thesis.grantor	Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií. Ústav radioelektroniky	cs

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 1s2.0S0273117724002205main.pdf
Size:: 11.51 MB
Format:: Adobe Portable Document Format
Description:: file 1s2.0S0273117724002205main.pdf

Download

Collections

Ústav radioelektroniky