Abstract
Introduction
Lesion segmentation in dermoscopic images significantly enhances the diagnostic performance of AI-based classification models. However, conventional methods often require pixel-level annotations, which are resource-intensive and prone to errors caused by external artifacts, such as hair and skin markings.
Methods
We propose a hybrid framework called SAM-enhanced YOLO, which integrates the Segment Anything Model (SAM) with You Only Look Once (YOLO) for precise pixel-level segmentation. This method combines YOLO’s efficient lesion localization with SAM’s advanced zero-shot segmentation capabilities. To further validate the framework, we compared it against traditional methods, including GrabCut and Otsu’s thresholding, as well as SAM used without YOLO (SAM-only). For SAM-only, lesion segmentation was initialized at the image center to simulate a typical dermoscopic imaging setup.
Results
SAM-enhanced YOLO demonstrated superior segmentation performance, achieving an Intersection over Union (IoU) of 0.738 and an F1-score (the harmonic mean of precision and recall) of 0.833, compared to 0.578 and 0.683 with SAM-only, respectively. This represents a 28% improvement in IoU and a 22% improvement in F1-score compared to SAM-only. The results were consistent across lesion shapes and contrast conditions, with SAM-enhanced YOLO exhibiting the lowest variability and highest robustness among the evaluated methods.
Conclusion
By reducing the need for pixel-level annotations and outperforming both standalone SAM and traditional methods, SAM-enhanced YOLO provides a scalable and resource-efficient solution for dermoscopic lesion segmentation. This framework holds significant potential for improving diagnostic workflows in clinical and resource-limited settings.
Plain Language Summary
Accurate lesion segmentation in dermoscopic images enhances the diagnostic capabilities of AI systems. Traditional methods demand manual pixel-level annotations, making them time-consuming and less efficient. This study introduces SAM-enhanced YOLO, combining advanced object detection and segmentation techniques to streamline and improve the accuracy of lesion analysis. By reducing annotation needs and improving performance, this method provides a practical solution for clinical applications.
Keywords:
- dermoscopy
- digital imaging
- object detection
- semantic segmentation
- skin lesion segmentation
Introduction
The segmentation of lesion regions in dermoscopic images has been shown to significantly improve the diagnostic performance of AI-based classification models. Barata et al reported that accurate segmentation, particularly through labor-intensive yet precise manual methods, enhanced classification accuracy by up to 15%, especially when lesion asymmetry and irregular borders were preserved.Citation1 Similarly, Al-Masni et al demonstrated that automated segmentation workflows improved F1 scores for malignant cases in the ISIC 2016 dataset by 4.71%.Citation2 These findings underscore the importance of lesion segmentation in enhancing diagnostic accuracy by enabling AI models to focus on clinically relevant features. However, achieving such precision typically relies on semantic segmentation, which requires pixel-level annotations. The process of creating such annotations is resource intensive, posing a significant challenge to the widespread adoption of AI in clinical settings. Furthermore, traditional segmentation methods often face challenges in generalization due to their dependence on fully annotated datasets and susceptibility to external artifacts, such as hair, rulers, and skin markings. Addressing these limitations is crucial for advancing AI-based diagnostic tools.
YOLO (You Only Look Once) is a widely used object detection model renowned for its speed and accuracy. Introduced by Wang et al, the most recent version, YOLOv10, enhances real-time object detection through end-to-end optimization and enhanced generalization.Citation3 Although YOLO is effective for object localization, its bounding-box-based approach has inherent limitations in achieving precise pixel-level segmentation. To overcome this issue, we developed a hybrid method that integrates YOLO with advanced segmentation techniques, which is described as “SAM-enhanced YOLO” in this study.
Introduced by Rother et al, GrabCut is a graph-based iterative segmentation method designed to optimize pixel clustering for distinguishing between foregrounds and backgrounds.Citation4 Ünver and Ayan demonstrated the effectiveness of combining YOLO for lesion localization with GrabCut for segmentation, achieving a sensitivity of 90% on the ISBI2017 dataset.Citation5 Bagheri et al advanced this approach by integrating YOLO for lesion localization into DeepLab for detailed segmentation, yielding high performance on challenging dermoscopic datasets.Citation6 These studies highlight the potential of integrating fast object detection with refined segmentation techniques. However, such methods often rely on pre-labeled datasets and are sensitive to external artifacts, such as hair, rulers, and skin markings.
The Segment Anything Model (SAM), a recently introduced segmentation framework, represents a significant step forward by utilizing zero-shot capabilities. Trained on the SA-1B dataset, which includes more than a billion segmentation masks, SAM exhibits robust generalization across a wide range of segmentation tasks.Citation7 Wang et al highlighted its potential by combining YOLO with SAM for three-dimensional object segmentation, achieving substantial improvements in both intersection over union (IoU) and pixel accuracy.Citation8 Inspired by these advancements, we developed a novel framework called “SAM-enhanced YOLO”. This approach aims to achieve high segmentation accuracy while significantly reducing annotation costs.
In this study, we implemented a framework in which YOLO provides bounding-box localization, which is subsequently refined into precise pixel-level segmentation using SAM. Unlike traditional methods that require pixel-level annotations, our approach relies only on bounding-box annotations for YOLO. Moreover, SAM’s adaptability reduces the need for extensive pre-processing and enhances segmentation accuracy, especially in challenging scenarios involving external artifacts, such as hair or skin markings, that often interfere with conventional techniques.
Methods
To evaluate the SAM-enhanced YOLO framework, we used the ISIC2018 dataset, which includes annotated dermoscopic images of melanoma and melanocytic nevi.Citation9 The dataset was divided into training and validation sets at an 80:20 ratio, and all images were resized to ensure compatibility with the framework. YOLO was implemented using the Ultralytics library (Version 8.3.34) and fine tuned using a pre-trained YOLOv10 model described by Wang et al.Citation3 Training was conducted for 50 epochs with a batch size of 4, incorporating commonly used data augmentation techniques such as flipping, rotation, and color adjustment to improve robustness. SAM was implemented using the segment-anything-hq library (Version 0.3), utilizing high-quality pretrained weights optimized for zero-shot segmentation tasks.Citation7
For baseline comparisons, we included GrabCut and Otsu’s thresholding, and SAM without SAM-enhanced YOLO, when used independently without bounding-box guidance from YOLO, relies on the initial conditions provided to the model. In our evaluation, the image center was used as the initial foreground estimate, reflecting the lesion-centered nature of dermoscopic images. For simplicity, we refer to the use of SAM without bounding-box initialization from YOLO as “SAM-only” throughout this study. GrabCut was implemented using OpenCV (Version 4.10.0.84),Citation4 which iteratively optimizes pixel clusters to delineate lesion boundaries. Otsu’s method calculated global thresholds based on image histograms for segmenting lesion regions.Citation10 The quality of the segmentation was assessed using the IoU, as well as precision, recall, and F1-score. All metrics were computed on a per-pixel basis by comparing the predicted masks with the ground truth. Statistical analyses were performed to determine significant differences between the methods evaluated. A Wilcoxon signed-rank test with Bonferroni correction was applied to account for multiple comparisons. This non-parametric test was chosen because segmentation metrics such as IoU, precision, recall, and F1-score are bounded values that may not follow a normal distribution. Bonferroni correction was used to control the family-wise error rate across multiple pairwise comparisons.
Results
The segmentation performance of SAM-enhanced YOLO was evaluated using representative dermoscopic images, as illustrated in . This figure visually compares segmentation results from SAM-enhanced YOLO, SAM-only, GrabCut, and Otsu’s thresholding against ground truth. SAM-enhanced YOLO consistently achieved the most precise delineation of lesion boundaries and closely matched the ground truth. SAM-only demonstrated moderate accuracy but often struggled with precise boundary delineation in cases of low-contrast lesions or lesions with highly irregular shapes, as it lacks the bounding-box initialization provided by YOLO. GrabCut demonstrated moderate accuracy, often struggling to capture faint lesion boundaries due to its reliance on initial user-defined regions, which can fail in low-contrast areas. In contrast, Otsu’s method frequently exhibits over-segmentation, including non-lesion areas, such as lesion peripheries and surrounding artifacts, because it is limited by global thresholding based solely on histogram characteristics.
Figure 1 Comparative analysis of segmentation methods for dermoscopic lesions. Segmentation results for representative dermoscopic lesions using SAM-enhanced YOLO, SAM-only, Otsu’s thresholding, and GrabCut methods: (a) the original dermoscopic image; (b) the YOLO-generated bounding box, providing coarse lesion localization; the green bounding boxes and confidence scores in (b) indicate the detection results from YOLO; (c) ground truth mask for the lesion; (d) SAM-enhanced YOLO segmentation, achieving the closest match to the ground truth; (e) SAM-only segmentation, which utilizes the image center as the initial foreground for zero-shot segmentation; (f) segmentation using Otsu’s thresholding; and (g) segmentation using GrabCut. Among these methods, SAM-enhanced YOLO consistently demonstrated superior precision and robustness, particularly for lesions with irregular shapes or under challenging visual conditions. SAM-only segmentation provides a useful baseline, but its performance is less reliable without YOLO’s bounding-box initialization.
Performance metrics are summarized in and . SAM-enhanced YOLO achieved the highest mean IoU (0.738 ± 0.181), followed by GrabCut (0.619 ± 0.321), SAM-only (0.578 ± 0.280), and Otsu’s thresholding (0.477 ± 0.266).
Table 1 Segmentation Performance Comparison Across Four Methods. Each Score Represents the Mean ± Standard Deviation. All Metrics Were Computed on a Pixel-Wise Basis Using the Ground Truth Masks
Download CSVDisplay Table
Figure 2 Intersection over Union (IoU) performance comparison of SAM-enhanced YOLO, SAM-only, GrabCut, and Otsu’s thresholding methods. This figure presents the mean IoU scores for the four segmentation methods— SAM-enhanced YOLO, SAM-only, GrabCut, and Otsu’s thresholding—evaluated across three random seeds. SAM-enhanced YOLO achieved the highest IoU score (0.738 ± 0.181), followed by SAM-only (0.578 ± 0.280), GrabCut (0.619 ± 0.321), and Otsu’s thresholding (0.477 ± 0.266). Error bars indicate the standard deviation of the IoU values. Statistically significant differences were observed among all methods based on Wilcoxon signed-rank tests with Bonferroni correction. In the box plots, x marks represent mean values, circles represent outliers, and triple asterisks (***) indicate statistically significant differences (p < 0.001).
In addition to IoU, we evaluated precision, recall, and F1-score to provide a more comprehensive assessment of segmentation quality.
SAM-enhanced YOLO demonstrated the best overall balance across all metrics, exhibiting notably high precision (0.954 ± 0.123) and F1-score (0.833 ± 0.160).
SAM-only and GrabCut exhibited moderate performance, while Otsu’s method consistently lagged behind.
These results further emphasize the robustness and accuracy of SAM-enhanced YOLO across diverse lesion types and conditions.
Statistical analyses confirmed that all pairwise differences between methods were significant (p < 0.05, Wilcoxon signed-rank test with Bonferroni correction).
To further explore the limitations of SAM-enhanced YOLO, we focused on cases where the model achieved high precision but still failed to adequately segment the full extent of the lesion. Specifically, we extracted images with an IoU between 0.30 and 0.50, precision above 0.85, recall below 0.60, and F1-score below 0.70. These thresholds were selected to isolate cases of partial segmentation, where the model accurately identified part of the lesion with minimal false positives but failed to capture the full region. From twenty such cases, six representative examples were selected and are shown in .
Figure 3 Representative failure cases for SAM-enhanced YOLO. Each row shows a single case in the following order: original dermoscopic image, ground truth, SAM-enhanced YOLO, SAM-only, Otsu’s thresholding, and GrabCut. Case (1) and (2) show lesions with extremely faint or low-contrast boundaries, where all methods struggled to accurately delineate the lesion. Case (3) involves a very small lesion, where slight mismatches between prediction and ground truth caused disproportionately large penalties in evaluation metrics. Case (4) and (5) contain dense hair artifacts that interfered with segmentation across all methods. Case (6) illustrates a lesion with ambiguous or imprecise ground truth annotation, making objective performance assessment difficult.
The six examples in illustrate four major patterns of failure. Images (1) and (2) show lesions with extremely faint and poorly defined boundaries, where all methods struggled to accurately delineate the lesion area. Image (3) depicts a particularly small lesion, where even slight misalignment led to a disproportionately large drop in evaluation scores. Images (4) and (5) contain dense hair artifacts, which appear to interfere with the segmentation process across all methods. Finally, image (6) shows an ambiguous or imprecise ground truth annotation, making objective evaluation difficult. These cases highlight conditions under which segmentation performance can degrade, including intrinsic image difficulty, external interference, and annotation quality.
In addition to segmentation accuracy, we compared the mean processing time per image for each method. Otsu’s method was the fastest, with a processing time that was negligible in our measurement environment. This was followed by YOLO-only (0.012 s), YOLO+SAM (0.027 s), and GrabCut (0.114 s). While Otsu and YOLO-only offered faster inference, their segmentation accuracy was notably inferior. YOLO+SAM achieved a favorable trade-off between accuracy and computational efficiency, making it suitable for practical applications.
However, SAM-enhanced YOLO was not without limitations. As shown in , SAM-enhanced YOLO struggled in certain cases involving lesions with faint boundaries or high visual complexity. These challenges were primarily due to inaccuracies in the bounding-box localization, which sometimes excluded parts of the lesion or included extraneous artifacts in the segmentation. For example, SAM-enhanced YOLO failed to properly capture lesions with extremely subtle boundaries that blended into the surrounding skin. These examples highlight scenarios where the reliance on bounding-box localization may hinder segmentation accuracy.
Despite these limitations, the results collectively highlight the superiority of SAM-enhanced YOLO in terms of both segmentation accuracy and consistency. While SAM-only serves as a useful benchmark for evaluating the impact of YOLO’s bounding-box initialization, its performance underscores the importance of combining efficient object localization with precise semantic segmentation for dermoscopic lesion analysis.
Discussion
The superior IoU performance of SAM-enhanced YOLO can be attributed to its ability to adapt to varying lesion shapes and handle low-contrast boundaries more effectively than other methods. SAM-only, while demonstrating reasonable segmentation accuracy (mean IoU: 0.73 ± 0.18), often struggled with precise boundary delineation in cases of irregular lesion shapes or low contrast. This limitation arises from its reliance on a fixed central initialization, which is less adaptable than YOLO’s bounding-box localization. Similarly, GrabCut, which relies on iterative optimization of pixel clusters, often failed to capture faint lesion boundaries, leading to incomplete segmentation. Otsu’s thresholding, on the other hand, exhibited over-segmentation due to its global thresholding approach, frequently including non-lesion areas and artifacts in the segmented region.
The comparison between SAM-only and SAM-enhanced YOLO underscores the critical role of bounding-box initialization in enhancing segmentation accuracy. By providing a localized region of interest, YOLO enables SAM to refine its predictions more effectively, resulting in significantly higher IoU scores and reduced variability across conditions. These findings emphasize the value of combining efficient object detection with advanced segmentation frameworks, particularly for challenging dermoscopic datasets. Nonetheless, the performance of SAM-only highlights the inherent strengths of the SAM’s zero-shot segmentation capabilities, especially in simpler cases or when bounding-box annotations are unavailable. This underscores SAM’s versatility, even in the absence of YOLO’s guidance, albeit with reduced accuracy and consistency in more complex scenarios.
Moreover, SAM-enhanced YOLO exhibited limitations in certain cases. For example, as shown in , SAM-enhanced YOLO struggled with lesions that featured extremely faint boundaries or were affected by high visual complexity. Such cases often resulted from inaccuracies in bounding-box localization, where parts of the lesion were either excluded or where extraneous artifacts were included. In these scenarios, central initialization as used in SAM-only could occasionally capture portions of the lesion missed by YOLO’s bounding boxes, albeit with lower overall precision. These observations suggest potential benefits in hybrid approaches that combine YOLO’s localization with additional refinement mechanisms. These examples highlight SAM-enhanced YOLO’s dependence on accurate object localization, which can limit its performance in cases of subtle or irregularly shaped lesions. In addition to these observations, it is worth noting that segmentation performance declined across all methods for particularly challenging images, such as those with small lesions, indistinct boundaries, or background colors similar to the lesion. These types of images appear inherently difficult to segment regardless of the algorithm used. Classical methods like Otsu and GrabCut were particularly vulnerable under these conditions. Otsu’s method assumes a bimodal intensity distribution and tends to fail when lesion-background contrast is low. GrabCut relies on initial region labeling and statistical separation between foreground and background, making it ineffective when lesions are small or visually ambiguous.
Compared to these methods, SAM-only, a zero-shot segmentation model pretrained on large-scale general-purpose datasets, showed more consistent performance across varying conditions, though it lacks domain-specific accuracy. YOLO-only, which was trained specifically for skin lesion detection, provides high sensitivity for lesion localization but only coarse segmentation masks. SAM-enhanced YOLO integrates these two perspectives by combining domain-adaptive detection with generalized segmentation, reducing dependence on specific visual features. This hybrid architecture likely contributes to the method’s robustness and consistent performance across diverse image conditions.
In addition to IoU, we evaluated precision, recall, and F1-score to gain a more comprehensive understanding of each method’s performance. SAM-enhanced YOLO achieved the highest values across all metrics, particularly excelling in precision and F1-score. This suggests that the method not only correctly identifies lesion regions but also minimizes false positives and produces well-balanced masks.
In contrast, SAM-only and GrabCut showed lower recall, indicating that these methods often failed to capture the full extent of the lesions. The complementary nature of these metrics reinforces the reliability of SAM-enhanced YOLO, especially in diverse and complex lesion presentations.
These findings suggest that SAM-enhanced YOLO offers a practical balance between segmentation accuracy and generalizability by mitigating structural limitations observed in other methods.
However, potential biases in the dataset—such as overrepresentation of common lesion types, limited diversity in skin tones, or inconsistencies in ground truth annotations—may affect the generalizability of the results. To address this, future work should consider validating the framework on multiple datasets with varied demographic characteristics or expert-reviewed annotations. Furthermore, integrating the proposed segmentation pipeline with diagnostic models or applying it to longitudinal datasets may offer additional insights into its clinical utility and its ability to track lesion evolution over time.
Moreover, although YOLO+SAM was not the fastest method, it provided the most effective compromise between segmentation performance and computational cost, making it a strong candidate for integration into real-time dermoscopic workflows.
The failure patterns identified in also reveal that the limitations observed in SAM-enhanced YOLO were not unique to this method. In fact, similar trends were present across the other segmentation techniques, indicating that the highlighted cases represent inherently difficult scenarios. In examples involving very small lesions, the low evaluation scores were likely influenced by the limited lesion area, which can cause small segmentation discrepancies to result in disproportionately large penalties in IoU and F1-score. In some instances, the ground truth appeared to slightly overestimate the lesion boundary, further exaggerating such effects. For images containing dense hair artifacts, all methods were adversely affected, but this issue could potentially be mitigated by incorporating preprocessing steps such as black-hat filtering. Finally, cases with ambiguous or imprecise ground truth annotations emphasized the importance of high-quality labels. Future studies may benefit from using datasets with expert-reviewed annotations or from combining multiple datasets to improve label reliability and evaluation fairness.
Conclusion
Both visual and quantitative evaluations confirm that SAM-enhanced YOLO outperforms traditional methods, such as GrabCut and Otsu’s thresholding, in terms of IoU, precision, recall, and F1-score, offering a more comprehensive and balanced assessment of segmentation quality. Its ability to accurately and consistently delineate lesion boundaries, even in cases with irregular shapes and low-contrast features, underscores its robustness in handling the complexities of dermoscopic lesions.
Failure case analysis revealed that segmentation performance may decline under specific conditions, such as extremely faint lesion boundaries, very small lesion areas, or the presence of dense hair artifacts. Moreover, some discrepancies appeared to stem from imprecise or overly broad ground truth annotations. These limitations may be addressed through improved preprocessing (eg, artifact removal) and the use of expert-reviewed or cross-validated datasets.
By addressing the shortcomings of traditional segmentation methods and demonstrating the potential for further refinement through integration with robust detection and segmentation models, SAM-enhanced YOLO offers a scalable and resource-efficient solution for dermoscopic image analysis. This framework not only reduces the need for pixel-level annotations but also provides a strong foundation for improving diagnostic workflows in clinical and resource-limited settings.
Abbreviations
SAM, Segment Anything Model; YOLO, You Only Look Once; IoU, Intersection over Union.
Ethics Approval and Informed Consent
This study did not involve any human participants, medical records, or identifiable personal data. All data were obtained from publicly available datasets (eg, ISIC 2018), which are anonymized and freely accessible for research purposes. Therefore, this study did not require ethical review or approval by an institutional review board or ethics committee.
Consent for Publication
The study did not involve any human participants or identifiable personal data. All images and data used in this study were obtained from publicly available databases (eg, ISIC 2018), which ensure that the data are anonymized and freely available for research purposes. Therefore, no consent for publication was required.
Author Contributions
All authors made a significant contribution to the work reported, whether that is in the conception, study design, execution, acquisition of data, analysis and interpretation, or in all these areas; took part in drafting, revising or critically reviewing the article; gave final approval of the version to be published; have agreed on the journal to which the article has been submitted; and agree to be accountable for all aspects of the work.
Disclosure
The author declares no conflicts of interest in this work.
Data Sharing Statement
The data that support the findings of this study are available from the corresponding author upon reasonable request.