Advancing e-waste classification with customizable YOLO based deep learning models

2025-05-25 06:40:34 英文原文

作者：Alrashoud, Mubarak

Introduction

E-waste, comprising discarded electrical or electronic devices, has seen an alarming increase due to rampant growth in adopting electronic devices¹. This surge has resulted in grave environmental and health challenges, necessitating precise classification and separation for efficient recycling and management².

Computer vision, particularly object detection models, has shown immense potential to address these challenges. Object detection, which involves identifying and localizing objects within an image, plays a pivotal role in several industrial applications, including e-waste management, as proposed by Rosebrock et al.³However, selecting an object detection model that harmoniously combines speed, accuracy, and computational efficiency remains a conundrum⁴.

YOLO, standing for ‘You Only Look Once,’ has been groundbreaking in real-time object detection and was recommended by Redmon et al.⁷. From its v5 iteration to v8, YOLO has consistently improved speed and performance metrics.

In this research, the researchers aim to evaluate three prominent object detection models, YOLOv5, YOLOv7, and YOLOv8 models, in terms of electronic waste classification. The crux of this research lies in discerning the performance of these models across different scenarios and input variations, enabling users to pinpoint the most suitable model for their specific e-waste management needs. In addendum, this research endeavors to present a comprehensive understanding of the strengths and weaknesses of these avant-garde object detection models in the context of e-waste classification. By doing so, we anticipate assisting a myriad of stakeholders, from researchers and developers to practitioners in computer vision and environmental management, in making judicious choices.

Related work

The escalating issue of e-waste disposal demands innovative techniques for its effective identification and classification. Central to this challenge is deploying advanced deep-learning models that can seamlessly detect and sort different types of electronic waste. In recent years, deep learning techniques, particularly those based on CNNs, have emerged as a potent solution to this growing problem. Table 1 summarizes the historical origins of prominent object detection models, highlighting their respective innovations and applications in fields such as real-time object detection and electronic waste detection.

Table 1 Historical origins of prominent object detection models.

Works by Zhou et al.¹² were pioneers in this area, presenting one of the first CNN frameworks tailored for e-waste component type identification. Their study underscored the transformative potential of CNNs in discerning between various e-waste types, laying the groundwork for subsequent research in this domain. Jiang et al.⁴ supported this idea, demonstrating an automated e-waste sorting system based on deep learning. Their system exemplifies the accuracy and effectiveness of CNNs.

Amidst these advances in CNN-based methodologies, there’s been a parallel surge in the popularity and development of the YOLO (You Only Look Once) series of models. Originally introduced by Redmonet al.⁷, YOLO revolutionized real-time object detection. The ensuing versions, YOLOv5, v7, and v8, testified to the rapid advancements in this line of research. Bochkovskiy et al.⁸unveiled the YOLOv5, emphasizing architectural refinements for bolstered performance and scalability. The subsequent iterations, YOLOv7, and v8, as presented by¹³, incorporated technologies like Feature Pyramid Networks¹⁴, further propelling the YOLO architecture to the forefront of real-time object detection.

In parallel, the role of comprehensive frameworks like the TensorFlow Object Detection API, introduced by Huang et al.⁶, cannot be undermined. This open-source framework facilitates the end-to-end development of object detection models. However, its direct applicability to e-waste detection remains a relatively untapped area, even though preliminary investigations like that of Li et al.¹⁵ have hinted at its potential.

Recently, Zhang and Liu¹⁶enriched the e-waste literature by providing an exhaustive review of deep learning-driven e-waste detection techniques, offering a holistic understanding of the evolving methodologies. Several subsequent studies have highlighted the strengths of YOLOv5 and allied architectures in e-waste detection. For instance, systems leveraging YOLOv5 alongside transfer learning have been proposed by researchers such as^9,17, and even combined with ensemble learning by¹⁸. Significant accuracy improvements were achieved by Chen et al.¹⁹ through the innovative integration of multi-scale feature fusion techniques with YOLOv5. The quest for improved detection methods has been supported by the integration of cascaded networks. Alsubaei et al.²⁰ demonstrated this by introducing a specialized deep learning-infused cascaded object detection system designed specifically for electronic waste.

In sum, while CNN-based methodologies have set benchmarks in e-waste detection, the YOLO family’s continuous evolution and complementary frameworks, such as the TensorFlow Object Detection API, signify an expansive and promising frontier. This research combines these distinct threads, offering a comprehensive evaluation of the family of YOLO architectures in e-waste detection, focusing on accuracy and applicability.

This research work is articulated to deliberate on the novelty of YOLOv5 in Section “Novelty of YOLOv5”, YOLOv7 in Section “Novelty of YOLOv7”, and YOLOv8 in Section “Novelty of YOLOv8”. The architectural comparison of the improved and customizable YOLOv5, YOLOv7, and YOLOv8 is explained in detail for a lucid understanding and contemplation. The methodology, dataset used for this research work, and model descriptions are mentioned in Section “Methodology”. The prominence of the YOLO models and their respective validations are mentioned in Section “Results and discussion” under the results and discussions. Section “Conclusion” dispenses the concluding remarks of this research, with improved YOLOv8 being the champion for the objective of E-waste classification, which was greatly inspired by Lou et al.²¹

Novelty of YOLOv5

The originality of YOLOv5 principally stems from its architectural breakthroughs and enhanced performance. The system incorporates a distinctive blend of elements, comprising CSPDarknet53 as its core, PANet in the intermediate section, and a YOLO head for object detection.

1.
The CSPDarknet53 backbone is a customized iteration of the Darknet-53 architecture, a deep CNN structure. The main feature of this approach is the utilization of the “cross-stage partial” technique, which divides the network into many stages with interconnected paths. This methodology improves the transmission of information and the dissemination of gradients.
2.
The PANet architecture in the neck region integrates characteristics from several layers of the backbone network, forming a feature pyramid. The method employs spatial pyramid pooling and lateral connections to aggregate features, enhancing object recognition across various scales.
3.
The YOLO head of YOLOv5 predicts the coordinates of bounding boxes and the probabilities of distinct classes at various grid scales. The model’s effectiveness in identifying objects of varying sizes is enhanced by its ability to predict across multiple scales.

The mathematical representation of the novelty of YOLOv5 is represented in Eq. (1):

$${\text{YOLOv5 }} = {\text{ f}}\left( {{\text{CSPDarknet53}},{\text{ PANet}},{\text{ YOLO Head}}} \right)$$

(1)

where

a.
“CSPDarknet53” refers to the core network infrastructure known as the CSPDarknet53 backbone network.
b.
The PANet is a network that aggregates paths and is utilized in the neck region.
c.
The YOLO Head symbolizes the YOLO head used for object detection.

Novelty of YOLOv7

The improved and customizable YOLOv7 incorporates advancements from the existing YOLOv5 framework to enhance speed and accuracy. This originality includes alterations in the core structure, techniques for focusing, and integration of features.

1.
The YOLOv7 model incorporates a customized variant of the CSPDarknet53 backbone network, which improves the ability to extract features.
2.
The YOLOv7 model has the Spatial Attention Module (SAM) block, which enhances accuracy by enabling the model to concentrate on pertinent portions of the input.
3.
YOLOv7 utilizes PANet for feature fusion, similar to YOLOv5. This technique enhances object detection by merging features from various scales.

The mathematical representation for the novelty of YOLOv7 is represented in Eq. (2):

$${\text{YOLOv7 }} = {\text{ f }}\left( {{\text{Modified CSPDarknet53}},{\text{ SAM Block}},{\text{ PANet}}} \right)$$

(2)

where

a.
The term “Modified CSPDarknet53” refers to the altered version of the CSPDarknet53 backbone network.
b.
The acronym SAM stands for Spatial Attention Module, which is used to improve the feature extraction process.
c.
PANet is an acronym for Path Aggregation Network used for feature fusion.

Novelty of YOLOv8

The improved and customizable YOLOv8 represents the most recent iteration of the YOLO series and incorporates various pioneering elements such as anchor-free object identification, multi-scale prediction, and enhanced backbone networks.

1.
Anchor-Free Detection: YOLOv8 directly predicts an object’s center instead of relying on anchor boxes. This obviates the necessity of adjusting anchor boxes and enhances the precision of object localization.
2.
Multi-Scale Prediction: YOLOv8 utilizes a method of predicting bounding boxes and class probabilities at many scales, enabling it to effectively recognize objects of varied sizes.
3.
Enhanced Backbone Network: YOLOv8 employs a customized variant of the CSPDarknet53 backbone network, incorporating improvements like GhostNet modules and eliminating redundant layers.

The mathematical representation for the novelty of YOLOv8 is represented in Eq. (3):

$${\text{YOLOv8 }} = {\text{ f}}\left( {{\text{Anchor{-}Free Detection}},{\text{ Multi{-}Scale Prediction}},{\text{ Improved Backbone Network}}} \right)$$

(3)

where

a.
Anchor-Free Detection is an innovative method for directly determining the centers of objects.
b.
Multi-scale prediction refers to the ability to forecast objects at different scales.
c.
The Improved Backbone Network refers to the advancements made in the CSPDarknet53 backbone network.
d.
The mathematical representations provide a comprehensive grasp of the distinct contributions and advancements brought by each YOLO model, enabling more precise comprehension of innovations in object recognition.

Comparing the architectures of the improved—YOLOv5, YOLOv7, and YOLOv8

YOLOv5

YOLOv5 is a popular single-shot object detection algorithm that builds upon the earlier versions of YOLO. It introduces architectural advancements and performance improvements. Below is a detailed explanation of the YOLOv5 architecture²².

Improved backbone network

Improved YOLOv5 utilizes a powerful backbone network called CSPDarknet53 (Cross-Stage Partial Network). CSPDarknet53 is a modified version of Darknet-53, a deep CNN architecture. It consists of 53 convolutional layers and is designed to effectively extract high-level features from input images. The backbone network follows a “cross-stage partial” strategy, where it splits the network into multiple stages, and each stage has a partial path and a cross-path. The partial path processes the features independently, while the cross-path aggregates information from previous stages. This architecture helps improve information flow and gradient propagation, leading to better performance.

Neck

The neck of YOLOv5 is responsible for feature fusion and object detection at multiple scales. It consists of several convolutional layers and performs feature pyramid construction to handle objects of different sizes. The improved YOLOv5 employs a PANet (Path Aggregation Network) structure as its neck. PANet combines features from different layers of the backbone network to create a feature pyramid. It incorporates spatial pyramid pooling and lateral connections to aggregate features at different scales. This enables the network to detect objects of various sizes and improves detection accuracy.

Head

The head of YOLOv5 performs object detection and localization based on the features generated by the backbone and neck. It predicts bounding box coordinates and class probabilities for the detected objects.

The improved YOLOv5 adopts a YOLO head architecture that consists of several convolutional layers⁵, followed by a set of output layers. The output layers predict bounding box coordinates relative to the cell in the grid and class probabilities for multiple anchor boxes. The head predicts bounding box offsets, objectness scores, and class probabilities at different grid scales. These predictions are made at different feature map resolutions, allowing the detection of objects at multiple scales.

Loss function

YOLOv5 utilizes a combination of loss functions to train the model effectively. The loss functions include.

1.
Objectness Loss: This measures the accuracy of object prediction within each grid cell.
2.
Localization Loss: Measures the accuracy of predicted bounding box coordinates.
3.
Confidence Loss: Measures the accuracy of predicted class probabilities.
4.
Classification Loss: Measures the accuracy of predicted class labels.

The final loss is a linear combination of these individual loss components, and the network parameters are optimized using backpropagation and gradient descent algorithms.

Model sizes

YOLOv5 offers models of different sizes: YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. These models vary in terms of depth, width, and computational requirements. Smaller models like YOLOv5s are faster but may sacrifice some accuracy, while larger models like YOLOv5x provide higher accuracy but require more computational resources.

As shown in Fig. 1, the improved YOLOv5 architecture combines a powerful backbone network (CSPDarknet53) with a PANet neck and a YOLO head to perform efficient and accurate object detection. It leverages feature fusion and multi-scale predictions to handle objects of different sizes, making it suitable for a wide range of real-world applications. Figure 2 provides a detailed representation of the overall architecture, highlighting its layered structure and computational pathways.

YOLOv7

YOLOv7 by Wang et al.^9,10,11 is a single-stage object detection algorithm that builds upon the earlier versions of YOLO. It introduces several architectural advancements and performance improvements, including Trainable Bag-of-Freebies (BoF). BoF is a modular framework that allows users to plug and play various techniques to improve model performance. The improved YOLOv7 includes several pre-trained BoF modules, such as GhostNet, Focus, and SPP, which can be used to improve the speed and accuracy of the model as depicted in Fig. 3

YOLOv7 head

The YOLOv7 head is a new design that improves object detection accuracy. It predicts bounding box coordinates and class probabilities for the detected objects using a set of convolutional layers followed by a set of output layers.

Improved backbone network

The improved YOLOv7 uses a modified version of the CSPDarknet53 backbone network, designed to effectively extract high-level features from input images. The modified backbone network includes several improvements, such as using GhostNet modules and removing unnecessary layers. As shown in Fig. 4, YOLOv7’s architecture enables more precise feature extraction, especially for smaller objects crucial for accurate e-waste classification.

The YOLOv7 is available in four sizes: YOLOv7s, YOLOv7m, YOLOv7l, and YOLOv7x. These models vary in terms of depth, width, and computational requirements. Smaller models like YOLOv7s are faster but may sacrifice some accuracy, while larger models like YOLOv7x provide higher accuracy but require more computational resources.

Overall, YOLOv7 is a powerful and efficient object detection algorithm that balances speed and accuracy well. It is suitable for various real-world applications, such as self-driving cars, robotics, and security surveillance.

YOLOv8

YOLOv8 by Wang et al.^9,10,11 is the latest version of the YOLO object detection algorithm. It introduces several new features and improvements over YOLOv7, including Anchor free detection and multi-scale prediction.

Anchor-free detection

As illustrated in Fig. 5, the improved YOLOv8 is an anchor-free object detector, which predicts an object’s center directly instead of using anchor boxes. This eliminates the need to tune anchor boxes and improves object detection accuracy. Figure 6 shows the full architecture of YOLOv8, demonstrating its ability to scale across multiple resolutions.

Multi-scale prediction

The improved and customizable YOLOv8 predicts bounding boxes and class probabilities at different scales, which allows it to detect objects of various sizes more effectively.

Improved backbone network

The improved YOLOv8 uses a modified version of the CSPDarknet53 backbone network, designed to effectively extract high-level features from input images. The modified backbone network includes several improvements, such as using GhostNet modules and removing unnecessary layers.

YOLOv8 is available in four sizes: YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x. These models vary in terms of depth, width, and computational requirements. Smaller models like YOLOv8s are faster but may sacrifice some accuracy, while larger models like YOLOv8x provide higher accuracy but require more computational resources.

Overall, the YOLOv8 is a powerful and efficient object detection algorithm that balances speed and accuracy well. It is suitable for various real-world applications, such as self-driving cars, robotics, and security surveillance.

YOLOv8 is the first YOLO model trained on COCO v6, the latest COCO object detection dataset version. This means that YOLOv8 is better at detecting new and challenging objects, such as small objects and objects in complex scenes. Overall, the improved YOLOv8 significantly improves over the improved YOLOv7 and offers the best performance in speed and accuracy. It is a good choice for any task that requires real-time object detection.

Methodology

Figure 7 depicts the comprehensive structure of the proposed framework in this research investigation.

Materials and methods

Image dataset fabrication refers to the generation of numerous images that can be used to train a machine-learning model. The creation of a dataset requires the collection, organization, and labeling of data and is a necessary step in the development of any object detection or classification model. This document discusses the creation of image datasets for seven classes, including resistors, capacitors, motherboards, regulators, batteries, LCDs, and IoT sensors. Figure 8 visually represents the seven e-waste categories considered in this study.

Image scraping and image standardization

The Python Image Library, as detailed by Lundh²³, is a valuable resource for scraping and standardizing images to generate credible image datasets. PIL can be used to perform a variety of operations when creating an image dataset, including opening, resizing, cropping, and saving images. If you have a collection of raw images with different sizes and aspect ratios, for instance, you can use PIL to resize them to a standard size, crop them to focus on the relevant object or region, and save them in a standard format. This ensures that all images in the dataset are consistent and can be utilized to train a machine-learning model effectively.

Image data acquisition and segmentation

Collecting many images is the first step in creating an image dataset. In this instance, we focused on locating images of the seven classes. Images can be gathered from various sources, including search engines, online marketplaces, and manufacturer websites. Photographs should be of high quality and include a range of perspectives, lighting conditions, and backgrounds. It is essential that the images accurately represent the objects we wish to recognize in the real world.

Once many images have been collected, we must organize them into folders based on their respective classes. Seven folders have been created in this research work, one for each class. This step will make labeling and using the images easier to train the machine-learning model. PIL can also be used to clean and create the dataset. For instance, you may need to remove duplicate images, incorrect labels, or low-quality images. PIL provides various methods for manipulating images, such as blurring, sharpening, and adjusting brightness and contrast, which can improve the quality of images and make them easier to categorize.

Image labeling using bounding box

A class must be assigned to each image to label the images. There are numerous tools for labelling images, such as label boxes, CVAT, and labeling. These tools enable us to draw a bounding box around each image object and assign it to the corresponding class. To avoid confusion when training the model, ensuring that the labelling is uniform across all images is crucial. Figure 9 depicts an annotated sample with the help of an image labeling tool named makesense.io.

Data augmentation

The process of generating new images by applying various transformations to the original images is known as data augmentation. Data augmentation can increase the size of the dataset and enhance the model’s robustness. Numerous techniques for enhancing data include flipping, rotation, and cropping. It is essential that the augmented images accurately represent the objects we wish to identify in the real world. Figure 10 illustrates the data augmentation techniques implemented, significantly expanding the dataset and improving the model’s generalization capability.

Dataset fragmentation

After collecting, organizing, labeling, and augmenting the images, we must divide the dataset into training, validation, and testing sets. The training set is used to train the model, the validation set is used to tune the model’s hyperparameters, and the testing set is used to assess the performance of the model. The division should be made randomly and reflect the distribution of classes in the real world.

Experimental dataset

In our research Dataset, we have considered a total of 700 images processed or synthesized using the above-mentioned techniques. For each dataset class, the distribution is shown in Table 2.

Table 2 Distribution of the dataset across various classes of e-waste components.

Model selection

Improved—YOLOv5, YOLOv7, and YOLOv8 are three distinct yet complementary machine learning pre-trained models that were utilized in our pursuit of a robust and effective solution for electronic waste detection.