Predictive maintenance (PM) has emerged as a crucial strategy for minimizing downtime, reducing operational costs, and improving system reliability across various industries. This study employs machine learning techniques to predict failures in industrial machinery, leveraging historical sensor data to train predictive models. The research evaluates multiple machine learning algorithms, including Random Forest (RF), Support Vector Machine (SVM), and Artificial Neural Networks (ANN), to determine the most effective approach for PM. The results indicate that RF provides the highest accuracy and precision in failure detection. The findings contribute to the advancement of PM strategies by demonstrating the effectiveness of data-driven approaches in minimizing equipment failures
Industrial systems and equipment are essential resources in manufacturing, logistics, and infrastructure. The unanticipated breakdown of these systems may result in considerable financial losses, production interruptions, and safety risks. Historically, maintenance approaches have been divided into three primary categories: reactive maintenance (fixing after a breakdown), preventive maintenance (planned upkeep), and PM (predicting failures through data analysis) [1-3]. Among these, project management (PM) has garnered more attention due to progress in artificial intelligence and machine learning (ML) [4-6].
ML-powered predictive maintenance leverages past sensor data and operational records to create models that foresee failures prior to their happening. These models assist organizations in improving maintenance schedules, minimizing unnecessary servicing, and prolonging the lifespan of equipment. In this research, we investigate different ML methods, such as Random Forest (RF), Support Vector Machines (SVM), and Artificial Neural Networks (ANN), to evaluate their efficiency in forecasting equipment malfunctions. The aim is to determine the best model for PM applications and examine the main factors affecting model performance [7-9].
This study emphasizes the preprocessing of sensor information, methods for feature selection, and metrics for model evaluation. The effectiveness of various ML algorithms is evaluated according to accuracy, precision, recall, and F1-score. The research also looks into the practical effects of implementing PM systems in industrial environments, highlighting the challenges and opportunities linked to data-driven maintenance approaches [10-12].
Mollapour et al. (2022) [13] performed a study regarding pitting corrosion in real operating conditions of a first-stage compressor blade. Their research underscores the negative impact of pitting corrosion on compressor efficiency, stressing the importance of robust PM strategies to reduce failure risks. In the same way,
Nie et al. (2020) [14] examined the stress corrosion cracking characteristics of FV520B stainless steel found in malfunctioning compressor impellers. Their research revealed critical elements leading to corrosion-related failures and emphasized the significance of choosing materials and monitoring the environment within PM frameworks.
Afia et al. (2024) [15] showed how optimization methods can improve the precision of fault classification models. Their method greatly enhanced the rates of fault detection, rendering it a feasible option for real-time PM implementations.
Li et al. (2022) [16] discussed deep transfer learning, emphasizing its ability to enhance fault diagnosis precision across different industrial settings. The research highlighted that transfer learning can utilize pre-trained models to improve PM strategies, especially in settings with limited data.
Nambiar et al. (2024) [17] investigated feature fusion methods integrated with ML for predicting faults in air compressors. Their research indicated that combining various feature extraction techniques greatly improves the predictive accuracy of machine learning models. This method allows for enhanced fault detection and aids in the advancement of smart maintenance systems.
Patil et al. (2024) [18] classified various ML methods utilized in PM and emphasized significant challenges like data quality, interpretability of models, and computational efficiency. The review determined that data-driven methods are essential in PM as they facilitate early fault identification and proactive decision-making.
Dimitrova et al. (2022) [19] provided insights into non-destructive smart inspection of wind turbine blades based on Industry 4.0 strategies. Their study underscored the role of advanced sensing technologies, artificial intelligence, and big data analytics in PM. The findings suggest that Industry 4.0 frameworks can be extended to various industrial applications, including air compressors and rotating equipment, to enhance fault detection and maintenance efficiency.
METHODOLOGY
The project was implemented in Python using Jupyter Notebook, an Integrated Development Environment (IDE) that enables code execution alongside result visualization (LOCALWEB, 2023). The following Python libraries were used: Numpy for numerical computation, Pandas for data visualization and manipulation, Matplotlib & Seaborn for graphical visualization, Scipy for Z-Score tests to detect outliers, and Scikit-Learn for model creation, training, and testing.
The dataset, sourced from Kaggle, is synthetic due to the unavailability of real industry data. It simulates a fictional industry and comprises 10,000 records across 10 columns: UDI (Unique Identifier), Product ID (Quality Variant and Serial Number), Type (Product Classification), Air Temperature [K] (Tair), Process Temperature (Tprocess)) [K], Rotational Speed (Srot)) [rpm], Torque (T) [Nm], Tool Wear (WrTool) [min], Target (Failure or No Failure), and Failure Type. The dataset description was translated and refined for clarity.
After importing the dataset into a Pandas DataFrame, the UDI column was set as the index to avoid redundancy. Data characterization revealed no null values but a class imbalance (96.5% non-failure vs. 3.5% failure), which could impact results. Inconsistencies in the Target column (27 misclassified entries) were corrected, and duplicate checks confirmed that no repeated entries existed, ensuring each Product ID was unique. Since Product ID provided no predictive value, it was removed (Figure 1).
Figure 1. Correction of Inconsistency in the Values of the Target Column
To identify outliers, histograms were generated for Tair, Tprocess, Srot, Torque, and WrTool. The most affected features were Srot [rpm], which had a right-skewed distribution, and Torque [Nm], which followed a normal distribution but contained significant outliers (Figure 2). Using the IQR method, Srot and Torque were confirmed as the only columns with outliers. The Z-Score test identified fewer outliers, but Power Failure cases remained frequent, suggesting that outliers hold predictive value for failures. Therefore, outliers were retained rather than removed (Figures 2).
Figure 2. Boxplot of the Srot [rpm] and Torque [Nm] Columns
In-depth Data Analysis
With the pre-processing ready, it was time to separate which tests would be done for this project and, with that, do the rest of the exploratory analysis for each of them. To carry out the tests, the values in the Type column were changed to numerical values following an order, as this concerns the quality of the product. Therefore, it is possible to change L to 0, M to 1 and H to 2 so that you can also check whether this column is important for the predictions.
The first test was divided into two parts. The first part consisted of creating another data set, but without the Failure Type column to test the separability of the data and see the possibility of identifying failures, regardless of their type. The second part was based on filtering the data set and leaving only the data that presented a failure to find out if it is possible to differentiate its type, since the equipment has already been identified with a failure.
The second test consisted of creating another data set, however, this time, without the Target column with the objective of studying whether the prediction models would be able to differentiate the 6 failure type options using the data set in its entirety. The results of both tests will be presented and discussed.
Part 1
As mentioned before, the first step in performing this test was to remove the Failure Type column. Then, a graph was generated that crosses columns 2 by 2 to check the correlations between them and, mainly, with the Target column (Figure 3).
Figure 3. 2 by 2 Cross Graph of the Columns of Part 1 of Test 1
When analyzing the graph, a close to linear correlation was observed between the columns Tprocess [K] X Tair [K] and Srot [rpm] X Torque [Nm], however, none of the columns appear to have a strong correlation with the Target column.
To answer this question, two correlation graphs were created, known as heat maps, one with Pearson and the other with Spearman, using a ready-made function from the Seaborn library, the heatmap, to check how influential the variables in the Target column are.
Figure 4. Heat Map Using Pearson Correlation for Part 1 of Test 1
Figure 5. Heat Map Using Spearman Correlation for Part 1 of Test 1
Analyzing Figure 4 and Figure 5, it was possible to confirm the correlations mentioned above and that none of the columns has a significantly high correlation with the Target column. Therefore, more in-depth testing was done to ensure that the models will be able to perform the classification.
Checking Target Separability
To perform this step, the first thing done was to separate the data set into two, one containing 80% of the data, which was used to train the models, and the other with 20% of the data, which served to validate and evaluate their performance. To ensure that all failed data remains in one of the sets, the separation was done with the randomness state set at 42, for work reproduction purposes, and in a stratified manner, that is, maintaining the proportion of 96.5% of non-failure and 3.5% of failure in both sets.
After the separation, another 2 by 2 cross-graph of the columns was made, however, this time, with the color separation of the Target column to check the distribution of the failures.
Figure 6. 2 by 2 Cross Chart of Columns with Target Separation
Analyzing Figure 6, it was noted that the separability of the Target is not as trivial as drawing a line, therefore, linear methods would not be of great use for this problem. With this, there was a need to carry out a hypothesis test that was used to validate whether there was any substantial difference for the elements of each class and, therefore, being able to say whether there was the possibility of making a separation with some ML model.
Target Separability Hypothesis Test
The test consisted of the following steps:
The hypotheses tested were:
For these hypotheses, the answers obtained for each column were:
Figure 7. Graphical View of the Target Separability Hypothesis Test
With the hypothesis test completed (Figure 7), it is possible to state that the AED is complete and almost all variables are able to generate a difference between Failure and Non-Failure, indicating the ability to make prediction models, despite the separability not being trivial. The creation of these will be covered in session 3.4 for organizational purposes.
For this part of the analysis, the dataset was filtered to include only instances where the Target value is 1, meaning only failure occurrences were considered. This approach aimed to determine whether, after predicting a failure, the model could identify the specific type of failure. A 2-by-2 cross-plot of dataset columns was generated to visualize class separability, revealing that Power Failure exhibited the most distinct separation. To further evaluate this separability, the Failure Type column was transformed into five separate columns using the get_dummies function, followed by the creation of two heat maps to assess correlations: one using Pearson correlation and the other using Spearman correlation. The analysis of these heat maps highlighted key relationships, such as Torque showing a moderate correlation with failures due to Overstrain and WrTool, while WrTool failure had minimal linear correlation with Srot but exhibited a moderate monothetic correlation. Additionally, Heat Dissipation failure displayed a strong positive correlation with Tair and moderate negative correlations with Srot and WrTool. Given that the heat maps clearly demonstrated class separability, further hypothesis testing was deemed unnecessary. With the exploratory data analysis (EDA) for this section complete, the classification model development will be addressed.
DATA ANALYSIS FOR TEST 2
As mentioned, this test consisted of removing the Target column to see how the models would perform in trying to directly classify all Failure Type classes using the dataset in its entirety. For this test, a 2 by 2 cross-graph of the columns was made again, separating the colors by Failure Type classes, but now in their entirety (Figure 8).
Figure 8. 2 by 2 Cross Chart of Columns with Separation of Failure Type of the Data Set in its Entirety
Afterwards, heat maps of the data set were made, separating the Failure Type classes again using the get_dummies function to check their correlations.
Figure 9. Heat Map Using Pearson Correlation for Test 2
Figure 10. Heat Map Using Spearman Correlation for Test 2
Analyzing Figure 9 and Figure 10, it was possible to notice the clear difference between the correlations of the test proposed and the correlations for this test, which indicated that it might be more difficult to differentiate the classes. Since the hypothesis test was already performed, it was not necessary to repeat it to be certain of the separability. At this point, the EDA for this test was completed and the prediction models created will be addressed to better organize the work.
This section outlines the parameters used for each prediction model and the evaluation metrics prioritized, with detailed results presented below. The search parameters remained consistent across tests, except for the evaluation metric. The chosen metrics were: Recall for Target = 1 in Test 1 Part 1, as identifying potential failures was crucial to prevent unexpected breakdowns during production. Accuracy for Test 1 Part 2, since the dataset already contained failure instances, and correctly classifying the failure type would save time in troubleshooting. Precision for the No Failure class in Test 2, as minimizing false negatives was essential to avoid unnecessary maintenance. Model selection was performed using GridSearch and RandomSearch from Scikit-Learn, ensuring the best parameter combinations were chosen for training.
With all the best parameters chosen, it was time to submit them to training and then testing so that a comparison could be made. Right after the models performed, it was time to compare which one had the best performance. For better organization, the entire solution is also available in the work published on Github and those that had the best results will be presented and discussed.
Test 1 Result
Part 1
For this part of the first test, the scoring metric used was Recall, as explained, and the classifier that performed best was the Decision Tree (DT)with an average score of 0.644 in the model selection, obtaining a result of 85% Recall in the training (Figure 11) and 77% for the test (Figure 12).
Figure 11. Model Training Result for Test 1 Part 1
Figure 12. Model Test Result for Test 1 Part 1
Given the level of imbalance in the data set, this result can be considered surprising, since the amount of data that the set had to train was minimal and, even so, it achieved a performance of 77% for Recall. A difference of 8% is noticeable between the results of the training and testing of the model, which may have been due to the difficult separability of the data and the distribution that, in the random separation of the data sets, may have allocated data in the test set that could not be observed in the training set.
Despite the result, it is possible to state that this model is not yet ready for implementation in industries, since it classified 16 pieces of equipment that will fail as “non-failure”. Even if the percentage of accuracy is relatively high, this number of pieces of equipment that will fail may be extremely important for the operation of the industry and would cause a greater loss than the 54 pieces of equipment correctly classified would provide in profit.
For the first part of the first test, the test results of all models were placed in order of Recall accuracy percentage for comparison purposes:
Part 2
For this part of the first test, the scoring metric used was Accuracy, as explained, and the classifier that performed best was LR, with an average score of 0.885 in model selection, obtaining a result of 97% Accuracy in training (Figure 13) and 87% for testing (Figure 14).
Figure 13. Model Training Result for Test 1 Part 2
Figure 14. Model Test Result for Test 1 Part 2
Observing both the training and the test, it is possible to notice that, in fact, the data with values considered outliers in the Torque [Nm] column were actually useful for the correct classification of the Power Failure class with an f1-score of 90%. Despite the 10% difference in the Accuracy of the training and testing of the model, it is possible to say that it is reliable, especially for classifying Heat Dissipation Failure and Power Failure and would save employees time when trying to identify the problem.
Despite the result, it is also possible to say that this model is not ready for implementation in industries, since it is part of a larger set and requires the correct Target classification to be used. Since its first part was rejected due to the amount of equipment classified incorrectly, this also becomes unfeasible.
However, this model cannot be completely discarded, since its classification and industrial assistance power alone is high and its errors do not impact maintenance if applied correctly. Therefore, it can be implemented as an individual model if there is something or someone who can correctly identify any signs of equipment failure.
For the second part of the first test, the test results of all models were placed in order of Accuracy percentage for comparison purposes:
Test Results 2
For the second test, the scoring metric used was Precision, as explained, and the classifier that performed best was the DT with an average score of 0.977 in model selection, obtaining a result of 99% Precision in training (Figure 15) and 99% for testing (Figure 16).
Figure 15. Model Training Result for Test 2
Figure 16. Model Test Result for Test 2
Looking at the model training and testing results, the first impression is that the model is almost perfect. However, since the data set is unbalanced and has approximately 96.5% of the data with the Failure Type as No Failure, and if the model predicts all the data as such, it already has a performance of 96.5% for Accuracy and Precision. This indicates that the priority is to analyze how many data that are considered failures that it predicts as having no failure.
With this observation in mind, it is possible to say that the model is not ready to be implemented in the industry, since the number of errors made by the model when classifying failures as non-failures is greater than the errors made in the first part of the first test. This result may have been caused by both the random distribution of data between the training and testing data sets, and the difficulty of directly identifying the Failure Types due to the difficult separation observed. For the second test, the test results of all models were placed in order of Precision accuracy percentage for comparison purposes. For the tiebreaker criterion, the amount of incorrect data for accuracy was used:
This work aimed to study a set of synthetic data generated to simulate a fictitious industry, which were generated to resemble the reality of real industries, since obtaining such data is difficult, in an attempt to to generate an improvement for the maintenance area when it comes to PM. Monitoring carried out by a machine that can distinguish when equipment will be damaged by some type of failure has the capacity to improve the quality of maintenance, facilitate the important work that this sector has and reduce the company's losses. Taking these objectives into consideration, the study was carried out taking into account its data imbalance and the non-trivial separability of the targets, thus having reduced expectations about the results, since the Scikit-Learn library is not recommended to work with these conditions. Knowing this, separating the work into two different tests and executing the complete AED for both was of great importance for understanding the problem as a whole and the difficulties to be faced. With the models created, it was observed that, in both tests, they had difficulty in correctly classifying the data, as expected, and the classifiers that stood out best were the DT and LR. Despite the difficulties encountered and the low expectations, the results were surprising. Although expectations have been exceeded, the models remain precarious and need to undergo further studies to avoid problems in industries as much as possible and to facilitate the work of the maintenance sector, as well as to reduce the time in which production is stopped, thus increasing the profits of the industries.
Therefore, we conclude that the models, as they are, do not have enough competence to be put into practice, because, although the number of hits is much greater than the number of errors, the errors have a damaging capacity large enough to make human monitoring more viable. Finally, there is still much research in the area to be done in order to solve this problem in a way that makes modeling a model using ML more viable than delegating this function to an employee.