Leveraging Distributed Systems for Fault-Tolerant Cloud Computing: A Review of Strategies and Frameworks


  • Saman M. Almufti IT Dept., Technical College of Informatics, Akre University for Applied Sciences, Duhok, Iraq; College of Science, Department of Computer Science, Nawroz University, Duhok, Kurdistan Region, Iraq.
  • Subhi R. M. Zeebaree Energy Eng. Dept., Technical College of Engineering, Duhok Polytechnic University, Duhok, Iraq.




Ensuring system availability and reliability is crucial in the quickly developing field of cloud computing. The importance of fault tolerance in cloud infrastructure systems grows as organizations become more reliant on it to support their critical operations. The purpose of this article is to investigate the intricate realm of cloud computing and distributed systems. Specifically, the paper will investigate the numerous forms of cloud computing, fault tolerance methods, and frameworks that enable cloud services to be robust and durable.

Cloud computing has transformed the way in which organizations and individuals access and administer computing resources. The paper discusses several deployment options, including public, private, hybrid, and multi-cloud environments, which provide organizations with the advantages of flexibility, scalability, and cost-effectiveness. The inherent flexibility of cloud computing renders it well-suited for a diverse range of applications, spanning from the hosting of websites to the execution of intricate data analytics processes.

Generally, cloud computing encounters substantial obstacles, including the need of maintaining uninterrupted service in the face of hardware failures, network outages, or software errors, despite its tremendous benefits. The critical importance of fault tolerance in this particular situation cannot be overstated, as it plays a pivotal role in maintaining the dependability and availability of the system.


The primary objective of this study is to examine the utilization of distributed systems as a means to augment fault tolerance within the realm of cloud computing and distributed systems. Distributed systems offer an optimal approach for addressing difficulties related to fault tolerance, owing to its intrinsic capability to divide workloads and data over several nodes. This approach utilizes redundancy, replication, and the ability to recover seamlessly from disturbances, hence enhancing the resilience and resource efficiency of cloud services. This research reviews novel techniques and frameworks that utilize distributed systems to create fault-tolerant cloud computing architectures, emphasizing their substantial influence on the cloud computing domain. In conclusion, this research report includes a comparative analysis table that encompasses twenty preceding works.


Download data is not yet available.


Alaei, M., Khorsand, R., & Ramezanpour, M. (2021). An adaptive fault detector strategy for scientific workflow scheduling based on improved differential evolution algorithm in cloud. Applied Soft Computing, 99. https://doi.org/10.1016/j.asoc.2020.106895

Lakhan, A., Mohammed, M. A., Zebari, D. A., Abdulkareem, K. H., Deveci, M., Marhoon, H. A., ... & Martinek, R. (2024). Augmented IoT Cooperative Vehicular Framework Based on Distributed Deep Blockchain Networks. IEEE Internet of Things Journal.

H. Shukur, S. Zeebaree, R. Zebari, D. Zeebaree, O. Ahmed, and A. Salih, “Cloud Computing Virtualization of Resources Allocation for Distributed Systems,” Journal of Applied Science and Technology Trends, vol. 1, no. 3, pp. 98–105, Jun. 2020, doi: 10.38094/jastt1331.

Mohammed Mohammed Sadeeq, Nasiba M. Abdulkareem, Subhi R. M. Zeebaree, Dindar Mikaeel Ahmed, Ahmed Saifullah Sami, and Rizgar R. Zebari, “IoT and Cloud Computing Issues, Challenges and Opportunities: A Review,” 2021, doi: 10.48161/issn.2709-8206.

Abdullah, P. Y., Zeebaree, S. R., Jacksi, K., & Zeabri, R. R. (2020). An hrm system for small and medium enterprises (sme) s based on cloud computing technology. International Journal of Research-GRANTHAALAYAH, 8(8), 56-64.

Abdullah, P. Y., Zeebaree, S. R., Shukur, H. M., & Jacksi, K. (2020). HRM system using cloud computing for Small and Medium Enterprises (SMEs). Technology Reports of Kansai University, 62(04), 04.

Zeebaree, S. R., Zebari, R. R., Jacksi, K., & Hasan, D. A. (2019). Security approaches for integrated enterprise systems performance: A Review. Int. J. Sci. Technol. Res, 8(12), 2485-2489.

Abdullah, P. Y., Zeebaree, S. R., Shukur, H. M., & Jacksi, K. (2020). HRM system using cloud computing for Small and Medium Enterprises (SMEs). Technology Reports of Kansai University, 62(04), 04.

Al-Jaroodi, J., Mohamed, N., & Al Nuaimi, K. (2012). An efficient fault-tolerant algorithm for distributed cloud services. Proceedings - IEEE 2nd Symposium on Network Cloud Computing and Applications, NCCA 2012, 1–8. https://doi.org/10.1109/NCCA.2012.21

Amoon, M. (2016). Adaptive Framework for Reliable Cloud Computing Environment. IEEE Access, 4, 9469–9478. https://doi.org/10.1109/ACCESS.2016.2623633

Araujo Neto, J. P., Pianto, D. M., & Ralha, C. G. (2019). MULTS: A multi-cloud fault-tolerant architecture to manage transient servers in cloud computing. Journal of Systems Architecture, 101. https://doi.org/10.1016/j.sysarc.2019.101651

Attallah, S. M. A., Fayek, M. B., Nassar, S. M., & Hemayed, E. E. (2021). Proactive load balancing fault tolerance algorithm in cloud computing. Concurrency and Computation: Practice and Experience, 33(10). https://doi.org/10.1002/cpe.6172

Bala, A., & Chana, I. (2012). Fault Tolerance-Challenges, Techniques and Implementation in Cloud Computing. www.IJCSI.org

Chang, H.-T., Chang, Y.-M., & Hsiao, S.-Y. (2014). Scalable network file systems with load balancing and fault tolerance for web services. Journal of Systems and Software, 93, 102–109. https://doi.org/10.1016/j.jss.2014.02.057

Chatterjee, M., Mitra, A., Setua, S. K., & Roy, S. (2020). Gossip-based fault-tolerant load balancing algorithm with low communication overhead. Computers & Electrical Engineering, 81, 106517. https://doi.org/10.1016/j.compeleceng.2019.106517

Cotroneo, D., De Simone, L., Liguori, P., & Natella, R. (2023). Run-time failure detection via non-intrusive event analysis in a large-scale cloud computing platform. Journal of Systems and Software, 198. https://doi.org/10.1016/j.jss.2023.111611

Dias, A. L., Turcato, A. C., Sestito, G. S., Brandao, D., & Nicoletti, R. (2021). A cloud-based condition monitoring system for fault detection in rotating machines using PROFINET process data. Computers in Industry, 126. https://doi.org/10.1016/j.compind.2021.103394

Fang, J., Chao, P., Zhang, R., & Zhou, X. (2019). Integrating workload balancing and fault tolerance in distributed stream processing system. World Wide Web, 22(6), 2471–2496. https://doi.org/10.1007/s11280-018-0656-0

Zebari, I. M., Zeebaree, S. R., & Yasin, H. M. (2019, April). Real time video streaming from multi-source using client-server for video distribution. In 2019 4th Scientific International Conference Najaf (SICN) (pp. 109-114). IEEE.

Shukur, H., Zeebaree, S., Zebari, R., Ahmed, O., Haji, L., & Abdulqader, D. (2020). Cache coherence protocols in distributed systems. Journal of Applied Science and Technology Trends, 1(3), 92-97.

Sadeeq, M. A., & Zeebaree, S. R. (2023). Design and implementation of an energy management system based on distributed IoT. Computers and Electrical Engineering, 109, 108775.

Ibrahem, A. H., & Zeebaree, S. R. (2024). Tackling the Challenges of Distributed Data Management in Cloud Computing-A Review of Approaches and Solutions. International Journal of Intelligent Systems and Applications in Engineering, 12(15s), 340-355

Haji, L., Ahmed, O., Sallow, A. B., Haji, L. M., Zeebaree, S. R. M., Ahmed, O. M., Sallow, A. B., Jacksi, K., & Zeabri, R. R. (2020). Dynamic Resource Allocation for Distributed Systems and Cloud Computing. https://www.researchgate.net/publication/342317991

Hamouda, R. Ben, Hafaiedh, I. Ben, & Robbana, R. (2021). Modelling and verification of reconfigurable fault-tolerant and self-recovering systems in hybrid Clouds. Simulation Modelling Practice and Theory, 111. https://doi.org/10.1016/j.simpat.2021.102331

Jain, A., Singh, P., & Jain, E. A. (2014). Survey Paper on Cloud Computing. In Article in International Journal of Innovations in Engineering and Technology. https://www.researchgate.net/publication/264435521

Jhawar, R., & Piuri, V. (2017). Fault Tolerance and Resilience in Cloud Computing Environments. In Computer and Information Security Handbook (pp. 165–181). Elsevier. https://doi.org/10.1016/B978-0-12-803843-7.00009-0

Karande, V. M., & Pais, A. R. (n.d.). CCIS 193 - A Framework for Intrusion Tolerance in Cloud Computing.

Kathpal, C., & Garg, R. (2019). Survey on Fault-Tolerance-Aware Scheduling in Cloud Computing. In Lecture Notes in Networks and Systems (Vol. 40, pp. 275–283). Springer. https://doi.org/10.1007/978-981-13-0586-3_28

Kumar, V., & Sharma, S. (2016). A Comparative Review on Fault Tolerance methods and models in Cloud Computing. In International Research Journal of Engineering and Technology. www.irjet.net

Abdullah, H. S., & Zeebaree, S. R. (2024). Distributed Algorithms for Large-Scale Computing in Cloud Environments: A Review of Parallel and Distributed Processing. International Journal of Intelligent Systems and Applications in Engineering, 12(15s), 356-365.

Ageed, Z. S., & Zeebaree, S. R. (2024). Distributed Systems Meet Cloud Computing: A Review of Convergence and Integration. International Journal of Intelligent Systems and Applications in Engineering, 12(11s), 469-490.

Li, Z., Chang, V., Hu, H., Hu, H., Li, C., & Ge, J. (2021). Real-time and dynamic fault-tolerant scheduling for scientific workflows in clouds. Information Sciences, 568, 13–39. https://doi.org/10.1016/j.ins.2021.03.003

Mohammadian, V., Navimipour, N. J., Hosseinzadeh, M., & Darwesh, A. (2022). Fault-Tolerant Load Balancing in Cloud Computing: A Systematic Literature Review. IEEE Access, 10, 12714–12731. https://doi.org/10.1109/ACCESS.2021.3139730

Mohammed, B., Kiran, M., Maiyama, K. M., Kamala, M. M., & Awan, I. U. (2017). Failover strategy for fault tolerance in cloud computing environment. Software - Practice and Experience, 47(9), 1243–1274. https://doi.org/10.1002/spe.2491

Mukwevho, M. A., & Celik, T. (2021). Toward a Smart Cloud: A Review of Fault-Tolerance Methods in Cloud Systems. IEEE Transactions on Services Computing, 14(2), 589–605. https://doi.org/10.1109/TSC.2018.2816644

Nazari Cheraghlou, M., Khadem-Zadeh, A., & Haghparast, M. (2016). A survey of fault tolerance architecture in cloud computing. In Journal of Network and Computer Applications (Vol. 61, pp. 81–92). Academic Press. https://doi.org/10.1016/j.jnca.2015.10.004

Poola, D., Salehi, M. A., Ramamohanarao, K., & Buyya, R. (2017). A Taxonomy and Survey of Fault-Tolerant Workflow Management Systems in Cloud and Distributed Computing Environments. In Software Architecture for Big Data and the Cloud (pp. 285–320). Elsevier. https://doi.org/10.1016/b978-0-12-805467-3.00015-6

Ragmani, A., Elomri, A., Abghour, N., Moussaid, K., Rida, M., & Badidi, E. (2020). Adaptive fault-tolerant model for improving cloud computing performance using artificial neural network. Procedia Computer Science, 170, 929–934. https://doi.org/10.1016/j.procs.2020.03.106

Rahman, Md. M., & Rouf, M. A. (2022). Aggressive Fault Tolerance in Cloud Computing Using Smart Decision Agent (pp. 329–344). https://doi.org/10.1007/978-981-16-6636-0_26

Rawat, A., Sushil, R., Agarwal, A., Sikander, A., & Bhadoria, R. S. (2023). A New Adaptive Fault Tolerant Framework in the Cloud. IETE Journal of Research, 69(5), 2897–2909. https://doi.org/10.1080/03772063.2021.1907231

Rehman, A. U., Aguiar, R. L., & Barraca, J. P. (2022). Fault-Tolerance in the Scope of Cloud Computing. IEEE Access, 10, 63422–63441. https://doi.org/10.1109/ACCESS.2022.3182211

Jubair, M. A., Mostafa, S. A., Zebari, D. A., Hariz, H. M., Abdulsattar, N. F., Hassan, M. H., ... & Alouane, M. T. H. (2022). A QoS aware cluster head selection and hybrid cryptography routing protocol for enhancing efficiency and security of VANETs. IEEE Access, 10, 124792-124804.

Rezaeipanah, A., Mojarad, M., & Fakhari, A. (2022). Providing a new approach to increase fault tolerance in cloud computing using fuzzy logic. International Journal of Computers and Applications, 44(2), 139–147. https://doi.org/10.1080/1206212X.2019.1709288

Rong, H., Liu, J., Wu, W., Hao, J., Wang, H., & Xian, M. (2020). Toward fault-tolerant and secure frequent itemset mining outsourcing in hybrid cloud environment. Computers and Security, 98. https://doi.org/10.1016/j.cose.2020.101969

Mohammed, Z. K., Mohammed, M. A., Abdulkareem, K. H., Zebari, D. A., Lakhan, A., Marhoon, H. A., ... & Martinek, R. (2024). A metaverse framework for IoT-based remote patient monitoring and virtual consultations using AES-256 encryption. Applied Soft Computing, 158, 111588.

Salih Ageed, Z., R. M. Zeebaree, S., Mohammed Sadeeq, M., Fattah Kak, S., Saeed Yahia, H., R. Mahmood, M., & Mahmood Ibrahim, I. (2021). Comprehensive Survey of Big Data Mining Approaches in Cloud Systems. Qubahan Academic Journal, 1(2), 29–38. https://doi.org/10.48161/qaj.v1n2a46

Sathiyamoorthi, V., Keerthika, P., Suresh, P., Zhang, Z., Rao, A. P., & Logeswaran, K. (2021). Adaptive fault tolerant resource allocation scheme for cloud computing environments. Journal of Organizational and End User Computing, 33(5), 1–24. https://doi.org/10.4018/JOEUC.20210901.oa7

Saxena, D., & Singh, A. K. (2022). OFP-TM: an online VM failure prediction and tolerance model towards high availability of cloud computing environments. Journal of Supercomputing, 78(6), 8003–8024. https://doi.org/10.1007/s11227-021-04235-z

Setlur, A. R., Nirmala, S. J., Singh, H. S., & Khoriya, S. (2020). An efficient fault tolerant workflow scheduling approach using replication heuristics and checkpointing in the cloud. Journal of Parallel and Distributed Computing, 136, 14–28. https://doi.org/10.1016/j.jpdc.2019.09.004

Shahid, M. A., Islam, N., Alam, M. M., Mazliham, M. S., & Musa, S. (2021). Towards Resilient Method: An exhaustive survey of fault tolerance methods in the cloud computing environment. In Computer Science Review (Vol. 40). Elsevier Ireland Ltd. https://doi.org/10.1016/j.cosrev.2021.100398

Shahid, M. A., Islam, N., Alam, M. M., Su’Ud, M. M., & Musa, S. (2020). A Comprehensive Study of Load Balancing Approaches in the Cloud Computing Environment and a Novel Fault Tolerance Approach. IEEE Access, 8, 130500–130526. https://doi.org/10.1109/ACCESS.2020.3009184

Singh, S., Jeong, Y.-S., & Park, J. H. (2016). A survey on cloud computing security: Issues, threats, and solutions. Journal of Network and Computer Applications, 75, 200–222. https://doi.org/10.1016/j.jnca.2016.09.002

Tang, X. (2022). Reliability-Aware Cost-Efficient Scientific Workflows Scheduling Strategy on Multi-Cloud Systems. IEEE Transactions on Cloud Computing, 10(4), 2909–2919. https://doi.org/10.1109/TCC.2021.3057422

Zankoya Zaxo, Duhok Polytechnic University, IEEE Computational Intelligence Society. Iraq Chapter., IEEE Communications Society. Iraq Chapter., & Institute of Electrical and Electronics Engineers. (2018). Distributed Cloud Computing and Distributed Parallel Computing: A Review.

Zhang, P., Chen, Y., Zhou, M., Xu, G., Huang, W., Al-Turki, Y., & Abusorrah, A. (2022). A Fault-Tolerant Model for Performance Optimization of a Fog Computing System. IEEE Internet of Things Journal, 9(3), 1725–1736. https://doi.org/10.1109/JIOT.2021.3088417

Zhuang, S., Li, Z., Zhuo, D., Wang, S., Liang, E., Nishihara, R., Moritz, P., & Stoica, I. (2021). Hoplite: Efficient and fault-tolerant collective communication for task-based distributed systems. SIGCOMM 2021 - Proceedings of the ACM SIGCOMM 2021 Conference, 641–656. https://doi.org/10.1145/3452296.3472897



How to Cite

M. Almufti, S. ., & R. M. Zeebaree, S. . (2024). Leveraging Distributed Systems for Fault-Tolerant Cloud Computing: A Review of Strategies and Frameworks. Academic Journal of Nawroz University, 13(2), 9–29. https://doi.org/10.25007/ajnu.v13n2a2012