High-performance computing is making a transition to the cloud. Many companies are porting their HPC application to the cloud. Although cloud computing provides a wide range of benefits, such as elasticity, scalability, infinite resources, pay-as-you-go pricing, and hardware virtualization, it still has many technical feasibility challenges. Read on!
One of the major problems with the cloud is that dynamic scalability is not up to the mark. Many HPC applications have dynamic scalability as a major feature. Dynamic scalability refers to scaling up and down compute notes by the application.
For instance, if you set the response time to two milliseconds for user queries, and the response time exceeds the threshold, the application will allocate additional compute nodes to overcome the problem of busy queries. The compute note allocation can take up to 10 minutes. Therefore, such as a delay prohibit the compute node from handling busy queries in time.
Low-Level Performance Control
Many HPC users do not have adequate control over the compute nodes. HPC applications read user queries and split the tasks into sub-tasks among compute nodes. Each node then requests data from the storage depending on the sub-tasks. For instance, if the user’s next query is the same, will the application reuse the previous data set or execute the same process again?
Even if you transfer the data from cloud to local storage to prevent latency delay, the same compute node may not serve the next user’s query. It is not easy to fully exploit the capacity of compute node due to the lack of low-level control. Similarly, it isn’t easy to maximize data locality.
Many HPC companies are unsure about the dedication of compute nodes for the application. Although multi-tenancy is a great cloud feature, it is still an issue for many HPC application.
Multi-tenancy refers to sharing compute nodes among multiple HPC applications. Because many application run on the same compute node, it will significantly decrease the amount of bandwidth assigned to each HPC application. As a result, you may face problems like performance degradation.
Reliability and fault-tolerance are other feasibility issues when it comes to HPC in the cloud. For example, some cloud platforms like Windows Azure does not provide information on the time it takes to replace a failed node with a new one. Besides, many companies don’t know the impact of hardware failure on the HPC application performance.
It is crucial to study and consider such impacts when developing and maintaining the load management system. For example, on Platform-as-a-service (PaaS) architecture, one downside is that testing your HPC application gainst compute node failures and fault tolerance are very difficult.
No Remote Debugging
Although companies can debug their cloud HPC locally, the architecture might not support remote debugging. As a result, you may face the problem for developing and deploying large HPC application on the cloud.
HPC program development has technical problems like parallel and remote debugging. Research shows that these are new problems on cloud computing, and service providers must develop and provide effective detection tools to companies.
It is crucial to use light-weight profiling tools to analyze and streamline performance. However, many cloud platforms lack these features, which cause problems for businesses.
Although many cloud computing platforms provide great architecture, they have flexibility issues that prohibit companies from optimizing their HPC performance. HPC users need high-level controls to improve their high-performance computing applications.
If you want to overcome these challenges, you can contact Clovertex for high-performance computing services. We have a team of experienced and professional IT personnel with extensive knowledge of technical HPC.