Instance-related Failures

Last Updated At: 2025-08-14 11:11:37

High Bandwidth Usage Prevents Log-in This document introduces the troubleshooting methods and solutions for Linux and Windows CVM when they cannot be remotely connected due to high bandwidth usage. Failure Symptoms After the CVM console log-in, the bandwidth monitoring data of the CVM prompts that excessively high bandwidth usage is preventing connection to the Cloud Platform server. Diagnosed excessively high bandwidth usage through self-diagnostic tools. Failure Location and Troubleshooting For Linux Servers After you log in to the Linux CVM via VNC, you need to do the following: Note: The following operations take the CVM running CentOS 7.6 as an example. 1.Run the following command to install the iftop tool (the iftop tool is a utility for monitoring traffic on Linux servers). yum install iftop -y Note: For Ubuntu system, please run apt-get install iftop -y command. 2.Run the following command to install lsof. yum install lsof -y 3.Run the following command to execute iftop. iftop -<=,=>indicates the direction of traffic. -TX means sending traffic. -RX means receiving traffic. -TOTAL indicates total traffic. -Cum indicates the total traffic since iftop was run up to now. -peak indicates the peak flow rate. rates represent the average traffic rate over the past 2s, 10s, and 40s. 4.According to the IP consuming traffic in iftop, execute the following command to track the process connected to the IP. lsof -i | grep IP For example, if the IP consuming traffic is 201.205.141.123, execute the following command: lsof -i | grep 201.205.141.123 According to the following returned results, we know that the bandwidth of this server is mainly consumed by the SSH process. sshd 12145 root 3u IPV4 3294018 0t0 TCP 10.144.90.86:ssh->203.205.141.123:58614(ESTABLISHED) sshd 12179 ubuntu 3u IPV4 3294018 0t0 TCP 10.144.90.86:ssh->203.205.141.123:58614(ESTABLISHED) 5.Check the process that consumes more bandwidth to evaluate whether the process is normal. -If a business process is consuming more bandwidth, you need to analyze whether this increase is caused by changes in access volume and whether space optimization or server configuration upgrades are required. -If an exception process consumes more bandwidth, this increase may be caused by a virus or Trojan horse. You can terminate the process manually or use security software to detect and kill it. You can also back up the data and reinstall the system. Failure to shut down and restart cloud server When performing operations such as shutting down and restarting a cloud server, there is a very small probability of failure to shut down or restart. If you encounter such situations, you can troubleshoot and handle the cloud server as follows. Possible causes • Excessively high CPU or memory utilization rate. • The Linux operating system-based cloud server does not have the ACPI management program installed. • System updates for Windows operating system-based cloud servers take too long. • When applying for Windows cloud servers for the first time, these servers haven’t been initialized. • Some software was installed on the operating system, or it got infected by trojans or viruses, resulting in damage to the system itself, etc. Troubleshooting Check CPU/Memory Usage 1.Check the CPU/memory usage based on the type of the cloud server’s operating system. o For Windows cloud servers: Right-click the taskbar in the cloud server and select Task Manager. o For Linux cloud servers: Execute the top command to view the information in the %CPU column and the %MEM column. 2.Terminate processes with excessively high CPU or memory utilization rates according to actual CPU/memory usage. If you still cannot shut down/restart, perform the force shutdown/restart function. Check if the ACPI Management Program is Installed Note This operation targets Linux operating system-based cloud servers. Execute the following commands to check whether the ACPI process exists. ps -ef | grep -w "acpid" | grep -v "grep" If the ACPI process exists, perform the force shutdown/restart function. • If there is no ACPI process, install the ACPI management program. For specific operations, refer to Configuring Linux Power Management. Check if Windows Update is Running Note This operation targets Windows operating system-based cloud servers. In the Windows cloud server OS interface, click Start > Control Panel > Windows Updates to see if any patches or programs are being updated. • When Windows performs certain patch operations, it will do some processing when shutting down the system. At this point, the update may take too long, leading to failure to shut down/restart. It is recommended that you wait until Windows updates are complete before attempting to shut down/restart the cloud server. • If there are no patches or programs being updated, perform the force shutdown/restart function. Check if the Cloud Server Has Been Initialized Note This operation targets Windows operating system-based cloud servers. When applying for Windows cloud servers for the first time, the system distributes images via the Sysprep method, which takes slightly longer during initialization. Before initialization is complete, Windows will ignore shutdown/restart operations, causing failure to shut down/restart. • If the Windows cloud server you applied for is initializing, it is recommended that you wait until the initialization of the Windows cloud server is complete before attempting to shut down/restart the cloud server. • If the cloud server has already been initialized, perform the force shutdown/restart function. Check if the Installed Software is Normal Use inspection tools or antivirus software to check if the software installed on the cloud server is normal or has been infected by trojans, viruses, etc. • If abnormalities are found, it indicates that the system itself might have been damaged, leading to failure to shut down/restart. It is suggested that you uninstall the software, scan using security software, or after backing up data, reinstall the system. • If no abnormalities are found, perform the force shutdown/restart function. Force Shutdown/Restart Function Note The cloud platform provides a force shutdown/restart feature, which can be used when multiple attempts to shut down or restart the cloud server fail. This operation forcefully shuts down or restarts the cloud server, potentially causing loss of data on the cloud server or damage to the file system. 1.Log in to the cloud server console. 2.On the instance management page, select the cloud server to be shut down or restarted and perform different operations according to actual needs. o Shut down the cloud server: Click More > Instance Status > Shut Down. o Restart the cloud server: Click More > Instance Status > Restart. 3.In the pop-up window titled "Shutdown" or "Restart Instance," select "Force Shutdown" or "Force Restart" and click OK. o Select "Force Shutdown" o Select "Force Restart" Unable to Create Network Namespace Problem Description When performing a command to create a new Network Namespace, the command gets stuck and cannot continue. Dmesg message prompt: "unregister_netdevice: waiting for lo to become free. Usage count = 1" Causes of Problem This problem is a kernel bug. Currently, the following kernel editions have this bug: ●Ubuntu 16.04 x86_64 kernel edition is 4.4.0--91-generic. ●Ubuntu 16.04 x86_32 kernel edition is 4.4.0--92-generic. Solution Upgrade the kernel edition to 4.4.0--98-generic, which has fixed this bug. Processing Procedures 1.Perform the following command to check the current kernel edition. uname -r 2.Perform the following command to check whether the 4.4.0--98-generic kernel edition is available for upgrade. sudo apt-get update sudo apt-cache search linux-image-4.4.0-98-generic If the following information is displayed, it represents that the edition exists in the source and can be upgraded: linux-image-4.4.0-98-generic - Linux kernel image for version 4.4.0 on 64 bit x86 SMP 3.Perform the following command to install the new edition of the kernel and the corresponding Header package. sudo apt-get install linux-image-4.4.0-98-generic linux-headers-4.4.0-98-generic 4.Perform the following command to restart the system. sudo reboot 5.Perform the following command to enter the system and check the kernel edition. uname -r If the following result is displayed, it represents the edition update is successful: 4.4.0-98-generic Kernel and IO Related Problems Kernel Problem Location and Solution Failure Symptoms Kernel-related failures may cause the machine to be unable to log in or abnormal restart. Possible Reasons Kernel hung_task The hung task mechanism is implemented by the kernel thread khungtaskd, which monitors the process in the TASK_UNINTERRUPTIBLE status. If in the period of kernel.hung_task_timeout_secs (Default 120 seconds), it remains in D status, then the stack information of the hung task process will be printed. If configure kernel.hung_task_panic=1, it will trigger a kernel panic and restart the machine. Kernel soft lockup Soft lockup means that the CPU is occupied by kernel code and cannot perform other processes. The principle of detecting soft lockup is to assign a kernel thread [watchdog/x] which will be timed perform to each CPU. If the thread is within a certain period (the default is 2*kernel.watchdog_thresh, 3.10 kernel kernel.watchdog_thresh and the default value is 10 seconds) is not performed, indicating that a soft lockup has occurred. If kernel.softlockup_panic=1 is configured, it will trigger a kernel panic and restart the machine. Kernel panic The kernel's abnormal crash causes the machine to restart abnormally. Common kernel panic scenarios are as follows: The kernel has a hung_task and kernel.hung_task_panic=1 is configured. The kernel has a soft lockup and kernel.softlockup_panic=1 is configured. The kernel bug is triggered. Processing Procedures The troubleshooting and processing steps for kernel-related problems are relatively complicated. It is recommended to submit a ticket for further location and processing. Hard Disk Problem Location and Solution Hard disk inode is full Failure phenomenon: When creating a new file, the error message No space left on device is prompted, and the inode space usage is 100% by using the df -i command. Possible causes: File system inode exhaustion. Processing steps: Delete unnecessary files or scale out the hard disk capacity. Hard disk space usage is full Failure phenomenon: When creating a new file, the error message No space left on device is prompted, and the hard disk space usage is 100% by using the df -h command. Possible causes: Hard disk space exhaustion. Processing steps: Delete unnecessary files or scale out the hard disk capacity. Hard disk read only Failure phenomenon: The file system can only read files but cannot create new files. Possible cause: The file system is damaged. Processing steps: 1. 1.Create a snapshot to back up disk data. See Create a Snapshot for details. 2.Perform the corresponding processing steps according to the hard disk type: System disk Data disk It is recommended to directly restart the instance; for details, see Restarting an Instance. Hard disk %util high Failure phenomenon: The instance is stuck, and logging in using SSH or VNC is slow or unresponsive. Possible causes: High IO causes hard disk %util to reach 100%. Processing steps: Check whether the high IO is reasonable, and evaluate whether to reduce IO read and write or replace the hard disk with a higher performance. Missing Soft Links in System Bin or Lib Symptom Description During command execution or system startup, errors occur indicating that commands or lib libraries cannot be found. Possible Causes In systems such as CentOS 7, CentOS 8, and Ubuntu 20, directories such as bin, sbin, lib, and lib64 are soft links. As shown below: lrwxrwxrwx 1 root 7 Jun 19 2018 bin -> usr/bin lrwxrwxrwx 1 root 7 Jun 19 2018 lib -> usr/lib lrwxrwxrwx 1 root 9 Jun 19 2018 lib64 -> usr/lib64 lrwxrwxrwx 1 root 8 Jun 19 2018 sbin -> usr/sbin If these soft links are deleted, errors will occur during command execution or system startup. Resolution Ideas Refer to the handling steps to check and create required soft links. Directions 1.Refer to Using Rescue Mode to enter rescue mode. 2.Execute the mount and chroot commands mentioned therein. When executing the chroot command: o If there's an error, execute cd /mnt/vm1. o If there's no error, execute cd /. 3.Execute the following command to check if the corresponding soft link exists. ls -al / | grep -E "lib|bin" o If yes, please contact us for assistance through Online Support. o If no, then please execute the following commands as needed to create the corresponding soft link. o ln -s usr/lib64 lib64 o ln -s usr/sbin o ln -s usr/bin bin ln -s usr/lib lib 4.Execute the following command to verify the soft links. chroot /mnt/vm1 /bin/bash If there are no error messages, it means the soft links have been successfully repaired. 5.Refer to Using Rescue Mode to exit rescue mode and boot the system. Error creating files due to "no space left on device" Symptom Description When creating new files in a Linux cloud server, an error message "no space left on device" appears. Possible Causes • Hard disk space is full. • File system inode is full. • Inconsistency between df and du. o A file has been deleted but there are still processes holding onto the corresponding file descriptor, preventing the hard disk space from being released. o Nested mounts. For example, the /data directory on the system disk occupies a large amount of space, and /data is also mounted to other data disks, leading to inconsistencies between df and du on the system disk. Resolution Ideas Refer to Troubleshooting Methods to identify and resolve issues. Troubleshooting Methods Resolving Hard Disk Space Full Issues 1.Log in to the cloud server; for details, see Logging into a Linux Instance using Standard Login Method. 2.Execute the following command to view hard disk usage. df -h 3.Identify the mount point with higher hard disk usage and enter the mount point with the following command. cd corresponding_mount_point For example, to cd to the system disk mount point, execute cd /. 4.Execute the following command to find directories occupying significant space. du -x --max-depth=1 | sort -n Based on the capacity of the largest occupied directory identified, proceed as follows: o If the directory size is much lower than the total hard disk space, continue troubleshooting per the Inconsistent df du Problem Resolution steps. o If the directory size is considerable, locate larger files within the identified directory in step 2. Assess whether they can be deleted considering business requirements. If deletion isn't possible, expand cloud disk storage to increase hard disk capacity. Resolving File System Inode Full Issues 1.Log in to the cloud server; for details, see Logging into a Linux Instance using Standard Login Method. 2.Execute the following command to view hard disk usage. df -i 3.Identify the mount point with higher hard disk usage and enter the mount point with the following command. cd corresponding_mount_point For example, to cd to the system disk mount point, execute cd /. 4.Execute the following command to find the directory containing the most files and address the issue. Please note that this command can be time-consuming, so be patient. find / -type f | awk -F / -v OFS=/ '{NF="";dir[NF="";dir[0]++}END{for(i in dir)print dir[i]" "i}' | sort -k1 -nr | head Resolving Inconsistent df du Problems Addressing Process Holding onto File Descriptors Execute the following command to view processes holding onto files. lsof | grep delete Based on the returned results, proceed as follows: • Kill the corresponding process. • Restart services. • If many processes hold onto file descriptors, consider rebooting the server. Addressing Nested Mounts Issue 1.Execute the mount command to mount the disk occupying large space to /mnt. For example: mount /dev/vda1 /mnt 2.Execute the following command to enter /mnt. cd /mnt 3.Execute the following command to find directories occupying significant space. du -x --max-depth=1 | sort -n Based on the returned results, assess whether directories or files can be deleted considering business requirements. 4.Execute the umount command to unmount the disk. For example: umount /mnt