 Special comment on aistation: AI cluster operation is as important as AI algorithm innovation

Aistation is an AI development resource platform of Inspur for deep learning development and online reasoning release, which can realize containerized deployment, visual development, centralized management, etc., aiming to achieve accurate resource management and scheduling, agile data integration and acceleration, process oriented AI scenarios and business integration. This time, the editor of sth had a comprehensive experience of Inspur aistation as administrator and user, and explained the function and application value of aistation in detail in the evaluation report. Sth said that Inspur aistation can realize the fine management of AI resources, effectively open up the development environment, computing resources and data resources, and improve the development efficiency.

Heres a survey by Patrick Kennedy, senior editor of sth

We often see building and running AI clusters, that is, managing all computing resources, users, data and models through training and reasoning as a challenge. Doing well in AI cluster operation may not be as popular as finding a new way to solve deep learning problems, but it is crucial to expand shared resources within the organization. Inspur aistation aims to manage this life cycle. We spent some time hands-on with the solution to understand how it works. I also took the opportunity to ask Liu Jun, the person in charge of Inspur AI, several questions about the new products.

Inspur aistation landing interface

Inspur has a test cluster in Shandong Province of China. I visited it with ciscovpn. Although I dont know the exact location, I think this Inspur building in Jinan (the second largest city and capital of Shandong Province) is the site of the test cluster, which is not in the same park as the Inspur intelligent factory we visited in 2019. Many of our readers are from outside China and have never been to Shandong, so to speak, the population of Jinan is similar to that of New York.

If youve heard of New York but not Jinan, thats why I want to focus on aistation. Inspur is one of the top three server suppliers in the world. About half of AI servers in Chinese market come from Inspur. Inspur is aimed at super large-scale users. One of its main capabilities is AI server, such as Inspur nf5468m5 and Inspur nf5488m5, which we recently evaluated. Aistation is also a wave product that helps manage a large number of AI training and reasoning servers, data, and users.

Basically, aistation is a kubernetes based clustering solution. What Inspur does is many common tools and tasks that need to be solved when running AI cluster uniformly. For example, it can manage users, groups, permissions, and quotas, the data associated with each user or group and the permissions and storage of that data, as well as manage development work and schedule resources on the cluster. In addition, we will cover some monitoring and alerts at the job, user, and node levels.

2u3001 Aistation operation: administrators perspective

Screen shots are not shown here one by one, select a part of the key interface to introduce. Before I understand what users see in the system, Id like to talk about management. The solution is very modern, based on kubernetes and containers. If you compare it with many traditional GPU / HPC / AI scheduling systems, you will have a better understanding of its modern architecture.

Cluster monitoring

After aistation starts running in the background, most of the daily management work can be done by script or through WebGui. You can look into the load and hardware configuration of each node, and even realize the tracking chain from the user to the container to the hardware that they run at the single GPU level.

Cluster monitoring > node monitoring

Resource Management > create resource group

In addition to creating resource groups, it may be more important to create users and user groups. Aistation can create users or integrate with existing user directory tools, and then grant users access to different resources, storage quotas, GPU quotas, and so on. This is important because companies may not want an intern to use 100% of the entire cluster or access sensitive training data / models, but will give priority to an internal advisory group of deep learning experts. The main value proposition of aistation is full management through a single system.

System Management > User Management > user management

Administrators can also access the entire cluster according to their permissions. For example, when a job runs slowly, administrators can use monitoring tools to find their jobs and problematic containers, and even go directly to the hardware to see if there are potential hardware problems.

Development environment > details

Aistation also has a fairly comprehensive visual interface for monitoring clusters, on which information such as CPU, GPU and memory utilization can be seen. In the aspect of cluster lifecycle management, this kind of data can help administrators to view resource configuration and system capacity. For example, if the cluster runs at 50% CPU, 60% GPU, and 95% memory, it fully indicates that the next generation of nodes needs more memory capacity.

Report management > resource statistics

Administrator users can also view completed tasks to see what the user previously ran, including whether the job was successful. In some cases, people will mine cryptocurrencies on corporate GPU clusters. This type of function is very important for audit tracking based on what has been run.

Training Management > completed tasks

In addition to the above functions, another important function is to manage the resources that users have in the system. Next, we will elaborate from the perspective of users.

3u3001 Aistation operation: User Perspective

Each user has access to a set of resources. You can see the dashboard when you log in to aistation. Many usage restrictions are defined by the user, group, and resource group functions displayed in the administration panel.

Inspur aistation user interface

If developers want to start a training task, they can view the trainable image. These mirrors are important because they are the ones that may be in use when you create tasks in the system. It can be a mirror from NVIDIA gpuccloud or a more standard image. Aistation also has the function of group image and even user image, which makes it easier for users to select container image. Users can see individuals, groups, and public images. An administrator can define an image as a personal image or a public image. It is also very important to open the view permission of sensitive image to specific groups or employees.

Image Management

Inspur aistation supports a variety of frameworks, and users can use tensorflow, python, paddlepadle or other frameworks.

Data management is very important in AI cluster. Aistation can define and store datasets. From the users point of view, they can see which datasets are available. Users can associate container images, nodes / physical resources and training data. Administrators can set permissions on these datasets. This is important because some datasets can only be viewed, used and downloaded by specified users.

In aistation, you can also load jupyter notebook, edit Python files directly, and save the notebook in the storage background of the cluster, and easily share it with other users.

After starting the task, the aistation platform will integrate a number of visualization tools. For example, you can launch tools such as tensorboard, visdom, or netscope to display the visualization from the drop-down menu; users can enter the containers terminal directly from the WebGui.

Visualization tensorboard of user development platform

Training jobs may take hours or days, and users can view the current status, progress, inspection results, and the jobs to be processed and their history at any time.

Training Management > completed assignments

As you can see, this solution supports many users of a company and nodes with multiple generations of GPUs. There are no other features shown here, such as email alerts and notifications for administrators and users, but this solution is clearly designed to run the entire AI operations of the company. Therefore, I would like to ask the business head of aistation for more information about its market entry strategy.

4u3001 Dialogue with Liu Jun

Liu Jun, general manager of Inspur artificial intelligence and high performance computing

Patrick Kennedy: how does Inspur plan the listing of aistation?

Liu Jun: aistation has direct sales and channel sales. We have dozens of channel partners selling aistation around the world.

PK: can aistation integrate cluster nodes from other server vendors?

Liu Jun: Yes, aistation can integrate cluster nodes from other suppliers.

Liu Jun: since it was released in April 2019, it has been applied in finance, education, Internet, smart city and other industries.

Liu Jun: aistation is specially designed for the field of deep learning development, which is suitable for large and small enterprises in the fields of finance, Internet, communication, transportation, medical treatment and education.

PK: what is the licensing model?

Liu Jun: sold by GPU server node.

PK: to upgrade a license, do you need to purchase a new key, or do you need to use your existing key to gain new rights from Inspur registration server

Liu Jun: users can enjoy the aistation free upgrade service within three years, and then need to purchase a new key to upgrade.

Liu Jun: in the future, aistation will support more AI accelerators and realize heterogeneous acceleration in resource management, scheduling, monitoring and optimization. We will build a more comprehensive AI development ecosystem to provide an integrated development platform for mainstream AI development tools, development frameworks and deep learning models.

Thanks again for Liu Juns patience in answering questions for our readers.

Last words

Id like to highlight the reasons for showing aistations operational view and listing strategy here. Aistation got a lot of software sales in its first year of release. Investors will be thrilled to see their investment companys AI management software platform achieve such results in the first year. Although many of its customers have launched this new solution as paid for by its customers, it has actually been used by many of its customers to manage the solution.

In combination with the current features and the concept of heterogeneous accelerator, you can immediately understand the prospect of this solution. It is different from other existing cluster management solutions in that it can even be used by large organizations and is all based on kubernetes, which is rapidly becoming the main tool for next generation services.