Aistation is an AI development resource platform of Inspur for deep learning development and online reasoning release, which can realize containerized deployment, visual development, centralized management, etc., aiming to achieve accurate resource management and scheduling, agile data integration and acceleration, process oriented AI scenarios and business integration. This time, the editor of sth had a comprehensive experience of Inspur aistation as administrator and user, and explained the function and application value of aistation in detail in the evaluation report. Sth said that Inspur aistation can realize the fine management of AI resources, effectively open up the development environment, computing resources and data resources, and improve the development efficiency.
Heres a survey by Patrick Kennedy, senior editor of sth
We often see building and running AI clusters, that is, managing all computing resources, users, data and models through training and reasoning as a challenge. Doing well in AI cluster operation may not be as popular as finding a new way to solve deep learning problems, but it is crucial to expand shared resources within the organization. Inspur aistation aims to manage this life cycle. We spent some time hands-on with the solution to understand how it works. I also took the opportunity to ask Liu Jun, the person in charge of Inspur AI, several questions about the new products.
1u3001 Background of Inspur aistation operation
If youve heard of New York but not Jinan, thats why I want to focus on aistation. Inspur is one of the top three server suppliers in the world. About half of AI servers in Chinese market come from Inspur. Inspur is aimed at super large-scale users. One of its main capabilities is AI server, such as Inspur nf5468m5 and Inspur nf5488m5, which we recently evaluated. Aistation is also a wave product that helps manage a large number of AI training and reasoning servers, data, and users.
Screen shots are not shown here one by one, select a part of the key interface to introduce. Before I understand what users see in the system, Id like to talk about management. The solution is very modern, based on kubernetes and containers. If you compare it with many traditional GPU / HPC / AI scheduling systems, you will have a better understanding of its modern architecture.
After aistation starts running in the background, most of the daily management work can be done by script or through WebGui. You can look into the load and hardware configuration of each node, and even realize the tracking chain from the user to the container to the hardware that they run at the single GPU level.
Cluster monitoring > node monitoring
Resource Management > create resource group
In addition to creating resource groups, it may be more important to create users and user groups. Aistation can create users or integrate with existing user directory tools, and then grant users access to different resources, storage quotas, GPU quotas, and so on. This is important because companies may not want an intern to use 100% of the entire cluster or access sensitive training data / models, but will give priority to an internal advisory group of deep learning experts. The main value proposition of aistation is full management through a single system.
System Management > User Management > user management
Administrators can also access the entire cluster according to their permissions. For example, when a job runs slowly, administrators can use monitoring tools to find their jobs and problematic containers, and even go directly to the hardware to see if there are potential hardware problems.
Development environment > details
Aistation also has a fairly comprehensive visual interface for monitoring clusters, on which information such as CPU, GPU and memory utilization can be seen. In the aspect of cluster lifecycle management, this kind of data can help administrators to view resource configuration and system capacity. For example, if the cluster runs at 50% CPU, 60% GPU, and 95% memory, it fully indicates that the next generation of nodes needs more memory capacity.
Report management > resource statistics
Administrator users can also view completed tasks to see what the user previously ran, including whether the job was successful. In some cases, people will mine cryptocurrencies on corporate GPU clusters. This type of function is very important for audit tracking based on what has been run.
Training Management > completed tasks
3u3001 Aistation operation: User Perspective
Each user has access to a set of resources. You can see the dashboard when you log in to aistation. Many usage restrictions are defined by the user, group, and resource group functions displayed in the administration panel.
Inspur aistation user interface
If developers want to start a training task, they can view the trainable image. These mirrors are important because they are the ones that may be in use when you create tasks in the system. It can be a mirror from NVIDIA gpuccloud or a more standard image. Aistation also has the function of group image and even user image, which makes it easier for users to select container image. Users can see individuals, groups, and public images. An administrator can define an image as a personal image or a public image. It is also very important to open the view permission of sensitive image to specific groups or employees.
Inspur aistation supports a variety of frameworks, and users can use tensorflow, python, paddlepadle or other frameworks.
Training Management > training task > create training task
Data management is very important in AI cluster. Aistation can define and store datasets. From the users point of view, they can see which datasets are available. Users can associate container images, nodes / physical resources and training data. Administrators can set permissions on these datasets. This is important because some datasets can only be viewed, used and downloaded by specified users.
Development platform > details
After starting the task, the aistation platform will integrate a number of visualization tools. For example, you can launch tools such as tensorboard, visdom, or netscope to display the visualization from the drop-down menu; users can enter the containers terminal directly from the WebGui.
Visualization tensorboard of user development platform
Training jobs may take hours or days, and users can view the current status, progress, inspection results, and the jobs to be processed and their history at any time.
As you can see, this solution supports many users of a company and nodes with multiple generations of GPUs. There are no other features shown here, such as email alerts and notifications for administrators and users, but this solution is clearly designed to run the entire AI operations of the company. Therefore, I would like to ask the business head of aistation for more information about its market entry strategy.
4u3001 Dialogue with Liu Jun
In terms of listing strategy, I put forward some questions to Liu Jun, head of Inspurs AI and HPC business. This name may sound familiar, because he has done an exclusive interview for us before.
Liu Jun, general manager of Inspur artificial intelligence and high performance computing
Patrick Kennedy: how does Inspur plan the listing of aistation?
Liu Jun: aistation has direct sales and channel sales. We have dozens of channel partners selling aistation around the world.
PK: can aistation integrate cluster nodes from other server vendors?
Liu Jun: since it was released in April 2019, it has been applied in finance, education, Internet, smart city and other industries.
PK: only for large organizations and service providers? Are smaller organizations like start-ups sales targets?
Liu Jun: sold by GPU server node.
Liu Jun: users can enjoy the aistation free upgrade service within three years, and then need to purchase a new key to upgrade.
PK: for this solution, will Inspur provide other new services in the future?
Thanks again for Liu Juns patience in answering questions for our readers.
Id like to highlight the reasons for showing aistations operational view and listing strategy here. Aistation got a lot of software sales in its first year of release. Investors will be thrilled to see their investment companys AI management software platform achieve such results in the first year. Although many of its customers have launched this new solution as paid for by its customers, it has actually been used by many of its customers to manage the solution. In combination with the current features and the concept of heterogeneous accelerator, you can immediately understand the prospect of this solution. It is different from other existing cluster management solutions in that it can even be used by large organizations and is all based on kubernetes, which is rapidly becoming the main tool for next generation services. All in all, if you are a small start-up with only 2-3 people, you may not need this solution, but as the number of clusters in your organization increases, when scheduling and management becomes a greater challenge, the value of Inspur aistation will become more prominent, and more information can be learned in the tenth power. Source: editor in charge of mass news: Chen Tiqiang_ NB6485
Id like to highlight the reasons for showing aistations operational view and listing strategy here. Aistation got a lot of software sales in its first year of release. Investors will be thrilled to see their investment companys AI management software platform achieve such results in the first year. Although many of its customers have launched this new solution as paid for by its customers, it has actually been used by many of its customers to manage the solution.
In combination with the current features and the concept of heterogeneous accelerator, you can immediately understand the prospect of this solution. It is different from other existing cluster management solutions in that it can even be used by large organizations and is all based on kubernetes, which is rapidly becoming the main tool for next generation services.
All in all, if you are a small start-up with only 2-3 people, you may not need this solution, but as the number of clusters in your organization increases, when scheduling and management becomes a greater challenge, the value of Inspur aistation will become more prominent, and more information can be learned in the tenth power.