Recently, the outbreak of new coronavirus in Wuhan of China has attracted close attention from the World Health Organization and many health institutions around the world. Among them, wired magazine reported the news that BLUEDOT, a Canadian company, took the lead in predicting and releasing the epidemic situation in Wuhan through AI monitoring platform, which attracted wide attention of domestic media. This seems to be the most desired outcome we want to see in the event of predicting the future - with the help of big data precipitation foundation and AI inference, human beings seem to be able to guess Providence and reveal the law of cause and effect originally hidden in chaos, so as to try to save the world before natural disasters.
Google GFT calls wolf coming: Rhapsody of influenza big data
Using AI to predict infectious diseases is obviously not the patent of BLUEDOT. In fact, as early as 2008, todays AI strong hand Google has made a less successful attempt.
In 2008, Google launched a system to predict the influenza epidemic trend - Google Flu Trends (hereinafter referred to as GFT). GFT World War I became famous just a few weeks before the H1N1 outbreak in the United States in 2009. Google engineers published a paper in Nature magazine. Through the massive search data accumulated by Google, they successfully predicted the spread of H1N1 across the United States. In the influenza trend and regional analysis, Google used billions of search records, processed 450 million different digital models, and constructed an influenza prediction index. The correlation between the results and the official data of the Centers for Disease Control and Prevention (CDC) is as high as 97%, but it is two weeks ahead of CDC.
In front of the epidemic, time is life and speed is wealth. If GFT can keep this predictive ability, it can obviously win the opportunity for the whole society to control the epidemic of infectious diseases in advance.
However, the prophecy myth did not last long. In 2014, GFT again received media attention, but this time it was because of its poor performance. In 2014, researchers published the article Googles fable of influenza: the trap of big data analysis in science, pointing out that in 2009, GFT failed to predict the non seasonal influenza A-H1N1. In the 108 weeks from August 2011 to August 2013, GFT exceeded the CDC reported influenza incidence by 100 weeks. How much is overestimated? In the quarter of 2011-2012, the incidence predicted by GFT is more than 1.5 times that reported by CDC; in the quarter of 2012-2013, the incidence predicted by GFT is more than twice that reported by CDC.
Although GFT adjusted the algorithm in 2013, and responded that the main cause of the deviation was that the large-scale media coverage of GFT led to the change of peoples search behavior, the influenza incidence predicted by GFT in 2013-2014 was still 1.3 times higher than that reported by CDC. And the systematic errors that the researchers found earlier still exist, that is, the wolf comes error is still being made.
What factors are missing from GFT, which makes the prediction system in a dilemma?
1u3001 Big data hubris
This arrogance premise assumption ignores that the large amount of data does not mean the comprehensive and accurate data, so the database samples that appear in 2009 can not cover the new data features that appear in the next few years. It is also because of this conceit, GFT does not seem to consider the introduction of professional health and medical data and expert experience, and does not clean and de noise the user search data, which leads to the problem that the epidemic incidence is overvalued but unable to be solved.
2u3001 Search engine evolution
At the same time, the pattern of search engine is not unchangeable. Google launched recommend related search terms after 2011, which is the pattern of search related terms that we are familiar with today.
For example, for flu search terms, we will provide a list of related search terms for flu treatment, and also provide recommendations for related diagnostic terms after 2012. The researchers analyzed that these adjustments may have artificially pushed up some searches and led Google to overestimate the prevalence.
For example, when users search for sore throat, Google will recommend sore throat and fever and how to treat sore throat in the recommended keywords. At this time, users may click for curiosity and other reasons, resulting in the phenomenon that the keywords used by users are not intended by users, thus affecting the accuracy of GFT data collection.
In turn, the users search behavior will also affect the prediction results of GFT. For example, the medias report on the influenza epidemic will increase the search times of flu Related words, and then affect the prediction of GFT. As Heisenberg pointed out, the uncertainty principle in quantum mechanics explains, measurement is interference. Then, in the noisy world of search engines full of media reports and users subjective information, there is also the prediction is interference paradox. The behavior of search engine users is not entirely spontaneous. Media reports, social media hotspots, search engine recommendations and even big data recommendations are all affecting users minds, resulting in the centralized outbreak of user specific search data.
Why is the prediction of GFT always on the high side? According to this theory, we can know that once the epidemic prediction index issued by GFT increases, media reports will be triggered immediately, which will lead to more relevant information search, thus strengthening the epidemic judgment of GFT. No matter how to adjust the algorithm, the result of uncertainty will not be changed.
3u3001 Correlation, not causation
The root problem with GFT, the researchers point out, is that Google engineers dont know exactly what the causal link is between search keywords and the spread of flu, but only focus on the statistical correlation between the data. Over advocating correlation and neglecting cause and effect will lead to data inaccuracy.
For example, taking flu as an example, if the search volume of the word skyrocketed over a period of time, it may be because the release of a movie or song of flu does not necessarily mean that the flu is really breaking out.
For a long time, although the outside world has been hoping that Google can disclose the algorithm of GFT, Google did not choose to disclose it. This has led many researchers to question whether the data can be reproduced repeatedly or whether there are more commercial considerations. They hope to combine big data search with traditional data statistics (small data) to create more in-depth and accurate research on human behavior.
Obviously, Google doesnt take this view seriously. Finally, in 2015, GFT officially went offline. But it continues to collect search data from relevant users for use only by CDC and some research institutions.
Why BLUEDOT takes the lead in successful prediction: Concerto of AI algorithm and artificial analysis
As we all know, Google was already laying out artificial intelligence at that time. It acquired deepmind in 2014, but it still maintained its independent operation. At the same time, Google did not pay more attention to GFT, so it did not consider adding AI to the algorithm model of GFT, but chose to let GFT go to euthanasia.
Almost at the same time, the BLUEDOT we see today was born.
BLUEDOT is an automatic epidemic surveillance system established by infectious disease expert kamran Khan, which tracks more than 100 infectious disease outbreaks by analyzing about 100000 articles in 65 languages every day. They try to use these targeted data collection to get clues about the outbreak and spread of potential epidemic diseases.
BLUEDOT has been using natural language processing (NLP) and machine learning (ML) to train the disease automatic monitoring platform, which can not only identify and exclude irrelevant noise in the data, for example, the system recognizes that this is the outbreak of Mongolian anthrax, but also the reunion of heavy metal band anthrax founded in 1981. For example, GFT only understands the users of flu related search as possible flu patients, obviously there are too many unrelated users, resulting in overestimation of the accuracy of the epidemic. This is also the difference between BLUEDOT and GFT in identifying key data.
As in this prediction of the new coronavirus epidemic, Mr kamlan said BLUEDOT searched foreign language news reports, animal and plant disease networks and official announcements to find the source of the epidemic. But the platform algorithm does not use social media publishing content, because these data are too messy and prone to more noise.
With regard to the prediction of transmission path after the outbreak of the virus, BLUEDOT prefers to use the data of global air tickets to better find the movement and action time of the infected residents. In early January, BLUEDOT also successfully predicted that the new coronavirus would spread from Wuhan to Beijing, Bangkok, Seoul and Taipei within a few days after its outbreak in Wuhan.
The new coronavirus outbreak is not BLUEDOTs first success. In 2016, BLUEDOT successfully predicted the emergence of Zika virus in Florida, U.S., six months in advance by analyzing the AI model of Zika virus transmission path in Brazil. This means that BLUEDOTs AI monitoring capability can even predict the regional spread of epidemics.
From failure to success, what are the differences between BLUEDOT and Google GFT?
1u3001 Forecast technical differences
Before, the mainstream prediction and analysis methods used a series of data mining techniques, among which the regression method in mathematical statistics, including multiple linear regression, polynomial regression, multiple logistic regression and other methods, is essentially a curve fitting, which is the conditional mean prediction of different models. This is also the technical principle of the prediction algorithm used by GFT.
Before machine learning, multiple regression analysis provides an effective method to deal with multiple conditions, which can try to find a result that minimizes the error of prediction data and maximizes the goodness of fit. But the desire of regression analysis for unbiased prediction of historical data can not guarantee the accuracy of future prediction data, which will cause the so-called over fitting.
According to the analysis of Shen Yan, a professor of Peking University Research Institute, in the article glory and trap of big data analysis -- Talking about Googles flu trend, Google GFT does have the problem of over fitting. In 2009, GFT can observe all CDC data from 2007 to 2008. The training data and test data used to find the best model are based on the standard of highly fitting CDC data at any cost.
Therefore, in the 2014 science paper, it was pointed out that when GFT predicted the influenza prevalence rate in 2007-2008, it would lose some seemingly strange search words and use another 50 million search words to fit 1152 data points. After 2009, GFT will face more unknown variables in the data to be predicted, including its own prediction which is also involved in the data feedback. No matter how the GFT is adjusted, it still faces the problem of over fitting, which makes the overall error of the system inevitable.
BLUEDOT adopts another strategy, that is, the combination of medical, health expertise and artificial intelligence, big data analysis technology, to track and predict the global distribution and spread of infectious diseases, and provide the best solution.
BLUEDOT mainly uses natural language processing and machine learning to improve the effectiveness of the monitoring engine. With the improvement of computing power and machine learning in recent years, the method of statistical prediction has been fundamentally changed. It is mainly the application of deep learning (neural network). With the method of back propagation, we can continuously train, feed back and learn from the data and acquire knowledge. After systematic self-learning, the prediction model will be continuously optimized and the prediction accuracy will be improved with the learning. And the historical data input before model training becomes particularly critical. Abundant characteristic data is the basis of training the prediction model. After cleaning the high-quality data and extracting the appropriate labeled features become the most important to predict the success.
2u3001 Forecast model differences
Different from the way that GFT completely gives the prediction process to the results of big data algorithm, BLUEDOT does not completely give the prediction to AI monitoring system. BLUEDOT will be handed over to human analysis after data filtering. This is also the difference between the relevance thinking of GFTs big data analysis and BLUEDOTs expert experience prediction model.
The big data analyzed by AI is the information selected from specific websites (medical and health, health and disease News) and platforms (air tickets, etc.). The early warning information given by AI also needs to be re analyzed by relevant epidemiologists to confirm whether it is normal, so as to evaluate whether these epidemic information can be released to the society in the first time.
Of course, these cases do not show that BLUEDOT has been fully successful in predicting epidemics. First of all, will there be some biases in the AI training model, for example, in order to avoid underreporting, will the severity of the epidemic be overstated, and the problem of wolf is coming will arise again? Second, is the data assessed by the monitoring model valid, such as BLUEDOTs careful use of social media data to avoid excessive noise?
Fortunately, as a professional health service platform, BLUEDOT will pay more attention to the accuracy of monitoring results than GFT. After all, professional epidemiologists are the final publishers of these prediction reports, and the accuracy of their prediction will directly affect their platform reputation and business value. This also means that BLUEDOT also needs to face some tests on how to balance commercial profits with public responsibility and information openness.
AIs prediction of epidemic outbreak is just a prelude
The first Wuhan coronavirus warning was from artificial intelligence? The headline in the media really surprised many people. In the era of global integration, the outbreak of epidemic diseases in any place is likely to spread to every corner of the world in a short period of time. The discovery time and the efficiency of early warning and notification become the key to prevent epidemic diseases.
If AI can become a better epidemic early warning mechanism, it can be regarded as a way for the World Health Organization (who) and national health departments to carry out epidemic prevention mechanism.
Then it also involves how these organizations take advantage of the epidemic forecast results provided by AI. In the future, the epidemic AI prediction platform must also provide an assessment of the level of epidemic infection risk, as well as the level of economic and political risks that may be caused by disease transmission, to help relevant departments make more stable decisions. And all of this still takes time. These organizations should also put this AI monitoring system on the agenda in the establishment of rapid response epidemic prevention mechanisms. It can be said that the AIs successful prediction of the epidemic ahead of time is a bright color of human response to the global epidemic crisis. It is hoped that the battle of epidemic prevention and control participated by artificial intelligence is just the prelude to this long-lasting battle, and there should be more possibilities in the future. For example, AI identification application of major infectious disease pathogens; establishment of AI early warning mechanism of infectious diseases based on seasonal epidemic data of major infectious disease epidemic areas and infectious diseases; AI assistance in the optimization of medical materials allocation after the outbreak of infectious diseases, etc. Lets wait and see. Source: Hu olfactory app editor in charge: Liao ziyao, nbjs10040
Then it also involves how these organizations take advantage of the epidemic forecast results provided by AI. In the future, the epidemic AI prediction platform must also provide an assessment of the level of epidemic infection risk, as well as the level of economic and political risks that may be caused by disease transmission, to help relevant departments make more stable decisions. And all of this still takes time. These organizations should also put this AI monitoring system on the agenda in the establishment of rapid response epidemic prevention mechanisms.
It can be said that the AIs successful prediction of the epidemic ahead of time is a bright color of human response to the global epidemic crisis. It is hoped that the battle of epidemic prevention and control participated by artificial intelligence is just the prelude to this long-lasting battle, and there should be more possibilities in the future. For example, AI identification application of major infectious disease pathogens; establishment of AI early warning mechanism of infectious diseases based on seasonal epidemic data of major infectious disease epidemic areas and infectious diseases; AI assistance in the optimization of medical materials allocation after the outbreak of infectious diseases, etc. Lets wait and see.