AAAI 2020 academic conference in advance: common sense knowledge and common sense reasoning

category:Internet
 AAAI 2020 academic conference in advance: common sense knowledge and common sense reasoning


Fig. 1.1 the yellow and blue boxes of the effect examples of the method in this paper on the Chinese video data set of Youku VC of MSVD video data set represent the candidates of objects and relationships, respectively, O-R-O represents object relation object in the semantic graph, and o-r-a represents object relation attribute. Ours refers to the description generated by the authors method, and GT is the real description statement. Three pictures represent frames randomly sampled from the video. The authors method can detect some objects that are difficult to detect to generate accurate description, for example, the small object eyebrow pencil used for makeup in (b), for example, the person who is seriously occluded in (d), can be inferred from the prior knowledge < woman, put on, make up > and < woman, play with, cat > respectively. And the method can also generate Chinese descriptions such as (c) and (f). English in parentheses is the translation of Chinese. The paper method c-rreasoning includes three modules, as shown in Figure 1.2. After generating semantic entity candidates, visual mapping and knowledge mapping module learn the visual feature vectors of semantic entity candidates through visual mapping, and learn the candidate knowledge vectors through knowledge mapping. According to the given candidate, common sense module constructs semantic map under the guidance of prior knowledge map. In relation reasoning module, text description is generated by GCN and sequence based language model according to given semantic graph.

Figure 1.2 shows the c-rreasoning method as an example of generating video description. 1. Visual mapping and knowledge mapping module (1) visual mapping is used to generate visual features of semantic entity candidates (such as objects, attributes, relationships). Candidates of objects and attributes are represented by visual features of local areas. The relationship candidate is represented by two local regions. Pre trained CNN is used to intensively sample local areas from the input image or video, cluster the sampled areas, take the candidate of cluster center as the representative candidate, and record the candidate visual feature vector as v. (2) Knowledge mapping can learn the candidate knowledge vector k by mapping the candidate visual feature vector V to the prior knowledge embedded in the vector semantic concept space. K = [K1,..., k_ (NV)], where ki = EPI, e represents the knowledge embedding vector, and PI represents the weight of the knowledge embedding vector. The knowledge embedding vector is calculated by using the knowledge graph on visualgenome. Three nonlinear mapping networks are constructed to construct soft assign visual feature vectors for object, relation and attribute. The real value of concept tag is obtained by using part of speech tagging tool to describe the real description.

Figure 1.3c-rreasoning iteration execution diagram 2. Common sense reasoning module takes visual feature vector V and knowledge vector k as input, and uses nonlinear mapping function Si = u03c6 (VI, Ki) to represent candidate as semantic feature s. Semantic features meet the relevance and constraints among objects, relationships and attributes inferred from common sense reasoning standards to generate semantic images or videos. The semantic map u03c6 () is updated by the back propagation of the c-rreasoning framework. Specifically, the knowledge graph is a set of triples. Each triplet (SH, Sr, st) represents the relationship SR between the head entity SH and the tail entity St. The correlation criteria of triples are expressed as follows:

Where W is the weight matrix that transforms the semantic features into complex vectors, WS ^ t with overline is the complex conjugate of WS ^ t, < > represents the multilinear point multiplication of vectors in triples, re() is the real part, im() is the virtual part. From the candidates, we choose the three tuple generated semantic map which has a greater response to the above criteria. 3. The relation inference module adopts the pattern of GCN + LSTM, uses the graph convolution network proposed in [1] to spread information along the edge of the graph, and encodes the features in the semantic graph according to the context to generate the relation aware triple features. Using the model of [2], the model uses the top-down attention LSTM to weight the visual features, and the language LSTM to generate the description. According to the weighted visual features and the hidden state in the attention LSTM, the attention weight of the triple features is calculated. After cascading, the conditional distribution of the word description can be obtained by inputting it into the language LSTM. 4. The total loss of model training is L = l ufe63 C + u03b2 ufe63 L ufe63 s, in which u03b2 is a super parameter, l ufe63 C is the cross entropy loss used to generate sentences, and l ufe63 s is used to guide the learning of the semantic features of triples.

In theory, c-rreasoning method can be trained in an end-to-end way. However, the construction of semantic graph in common sense reasoning module is facing optimization challenges. Therefore, the author designs an iterative algorithm to alternately optimize the generation of semantic graph in common sense reasoning module and the generation of description in relation reasoning module. The algorithm is as follows:

The author used the MSVD video data set and mscoco image data set to carry out the experiment. The MSVD data set was collected from YouTube video, and the training verification and test set was divided into 120010670. Mscoco data set contains more than 100k images, each of which has 5 descriptions. The number of training verification and test images used by the author is divided into 11328750005000. In addition, the author also conducted qualitative experiments on the Chinese video description data set Youku VC. The training verification test data is divided into 1000215215. The visualization results of each short video with 10 Chinese descriptions on MSVD and Youku VC are shown in Figure 1.1 above. Table 1 shows the experimental results on the MSVD video data set. The first four methods are simple sequence to sequence model, which do not take advantage of the relationship between objects. Compared with the authors method, it achieves better performance and proves the superiority of joint common sense and relational reasoning. Compared with the method of using the detector pre trained on the image data set and then detecting the object in the video, the authors method is still better, which shows that using prior knowledge to identify the object is more reliable. Table 2 shows the results on mscoco image data set, which is higher than the method without using semantic information (the first line). Compared with the method with pre training detector, the method proposed by the author can also be compared with it. The authors also show that their method uses a pre trained faster-cnn detector to extract the results of the initial region from the image. In addition, the ablation experiments for each part confirm the effectiveness of each module, as shown in Table 3.

In summary, this paper is not focused on common sense knowledge and common sense reasoning itself, but combines common sense and relational reasoning to make the elusive and not directly visible objects or relationships in image and video description appear, making the description more accurate. And the method in this paper does not need to use pre trained object or relation detector. Through this common sense relationship, the strategy of joint learning can better achieve the consistency of global semantics. The author thinks that this paper will give some inspiration to the application of common sense knowledge and common sense reasoning in video image description, Visual Knowledge Q & A and other fields. 77-6 086. Graph based reasoning over heterogeneous external knowledge for common sense question answering. Paper link: https://arxiv.org/pdf/1909.05311.pdf the introduction of common sense question answering often requires those background knowledge that is not significantly expressed in the question. The key to this challenge is how to obtain evidence from external knowledge and make predictions based on the evidence. As shown in Figure 2.1, the question what do people usually do when playing the guitar? Option A. cry B. listen to the voice C. sing D. arthritis E. make music. According to the evidence from ConceptNet, we can pick out the two options A and C, and according to the evidence from Wikipedia, we can pick out the two options C and E. Combined with the two options, we can get the right answer C. Structured knowledge sources (such as conceptnet) contain valuable structural relationships between concepts, which are helpful for reasoning, but their coverage is low. The pure text knowledge source (such as Wikipedia) is a supplement to the structured knowledge, which can provide rich and extensive evidence. In this work, the author proposes to automatically extract evidence from these two heterogeneous knowledge sources and answer questions based on the extracted evidence.

Figure 2.1 an example in the ommonsenseqa dataset that requires multiple external knowledge to make the correct predictions. The method of this paper includes two parts: knowledge extraction and graph based reasoning, as shown in Figure 2.2.

Figure 2.2 overview of methods in this paper (1) knowledge extraction part: according to the given problems and options, automatically extract the map path from the structured knowledge base conceptnet, and automatically extract sentences from Wikipedia plain text. In order to make better use of the relationship structure of evidence, the author constructs graphs for two kinds of knowledge sources. Conceptnet is a large-scale common sense knowledge base. Common sense knowledge is represented by triples (entity node, relationship, entity node). For a given problem and option, first identify the entity in it, then search the path from the problem entity to the option in conceptnet, and combine the involved triples into a graph, with triples as nodes and triples as sides For Wikipedia sentences, the author uses the elasticsearch tool to index the sentences, searches after deleting the stop words in the questions and options, ranks the matched Wikipedia sentences according to the matching scores, takes the first k matching sentences as evidence, and proposes each predicate in the Wikipedia evidence through semantic role labeling Take subject (subject) and object (object), take subject predicate and object as node of graph, and take the relationship between predicate and other two as edge of graph.

(2) Graph based reasoning part: the graph based reasoning part consists of two modules: (a) graph based context representation learning module, which uses graph structure information to redefine the distance between words to learn better context word representation. The author proposes a topological sorting algorithm (algorithm1) to sort the evidences according to the construction map. It should be noted that for the structured knowledge source conceptnet, the relation template provided by conceptnet is used to convert the triples into natural language text sentences. The author uses xlnet as the backbone, and takes the concatenation of the sorted conceptnet evidence statement, sorted Wikipedia evidence statement, question statement and option as the input of xlnet. The output of xlnet is the context word representation. By transforming the extracted graph into natural language text, two different heterogeneous knowledge sources are integrated into the same representation space. (b) Graph based reasoning module. The graph convolution network in [1] is used to encode the graph structure information to obtain the node representation, and the node representation is updated by merging the features of the adjacent nodes. The ith node represents H ^ 0, which is obtained by averaging the hidden state of the corresponding evidence in the output of xlnet and reducing the dimension through nonlinear transformation

Where Si = {W0, u00b7u00b7u00b7, WT} is the evidence corresponding to the ith node, h_wjis the context representation of xlnet to WJ, w reduces the high-dimensional d to the low-dimensional K, and u03c3 is the activation function. In order to infer the graph, the information from each adjacent node is first aggregated. Information gathered by the ith node Z ^ L:

Where Ni is the neighbor of the ith node, and H ^ J ^ L is the representation of the j node in the L layer. This paper combines Z ^ l with the transformed i-th node representation to obtain the updated node representation H ^ 1 (L + 1). The graph attention mechanism is used to aggregate the graph representation for prediction, and the graph shows the calculation of H ^ G

Where H ^ L is the representation of the ith node at the last layer, H ^ C is the representation of the last sequence of xlnet, and can also be regarded as input representation, u03b1 ^ I is the weight of the ith node, and H ^ G is the graph representation. The author cascade the input representation h ufe63 C with the graph representation H ^ g, and input multi-layer perceptron to calculate the confidence score (Q, a). For question Q, the probability of candidate answer a is calculated as follows:

Where a is the candidate answer set, and the highest confidence score is selected as the predicted answer. The author used commonsense QA data set to carry out the experiment. The data set contains 12102 samples (train: 9741, Val: 1221, test: 1140). The author selected the best model on the verification set and submitted the prediction results on the test data. In the comparative experiment, the models in the ranking list were selected and divided into four groups

u00b7Group1: there is no corresponding description of the model and no published paper

u00b7Group2: model does not use extracted knowledge

u00b7Group3: model uses extracted structured knowledge

u00b7Gropu4: the model uses the extracted unstructured knowledge

These methods either use the evidence from the structured knowledge source, or use the evidence from the unstructured knowledge source, without taking advantage of both knowledge sources. The results on the commonsense QA verification set and test set are shown in Table 1. Compared with the four methods, the authors method achieves the best performance. Table 2 shows the results of ablation experiments on the validation set by the authors method. In baseline, the author simply connects all the evidences to xlnet and uses context representation to predict. By adding topological sorting algorithm, the revenue is 1.9% higher than that of baseline, the increase of 1.4% is brought by adding graphic reasoning module alone, and the increase of 3.5% is obtained by adding them together. Then the author carried out the knowledge source ablation experiment, and the results of Table 3 proved that the combination of conceptnet and Wikipedia can greatly improve the effect, which shows that the performance of heterogeneous knowledge sources is better than that of single knowledge source and different knowledge sources.

Summary of the author: the authors innovation in this paper is to propose a graph based method, using different structure of knowledge sources for Common Knowledge Q & A, and to propose a graph based context representation learning module and graph based reasoning module, which make better use of graph information. The authors method has achieved the most advanced performance in the current commonseqa ranking. Reference: [1] kipftn, wellingm.semi-supervised classification with graphconvolutionalnetworks [J]. Arxivpreprintarxiv: 1609.02907, 2016.PIQA:ReasoningaboutPhysicalCommonsenseinNaturalLanguage (physical interactive question and answer: physical commonsense reasoning in natural language) link: https://arxiv.org/pdf/1911.11641.pdf paper quick reading: without using a brush to paint eye shadow, should I use cotton swabs and toothpicks? This kind of problem needs common sense in the physical world, which challenges the current natural language understanding system. Although the recent pre training models (such as BERT) have made progress in the field of question answering such as news articles and encyclopedic entries, which are rich in textual information, in a more realistic field, due to the bias of the reports, the text is essentially limited, and the fact that it is a bad idea to use toothpicks to smear eye shadow is rarely reported. Can AI systems reliably answer physics common sense questions without experiencing the physical world? Ability to capture common sense knowledge about everyday objects, including their physical properties, affordability, and how to manipulate them. In this paper, the author introduces a physical common sense reasoning task and the corresponding benchmark data set piqa (physical interaction: question answering) for evaluation. Although its easy for humans to deal with this data set (95% accuracy), large pre training models are difficult (77%). The author analyzes the lack of knowledge in the existing models, which provides an important opportunity for future research.

Figure 3.1 piqa data sample piqa data set piqa focuses on physical common sense in daily life and prefers atypical solutions. Figure 3.1 shows an example of piqa. What should you do to use a mineral water bottle to separate the yolk from the egg white? a. Squeeze the mineral water bottle against the egg yolk, and then loosen it, it will produce suction to suck the egg yolk away. b. Put the mineral water bottle on the egg yolk and push it continuously, it can produce suction to suck the egg yolk away. Its easy for humans to pick out a, but its easy for machines to fall for it.

Figure 3.2 schematic diagram of piqa data set (the Q & A on the left is more focused on the attributes of objects, and the Q & A on the right is right from a technical point of view, but more convenient and desirable) the essence of piqa task is to select Q & A, given Q and two possible solutions S1, S2, only one of which is correct (as shown in Figure 3.2), the model or human must choose the most appropriate solution. Data sets are defined by human annotators according to purpose solution. The purpose can be regarded as a post condition, and the solution represents the process of completing this condition. The more detailed the purpose, the easier it is to write correct and incorrect solutions. In order to remind the annotator to think creatively and set up an atypical solution, it will inspire the annotator to get inspiration from the link of instructables.com website to build the task data set. Instructables.com is a crowdsourcing guide to build, make, bake with everyday materials, from cooking to car repair, etc. in most cases, each step will be equipped with images, videos and a list of tools needed. The annotator will use some language skills to make minor changes to the right solution to get the wrong solution, such as changing a key word, changing the value, replacing another action that is not helpful to achieve the goal in pairs, etc. When validating the dataset, the samples that need expert knowledge will be removed, and the afite algorithm will be used for further cleaning. Piqa data set information statistics piqa data set consists of more than 16000 trained QA pairs, in addition, about 2k and 3K are provided for development and testing. The average length of the goal is 7.8 words, the average length of the correct and incorrect solutions is 21.3 words, and there is at least 85% overlap between the words used in the correct and incorrect solutions. Through the statistics of word frequency of nouns, verbs, adjectives and adverbs, it is verified that the data set is indeed strongly related to physical phenomena. For example, the adjectives with the highest frequency of words include state (dry, clean, hot), shape (small, sharp, flat), form (fast, careful). These attributes usually determine whether the solution is right or not.

The experimental evaluators conducted experiments on GPT model, Bert model, and Roberta model (a version of Bert, using more data for pre training). The experimental results are shown in Table 1. The results show that there is still a gap of nearly 20% between the best model and human. Next, the author analyzes which aspects of the data set fooled the Roberta model. The greater the edit distance (i.e. the number of different words) between the two solutions, the lower the accuracy of the model. The author finds that the Roberta model still cannot understand many common and general physical concepts. As shown in Figure 3.4, the only difference between S1 and S2 in the validation set sample (Q, S1, S2) is that when w is cold , before , after , the accuracy of the Roberta model is close to 50% of the random guess. Taking water and spoon with high accuracy as examples, the author further explores the words that most often replace them in the training set, as shown in Figure 3.5. The most common words to replace spoon are fork and knife, but spoon in the physical world can not be replaced by sharp or pointed utensils. Robertas performance on spoon (90%) indicates that it may understand this simple feature of spoon. water is very common in the training set and has a high universality. The most common words to replace it are milk, oil and soda. Replacing water with these words in the physical world may have very bad consequences. About 75% accuracy of Roberta shows that it has not understood the concept of water. However, only 66% of free indicates that the understanding of verbs is still not a strong point for Roberta.

(left) Figure 3.4: Roberts understanding of the physical world through common sense concepts (right) figure 3.5: water, spoon, The most common substitutions of freeze are summarized in this paper. The task proposed in this paper is very novel. Compared with the common sense contained in the common sense knowledge base, the physical common sense pays more attention to the physical properties of objects, and the piqa data set tends to the atypical physical common sense, so it can not get the answer directly from the existing text base. For this kind of physical common sense reasoning problem, there is still a big gap between the performance of the best model and human beings, which shows that the model lacks the understanding of some of the most basic physical characteristics of the physical world. If there is a breakthrough in this kind of problems, the field of artificial intelligence will go further. The three papers selected by the author show the research progress of common sense knowledge and common sense reasoning from different perspectives. The first paper opens the field of vision, combines common sense with relational reasoning, and applies it to the field of image video description generation. The three parts cooperate with each other, and achieve good results. The second paper proposes a new method to solve the problem of common sense knowledge Q & A, which integrates heterogeneous common sense knowledge sources into the same representation space, and uses graph reasoning to perform common sense knowledge Q & A, which is very effective on the benchmark data set commonsenqa. In the third paper, a new common sense reasoning task and piqa data set are proposed, which provide opportunities and challenges for the future research of common sense problems. Because common sense knowledge and common sense reasoning are often combined with natural language understanding and visual Q & A, it is more difficult to solve problems related to common sense than general natural language processing and computer vision problems. Except for the third part, which is to introduce data sets, the first and second articles all adopt graph or graph neural network to solve them, which shows that It may be a way to solve the problem. At present, the most advanced language model still has a big gap with human beings in solving common sense knowledge, common sense reasoning and other problems (such as the performance on CommonseQA and PIQA), and common sense knowledge and common sense reasoning are still areas worth exploring in artificial intelligence! List of papers received by AAAI 2020: https://aaai.org/conferences/aaai-20/wp-content/uploads/2020/01/aaai-20-accepted-paper-list.pdf other papers related to this aaai2020:

u00b7Commonsense knowledge base completion with structural and semantic context

u00b7Paper link: https://arxiv.org/pdf/1910.02915.pdf

u00b7Understanding the semantic content of sparse words embedded in a common sense knowledge base

u00b7Paper link: https://kr2ml.github.io/2019/papers/kr2ml-2019-paper-29.pdf

u00b7Evaluating common sense in pre trainedlanguage models

u00b7Paper link: https://arxiv.org/pdf/1911.11931.pdf u00b7 knowitvqa: answeringknowledge based questions about videos u00b7 paper link: https://arxiv.org/pdf/1910.10706.pdf brief introduction of analyst: Luo Sainan, graduate student majoring in computer science and technology of Xian University of Electronic Science and technology, with research direction of network security There is great curiosity in all fields of computer vision. I hope to learn and make progress together with you. Extend reading intelligent robot Xiaobai to join in Wuhan new crown pneumonia treatment front line 5g, AI infrared temperature detector, Ministry of industry and information technology: daily production capacity of more than 10000 aaais2020 academic conference look ahead: common sense knowledge and common sense reasoning this source: machines heart responsibility editor: Liao ziyao nbjs10040

u00b7Paper link: https://arxiv.org/pdf/1911.11931.pdf

u00b7Knowitvqa: answering knowledge based questions about videos

u00b7Paper link: https://arxiv.org/pdf/1910.10706.pdf

Profile of analyst: Luo Sainan is a graduate student majoring in computer science and technology of Xian University of Electronic Science and technology. His research direction is network security. He has great curiosity in all fields of computer vision. He hopes to learn and make progress together with you.