201 5 10 5 1
Millions of articles are published by social media users every day. Most of them are useless but some of them are very useful and it is very difficult to filter useful content from the big universe of social media. It will be a feel good for the end user if a computer could do this. Yes, a computer model based on the convolutional neural network can do this.
Researchers from the Chinese Academy of Sciences (CAS) have developed a convolutional neural network (CNN)-based model to extract knowledgeable snippets and annotate documents. Their method, outlined on a paper pre-published on arXiv, was found to perform better than existing tools, despite being trained for shorter periods of time.
An article published on Techxplore defined the term “knowledgeable document” (as written in the research paper) as “a document containing multiple knowledgeable snippets, which describe concepts, properties of entities, or the relations among entities.”
Researchers tested the effectiveness of their neural network model on real documents from the various WeChat content domains. WeChat is a Chinese social media and payment platform.
So far, most knowledge bases, such as YAGO or DBpedia, extract knowledge based on Wikipedia, WordNet, GeoNames, and other online resources. However, compared to social media platforms, these resources often contain limited and inflexible information.
“Another recent knowledge base, Probase, with 2.7 million concepts, was automatically harnessed from the so-far largest corpus, consisting of 326 million knowledgeable sentences extracted from 1.68 billion web pages,” the researchers wrote in their paper. “However, these sentences are extracted only by the Hearst patterns. For extracting more knowledgeable snippets to construct more comprehensive knowledge bases, semantic-based methods are needed to complement the previous pattern-based ones.”
Knowledgeable snippets and articles could also be used to develop knowledge retrieval and question answering services. These services would, for instance, answer questions raised by users who are looking for help with a particular problem. With these applications in mind, the researchers at CAS set out to develop a CNN based model that can analyze the semantics of a document, determine whether it is knowledgeable or not, and extract knowledgeable snippets of information from it.
“Specifically, we propose SSNN, a joint CNN-based model, to understand the abstract concept of documents in different domains collaboratively and judge whether a document is knowledgeable or not,” the researchers explain in their paper. “In more detail, the network structure of SSNN is ‘low-level Sharing, high-level Splitting,” in which the low-level layers are shared for different domains while the high-level layers beyond the CNN are trained separately to perceive the differences of different domains.”
The model devised by the researchers offers an end-to-end solution to annotate documents that does not entail extensive and time-consuming feature engineering. They also developed manual features and trained an SVM classifier model to complete the task.