Web NER Model Generation Tool

(1)

1. Abstract:

Named entity recognition (NER) is of vital importance in information extraction and natural language processing. Current NER research are trained mainly on journalistic documents such as news articles to extract person names, location names, and organization names. Since they have not been trained to deal with informal documents, the performance drops on Web documents which contain noise, and is less structured. Therefore, the State-of-the-art NER systems do not work well on Web documents. When users want to recognize named entity from Web documents, they certainly have to retrain the new model. Retraining a new model is labor intensive and time consuming. The preparatory work includes preparing a large set of training data, labeling named entity, selecting an appropriate segmentation, symbols unification, normalization, designing feature, preparing dictionary, and so on. The pre-processing work is very complicated. Besides, users need to repeat the previous work for different languages or different recognition types.

In this research, we propose a NER model generation tool for effective Web entity extraction. We propose a semi-supervised learning approach for NER via automatic labeling and tri-training which makes use of unlabeled data and structured resources containing known named entities. Experiments confirmed that the use of this tool can be applied in different languages for various types of named entities.

2. Tool Introduction:

Our Web NER model generation tool accepts named entity list which users provided. User can collect named entity list from Yellow Pages.

(Please refer to page-level information extraction tool proposed by Jhong-Li Ding https://sites.google.com/site/nculab/plde)

This tool can then use known entities as query keyword and collect search result snippets that contain the query keyword as training. Then our tool will automatically make dictionary, extract features, learn CRF model and apply self-testing and tri- training to improve the NER performance.

Compared with the high cost of label training examples by human, our tool, labeling training data from Google’s search engine automatically, can obtain large labeled training set and is less time-consuming.

The example and the default 18 features we used are described below:

(2)

This figure gives an example of the Chinese sentence “110 內政服務熱線:1996 勤務指揮中心，台北市”. Seeing (1) in Figure 5, the “1” represents that the token is at the beginning of this sentence; (2) represents that the token is the end of a phrase; (3) represents that this token is symbol; (4) represents that the token is the start of a phrase; (5) represents that the token is the frequently used words before named entity; (6) represents that the token is the end of a phrase; (7) represents that this token is symbol; (8) the first “1” represents that the token is the start of a phrase, the second “1” represents that the token is the frequently used words after named entity, the third “1” represents that this token and next token are the frequently used words before named entity, in this case the token is “台” of “台北”, and the last

“1” represents that this token and next two tokens are the frequently used words before named entity, in this case the token is “ 台” of “台北市”; (9) represents that this token and previous token are the frequently used words before named entity, in this case the token is “北” of “台北”, and the last “1” represents that the intermediate token and around two tokens are the frequently used words before named entity, in this case the token is “北” of “台北市; (10) the first “1” represents that the token is the frequently used words before named entity, the

(3)

second“1”represents that this token and previous two tokens are the frequently used words before named entity, in this case the token is “市” of ”台北市”, and the last “1” represents that the token at the end of this sentence.

Finally, we used BIEOS Start/End tagging as tagging scheme to mark the named entities to be extracted. The BIEOS tagging classifies each token. One class is used for the first token of an attribute value (B), one for inner tokens (I), one for the last token (E), one for not belonging to any attribute value (O) and another one for attribute value only a single token (S).

3. Usage

Step1. Google_Snippets_Crawler

Crawl.bat is an example.

command: java -cp NER_ModelGenerator.jar Google_Snippets_Crawler.MyMain example\\entity example\\TrainingData lang_en 10

There are 4 major parameters:

(1) example\\entity : ^thenamed entity list file^.

(2) example\\TrainingData : are search snippets from Google search engine. (3) lang_en : is Google search parameters in URL. (Chinese : lang_zh-CN

%7Clang_zh-TW, Japanese : lang_ja, English : lang_en)

(4) 10 : used the named entities as query keyword and collected top 10 search result snippets. (can be designed by user)

Step2. Training

RunTrain.bat is an example.

 _Basic:

command: java -cp NER_ModelGenerator.jar TrainModel.Start example\\TrainingData example\\entity\\address.txt example\\TrainingDataout alphabetic model 100 10

(1) example\\TrainingData : are search snippets from Google search engine. (2) example\\entity\\address.txt : ^{is the}named entity list file^.

(3) example\\TrainingDataout : the training data after preprocessing and the

(4)

trained model.

(4) alphabetic : alphabetic data,

nonalphabetic : non-alphabetic data. (5) model : the trained model name.

(6) 100 : automatically chose the top 100 frequent words in the list as dictionary . (M can be designed by users)

(7) 10 : get 10 words before and after the name entity as window. (N can be designed by users)

 Self-Testing:

command: java -cp NER_ModelGenerator.jar TrainModel.Start example\\TrainingData example\\entity\\address.txt example\\TrainingDataout alphabetic model 100 10 0.7

(3) example\\TrainingDataout : the training data after preprocessing and the trained model.

(8) 0.7 : use the filter threshold 0.7 to remove low probability sentences. The default filter threshold is 0.7. (This value depends on individual circumstances).

 Tri-Training:

command: java -cp NER_ModelGenerator.jar TrainModel.Start example\\TrainingData example\\entity\\address.txt example\\TrainingDataout alphabetic model 100 10 0.7 example\\UnlabelData

(3) example\\TrainingDataout : the training data after preprocessing and the trained model.

(5)

(8) 0.7 : use the filter threshold 0.7 to remove low probability sentences. The default filter threshold is 0.7. (This value depends on individual circumstances). (9) example\\UnlabelData : are the unlabeled_{data file}.

Step3. Testing data preparation and Testing

RunTest.bat is an example.

command: java -cp NER_ModelGenerator.jar Testing.Start

example\\TestingData\\nameSnippet_a_test.txt example\\TestOutput\\ alphabetic model_DS5_self_eng_07_re

(1) example\\TestingData\\nameSnippet_a_test.txt : is the testing data you want to assign sequential tags.

(2) example\\TestOutput\\ : the testing data after preprocessing and the model. (3) example\\TrainingDataout : test output.

nonalphabetic : non-alphabetic data.

(5) model_DS5_self_eng_07_re : model name(please place the model in the example\\TestOutput folder)

If you want to download our tool or have any questions, please feel free to contact with Professor Chang ([email protected])