Current Version: 1.0 Developed by: Claude Pasquier Copyright: © 2010 The University of nice Sophia antipolis License: CeCILL-B DescriptionThis software is a Java implementation of the Genia tagger (Part-of-speech tagging and shallow parsing for biomedical texts) version 3.0.1 of April 16 2007 available here. The original version was developped in C++ by Yoshimasa Tsuruoka from the the Tsujii Laboratory at the University of Tokyo and distributed under the modified BSD licence. The datasets are identical to the original C++ version. The output from this java version should be identical to the output of the original C++ version. For more information about the original software, see:
Availability
ExecutionPrepare a text file containing one sentence per line, then execute the program with:
The tagger outputs the base forms, part-of-speech (POS) tags, chunk tags, and named entity (NE) tags in the following tab-separated format.
Chunks are represented in the IOB2 format (B for BEGIN, I for INSIDE, and O for OUTSIDE). Example
You can easily extract four noun phrases ("Inhibition", "NF-kappaB activation", "the anti-apoptotic effect", and "isochamaejasmin") from this output by looking at the chunk tags. You can also find a protein name with the named entity tags. Related publication1. Pasquier C: Single Document Keyphrase Extraction Using Sentence Clustering and Latent Dirichlet Allocation. In proceedings of the 5th International Workshop on Semantic Evaluation. Uppsala, Sweden: Association for Computational Linguistics; 2010:154–157. |
Softwares > Text Mining >