Softwares‎ > ‎Text Mining‎ > ‎

GeniaJ

Current Version: 1.0

Developed by: Claude Pasquier

Copyright: © 2010 The University of nice Sophia antipolis

License: CeCILL-B

Description

This software is a Java implementation of the Genia tagger (Part-of-speech tagging and shallow parsing for biomedical texts) version 3.0.1 of April 16 2007 available here. The original version was developped in C++ by Yoshimasa Tsuruoka from the the Tsujii Laboratory at the University of Tokyo and distributed under the modified BSD licence. The datasets are identical to the original C++ version. The output from this java version should be identical to the output of the original C++ version.

For more information about the original software, see:

  • Yoshimasa Tsuruoka, Yuka Tateishi, Jin-Dong Kim, Tomoko Ohta, John McNaught, Sophia Ananiadou, and Jun'ichi Tsujii, Developing a Robust Part-of-Speech Tagger for Biomedical Text, Advances in Informatics - 10th Panhellenic Conference on Informatics, LNCS 3746, pp. 382-392, 2005 (pdf).

Availability

  • GeniaJ.jar (java version 1.0 of july 1, 2010) is available freely by request.

Execution

Prepare a text file containing one sentence per line, then execute the program with:

java -Xmx500m -jar GeniaJ.jar < RAWTEXT > TAGGEDTEXT

The tagger outputs the base forms, part-of-speech (POS) tags, chunk tags, and named entity (NE) tags in the following tab-separated format.

word1   base1   POStag1 chunktag1 NEtag1

word2   base2   POStag2 chunktag2 NEtag2

  :       :        :       :        :

Chunks are represented in the IOB2 format (B for BEGIN, I for INSIDE, and O for OUTSIDE).

Example

> echo "Inhibition of NF-kappaB activation reversed the anti-apoptotic effect of isochamaejasmin." | java -Xmx500m -jar GeniaJ.jar

Inhibition      Inhibition      NN      B-NP     O
of              of              IN      B-PP     O
NF-kappaB       NF-kappaB       NN      B-NP     B-protein
activation      activation      NN      I-NP     O
reversed        reverse         VBD     B-VP     O
the             the             DT      B-NP     O
anti-apoptotic  anti-apoptotic  JJ      I-NP     O
effect          effect          NN      I-NP     O
of              of              IN      B-PP     O
isochamaejasmin isochamaejasmin NN      B-NP     O
.               .               .       O        O

You can easily extract four noun phrases ("Inhibition", "NF-kappaB activation", "the anti-apoptotic effect", and "isochamaejasmin") from this output by looking at the chunk tags. You can also find a protein name with the named entity tags.

Related publication

1. Pasquier C: Single Document Keyphrase Extraction Using Sentence Clustering and Latent Dirichlet Allocation. In proceedings of the 5th International Workshop on Semantic Evaluation. Uppsala, Sweden: Association for Computational Linguistics; 2010:154–157.