docanalysis | Corpus Sectioning, Entity extraction and Text Mining

Release Category: Production

Developed By: Shweata N. Hegde and Peter Murray-Rust

docanalysis is a command-line tool that processes document collections (CProjects) and performs text analysis.

It can:

  1. Divide documents into sections
  2. Perform text mining and natural language processing (NLP)
  3. Generate dictionaries of terms

It uses custom code along with Python tools like NLTK, and it can use spaCy or scispaCy for extracting and annotating entities. The tool creates summary data and word lists as output.

Primary functionality:

Primary inputs:

Primary outputs:

Main file types for transfer: .xml (fulltext, sectioned), .json (eupmc_result), .csv, .html, .json (output), AMI .xml dictionaries.

Installation

Check the successful installation with command : docanalysis --help. You should see a help message come up.

Tutorials (Jupyter Notebook/ Colab Notebook)

hackathon

← Back