The ELA repository on GitHub contains various Python tools that have been used to process XML-TEI encoded texts for use in the ELA platform. The toolset is provided as:
- a command-line based tool that extracts information from the XML files
- a set of Jupyter notebooks that have been used for several types of operation, ranging from XML cleanup and normalization to text statistics and analysis
- some high-level libraries supporting the Jupyter notebooks
- a directory structure that includes locations where the tools expect to find data and possibly write results
- scripts that recreate the Python virtual environment on various platforms.
The tool set is not expected to work online, because it requires a fairly complete virtual environment and the possibility to read and write files. The use of Jupyter is recommended for the entire toolset (including the command line based script).
Usage
After cloning the repository, open a terminal window and cd
to the ELA_archive/TOOLS directory. Prepare the virtual environment by running the appropriate script (i.e. if you run a Ubuntu 18.04 workstation, having Python 3.6 installed, run ./env-py36lx64.sh
and wait for the environment creation). Then activate the environment and run Jupyter as follows:
$ . ela_tk_py36lx64/bin/activate
$ jupyter-lab
A Jupyter browser session is launched, and all the nbk_*.ipynb files can be opened. Follow the instructions on the notebooks to run the tools.
Note: The tool set is usable for its own specific purpose, but cannot be considered production-ready for anything else than providing data to the ELA platform: please consider it as a suite in its early stage of development. Form, structure and usage is subject to change without notice.
Jupyter notebooks
Tools ELA
tools_ela.py is a command line tool that extracts information from original transcribed text. The provided information is:
- Indexes of Words, Lemmas, Types
- Frequencies of Lemmas, Types
- Concordances
- TTR
- Collocations (for both words and lemmas)
- N-grams
- Min, Max and Mean lengths of Types
- POS Tagging (both Bayesian and HMM - Hidden Markov Model)
- TEI attributes (on TEI files)
- TEI entity lists (on TEI files)
The tool can handle both XML-TEI encoded and pure-text files. By default it handles XML-TEI, but a command line argument chan instruct the tool to treat files as text. It is actually a multiplatform tool, actively tested on both Linux and Windows.
All results, except for concordances, are in JSON format.
Note: tools_ela.py requires CLTK to be installed, and the latin_models_cltk
corpus to be loaded.
Retag (part 1 and part 2)
Replace and normalize geogName
and persName
tags in files by using lists (machine-produced and human-reviewed); all known names and their variants are replaced by normalized tags.
Solve Abbreviations
This tool simply solves abbreviations by trying to guess in which case an abbreviation should be declined, by looking at the ending of the following word. This is in a very preliminary stage, and shold not be used if not for text that will be intensely revised subsequently.
Geographic Names DB Loader
Use online text dumps from Pleiades and GeoNames to feed the local geographic names database. This step is needed in order to prepare the database used by tools_ela.py for geographic coordinates retrieval: georeferentiation data will not be provided by tools_ela.py if the database is not found or prepared.
Perform global statistics on a corpus
Given a corpus as a flat directory processed by tools_ela.py return global corpus statistics in human readable format. Performed global statistics are:
- Number of words
- Number of latin words [for multilingual texts (actually for all texts)]
- Number of lemmas
- Number of latin types
- Min/max/average lenghts of Latin types
- Type token ratio
- List of words and their frequencies
- List of latin words and their frequencies
- List of latin stopwords and their frequencies
- List of lemmas, frequencies, variants (non-lemmatized)
- List of persName instances and frequencies
- List of geogName/placeName instances and frequencies
Results are provided as TXT and/or HTML files.
The environment is prepared to create some word clouds from the statistics with matplotlib. Some example of network analysis based on authors, people and places mentioned in the documents are also offered.