import and convert dictionaries from other programs
Many Plover users have steno experience with other programs and therefore have mature dictionaries in those programs' formats. A tool should exist to easily convert other programs' dictionaries to the Plover dictionary format.
Blueprint information
- Status:
- Not started
- Approver:
- None
- Priority:
- Undefined
- Drafter:
- None
- Direction:
- Needs approval
- Assignee:
- None
- Definition:
- New
- Series goal:
- None
- Implementation:
- Unknown
- Milestone target:
- None
- Started by
- Completed by
Related branches
Related bugs
Sprints
Whiteboard
A list of (Vim-flavored) regular expressions that will convert a dictionary exported in rtf/cre format into Python dictionary format. Ideally this should be turned into a simple script that new users can run on their dictionaries without prior knowledge of regular expressions. This has only been fully tested with rtf/cre dictionaries exported by Eclipse. Additional formatting is probably necessary for rtf/cre files exported from CAT software other than Eclipse. More testing is required. Note that Plover currently supports two types of steno dictionary: Eclipse format, where hyphens are only made explicit when necessary, and DigitalCAT format, where all hyphens are explicit. Default format is Eclipse, so if you are importing a DigitalCAT dictionary, change the format in Plover's .config file.
-------
# escape backslashes
%s/\\/\\\\/g
# escape "
%s/"/\\"/g
# convert double spaces to single spaces
%s/ / /g
# Remove lines with court reporter-specific paragraphing commands (this is drastic, but they cause no end of trouble. Will maybe try to support them
# to some degree in a later version.)
%s/^.*{$}.*$\n//
%s/^.*\
# Convert steno half of entry to Python format
%s/{\\\\.\\\\cxs \([^\}]\+\)}/"\1": /
# Get rid of any lines that don't start with quotes. (i.e., more court reporting formatting residue)
%s/^[^"].*$\n//
# Convert infixes.
%s/: \\\\cxds \(.*\)\\\\cxds/: {^\1^}/
# Convert suffixes.
%s/: \\\\cxds \(.*\)/: {^\1}/
# Convert prefixes.
%s/: \(.*\)\\\\cxds/: {\1^}/
# Delete "force uncap" command (caption-specific command that Plover doesn't need to implement now, if ever.)
%s/{l1}//g
%s/{l0}//g
# Delete \\cxp, the punctuation marker, since Plover recognizes specific punctuation marks independently.
%s/\\\\cxp//g
# Convert glue strokes.
%s/\\\\cxfing /\&/g
# Convert "cap next" strokes.
%s/\\\\cxfc /-|/g
# Convert "stitch" strokes to suffix with hyphen.
%s/{\\\\cxstit /{^-/
# Search for other cx strokes and deal with them manually.
/cx
# Delete spaces at ends of line.
%s/ \n/^M/g - (don't type in the ^M; do control-q, then control-m, and what will display is ^M)
# Convert other half of entries.
:%s/^"\
# Put in curly brackets at beginning and end of dictionary
# I'm sure there's a way to do this automatically, but I just did it manually.
You can find a ~9 mb zip file containing several unconverted dictionaries in rtf format and a few converted dictionaries in json format as well, in both Eclipse (only necessary hyphens explicit) and DigitalCAT (all hyphens explicit) flavors of steno here:
http://
The DigitalCAT dictionaries will require much more weeding, since they have extra metadata that the regular expressions in the launchpad blueprint doesn't account for. Stuff like dictentrydate, which we can just cut out completely, and conflicts, which will require the sacrifice of the entry, since Plover doesn't support conflict differentiation (nor will it ever, if I have anything to say about it). Basically anything starting with cx is steno-specific metadata