The Merriam-Webster Dictionary API at some point discontinued the use of XML in favor of exclusively returning JSON. This had the unfortunate effect of making finding a definition listed by sense number (i.e. definition 1 a (2)
) much more complicated because they’re trying to force a semi-arbitrary document based model into nested json.
After reading through the documents I was able to figure out the object nesting hierarchy they return but never directly explain. The relevant nestings I believe go:
(arrows show what an object can be nested in i.e. dt is in a sense or sdsense
which can in turn be inside another sense. Sorry for the unclear names, thats
just what the api calls everything)
dt ============> sense ===========> sseq ========> def
==> sdsense ==/ / ^ ==> vd ==/
==> bs == |
|
========+> pseq
Most fields can be either arbitrarily long arrays like (('sense', {...}))
or objects {'sense': ({...})}
depending on context which is a pain especially because sometimes they’re nested, sometimes they’re flat. Objects can also contain any combination of nested objects repeated any number of times. For example, an sseq
can contain a pseq
which contains a bs
, a bs
of it’s own, another pseq
, and then a few sense
objects. Not every one is that complex but there are no restriction stopping them.
To make matters more complicated the sense number (a property of either sense
or sdsense
) can be explicit: 1 a (2)
or implicit: a
or (3)
which means it depends on a previous sense
or sdsense
that could be nested in a completely different object. They can also change between implicit and explicit as you go up and down the hierarchy.
And lastly that vd
field means that instead of def
being an array of sseq
it is of the form def: ({'vd': "(verb type)", "sseq": (...)}, {'vd': "(verb type)", "sseq": (...)})
and that sense numbers can suddenly either explicitly or implicitly reference the verb type as well which is never actually nested inside anything relevant.
Check the offial API at the Merriam Webster website for all the quirks and exceptions.
TLDR: json an api returns is a hot mess and there’s no easy way to reference the definition 1 a (2)
of a dictionary entry. Since a project I’m working on depends on referencing definitions directly by that number I wrote the python parser below to reform the json into a well nested object that can be accessed like parse_resp('word')('1')('a')('(2)')('def')
.
My approach writing the code was to start with parsing the dt
object and then work my back through each layer, creating a better structured object as I go. lists
and dicts
are still mixed together inside the object because the relationship between implicit sense numbers needs to be preserved.
Finally, once that object is completely built, there’s an admittedly gross method unpack_defs
that uses a couple global variables to track the last well-formed sense number while it recursively pulls the entire object inside-out into one well-nested dict.
I’m sure I could rewrite at least that piece better but given this was supposed to be a simple API request in a larger project that the lack of xml support greatly complicated, once it was working for all the exceptions of the API I called it a day and moved on.
Code:
import urllib.request
import json
import os
import re
# TO-DO: GET RID OF GLOBAL STATE
curr_num = ''
curr_let = ''
def reset():
global curr_let
global curr_num
curr_num = ''
curr_let = ''
# parse_resp is the main function called from other packages and accepts a Merriam Webster api response
# parse_resp defaults to returning only the definitions referenced in the first-use 'date' field because that's what the overall project is interested in
def parse_resp(resp, all_defs=False):
reset()
entries = (parse_entry(entry) for entry in resp)
if all_defs == True:
return (non_empty for non_empty in entries if non_empty)
for entry in entries:
entry('def') = get_sense_by_sn(entry('sn'), entry('def'))
entries = (e for e in entries if e('date') and e('def'))
return entries
def get_sense_by_sn(sense_number, entry):
def loop_entry(entry):
x = entry
for sn in sense_number:
x = x(sn)
return x('def')
try: return loop_entry(entry)
except: pass
# check for implicit references to transitive / intransitive verb if explicit path fails
try: return loop_entry(entry('t'))
except: pass
try: return loop_entry(entry('i'))
except: pass
return ''
def parse_entry(defs):
(date, sn) = parse_date(defs)
entry = unpack_defs(parse_defs(defs))
if entry is None:
return
return {
'date': date,
'sn': sn,
'def': entry
}
def unpack_defs(defs):
global curr_num
global curr_let
if isinstance(defs, dict):
(key, value) = next(iter(defs.items()))
key = str(key)
if is_verb(key):
return {key: unpack_defs(value)}
else:
return fmt_def(key, value)
if isinstance(defs, list):
def_dict = dict()
for d in defs:
if isinstance(d, list):
def_dict.update(unpack_defs(d))
if isinstance(d, dict):
(key, value) = next(iter(d.items()))
if is_number(key):
curr_num = key
def_dict.update(fmt_def(key, value))
elif is_paren(key):
def_dict(curr_num)(curr_let).update(fmt_def(key, value))
else:
keys = key.split()
def nest_keys(keys):
global curr_num
global curr_let
key = keys(0)
if len(keys) > 1:
if is_number(key):
curr_num = key
if is_letter(key):
curr_let = key
return {key: nest_keys(keys(1:))}
else:
return (fmt_def(key, value))
if not curr_num or (is_number(keys(0)) and curr_num != keys(0)):
def_dict.update(nest_keys(keys))
elif curr_num:
def_dict(curr_num).update(fmt_def(key, value))
return def_dict
def fmt_def(key, value):
key = str(key)
return {key: {'def': value}}
def is_number(string):
try:
int(string)
return True
except:
return False
def is_paren(string):
return '(' in string(0)
def is_letter(string):
return string.isalpha()
def is_verb(string):
return string == 't' or string == 'i'
# First Known Use: date
# Hierarchical Context
# Top-level member of dictionary entry
# Data Model
# "date" : string
def parse_date(entry):
if isinstance(entry, dict):
if 'date' in entry.keys():
return clean_date(entry('date'))
return ('', '')
def clean_date(date_string):
date = int(re.search(r'(0-9)+', date_string).group())
sn = re.findall(r'(?<=|)(w())*', date_string) or ('1')
sn = (non_empty for non_empty in sn if non_empty)
# quick and dirty conversion of centuries
if date < 100:
date = (date-1) * 100
return (date, sn)
# Definition: def
# Hierarchical Context
# Occurs as top-level member of dictionary entry and in dros.
# Data Model
# array of one or more objects
def parse_defs(defs):
if isinstance(defs, dict):
defs = defs('def')
for entry in defs:
if is_vd(entry):
return parse_vd(entry)
elif is_sseq(entry):
return parse_sseq(entry)
# Verb Divider: vd
# Hierarchical Context
# Occurs in def, preceding an sls (optional) and sseq (required)
# Data Model
# "vd" : string
def is_vd(vd):
if isinstance(vd, dict):
return 'vd' in vd.keys()
return False
# Verbs can be transititive or intransitive and referenced as 'i' or 'v' in date
def parse_vd(vd):
verb_type = 't'
if 'intransitive' in vd('vd'):
verb_type = 'i'
return {verb_type: parse_sseq(vd('sseq'))}
#
# Shared logic for sseq and pseg
#
def parse_array(array):
sense_list = ()
for sense in array:
if is_sense(sense):
sense_list.append(parse_sense(sense))
elif is_bs(sense):
sense_list.append(parse_bs(sense))
elif is_pseq(sense):
sense_list.append(parse_pseq(sense))
elif isinstance(sense, list):
sense_list.append(parse_array(sense))
if len(sense_list) == 1:
return sense_list(0)
return sense_list
# Sense Sequence: sseq
# Hierarchical Context:
# Occurs in def
# Data Model:
# "sseq" : array
def is_sseq(sseq):
if isinstance(sseq, dict):
return 'sseq' in sseq.keys()
return False
def parse_sseq(sseq):
if isinstance(sseq, dict):
sseq = sseq('sseq')
sense_list=()
for array in sseq:
sense_list.append(parse_array(array))
if len(sense_list) == 1:
return sense_list(0)
return sense_list
# Parenthesised Sense Sequence:
# Hierarchical Context:
# Occurs as an element in an sseq array.
# Data Model:
# array consisting of one or more sense elements and an optional bs element.
def is_pseq(pseq):
if isinstance(pseq, list):
return pseq(0) == 'pseq'
return False
def parse_pseq(pseq):
sense_list = ()
for sense in pseq(1:):
sense_list.append(parse_array(sense))
return sense_list
# Binding Substitution: bs
# Hierarchical Context
# Occurs as an element in an sseq or pseq array, where it is followed by one or more sense elements.
# Data Model:
# array of the form ("bs", {sense})
def is_bs(bs):
if isinstance(bs, list):
return 'bs' == bs(0)
return False
def parse_bs(bs):
return parse_sense(bs(1))
# Sense: sense
# Hierarchical Context
# Occurs as an element in an sseq array.
# Data Model:
# object or array consisting of one dt (required) and zero or more et, ins, lbs, prs, sdsense, sgram, sls, sn, or vrs
def is_sense(sense):
if isinstance(sense, dict):
return 'sense' in sense.keys()
elif isinstance(sense, list):
return 'sense' == sense(0)
return False
def parse_sense(sense):
if isinstance(sense, dict):
sense = sense('sense')
elif isinstance(sense, list):
sense = sense(1)
else: return
sn = 1
if 'sn' in sense.keys():
sn = sense('sn')
sn = str(sn)
dt = parse_dt(sense('dt'))
if 'sdsense' in sense.keys():
dt += parse_sdsense(sense('sdsense'))
return {sn: dt}
# Divided Sense: sd
# Hierarchical Context
# Occurs within a sense, where it is always preceded by dt.
# Data Model:
# "sdsense" : object with the following members:
# "sd" : string sense divider (required)
# et, ins, lbs, prs, sgram, sls, vrs (optional)
# dt definition text (required)
def parse_sdsense(sdsense):
return parse_dt(sdsense('dt'))
# Defining Text: dt
#Hierarchical Context
# Occurs as an element in an sseq array.
# Data Model:
# "dt" : array consisting of one or more elements:
# ("text", string) where string contains the definition content (required)
# optional bnw, ca, ri, snote, uns, or vis elements
def parse_dt(dt):
for elem in dt:
if elem(0) == 'text':
### Implement text cleaning
return re.sub(r'{(w)*}', '', elem(1)).strip()