python – Parse selected records from empty-line separated file


This is my first post here and I hope I will get some recommendations to improve my code. I have a parser which processes the file with the following structure:

SE|43171|ti|1|text|Distribution of metastases...
SE|43171|ti|1|entity|C0033522
SE|43171|ti|1|relation|C0686619|COEXISTS_WITH|C0279628

SE|43171|ab|2|text|The aim of this study...
SE|43171|ab|2|entity|C2744535
SE|43171|ab|2|relation|C0686619|PROCESS_OF|C0030705

SE|43171|ab|3|text|METHODS Between April 2014...
SE|43171|ab|3|entity|C1964257
SE|43171|ab|3|entity|C0033522
SE|43171|ab|3|relation|C0085198|INFER|C0279628
SE|43171|ab|3|relation|C0279628|PROCESS_OF|C0030705

SE|43171|ab|4|text|Lymph node stations...
SE|43171|ab|4|entity|C1518053
SE|43171|ab|4|entity|C1515946

Records (i.e., blocks) are separated by an empty line. Each line in a block starts with a SE tag; the text tag always occurs in the first line of each block (in the 4th field). The program extracts:

  1. All relation tags in a block, and
  2. Corresponding text (i.e., sentence ID (sent_id) and sentence text (sent_text)) from the first line of the block, if relation tag is present in a block. Please note that the relation tag is not necessarily present in each block.

Below is a mapping dictionary between tags and related fields in a file and a main program.

# Specify mappings to parse lines from input file
mappings = {
        "id": 1,
        "text": {
            "sent_id": 3,
            "sent_text": 5
        },
        "relation": {
            'subject': 5,
            'predicate': 6,
            'object': 7,
        }
    }

Finally a code:

def extraction(file_in):
    """This function extracts lines with 'text' and 'relation'
    tag in the 4th field."""
    extraction = {}
    file = open(file_in, encoding='utf-8')
    bla = {'text': ()}
    for line in file:
        results = {'relations': ()}
        if line.startswith('SE'):
            elements = line.strip().split('|')
            pmid = elements(1)
            
            if elements(4) == 'text':
                tmp = {}
                for key, idx in mappings('text').items():
                    tmp(key) = elements(idx)
                bla('text').append(tmp)
            
            if elements(4) == 'relation':
                tmp = {}
                for key, ind in mappings('relation').items():
                        tmp(key) = elements(ind)
                tmp.update(sent_id = bla('text')(0)('sent_id'))
                tmp.update(sent_text = bla('text')(0)('sent_text'))
                results('relations').append(tmp)
                extraction(pmid) = extraction.get(pmid, ()) + results('relations')
        else:
           bla = {'text': ()}
    file.close()
    return extraction

The output looks like:

import json
print(json.dumps(extraction('test.txt'), indent=4))

{
    "43171": (
        {
            "subject": "C0686619",
            "predicate": "COEXISTS_WITH",
            "object": "C0279628",
            "sent_id": "1",
            "sent_text": "Distribution of lymph node metastases..."
        },
        {
            "subject": "C0686619",
            "predicate": "PROCESS_OF",
            "object": "C0030705",
            "sent_id": "2",
            "sent_text": "The aim of this study..."
        },
        {
            "subject": "C0085198",
            "predicate": "INFER",
            "object": "C0279628",
            "sent_id": "3",
            "sent_text": "METHODS Between April 2014..."
        },
        {
            "subject": "C0279628",
            "predicate": "PROCESS_OF",
            "object": "C0030705",
            "sent_id": "3",
            "sent_text": "METHODS Between April 2014..."
        }
    )
}

Thanks for any recommendation.