I’m trying to understand how to use pydantic
for data parsing & validation.
The idea I had in mind was following:
- I have a list of DOI names for which I need to collect metadata
- I could use
pydantic
to test if the DOI name is valid- if it’s valid, I proceed with making a request to scite api
- if not, the code raises an exception instead of calling api
Very basic idea as you see.
The code:
scite_utils.py
import requests
from ratelimit import limits, sleep_and_retry
from pydantic import BaseModel, ValidationError, constr
class DOI(BaseModel):
name: constr(regex=r"^10.d{4,9}/(-._;()/:a-zA-Z0-9)+$")
@sleep_and_retry
@limits(calls=40, period=60) # api ratelimits
def remote_call(endpoint: str):
r = requests.get(f"https://api.scite.ai/{endpoint}")
r.encoding = "UTF-8"
return r.json()
def get_paper(doi: DOI):
if not isinstance(doi, DOI):
try:
doi = DOI(name=doi).name
except ValidationError:
return
return remote_call(endpoint=f"papers/{doi}")
test.py
import pandas as pd
from typing import List
from scite_utils import DOI, get_paper
# a sample of 5 but realisticlly - thousands?
DOIs = (
"10.12724/ajss.47.0",
"10.31124/advance.12630065.v1",
"obv_error",
"10.1122ajss.123.9",
"10.1080/09662830500528294"
)
def create_table(names: List(DOI)):
data = (get_paper(name) for name in names)
return pd.DataFrame(
publication
for publication in data
if publication is not None
)
if __name__=="__main__":
result = create_table(DOIs)
print(result)
Questions:
- Is this a valid use case for
pydantic
? - Should I validate DOI name inside of the function that makes a request
(or is there a better place for it, like a separate function)? - Doesn’t it look awkward generally? I feel like I’m missing the point,
because I could’ve written a custom function withre
to validate DOI name.