python – Vectorized implementation of Maximal Marginal Relevance

I am implementing the Maximal Marginal Relevance criterion for selecting relevant & diverse items from a collection R. Essentially, the implementation directly follows this formula:


from typing import Optional

import numpy as np
import torch as T

def maximal_marginal_relevance(
    sim_di_q: T.Tensor, sim_di_dq: T.Tensor, lam: float, num_iters: Optional(int) = None
    """Computes the Maximal Marginal Relevance (Carbonell and Goldstein, 1998) given two similarity arrays.

        sim_di_q (T.Tensor): Similarity between candidates and query, shape: (N,).
        sim_di_dq (T.Tensor): Pairwise similarity between candidates, shape: (N, N).
        lam (float): Lambda parameter: 0 = maximal diversity, 1 = greedy.
        num_iters (int, optional): Number of results to return. If none, all items are used.

        T.Tensor: Ordering of the candidates with respect to the MMR scoring.
    if num_iters is None:
        num_iters = len(sim_di_q)

    if not (0 < num_iters <= len(sim_di_q)):
        raise ValueError("Number of iterations must be in (0, num_pids).")

    if not (0 <= lam <= 1):
        raise ValueError("lambda must be in (0, 1).")

    R = np.arange(len(sim_di_q))
    S = np.array((sim_di_q.argmax()))

    for _ in range(num_iters - 1):
        cur_di = R(~np.isin(R, S))  # Di in RS

        lhs = sim_di_q(cur_di)
        rhs = sim_di_dq(S(:, None), cur_di).max(axis=0).values

        idx = np.argmax(lam * lhs - (1 - lam) * rhs)

        S = np.append(S, (cur_di(idx)))

    return T.tensor(S)

My code works as expected — tested on this example, but I feel like two things in particular can be improved:

  • performance: I have some concerns regarding S = np.append(S, (cur_di(idx)))
  • code-quality: type annotations and possibility of expressing the shape of the arrays as type annotations — what’s the best practice here?