Here at Fin, as with any large group, our agent team is a heterogeneous mix of people with different types of skills. As we scale our operations team, we want to ensure that incoming requests are always being routed to the agents most capable of completing them quickly and effectively. What follows is a walkthrough of how we capitalize on agents’ differing skill sets to ensure that agents are always working on the types of tasks that they are fastest at.

## Routing Work Based on Expected Time-to-Completion

After an agent completes a task, if there is still work sitting in our queue waiting to be picked up (which there inevitably is), then our router has to decide what piece of work to feed that agent next. The router takes into account a number of different variables when making this decision, such as when the request was first made, or if there are any externally-imposed deadlines on the work, e.g. needing to buy a plane ticket before the flight sells out.

All of these other, more pressing considerations being equal, we would then like to preferentially route tasks to agents who we think will be able to complete that work in the shortest amount of time. For each type of task, we have data on how long it took a given agent to complete that type of task in the recent past, and we would like to use this information to determine whether that agent will be significantly faster (or slower) than their peers at completing that type of task in the future. If we can be reasonably confident that an agent will be faster (slower) at completing a certain type of work than their peers, then we should (shouldn’t) route that piece of work to them.

The agent in Figure 1 is significantly faster than the rest of the team at Calendar & Scheduling tasks, but performs at roughly average speed on Booking & Reservation tasks:

Figure 1. Amount of time it takes a particular agent to complete two different task types compared to the population.

## Statistical Hypothesis Testing

The question that we’re asking here, namely, which agents differ significantly from the population in terms of how long it takes them to complete a given task type, is highly amenable to traditional hypothesis testing. The hypotheses that we are trying to decide between are:

- H
_{0}: A piece of work from category C completed by agent A is not more likely to be completed faster (or slower) than if that same piece of work were completed by another randomly-selected agent from the population - H
_{1}: A piece of work from category C completed by agent A is more likely to be completed faster (or slower) than if that same piece of work were completed by another randomly-selected agent from the population

For a specific agent A and category of work C, we can answer this question using the Wilcoxon Rank-Sum Test, which can be called in Python using:

from scipy.stats import mannwhitneyu |

statistic, pvalue = mannwhitneyu(agent_durations, population_durations, use_continuity=True, alternative=’two-sided’) |

view rawmann_whitney_u_test.py hosted with ❤ by GitHub

## Controlling for Multiple Hypotheses

If we simply apply the above test to every agent/category combination, and deem each test significant if its p-value is below the predefined Type 1 error rate cutoff, we will be dramatically inflating our true Type 1 error rate by virtue of having tested hundreds of different hypotheses. The webcomic xkcd illustrates this problem very nicely in the comic below.

Figure 2. xkcd, warning the public about the dangers of multiple hypothesis testing since 2011

Broadly speaking, there are two approaches to correcting for multiple hypothesis tests:

- Control the
**Family-wise Error Rate (FWER)**: Limit the probability that any of our tests conclude that there is a significant difference when none exists - Control the
**False Discovery Rate (FDR)**: Limit the proportion of our tests that conclude that there is a significant difference when none exists

In the above xkcd comic, the scientists should have controlled the FWER, as the cost of falsely alarming the public about a nonexistent health hazard is very high. However, in our case, the cost of a false positive is much lower; it just results in us routing work suboptimally.

For our purposes it is sufficient to control the FDR such that at most 20% of the null hypotheses we reject are false positives. This can be accomplished by using the Benjamini-Hochberg (BH) procedure, which works as follows:

- For each of the m many hypothesis tests performed, order the resulting p-values from least to greatest as p
_{1},p_{2},…,p_{m} - For a given false-positive cutoff α (= 0.20 in our case), and a given ordered p-value index i, check whether p
_{i}< α * i / m - Find the largest i such that this inequality holds, and reject all null hypotheses corresponding to the p-values with indices up to and including i

This test also has a very nice geometric interpretation: Plot each of the p-values as a point with coordinates (i, p_{i}), and plot the cutoff as a line through the origin with slope α / m. Then reject all hypotheses with p-values to the left of the rightmost position where the points cross above the line.

Figure 3. (left) Rejected 50% of Null Hypotheses without performing multiple test correction, (right) Rejected 34% of Null Hypotheses after performing the Benjamini-Hochberg procedure

This test can be called in Python using:

from statsmodels.sandbox.stats.multicomp import multipletests |

reject, pvals_corrected, alphacSidak, alphacBonf = multipletests(p_values, alpha=0.2, method=’fdr_bh’) |

view rawbenjamini_hochberg_procedure.py hosted with ❤ by GitHub

## Large Effect Size Requirement

One problem with focusing only on p-values is that in practice if your data set is large enough, it is possible to reject any null hypothesis, no matter how minute the difference is between the distributions under consideration. One way to guard against this problem is to impose a further requirement that the effect size, i.e. the magnitude of the difference between the two distributions, be sufficiently large. There are many different ways to quantify effect size, but one simple and easily-interpretable option is to measure the difference between the medians of the two distributions. Specifically, we require that the agent’s median working time and the population-wide median working time must differ by at least 20% for us to care about it. This joint p-value + effect size requirement can be visualized using a so-called “volcano plot”.

Figure 4. Volcano plot; each point is a single agent/category combination, with red points indicating agents whose working speed on a given category of work differs significantly from the rest of the agent population

## Routing Results

Simulating the behavior of this new speed-based router across all work received, we find that preferentially routing work to the agents who complete it most quickly (and away from agents who complete it most slowly) decreases the population-wide median task completion time by 10%. Much of these gains occur in the left shoulder of the distribution, reflecting the fact that more tasks are now being completed “abnormally” quickly.

Figure 5. Population-wide per-task working time before and after implementing preferential routing

If you thought this analysis was cool, and are excited about digging into our operational data yourself, apply to be a data scientist at Fin! We’re hiring!

— Jon Simon