| Title: | Comparing Automated Subject Indexing Methods in R |
|---|---|
| Description: | Perform evaluation of automatic subject indexing methods. The main focus of the package is to enable efficient computation of set retrieval and ranked retrieval metrics across multiple dimensions of a dataset, e.g. document strata or subsets of the label set. The package also provides the possibility of computing bootstrap confidence intervals for all major metrics, with seamless integration of parallel computation and propensity scored variants of standard metrics. |
| Authors: | Maximilian Kähler [aut, cre] (ORCID: <https://orcid.org/0000-0003-4695-0565>), Markus Schumacher [aut], Deutsche Nationalbibliothek [cph] |
| Maintainer: | Maximilian Kähler <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.3.3 |
| Built: | 2026-05-25 07:46:56 UTC |
| Source: | https://github.com/deutsche-nationalbibliothek/casimir |
Helper function for filtering predictions with score above a certain threshold or rank below some limit rank.
apply_threshold(threshold, limit = NA_real_, base_compare)apply_threshold(threshold, limit = NA_real_, base_compare)
threshold |
A numeric threshold between 0 and 1. |
limit |
An integer cutoff >= 1 for rank-based thresholding. Requires a
column |
base_compare |
A data.frame as created by |
A data.frame with observations that satisfy (score >=
threshold AND (if applicable) rank <= limit) OR gold ==
TRUE. A new logical column suggested indicates TRUE if score
>= threshold AND (if applicable) rank <= limit, and FALSE for
false negative observations (that may have no score, a score below the
threshold or rank above the limit).
library(casimir) gold <- tibble::tribble( ~doc_id, ~label_id, "A", "a", "A", "b", "A", "c", "B", "a", "B", "d", "C", "a", "C", "b", "C", "d", "C", "f" ) pred <- tibble::tribble( ~doc_id, ~label_id, ~score, "A", "a", 0.9, "A", "d", 0.7, "A", "f", 0.3, "A", "c", 0.1, "B", "a", 0.8, "B", "e", 0.6, "B", "d", 0.1, "C", "f", 0.1, "C", "c", 0.2, "C", "e", 0.2 ) base_compare <- create_comparison(gold, pred) res_0 <- apply_threshold( threshold = 0.3, base_compare = base_compare )library(casimir) gold <- tibble::tribble( ~doc_id, ~label_id, "A", "a", "A", "b", "A", "c", "B", "a", "B", "d", "C", "a", "C", "b", "C", "d", "C", "f" ) pred <- tibble::tribble( ~doc_id, ~label_id, ~score, "A", "a", 0.9, "A", "d", 0.7, "A", "f", 0.3, "A", "c", 0.1, "B", "a", 0.8, "B", "e", 0.6, "B", "d", 0.1, "C", "f", 0.1, "C", "c", 0.2, "C", "e", 0.2 ) base_compare <- create_comparison(gold, pred) res_0 <- apply_threshold( threshold = 0.3, base_compare = base_compare )
A wrapper for use within bootstrap computation of pr auc which covers the repeated application of:
join with resampled doc_ids
summarise_intermediate_results
postprocessing of curve data
auc computation
boot_worker_fn( sampled_id_list, intermed_res, propensity_scored, replace_zero_division_with )boot_worker_fn( sampled_id_list, intermed_res, propensity_scored, replace_zero_division_with )
sampled_id_list |
A list of all doc_ids of the examples drawn in each bootstrap iteration. |
intermed_res |
Intermediate results as produced by
|
propensity_scored |
Logical, whether to use propensity scores as weights. |
replace_zero_division_with |
In macro averaged results (doc-avg, subj-avg), it may occur that some
instances have no predictions or no gold standard. In these cases,
calculating precision and recall may lead to division by zero. CASIMiR
standardly removes these missing values from macro averages, leading to a
smaller support (count of instances that were averaged). Other
implementations of macro averaged precision and recall default to 0 in these
cases. This option allows to control the default. Set any value between 0
and 1. (Defaults to |
A data.frame with a column "pr_auc" and optional
grouping_vars.
Internal helper function designed to ensure that id columns are not passed as
factor variables. Factor variables in id columns may cause undesired
behaviour with the drop_empty_group argument.
check_id_vars(df)check_id_vars(df)
df |
An input data.frame. |
The input data.frame df with the id columns being no
longer factor variables.
Check an arbitrary column in a data.frame for factor type and coerce to character.
check_id_vars_col(df, col)check_id_vars_col(df, col)
df |
An input data.frame. |
col |
The name of the column to check. |
The input data.frame df with the specified column being no
longer a factor variable.
Internal helper function to check a comparison matrix for inconsistent relevance values of gold standard and predicted labels.
check_repair_relevance_compare( gold_vs_pred, ignore_inconsistencies = options::opt("ignore_inconsistencies") )check_repair_relevance_compare( gold_vs_pred, ignore_inconsistencies = options::opt("ignore_inconsistencies") )
gold_vs_pred |
As created by |
ignore_inconsistencies |
Warnings about data inconsistencies will be silenced. (Defaults to |
A valid comparison matrix with possibly corrected relevance values,
being compatible with compute_intermediate_results.
Internal helper function to check a data.frame with predicted labels for a valid relevance column.
check_repair_relevance_pred( predicted, ignore_inconsistencies = options::opt("ignore_inconsistencies") )check_repair_relevance_pred( predicted, ignore_inconsistencies = options::opt("ignore_inconsistencies") )
predicted |
Multi-label prediction results. Expects a data.frame with
columns |
ignore_inconsistencies |
Warnings about data inconsistencies will be silenced. (Defaults to |
A valid predicted data.frame with possibly eliminated missing
values.
Compute intermediate set retrieval results per group such as number of gold standard and predicted labels, number of true positives, false positives and false negatives, precision, R-precision, recall and F1 score.
compute_intermediate_results( gold_vs_pred, grouping_var, propensity_scored = FALSE, cost_fp = NULL, drop_empty_groups = options::opt("drop_empty_groups"), check_group_names = options::opt("check_group_names") ) compute_intermediate_results_dplyr( gold_vs_pred, grouping_var, propensity_scored = FALSE, cost_fp = NULL )compute_intermediate_results( gold_vs_pred, grouping_var, propensity_scored = FALSE, cost_fp = NULL, drop_empty_groups = options::opt("drop_empty_groups"), check_group_names = options::opt("check_group_names") ) compute_intermediate_results_dplyr( gold_vs_pred, grouping_var, propensity_scored = FALSE, cost_fp = NULL )
gold_vs_pred |
A data.frame with logical columns |
grouping_var |
A character vector of grouping variables that must be
present in |
propensity_scored |
Logical, whether to use propensity scores as weights. |
cost_fp |
A numeric value > 0, defaults to NULL. |
drop_empty_groups |
Should empty levels of factor variables be dropped in grouped set retrieval
computation? (Defaults to |
check_group_names |
Perform replacement of dots in grouping columns. Disable for faster
computation if you can make sure that all columns used for grouping
("doc_id", "label_id", "doc_groups", "label_groups") do not contain
dots. (Defaults to |
A list of two elements:
results_table A data.frame with columns "n_gold",
"n_suggested", "tp", "fp", "fn", "prec", "rprec", "rec", "f1".
grouping_var The input vector grouping_var.
compute_intermediate_results_dplyr(): Variant with dplyr based
internals rather than collapse internals.
library(casimir) gold <- tibble::tribble( ~doc_id, ~label_id, "A", "a", "A", "b", "A", "c", "B", "a", "B", "d", "C", "a", "C", "b", "C", "d", "C", "f" ) pred <- tibble::tribble( ~doc_id, ~label_id, "A", "a", "A", "d", "A", "f", "B", "a", "B", "e", "C", "f" ) gold_vs_pred <- create_comparison(gold, pred) compute_intermediate_results(gold_vs_pred, "doc_id")library(casimir) gold <- tibble::tribble( ~doc_id, ~label_id, "A", "a", "A", "b", "A", "c", "B", "a", "B", "d", "C", "a", "C", "b", "C", "d", "C", "f" ) pred <- tibble::tribble( ~doc_id, ~label_id, "A", "a", "A", "d", "A", "f", "B", "a", "B", "e", "C", "f" ) gold_vs_pred <- create_comparison(gold, pred) compute_intermediate_results(gold_vs_pred, "doc_id")
Compute intermediate ranked retrieval results per group such as Discounted Cumulative Gain (DCG), Ideal Discounted Cumulative Gain (IDCG), Normalised Discounted Cumulative Gain (NDCG) and Label Ranking Average Precision (LRAP).
compute_intermediate_results_rr( gold_vs_pred, grouping_var, drop_empty_groups = options::opt("drop_empty_groups") )compute_intermediate_results_rr( gold_vs_pred, grouping_var, drop_empty_groups = options::opt("drop_empty_groups") )
gold_vs_pred |
A data.frame as generated by |
grouping_var |
A character vector of grouping variables that must be
present in |
drop_empty_groups |
Should empty levels of factor variables be dropped in grouped set retrieval
computation? (Defaults to |
A data.frame with columns "dcg", "idcg", "ndcg", "lrap".
library(casimir) gold <- tibble::tribble( ~doc_id, ~label_id, "A", "a", "A", "b", "A", "c", "A", "d", "A", "e", ) pred <- tibble::tribble( ~doc_id, ~label_id, ~score, "A", "f", 0.3277, "A", "e", 0.32172, "A", "b", 0.13517, "A", "g", 0.10134, "A", "h", 0.09152, "A", "a", 0.07483, "A", "i", 0.03649, "A", "j", 0.03551, "A", "k", 0.03397, "A", "c", 0.03364 ) gold_vs_pred <- create_comparison(gold, pred) compute_intermediate_results_rr( gold_vs_pred, rlang::syms(c("doc_id")) )library(casimir) gold <- tibble::tribble( ~doc_id, ~label_id, "A", "a", "A", "b", "A", "c", "A", "d", "A", "e", ) pred <- tibble::tribble( ~doc_id, ~label_id, ~score, "A", "f", 0.3277, "A", "e", 0.32172, "A", "b", 0.13517, "A", "g", 0.10134, "A", "h", 0.09152, "A", "a", 0.07483, "A", "i", 0.03649, "A", "j", 0.03551, "A", "k", 0.03397, "A", "c", 0.03364 ) gold_vs_pred <- create_comparison(gold, pred) compute_intermediate_results_rr( gold_vs_pred, rlang::syms(c("doc_id")) )
Compute the area under the precision-recall curve with support for
bootstrap-based confidence intervals and different stratification and
aggregation modes for the underlying precision and recall aggregation.
Precision is calculated as the best value at a given level of recall for all
possible thresholds on score and limits on rank. In essence,
compute_pr_auc performs a two-dimensional optimisation over thresholds
and limits applying both threshold-based cutoff as well as rank-based cutoff.
compute_pr_auc( predicted, gold_standard, doc_groups = NULL, label_groups = NULL, mode = "doc-avg", steps = 100, thresholds = NULL, limit_range = NA_real_, compute_bootstrap_ci = FALSE, n_bt = 10L, seed = NULL, graded_relevance = FALSE, rename_metrics = FALSE, propensity_scored = FALSE, label_distribution = NULL, cost_fp_constant = NULL, replace_zero_division_with = options::opt("replace_zero_division_with"), drop_empty_groups = options::opt("drop_empty_groups"), ignore_inconsistencies = options::opt("ignore_inconsistencies"), verbose = options::opt("verbose"), progress = options::opt("progress") )compute_pr_auc( predicted, gold_standard, doc_groups = NULL, label_groups = NULL, mode = "doc-avg", steps = 100, thresholds = NULL, limit_range = NA_real_, compute_bootstrap_ci = FALSE, n_bt = 10L, seed = NULL, graded_relevance = FALSE, rename_metrics = FALSE, propensity_scored = FALSE, label_distribution = NULL, cost_fp_constant = NULL, replace_zero_division_with = options::opt("replace_zero_division_with"), drop_empty_groups = options::opt("drop_empty_groups"), ignore_inconsistencies = options::opt("ignore_inconsistencies"), verbose = options::opt("verbose"), progress = options::opt("progress") )
predicted |
Multi-label prediction results. Expects a data.frame with
columns |
gold_standard |
Expects a data.frame with columns |
doc_groups |
A two-column data.frame with a column |
label_groups |
A two-column data.frame with a column |
mode |
One of the following aggregation modes: |
steps |
Number of breaks to divide the interval |
thresholds |
Alternatively to steps, one can manually set the thresholds
to be used to build the pr curve. Defaults to the quantiles of the true
positive suggestions' score distribution to be obtained from |
limit_range |
A vector of limit values to apply on the rank column. Defaults to NA, applying no cutoff on the predictions' label rank. |
compute_bootstrap_ci |
A logical indicator for computing bootstrap CIs. |
n_bt |
An integer number of resamples to be used for bootstrapping. |
seed |
Pass a seed to make bootstrap replication reproducible. |
graded_relevance |
A logical indicator for graded relevance. Defaults to
|
rename_metrics |
If set to
|
propensity_scored |
Logical, whether to use propensity scores as weights. |
label_distribution |
Expects a data.frame with columns |
cost_fp_constant |
Constant cost assigned to false positives.
|
replace_zero_division_with |
In macro averaged results (doc-avg, subj-avg), it may occur that some
instances have no predictions or no gold standard. In these cases,
calculating precision and recall may lead to division by zero. CASIMiR
standardly removes these missing values from macro averages, leading to a
smaller support (count of instances that were averaged). Other
implementations of macro averaged precision and recall default to 0 in these
cases. This option allows to control the default. Set any value between 0
and 1. (Defaults to |
drop_empty_groups |
Should empty levels of factor variables be dropped in grouped set retrieval
computation? (Defaults to |
ignore_inconsistencies |
Warnings about data inconsistencies will be silenced. (Defaults to |
verbose |
Verbose reporting of computation steps for debugging. (Defaults to |
progress |
Display progress bars for iterated computations (like bootstrap CI or
pr curves). (Defaults to |
A data.frame with columns "pr_auc" and (if applicable)
"ci_lower", "ci_upper" and additional stratification variables.
compute_set_retrieval_scores,
compute_pr_auc_from_curve
library(ggplot2) library(casimir) gold <- tibble::tribble( ~doc_id, ~label_id, "A", "a", "A", "b", "A", "c", "B", "a", "B", "d", "C", "a", "C", "b", "C", "d", "C", "f" ) pred <- tibble::tribble( ~doc_id, ~label_id, ~score, ~rank, "A", "a", 0.9, 1, "A", "d", 0.7, 2, "A", "f", 0.3, 3, "A", "c", 0.1, 4, "B", "a", 0.8, 1, "B", "e", 0.6, 2, "B", "d", 0.1, 3, "C", "f", 0.1, 3, "C", "c", 0.2, 1, "C", "e", 0.2, 1 ) auc <- compute_pr_auc(pred, gold, mode = "doc-avg")library(ggplot2) library(casimir) gold <- tibble::tribble( ~doc_id, ~label_id, "A", "a", "A", "b", "A", "c", "B", "a", "B", "d", "C", "a", "C", "b", "C", "d", "C", "f" ) pred <- tibble::tribble( ~doc_id, ~label_id, ~score, ~rank, "A", "a", 0.9, 1, "A", "d", 0.7, 2, "A", "f", 0.3, 3, "A", "c", 0.1, 4, "B", "a", 0.8, 1, "B", "e", 0.6, 2, "B", "d", 0.1, 3, "C", "f", 0.1, 3, "C", "c", 0.2, 1, "C", "e", 0.2, 1 ) auc <- compute_pr_auc(pred, gold, mode = "doc-avg")
Compute the area under the precision-recall curve given pr curve data. This
function is mainly intended for generating plot data. For computation of the
area under the curve, use compute_pr_auc. The function uses a simple
trapezoidal rule approximation along the steps of the generated curve data.
compute_pr_auc_from_curve( pr_curve_data, grouping_vars = NULL, drop_empty_groups = options::opt("drop_empty_groups") )compute_pr_auc_from_curve( pr_curve_data, grouping_vars = NULL, drop_empty_groups = options::opt("drop_empty_groups") )
pr_curve_data |
A data.frame as produced by
|
grouping_vars |
Additional columns of the input data to group by. |
drop_empty_groups |
Should empty levels of factor variables be dropped in grouped set retrieval
computation? (Defaults to |
A data.frame with a column "pr_auc" and optional
grouping_vars.
compute_pr_curve
library(ggplot2) library(casimir) gold <- tibble::tribble( ~doc_id, ~label_id, "A", "a", "A", "b", "A", "c", "B", "a", "B", "d", "C", "a", "C", "b", "C", "d", "C", "f" ) pred <- tibble::tribble( ~doc_id, ~label_id, ~score, ~rank, "A", "a", 0.9, 1, "A", "d", 0.7, 2, "A", "f", 0.3, 3, "A", "c", 0.1, 4, "B", "a", 0.8, 1, "B", "e", 0.6, 2, "B", "d", 0.1, 3, "C", "f", 0.1, 3, "C", "c", 0.2, 1, "C", "e", 0.2, 1 ) pr_curve <- compute_pr_curve( gold, pred, mode = "doc-avg", optimize_cutoff = TRUE ) auc <- compute_pr_auc_from_curve(pr_curve) # note that pr curves take the cummax(prec), not the precision ggplot(pr_curve$plot_data, aes(x = rec, y = prec_cummax)) + geom_point( data = pr_curve$opt_cutoff, aes(x = rec, y = prec_cummax), color = "red", shape = "star" ) + geom_text( data = pr_curve$opt_cutoff, aes( x = rec + 0.2, y = prec_cummax, label = paste("f1_opt =", round(f1_max, 3)) ), color = "red" ) + geom_path() + coord_cartesian(xlim = c(0, 1), ylim = c(0, 1))library(ggplot2) library(casimir) gold <- tibble::tribble( ~doc_id, ~label_id, "A", "a", "A", "b", "A", "c", "B", "a", "B", "d", "C", "a", "C", "b", "C", "d", "C", "f" ) pred <- tibble::tribble( ~doc_id, ~label_id, ~score, ~rank, "A", "a", 0.9, 1, "A", "d", 0.7, 2, "A", "f", 0.3, 3, "A", "c", 0.1, 4, "B", "a", 0.8, 1, "B", "e", 0.6, 2, "B", "d", 0.1, 3, "C", "f", 0.1, 3, "C", "c", 0.2, 1, "C", "e", 0.2, 1 ) pr_curve <- compute_pr_curve( gold, pred, mode = "doc-avg", optimize_cutoff = TRUE ) auc <- compute_pr_auc_from_curve(pr_curve) # note that pr curves take the cummax(prec), not the precision ggplot(pr_curve$plot_data, aes(x = rec, y = prec_cummax)) + geom_point( data = pr_curve$opt_cutoff, aes(x = rec, y = prec_cummax), color = "red", shape = "star" ) + geom_text( data = pr_curve$opt_cutoff, aes( x = rec + 0.2, y = prec_cummax, label = paste("f1_opt =", round(f1_max, 3)) ), color = "red" ) + geom_path() + coord_cartesian(xlim = c(0, 1), ylim = c(0, 1))
Compute the precision-recall curve for a given step size and limit range.
compute_pr_curve( predicted, gold_standard, doc_groups = NULL, label_groups = NULL, mode = "doc-avg", steps = 100, thresholds = NULL, limit_range = NA_real_, optimize_cutoff = FALSE, graded_relevance = FALSE, propensity_scored = FALSE, label_distribution = NULL, cost_fp_constant = NULL, replace_zero_division_with = options::opt("replace_zero_division_with"), drop_empty_groups = options::opt("drop_empty_groups"), ignore_inconsistencies = options::opt("ignore_inconsistencies"), verbose = options::opt("verbose"), progress = options::opt("progress") )compute_pr_curve( predicted, gold_standard, doc_groups = NULL, label_groups = NULL, mode = "doc-avg", steps = 100, thresholds = NULL, limit_range = NA_real_, optimize_cutoff = FALSE, graded_relevance = FALSE, propensity_scored = FALSE, label_distribution = NULL, cost_fp_constant = NULL, replace_zero_division_with = options::opt("replace_zero_division_with"), drop_empty_groups = options::opt("drop_empty_groups"), ignore_inconsistencies = options::opt("ignore_inconsistencies"), verbose = options::opt("verbose"), progress = options::opt("progress") )
predicted |
Multi-label prediction results. Expects a data.frame with
columns |
gold_standard |
Expects a data.frame with columns |
doc_groups |
A two-column data.frame with a column |
label_groups |
A two-column data.frame with a column |
mode |
One of the following aggregation modes: |
steps |
Number of breaks to divide the interval |
thresholds |
Alternatively to steps, one can manually set the thresholds
to be used to build the pr curve. Defaults to the quantiles of the true
positive suggestions' score distribution to be obtained from |
limit_range |
A vector of limit values to apply on the rank column. Defaults to NA, applying no cutoff on the predictions' label rank. |
optimize_cutoff |
Logical. If |
graded_relevance |
A logical indicator for graded relevance. Defaults to
|
propensity_scored |
Logical, whether to use propensity scores as weights. |
label_distribution |
Expects a data.frame with columns |
cost_fp_constant |
Constant cost assigned to false positives.
|
replace_zero_division_with |
In macro averaged results (doc-avg, subj-avg), it may occur that some
instances have no predictions or no gold standard. In these cases,
calculating precision and recall may lead to division by zero. CASIMiR
standardly removes these missing values from macro averages, leading to a
smaller support (count of instances that were averaged). Other
implementations of macro averaged precision and recall default to 0 in these
cases. This option allows to control the default. Set any value between 0
and 1. (Defaults to |
drop_empty_groups |
Should empty levels of factor variables be dropped in grouped set retrieval
computation? (Defaults to |
ignore_inconsistencies |
Warnings about data inconsistencies will be silenced. (Defaults to |
verbose |
Verbose reporting of computation steps for debugging. (Defaults to |
progress |
Display progress bars for iterated computations (like bootstrap CI or
pr curves). (Defaults to |
A list of three elements:
plot_data A data.frame with full pr curves and columns
"searchspace_id", "prec", "rec", "prec_cummax", "mode".
opt_cutoff A data.frame with optimal cutoffs and columns
"thresholds", "limits", "searchspace_id", "f1_max", "prec",
"rec", "prec_cummax", "mode".
all_cutoffs A data.frame with all cutoffs and columns
"thresholds", "limits", "searchspace_id", "metric", "value",
"support", "f1_max", "prec", "rec", "prec_cummax", "mode".
All three data.frames may contain additional stratification variables
passed with doc_groups and label_groups. The latter two
data.frames are non-empty only if optimize_cutoff == TRUE.
library(ggplot2) library(casimir) gold <- tibble::tribble( ~doc_id, ~label_id, "A", "a", "A", "b", "A", "c", "B", "a", "B", "d", "C", "a", "C", "b", "C", "d", "C", "f" ) pred <- tibble::tribble( ~doc_id, ~label_id, ~score, ~rank, "A", "a", 0.9, 1, "A", "d", 0.7, 2, "A", "f", 0.3, 3, "A", "c", 0.1, 4, "B", "a", 0.8, 1, "B", "e", 0.6, 2, "B", "d", 0.1, 3, "C", "f", 0.1, 1, "C", "c", 0.2, 2, "C", "e", 0.2, 2 ) pr_curve <- compute_pr_curve( pred, gold, mode = "doc-avg", optimize_cutoff = TRUE ) auc <- compute_pr_auc_from_curve(pr_curve$plot_data) # note that pr curves take the cummax(prec), not the precision ggplot(pr_curve$plot_data, aes(x = rec, y = prec_cummax)) + geom_point( data = pr_curve$opt_cutoff, aes(x = rec, y = prec_cummax), color = "red", shape = "star" ) + geom_text( data = pr_curve$opt_cutoff, aes( x = rec + 0.2, y = prec_cummax, label = paste("f1_opt =", round(f1_max, 3)) ), color = "red" ) + geom_path() + coord_cartesian(xlim = c(0, 1), ylim = c(0, 1))library(ggplot2) library(casimir) gold <- tibble::tribble( ~doc_id, ~label_id, "A", "a", "A", "b", "A", "c", "B", "a", "B", "d", "C", "a", "C", "b", "C", "d", "C", "f" ) pred <- tibble::tribble( ~doc_id, ~label_id, ~score, ~rank, "A", "a", 0.9, 1, "A", "d", 0.7, 2, "A", "f", 0.3, 3, "A", "c", 0.1, 4, "B", "a", 0.8, 1, "B", "e", 0.6, 2, "B", "d", 0.1, 3, "C", "f", 0.1, 1, "C", "c", 0.2, 2, "C", "e", 0.2, 2 ) pr_curve <- compute_pr_curve( pred, gold, mode = "doc-avg", optimize_cutoff = TRUE ) auc <- compute_pr_auc_from_curve(pr_curve$plot_data) # note that pr curves take the cummax(prec), not the precision ggplot(pr_curve$plot_data, aes(x = rec, y = prec_cummax)) + geom_point( data = pr_curve$opt_cutoff, aes(x = rec, y = prec_cummax), color = "red", shape = "star" ) + geom_text( data = pr_curve$opt_cutoff, aes( x = rec + 0.2, y = prec_cummax, label = paste("f1_opt =", round(f1_max, 3)) ), color = "red" ) + geom_path() + coord_cartesian(xlim = c(0, 1), ylim = c(0, 1))
Compute inverse propensity scores based on a label distribution. Propensity scores for extreme multi-label learning are proposed in Jain, H., Prabhu, Y., & Varma, M. (2016). Extreme Multi-label Loss Functions for Recommendation, Tagging, Ranking and Other Missing Label Applications. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 13-17-Aug, 935–944. doi:10.1145/2939672.2939756.
compute_propensity_scores(label_distribution, a = 0.55, b = 1.5)compute_propensity_scores(label_distribution, a = 0.55, b = 1.5)
label_distribution |
Expects a data.frame with columns |
a |
A numeric parameter for the propensity score calculation, defaults to 0.55. |
b |
A numeric parameter for the propensity score calculation, defaults to 1.5. |
A data.frame with columns "label_id", "label_weight".
library(tidyverse) library(casimir) label_distribution <- dnb_label_distribution compute_propensity_scores(label_distribution)library(tidyverse) library(casimir) label_distribution <- dnb_label_distribution compute_propensity_scores(label_distribution)
This function computes the ranked retrieval scores Discounted Cumulative Gain (DCG), Ideal Discounted Cumulative Gain (IDCG), Normalised Discounted Cumulative Gain (NDCG) and Label Ranking Average Precision (LRAP). Ranked retrieval, unlike set retrieval, assumes ordered predictions. Unlike set retrieval metrics, ranked retrieval metrics are logically bound to a document-wise evaluation. Thus, only the aggregation mode "doc-avg" is available for these scores.
compute_ranked_retrieval_scores( predicted, gold_standard, doc_groups = NULL, drop_empty_groups = options::opt("drop_empty_groups"), progress = options::opt("progress") )compute_ranked_retrieval_scores( predicted, gold_standard, doc_groups = NULL, drop_empty_groups = options::opt("drop_empty_groups"), progress = options::opt("progress") )
predicted |
Multi-label prediction results. Expects a data.frame with
columns |
gold_standard |
Expects a data.frame with columns |
doc_groups |
A two-column data.frame with a column |
drop_empty_groups |
Should empty levels of factor variables be dropped in grouped set retrieval
computation? (Defaults to |
progress |
Display progress bars for iterated computations (like bootstrap CI or
pr curves). (Defaults to |
A data.frame with columns "metric", "mode", "value", "support"
and optional grouping variables supplied in doc_groups. Here,
support is defined as number of documents that contribute to the
document average in aggregation of the overall result.
# some dummy results gold <- tibble::tribble( ~doc_id, ~label_id, "A", "a", "A", "b", "A", "c", "A", "d", "A", "e", ) pred <- tibble::tribble( ~doc_id, ~label_id, ~score, "A", "f", 0.3277, "A", "e", 0.32172, "A", "b", 0.13517, "A", "g", 0.10134, "A", "h", 0.09152, "A", "a", 0.07483, "A", "i", 0.03649, "A", "j", 0.03551, "A", "k", 0.03397, "A", "c", 0.03364 ) results <- compute_ranked_retrieval_scores( pred, gold )# some dummy results gold <- tibble::tribble( ~doc_id, ~label_id, "A", "a", "A", "b", "A", "c", "A", "d", "A", "e", ) pred <- tibble::tribble( ~doc_id, ~label_id, ~score, "A", "f", 0.3277, "A", "e", 0.32172, "A", "b", 0.13517, "A", "g", 0.10134, "A", "h", 0.09152, "A", "a", 0.07483, "A", "i", 0.03649, "A", "j", 0.03551, "A", "k", 0.03397, "A", "c", 0.03364 ) results <- compute_ranked_retrieval_scores( pred, gold )
Compute multi-label metrics precision, recall, F1 and R-precision for subject indexing results.
compute_set_retrieval_scores( predicted, gold_standard, k = NULL, mode = "doc-avg", compute_bootstrap_ci = FALSE, n_bt = 10L, doc_groups = NULL, label_groups = NULL, graded_relevance = FALSE, rename_metrics = FALSE, seed = NULL, propensity_scored = FALSE, label_distribution = NULL, cost_fp_constant = NULL, replace_zero_division_with = options::opt("replace_zero_division_with"), drop_empty_groups = options::opt("drop_empty_groups"), ignore_inconsistencies = options::opt("ignore_inconsistencies"), verbose = options::opt("verbose"), progress = options::opt("progress") ) compute_set_retrieval_scores_dplyr( predicted, gold_standard, k = NULL, mode = "doc-avg", compute_bootstrap_ci = FALSE, n_bt = 10L, doc_groups = NULL, label_groups = NULL, graded_relevance = FALSE, rename_metrics = FALSE, seed = NULL, propensity_scored = FALSE, label_distribution = NULL, cost_fp_constant = NULL, ignore_inconsistencies = FALSE, verbose = FALSE, progress = FALSE )compute_set_retrieval_scores( predicted, gold_standard, k = NULL, mode = "doc-avg", compute_bootstrap_ci = FALSE, n_bt = 10L, doc_groups = NULL, label_groups = NULL, graded_relevance = FALSE, rename_metrics = FALSE, seed = NULL, propensity_scored = FALSE, label_distribution = NULL, cost_fp_constant = NULL, replace_zero_division_with = options::opt("replace_zero_division_with"), drop_empty_groups = options::opt("drop_empty_groups"), ignore_inconsistencies = options::opt("ignore_inconsistencies"), verbose = options::opt("verbose"), progress = options::opt("progress") ) compute_set_retrieval_scores_dplyr( predicted, gold_standard, k = NULL, mode = "doc-avg", compute_bootstrap_ci = FALSE, n_bt = 10L, doc_groups = NULL, label_groups = NULL, graded_relevance = FALSE, rename_metrics = FALSE, seed = NULL, propensity_scored = FALSE, label_distribution = NULL, cost_fp_constant = NULL, ignore_inconsistencies = FALSE, verbose = FALSE, progress = FALSE )
predicted |
Multi-label prediction results. Expects a data.frame with
columns |
gold_standard |
Expects a data.frame with columns |
k |
An integer limit on the number of predictions per document to
consider. Requires a column |
mode |
One of the following aggregation modes: |
compute_bootstrap_ci |
A logical indicator for computing bootstrap CIs. |
n_bt |
An integer number of resamples to be used for bootstrapping. |
doc_groups |
A two-column data.frame with a column |
label_groups |
A two-column data.frame with a column |
graded_relevance |
A logical indicator for graded relevance. Defaults to
|
rename_metrics |
If set to
|
seed |
Pass a seed to make bootstrap replication reproducible. |
propensity_scored |
Logical, whether to use propensity scores as weights. |
label_distribution |
Expects a data.frame with columns |
cost_fp_constant |
Constant cost assigned to false positives.
|
replace_zero_division_with |
In macro averaged results (doc-avg, subj-avg), it may occur that some
instances have no predictions or no gold standard. In these cases,
calculating precision and recall may lead to division by zero. CASIMiR
standardly removes these missing values from macro averages, leading to a
smaller support (count of instances that were averaged). Other
implementations of macro averaged precision and recall default to 0 in these
cases. This option allows to control the default. Set any value between 0
and 1. (Defaults to |
drop_empty_groups |
Should empty levels of factor variables be dropped in grouped set retrieval
computation? (Defaults to |
ignore_inconsistencies |
Warnings about data inconsistencies will be silenced. (Defaults to |
verbose |
Verbose reporting of computation steps for debugging. (Defaults to |
progress |
Display progress bars for iterated computations (like bootstrap CI or
pr curves). (Defaults to |
A data.frame with columns "metric", "mode", "value", "support"
and optional grouping variables supplied in doc_groups or
label_groups. Here, support is defined for each mode
as:
mode == "doc-avg"The number of tested documents.
mode == "subj-avg"The number of labels contributing to the subj-average.
mode == "micro"The number of doc-label pairs contributing
to the denominator of the respective metric, e.g. for
precision, for recall, for F1 and
for R-precision.
compute_set_retrieval_scores_dplyr(): Variant with internal usage of
dplyr rather than collapse library. Tends to be slower, but more stable.
library(tidyverse) library(casimir) library(furrr) gold <- tibble::tribble( ~doc_id, ~label_id, "A", "a", "A", "b", "A", "c", "B", "a", "B", "d", "C", "a", "C", "b", "C", "d", "C", "f", ) pred <- tibble::tribble( ~doc_id, ~label_id, "A", "a", "A", "d", "A", "f", "B", "a", "B", "e", "C", "f", ) plan(sequential) # or whatever resources you have a <- compute_set_retrieval_scores( pred, gold, mode = "doc-avg", compute_bootstrap_ci = TRUE, n_bt = 100L ) ggplot(a, aes(x = metric, y = value)) + geom_bar(stat = "identity") + geom_errorbar(aes(ymin = ci_lower, ymax = ci_upper)) + facet_wrap(vars(metric), scales = "free") # example with graded relevance pred_w_relevance <- tibble::tribble( ~doc_id, ~label_id, ~relevance, "A", "a", 1.0, "A", "d", 0.0, "A", "f", 0.0, "B", "a", 1.0, "B", "e", 1 / 3, "C", "f", 1.0, ) b <- compute_set_retrieval_scores( pred_w_relevance, gold, mode = "doc-avg", graded_relevance = TRUE )library(tidyverse) library(casimir) library(furrr) gold <- tibble::tribble( ~doc_id, ~label_id, "A", "a", "A", "b", "A", "c", "B", "a", "B", "d", "C", "a", "C", "b", "C", "d", "C", "f", ) pred <- tibble::tribble( ~doc_id, ~label_id, "A", "a", "A", "d", "A", "f", "B", "a", "B", "e", "C", "f", ) plan(sequential) # or whatever resources you have a <- compute_set_retrieval_scores( pred, gold, mode = "doc-avg", compute_bootstrap_ci = TRUE, n_bt = 100L ) ggplot(a, aes(x = metric, y = value)) + geom_bar(stat = "identity") + geom_errorbar(aes(ymin = ci_lower, ymax = ci_upper)) + facet_wrap(vars(metric), scales = "free") # example with graded relevance pred_w_relevance <- tibble::tribble( ~doc_id, ~label_id, ~relevance, "A", "a", 1.0, "A", "d", 0.0, "A", "f", 0.0, "B", "a", 1.0, "B", "e", 1 / 3, "C", "f", 1.0, ) b <- compute_set_retrieval_scores( pred_w_relevance, gold, mode = "doc-avg", graded_relevance = TRUE )
Join the gold standard and the predicted results in one table based on the document id and the label id.
create_comparison( predicted, gold_standard, doc_groups = NULL, label_groups = NULL, graded_relevance = FALSE, propensity_scored = FALSE, label_distribution = NULL, ignore_inconsistencies = options::opt("ignore_inconsistencies") )create_comparison( predicted, gold_standard, doc_groups = NULL, label_groups = NULL, graded_relevance = FALSE, propensity_scored = FALSE, label_distribution = NULL, ignore_inconsistencies = options::opt("ignore_inconsistencies") )
predicted |
Multi-label prediction results. Expects a data.frame with
columns |
gold_standard |
Expects a data.frame with columns |
doc_groups |
A two-column data.frame with a column |
label_groups |
A two-column data.frame with a column |
graded_relevance |
A logical indicator for graded relevance. Defaults to
|
propensity_scored |
Logical, whether to use propensity scores as weights. |
label_distribution |
Expects a data.frame with columns |
ignore_inconsistencies |
Warnings about data inconsistencies will be silenced. (Defaults to |
A data.frame with columns "label_id", "doc_id", "suggested",
"gold".
library(casimir) gold <- tibble::tribble( ~doc_id, ~label_id, "A", "a", "A", "b", "A", "c", "B", "a", "B", "d", "C", "a", "C", "b", "C", "d", "C", "f" ) pred <- tibble::tribble( ~doc_id, ~label_id, "A", "a", "A", "d", "A", "f", "B", "a", "B", "e", "C", "f" ) create_comparison(pred, gold)library(casimir) gold <- tibble::tribble( ~doc_id, ~label_id, "A", "a", "A", "b", "A", "c", "B", "a", "B", "d", "C", "a", "C", "b", "C", "d", "C", "f" ) pred <- tibble::tribble( ~doc_id, ~label_id, "A", "a", "A", "d", "A", "f", "B", "a", "B", "e", "C", "f" ) create_comparison(pred, gold)
Create a rank per document id based on score.
create_rank_col(df) create_rank_col_dplyr(df)create_rank_col(df) create_rank_col_dplyr(df)
df |
A data.frame with columns |
The input data.frame df with an additional column
"rank".
create_rank_col_dplyr(): Variant with internal usage of
dplyr rather than collapse library.
Helper function for document-wise computation of ranked retrieval scores DCG, NDCG and LRAP. Implemented as in Annif https://github.com/NatLibFi/Annif/blob/master/annif/eval.py. Reference implementation of DCG to test against.
dcg_score(gold_vs_pred, limit = NULL)dcg_score(gold_vs_pred, limit = NULL)
gold_vs_pred |
A data.frame as generated by |
limit |
An integer cutoff value for DCG@N. |
The numeric value of DCG.
A subset of documents found in the catalogue of the DNB with intellectually
assigned subject labels from the GND subject vocabulary.
The document ids match those in the dnb_test_predictions dataset.
dnb_gold_standarddnb_gold_standard
dnb_gold_standardA data.frame with 337 rows and 2 columns:
doc_idDNB identifier of a document in the catalogue.
label_idDNB identifier of a concept in the GND subject vocabulary.
A subset of labels used in the catalogue of the DNB along with their
frequencies of occurrence. The label_ids match those in the
dnb_gold_standard and dnb_test_predictions datasets.
dnb_label_distributiondnb_label_distribution
dnb_label_distributionA data frame with 7,772 rows and 3 columns:
label_idDNB identifier of a concept in the GND subject vocabulary.
label_freqNumber of occurences of the specified label in the overall catalogue.
n_docsOverall number of documents in the ground truth dataset.
A subset of documents found in the catalogue of the DNB with predictions
generated with some arbitrary indexing method. The document ids match those
in the dnb_gold_standard dataset.
dnb_test_predictionsdnb_test_predictions
dnb_test_predictionsA data frame with 100,000 rows and 3 columns:
doc_idDNB identifier of a document in the catalogue.
label_idDNB identifier of a concept in the GND subject vocabulary.
scoreA confidence score in generated by the
indexing method.
Compute the denominator for R-precision based on propensity scored ranking of gold standard labels.
find_ps_rprec_deno(gold_vs_pred, grouping_var, cost_fp) find_ps_rprec_deno_dplyr(gold_vs_pred, grouping_var, cost_fp)find_ps_rprec_deno(gold_vs_pred, grouping_var, cost_fp) find_ps_rprec_deno_dplyr(gold_vs_pred, grouping_var, cost_fp)
gold_vs_pred |
A data.frame with logical columns |
grouping_var |
A character vector of grouping variables that must be
present in |
cost_fp |
A numeric value > 0, defaults to NULL. |
A data.frame with columns "n_gold", "n_suggested", "tp", "fp",
"fn", "delta_relevance", "rprec_deno".
find_ps_rprec_deno_dplyr(): Variant with dplyr based
internals rather than collapse internals.
Helper function which performs the major bootstrap operation and wraps the
repeated application of summarise_intermediate_results and
compute_pr_auc_from_curve for each bootstrap run.
generate_pr_auc_replica( intermed_res_all_thrsld, seed, n_bt, propensity_scored, replace_zero_division_with = options::opt("replace_zero_division_with"), progress = options::opt("progress") )generate_pr_auc_replica( intermed_res_all_thrsld, seed, n_bt, propensity_scored, replace_zero_division_with = options::opt("replace_zero_division_with"), progress = options::opt("progress") )
intermed_res_all_thrsld |
Intermediate results as produced by
|
seed |
Pass a seed to make bootstrap replication reproducible. |
n_bt |
An integer number of resamples to be used for bootstrapping. |
propensity_scored |
Logical, whether to use propensity scores as weights. |
replace_zero_division_with |
In macro averaged results (doc-avg, subj-avg), it may occur that some
instances have no predictions or no gold standard. In these cases,
calculating precision and recall may lead to division by zero. CASIMiR
standardly removes these missing values from macro averages, leading to a
smaller support (count of instances that were averaged). Other
implementations of macro averaged precision and recall default to 0 in these
cases. This option allows to control the default. Set any value between 0
and 1. (Defaults to |
progress |
Display progress bars for iterated computations (like bootstrap CI or
pr curves). (Defaults to |
A data.frame with columns "boot_replicate", "pr_auc".
Wrapper for computing n_bt bootstrap replica, combining the
functionality of compute_intermediate_results and
summarise_intermediate_results.
generate_replicate_results( base_compare, n_bt, grouping_var, seed = NULL, ps_flags = list(intermed = FALSE, summarise = FALSE), label_distribution = NULL, cost_fp = NULL, replace_zero_division_with = options::opt("replace_zero_division_with"), drop_empty_groups = options::opt("drop_empty_groups"), progress = options::opt("progress") ) generate_replicate_results_dplyr( base_compare, n_bt, grouping_var, seed = NULL, label_distribution = NULL, ps_flags = list(intermed = FALSE, summarise = FALSE), cost_fp = NULL, progress = FALSE )generate_replicate_results( base_compare, n_bt, grouping_var, seed = NULL, ps_flags = list(intermed = FALSE, summarise = FALSE), label_distribution = NULL, cost_fp = NULL, replace_zero_division_with = options::opt("replace_zero_division_with"), drop_empty_groups = options::opt("drop_empty_groups"), progress = options::opt("progress") ) generate_replicate_results_dplyr( base_compare, n_bt, grouping_var, seed = NULL, label_distribution = NULL, ps_flags = list(intermed = FALSE, summarise = FALSE), cost_fp = NULL, progress = FALSE )
base_compare |
A data.frame as generated by |
n_bt |
An integer number of resamples to be used for bootstrapping. |
grouping_var |
A character vector of variables that must be present in
|
seed |
A seed passed to resampling step for reproducibility. |
ps_flags |
A list as returned by |
label_distribution |
Expects a data.frame with columns |
cost_fp |
A numeric value > 0, defaults to NULL. |
replace_zero_division_with |
In macro averaged results (doc-avg, subj-avg), it may occur that some
instances have no predictions or no gold standard. In these cases,
calculating precision and recall may lead to division by zero. CASIMiR
standardly removes these missing values from macro averages, leading to a
smaller support (count of instances that were averaged). Other
implementations of macro averaged precision and recall default to 0 in these
cases. This option allows to control the default. Set any value between 0
and 1. (Defaults to |
drop_empty_groups |
Should empty levels of factor variables be dropped in grouped set retrieval
computation? (Defaults to |
progress |
Display progress bars for iterated computations (like bootstrap CI or
pr curves). (Defaults to |
A data.frame containing n_bt boot replica of results as
returned by compute_intermediate_results and
summarise_intermediate_results.
generate_replicate_results_dplyr(): Variant with dplyr based
internals rather than collapse internals.
Internal wrapper for computing bootstrapping results on one sample, combining
the functionality of compute_intermediate_results and
summarise_intermediate_results.
helper_f( sampled_id_list, compare_cpy, grouping_var, label_distribution = NULL, ps_flags = list(intermed = FALSE, summarise = FALSE), cost_fp = NULL, replace_zero_division_with = options::opt("replace_zero_division_with"), drop_empty_groups = options::opt("drop_empty_groups") )helper_f( sampled_id_list, compare_cpy, grouping_var, label_distribution = NULL, ps_flags = list(intermed = FALSE, summarise = FALSE), cost_fp = NULL, replace_zero_division_with = options::opt("replace_zero_division_with"), drop_empty_groups = options::opt("drop_empty_groups") )
sampled_id_list |
A list of all doc_ids of this bootstrap. |
compare_cpy |
As created by |
grouping_var |
A vector of variables to be used for aggregation. |
label_distribution |
Expects a data.frame with columns |
ps_flags |
A list as returned by |
cost_fp |
A numeric value > 0, defaults to NULL. |
replace_zero_division_with |
In macro averaged results (doc-avg, subj-avg), it may occur that some
instances have no predictions or no gold standard. In these cases,
calculating precision and recall may lead to division by zero. CASIMiR
standardly removes these missing values from macro averages, leading to a
smaller support (count of instances that were averaged). Other
implementations of macro averaged precision and recall default to 0 in these
cases. This option allows to control the default. Set any value between 0
and 1. (Defaults to |
drop_empty_groups |
Should empty levels of factor variables be dropped in grouped set retrieval
computation? (Defaults to |
A data.frame as returned by summarise_intermediate_results.
Internal wrapper for computing bootstrapping results on one sample, combining
the functionality of compute_intermediate_results and
summarise_intermediate_results.
helper_f_dplyr( sampled_id_list, compare_cpy, grouping_var, ps_flags = list(intermed = FALSE, summarise = FALSE), label_distribution = NULL, cost_fp = NULL )helper_f_dplyr( sampled_id_list, compare_cpy, grouping_var, ps_flags = list(intermed = FALSE, summarise = FALSE), label_distribution = NULL, cost_fp = NULL )
sampled_id_list |
A list of all doc_ids of this bootstrap. |
compare_cpy |
As created by |
grouping_var |
A vector of variables to be used for aggregation. |
ps_flags |
A list with logicals |
label_distribution |
Expects a data.frame with columns |
cost_fp |
A numeric value > 0, defaults to NULL. |
A data.frame as returned by
summarise_intermediate_results_dplyr.
Helper function to perform a secure join of a comparison matrix with propensity scores.
join_propensity_scores(input_data, label_weights) join_propensity_scores_dplyr(input_data, label_weights)join_propensity_scores(input_data, label_weights) join_propensity_scores_dplyr(input_data, label_weights)
input_data |
A data.frame containing at least the column
|
label_weights |
Expects a data.frame with columns |
The input data.frame input_data with an additional column
"label_weight".
join_propensity_scores_dplyr(): Variant with dplyr based
internals rather than collapse internals.
library(casimir) gold <- tibble::tribble( ~doc_id, ~label_id, "A", "a", "A", "b", "A", "c", "B", "a", "B", "d", "C", "a", "C", "b", "C", "d", "C", "f" ) pred <- tibble::tribble( ~doc_id, ~label_id, "A", "a", "A", "d", "A", "f", "B", "a", "B", "e", "C", "f" ) label_distribution <- tibble::tribble( ~label_id, ~label_freq, ~n_docs, "a", 10000, 10100, "b", 1000, 10100, "c", 100, 10100, "d", 1, 10100, "e", 1, 10100, "f", 2, 10100, "g", 0, 10100 ) comp <- create_comparison(gold, pred) label_weights <- compute_propensity_scores(label_distribution) comp_w_label_weights <- join_propensity_scores( input_data = comp, label_weights = label_weights )library(casimir) gold <- tibble::tribble( ~doc_id, ~label_id, "A", "a", "A", "b", "A", "c", "B", "a", "B", "d", "C", "a", "C", "b", "C", "d", "C", "f" ) pred <- tibble::tribble( ~doc_id, ~label_id, "A", "a", "A", "d", "A", "f", "B", "a", "B", "e", "C", "f" ) label_distribution <- tibble::tribble( ~label_id, ~label_freq, ~n_docs, "a", 10000, 10100, "b", 1000, 10100, "c", 100, 10100, "d", 1, 10100, "e", 1, 10100, "f", 2, 10100, "g", 0, 10100 ) comp <- create_comparison(gold, pred) label_weights <- compute_propensity_scores(label_distribution) comp_w_label_weights <- join_propensity_scores( input_data = comp, label_weights = label_weights )
Helper function for document-wise computation of ranked retrieval scores DCG, NDCG and LRAP. Implemented as in Annif https://github.com/NatLibFi/Annif/blob/master/annif/eval.py. Reference implementation for Label Ranking Average Precision.
lrap_score(gold_vs_pred)lrap_score(gold_vs_pred)
gold_vs_pred |
A data.frame as generated by |
The numeric value of LRAP.
Helper function for document-wise computation of ranked retrieval scores DCG, NDCG and LRAP. Implemented as in Annif https://github.com/NatLibFi/Annif/blob/master/annif/eval.py. Reference implementation for NDCG to test against.
ndcg_score(gold_vs_pred, limit = NULL)ndcg_score(gold_vs_pred, limit = NULL)
gold_vs_pred |
A data.frame as generated by |
limit |
An integer cutoff value for NDCG@N. |
The numeric value of NDCG.
Declaration of options to be used as identical function arguments
check_group_names |
Perform replacement of dots in grouping columns. Disable for faster
computation if you can make sure that all columns used for grouping
("doc_id", "label_id", "doc_groups", "label_groups") do not contain
dots. (Defaults to |
ignore_inconsistencies |
Warnings about data inconsistencies will be silenced. (Defaults to |
drop_empty_groups |
Should empty levels of factor variables be dropped in grouped set retrieval
computation? (Defaults to |
replace_zero_division_with |
In macro averaged results (doc-avg, subj-avg), it may occur that some
instances have no predictions or no gold standard. In these cases,
calculating precision and recall may lead to division by zero. CASIMiR
standardly removes these missing values from macro averages, leading to a
smaller support (count of instances that were averaged). Other
implementations of macro averaged precision and recall default to 0 in these
cases. This option allows to control the default. Set any value between 0
and 1. (Defaults to |
progress |
Display progress bars for iterated computations (like bootstrap CI or
pr curves). (Defaults to |
verbose |
Verbose reporting of computation steps for debugging. (Defaults to |
Internally used, package-specific options. All options will prioritize R options() values, and fall back to environment variables if undefined. If neither the option nor the environment variable is set, a default value is used.
Option values specific to casimir can be
accessed by passing the package name to env.
options::opts(env = "casimir") options::opt(x, default, env = "casimir")
Warnings about data inconsistencies will be silenced.
FALSE
casimir.ignore_inconsistencies
R_CASIMIR_IGNORE_INCONSISTENCIES (evaluated if possible, raw string otherwise)
Display progress bars for iterated computations (like bootstrap CI or pr curves).
FALSE
casimir.progress
R_CASIMIR_PROGRESS (evaluated if possible, raw string otherwise)
Verbose reporting of computation steps for debugging.
FALSE
casimir.verbose
R_CASIMIR_VERBOSE (evaluated if possible, raw string otherwise)
Perform replacement of dots in grouping columns. Disable for faster computation if you can make sure that all columns used for grouping ("doc_id", "label_id", "doc_groups", "label_groups") do not contain dots.
TRUE
casimir.check_group_names
R_CASIMIR_CHECK_GROUP_NAMES (evaluated if possible, raw string otherwise)
Should empty levels of factor variables be dropped in grouped set retrieval computation?
TRUE
casimir.drop_empty_groups
R_CASIMIR_DROP_EMPTY_GROUPS (evaluated if possible, raw string otherwise)
In macro averaged results (doc-avg, subj-avg), it may occur that some instances have no predictions or no gold standard. In these cases, calculating precision and recall may lead to division by zero. CASIMiR standardly removes these missing values from macro averages, leading to a smaller support (count of instances that were averaged). Other implementations of macro averaged precision and recall default to 0 in these cases. This option allows to control the default. Set any value between 0 and 1.
NULL
casimir.replace_zero_division_with
R_CASIMIR_REPLACE_ZERO_DIVISION_WITH (evaluated if possible, raw string otherwise)
options getOption Sys.setenv Sys.getenv
Reshape pr curve data to a format that is easier for plotting.
pr_curve_post_processing(results_summary)pr_curve_post_processing(results_summary)
results_summary |
As produced by |
A data.frame with columns "searchspace_id", "prec", "rec",
"prec_cummax" and possible additional stratification variables.
Calculate the cost for false positives depending on the chosen
cost_fp_constant.
process_cost_fp(cost_fp_constant, gold_vs_pred)process_cost_fp(cost_fp_constant, gold_vs_pred)
cost_fp_constant |
Constant cost assigned to false positives.
|
gold_vs_pred |
A data.frame with logical columns |
A numeric value > 0.
Rename metric names for generalised precision etc. The output will be renamed if:
graded_relevance == TRUEprefixed with "g-" to indicate that metrics are computed with graded relevance.
propensity_scored == TRUEprefixed with "ps-" to indicate that metrics are computed with propensity scores.
!is.null(k)suffixed with "@k" to indicate that metrics are limited to top k predictions.
rename_metrics( res_df, k = NULL, propensity_scored = FALSE, graded_relevance = FALSE )rename_metrics( res_df, k = NULL, propensity_scored = FALSE, graded_relevance = FALSE )
res_df |
A data.frame with a column |
k |
An integer limit on the number of predictions per document to
consider. Requires a column |
propensity_scored |
Logical, whether to use propensity scores as weights. |
graded_relevance |
A logical indicator for graded relevance. Defaults to
|
The input data.frame res_df with renamed metrics for
generalised precision etc.
Determine the appropriate grouping variables for each aggregation mode.
set_grouping_var(mode, doc_groups, label_groups, var = NULL)set_grouping_var(mode, doc_groups, label_groups, var = NULL)
mode |
One of the following aggregation modes: |
doc_groups |
A two-column data.frame with a column |
label_groups |
A two-column data.frame with a column |
var |
Additional variables to include. |
A character vector of variables determining the grouping structure.
Generate flags if propensity scores should be applied to intermediate results or summarised results.
set_ps_flags(mode, propensity_scored)set_ps_flags(mode, propensity_scored)
mode |
One of the following aggregation modes: |
propensity_scored |
Logical, whether to use propensity scores as weights. |
A list containing logical flags "intermed" and
"summarise".
Compute the mean of intermediate results created by
compute_intermediate_results.
summarise_intermediate_results( intermediate_results, propensity_scored = FALSE, label_distribution = NULL, set = FALSE, replace_zero_division_with = options::opt("replace_zero_division_with") )summarise_intermediate_results( intermediate_results, propensity_scored = FALSE, label_distribution = NULL, set = FALSE, replace_zero_division_with = options::opt("replace_zero_division_with") )
intermediate_results |
As produced by
|
propensity_scored |
Logical, whether to use propensity scores as weights. |
label_distribution |
Expects a data.frame with columns |
set |
Logical. Allow in-place modification of
|
replace_zero_division_with |
In macro averaged results (doc-avg, subj-avg), it may occur that some
instances have no predictions or no gold standard. In these cases,
calculating precision and recall may lead to division by zero. CASIMiR
standardly removes these missing values from macro averages, leading to a
smaller support (count of instances that were averaged). Other
implementations of macro averaged precision and recall default to 0 in these
cases. This option allows to control the default. Set any value between 0
and 1. (Defaults to |
A data.frame with columns "metric", "value".
Compute the mean of intermediate results created by
compute_intermediate_results. Variant with dplyr based internals
rather than collapse internals.
summarise_intermediate_results_dplyr( intermediate_results, propensity_scored = FALSE, label_distribution = NULL )summarise_intermediate_results_dplyr( intermediate_results, propensity_scored = FALSE, label_distribution = NULL )
intermediate_results |
As produced by
|
propensity_scored |
Logical, whether to use propensity scores as weights. |
label_distribution |
Expects a data.frame with columns |
A data.frame with columns "metric", "value".