hold_out_command

View page source

Hold Out Command: Randomly Sub-sample The Data

Syntax

dismod_at database hold_out integrand_name max_fit
dismod_at database hold_out integrand_name max_fit max_fit_parent
dismod_at database hold_out integrand_name max_fit   \
      cov_name cov_value_1 cov_value_2
dismod_at database hold_out integrand_name max_fit max_fit_parent   \
      cov_name cov_value_1 cov_value_2

Purpose

This command is used to set a maximum number of data values that are included in subsequent fits. It is intended to make the initialization and fitting faster. The random choice of which values to include can be made repeatable using random_seed .

database

Is an sqlite database containing the dismod_at input tables which are not modified.

integrand_name

This is the integrand that we are sub-sampling.

max_fit

If this argument is present, it is the maximum number of data points to fit for the specified integrand; i.e., the maximum number that are not held out. If for this integrand there are more than max_fit points with hold_out zero in the data table, points are randomly held out so that there are max_fit points fit for this integrand.

max_fit_parent

If this argument is present, max_fit only applies to the total data from child nodes. The value max_fit_parent determines the maximum number of Parent Node data values to include.

cov_name

If this argument is present, it specifies a covariate column that will be balanced; see covariate balancing below:

cov_value_1

If this argument is present, it specifies one of the covariate values for the balancing. This is a string representation of a double value.

cov_value_2

If this argument is present, it specifies the opposite covariate value for the balancing. This is a string representation of a double value.

Balancing

Child Nodes

The choice of which points to include in the fit tries to sample the same number of data points from each of the child nodes (and the parent node). If there are not sufficiently many data for one of these nodes, the others make up the difference.

Covariates

If cov_name is present, the data for each child is further split into those with cov_value_1, those with cov_value_2, and those with a different value (for the covariate specified by cov_name ). The choice of which points to include tries to sample the same number points form each of these sub-groups.

data_subset_table

Only rows of the data_subset_table that correspond to this integrand are modified. The hold_out is set one (zero) if the corresponding data is (is not) selected for hold out. Only points that have hold_out zero in the data table can have hold_out non-zero in the data_subset table. See the fit command hold_out documentation.

Example

The files user_hold_out_1.py and user_hold_out_2.py contain examples and tests using this command.