Discretize a continuous variable as indicated by interval mappings in accordance with the PMML element Discretize.

xform_discretize(
  wrap_object,
  xform_info,
  table,
  default_value = NA,
  map_missing_to = NA,
  ...
)

Arguments

wrap_object

Output of xform_wrap or another transformation function.

xform_info

Specification of details of the transformation. This may be a name of an external file or a list of data frames. Even if only 1 variable is to be transformed, the information for that transform should be given as a list with 1 element.

table

Name of external CSV file containing the map from input to output values.

default_value

Value to be given to the transformed variable if the value of the input variable does not lie in any of the defined intervals. If 'xform_info' is a list, this is a vector with each element corresponding to the corresponding list element.

map_missing_to

Value to be given to the transformed variable if the value of the input variable is missing. If 'xform_info' is a list, this is a vector with each element corresponding to the corresponding list element.

...

Further arguments passed to or from other methods.

Value

R object containing the raw data, the transformed data and data statistics.

Details

Create a discrete variable from a continuous one as indicated by interval mappings. The discrete variable value depends on interval in which the continuous variable value lies. The mapping from intervals to discrete values can be given in an external table file referred to in the transform command or as a list of data frames.

Given a list of intervals and the discrete value each interval is linked to, a discrete variable is defined with the value indicated by the interval where it lies in. If a continuous variable InVar of data type InType is to be converted to a variable OutVar of data type OutType, the transformation command is in the format:

xform_info = "[InVar->OutVar][InType->OutType]", table="TableFileName",
default_value="defVal", map_missing_to="missingVal"

where TableFileName is the name of the CSV file containing the interval to discrete value map. The data types of the variables can be any of the ones defined in the PMML format including integer, double or string. defVal is the default value of the transformed variable and if any of the input values are missing, missingVal is the value of the transformed variable.

The arguments InType, OutType, default_value and map_missing_to are optional. The CSV file containing the table should not have any row and column identifiers, and the values given must be in the same order as in the map command. If the data types of the variables are not given, the data types of the input variables are attempted to be determined from the boxData argument. If that is not possible, the data types are assumed to be string.

Intervals are either given by the left or right limits, in which case the other limit is considered as infinite. It may also be given by both the left and right limits separated by the character ":". An example of how intervals should be defined in the external file are:


rightVal1),outVal1
rightVal2],outVal2
[leftVal1:rightVal3),outVal3
(leftVal2:rightVal4],outVal4
(leftVal,outVal5

which, given an input value inVal and the output value to be calculated out, means that:


if(inVal < rightVal1) out=outVal1
f(inVal <= rightVal2) out=outVal2
if( (inVal >= leftVal1) and (inVal < rightVal3) ) out=outVal3
if( (inVal > leftVal2) and (inVal <= rightVal4) ) out=outVal4
if(inVal > leftVal) out=outVal5

It is also possible to give the information about the transforms without an external file, using a list of data frames. Each data frame defines a discretization operation for 1 input variable. The first row of the data frame gives the original field name, the derived field name, the left interval, the left value, the right interval and the right value. The second row gives the data type of the values as listed in the first row. The second row with the data types of the fields is not required. If not given, all fields are assumed to be strings. In this input format, the 'default_value' and 'map_missing_to' parameters should be vectors. The first element of each vector will correspond to the derived field defined in the 1st element of the 'xform_info' list etc. Although somewhat more complicated, this method is designed to not require any external features. Further, once the initial list is constructed, modifying it is a simple operation; making this a better method to use if the parameters of the transformation are to be modified frequently and/or automatically. This is made more clear in the example below.

See also

Author

Tridivesh Jena

Examples

# First wrap the data
iris_box <- xform_wrap(iris)
if (FALSE) {
# Convert the continuous variable "Sepal.Length" to a discrete
# variable "dsl". The intervals to be used for this transformation is
# given in a file, "intervals.csv", whose content is, for example,:
#
#  5],val1
#  (5:6],22
#  (6,val2
#
# This will be used to create a discrete variable named "dsl" of dataType
# "string" such that:
#    if(Sepal.length <= 5) then dsl = "val1"
#    if((Sepal.Lenght > 5) and (Sepal.Length <= 6)) then dsl = "22"
#    if(Sepal.Length > 6) then dsl = "val2"
#
# Give "dsl" the value 0 if the input variable value is missing.
iris_box <- xform_discretize(iris_box,
  xform_info = "[Sepal.Length -> dsl][double -> string]",
  table = "intervals.csv", map_missing_to = "0"
)
}

# A different transformation using a list of data frames, of size 1:
t <- list()
m <- data.frame(rbind(
  c(
    "Petal.Length", "dis_pl", "leftInterval", "leftValue",
    "rightInterval", "rightValue"
  ),
  c(
    "double", "integer", "string", "double", "string",
    "double"
  ),
  c("0)", 0, "open", NA, "Open", 0),
  c(NA, 1, "closed", 0, "Open", 1),
  c(NA, 2, "closed", 1, "Open", 2),
  c(NA, 3, "closed", 2, "Open", 3),
  c(NA, 4, "closed", 3, "Open", 4),
  c("[4", 5, "closed", 4, "Open", NA)
), stringsAsFactors = TRUE)

# Give column names to make it look nice; not necessary!
colnames(m) <- c(
  "Petal.Length", "dis_pl", "leftInterval", "leftValue",
  "rightInterval", "rightValue"
)

# A textual representation of the data frame is:
#   Petal.Length  dis_pl leftInterval leftValue rightInterval rightValue
# 1 Petal.Length  dis_pl leftInterval leftValue rightInterval rightValue
# 2       double integer       string    double        string     double
# 3           0)       0         open      <NA>          Open          0
# 4         <NA>       1       closed         0          Open          1
# 5         <NA>       2       closed         1          Open          2
# 6         <NA>       3       closed         2          Open          3
# 7         <NA>       4       closed         3          Open          4
# 8           (4       5       closed         4          Open       <NA>
#
# This is a transformation that defines a derived field 'dis_pl'
# which has the integer value '0' if the original field
# 'Petal.Length' has a value less than 0. The derived field has a
# value '1' if the input is greater than or equal to 0 and less
# than 1. Note that the values of the 1st column after row 2 have
# been deliberately given NA values in the middle. This is to
# show that that column is meant for a textual representation of
# the transformation as defined for the method involving external
# files; however in this methodtheir values are not used.

# Add the data frame to a list. The default values and the missing
# values should be given as a vector, each element of the vector
# corresponding to the element at the same index in the list. If
# these values are not given as a vector, they will be used for the
# first list element only.
t[[1]] <- m
def <- c(11)
mis <- c(22)
iris_box <- xform_discretize(iris_box,
  xform_info = t, default_value = def,
  map_missing_to = mis
)

# Make a simple model to see the effect.
fit <- lm(Petal.Width ~ ., iris_box$data[, -5])
fit_pmml <- pmml(fit, transforms = iris_box)