30.3 Understanding Automatic Data Preparation
Understand data transformation using Automatic Data Preparation (ADP).
Most algorithms require some form of data transformation. During the model build process, Oracle Data Mining can automatically perform the transformations required by the algorithm. You can choose to supplement the automatic transformations with additional transformations of your own, or you can choose to manage all the transformations yourself.
In calculating automatic transformations, Oracle Data Mining uses heuristics that address the common requirements of a given algorithm. This process results in reasonable model quality in most cases.
Binning, normalization, and outlier treatment are transformations that are commonly needed by data mining algorithms.
Related Topics
30.3.1 Binning
Binning, also called discretization, is a technique for reducing the cardinality of continuous and discrete data. Binning groups related values together in bins to reduce the number of distinct values.
Binning can improve resource utilization and model build response time dramatically without significant loss in model quality. Binning can improve model quality by strengthening the relationship between attributes.
Supervised binning is a form of intelligent binning in which important characteristics of the data are used to determine the bin boundaries. In supervised binning, the bin boundaries are identified by a singlepredictor decision tree that takes into account the joint distribution with the target. Supervised binning can be used for both numerical and categorical attributes.
30.3.3 Outlier Treatment
A value is considered an outlier if it deviates significantly from most other values in the column. The presence of outliers can have a skewing effect on the data and can interfere with the effectiveness of transformations such as normalization or binning.
Outlier treatment methods such as trimming or clipping can be implemented to minimize the effect of outliers.
Outliers represent problematic data, for example, a bad reading due to the abnormal condition of an instrument. However, in some cases, especially in the business arena, outliers are perfectly valid. For example, in census data, the earnings for some of the richest individuals can vary significantly from the general population. Do not treat this information as an outlier, since it is an important part of the data. You need domain knowledge to determine outlier handling.
30.3.4 How ADP Transforms the Data
The following table shows how ADP prepares the data for each algorithm.
Table 301 Oracle Data Mining Algorithms With ADP
Algorithm  Mining Function  Treatment by ADP 

Association Rules 
ADP has no effect on association rules. 

Classification 
ADP has no effect on Decision Tree. Data preparation is handled by the algorithm. 

Clustering 
Singlecolumn (not nested) numerical columns that are modeled with Gaussian distributions are normalized with outliersensitive normalization. ADP has no effect on the other types of columns. 

Classification and Regression 
Numerical attributes are normalized with outliersensitive normalization. 

Clustering 
Numerical attributes are normalized with outliersensitive normalization. 

Attribute Importance 
All attributes are binned with supervised binning. 

Classification 
All attributes are binned with supervised binning. 

Feature Extraction 
Numerical attributes are normalized with outliersensitive normalization. 

Clustering 
Numerical attributes are binned with a specialized form of equiwidth binning, which computes the number of bins per attribute automatically. Numerical columns with all nulls or a single value are removed. 

Feature Extraction 
Numerical attributes are normalized with outliersensitive normalization. 

Classification, Anomaly Detection, and Regression 
Numerical attributes are normalized with outliersensitive normalization. 
See Also:

Part III of Oracle Data Mining Concepts for more information about algorithmspecific data preparation