How to build a decision tree model in IBM Db2

After growing a machine studying mannequin, you want a spot to run your mannequin and serve predictions. If your organization is within the early stage of its AI journey or has funds constraints, you might wrestle to discover a deployment system in your mannequin. Constructing ML infrastructure and integrating ML fashions with the bigger enterprise are main bottlenecks to AI adoption ^[1,2,3]. IBM Db2 may help clear up these issues with its built-in ML infrastructure. Somebody with the information of SQL and entry to a Db2 occasion, the place the in-database ML function is enabled, can simply be taught to construct and use a machine studying mannequin within the database.

On this put up, I’ll present find out how to develop, deploy, and use a choice tree mannequin in a Db2 database.

These are my main steps on this tutorial:

Arrange Db2 tables
Discover ML dataset
Preprocess the dataset
Prepare a choice tree mannequin
Generate predictions utilizing the mannequin
Consider the mannequin

I carried out these steps in a Db2 Warehouse on-prem database. Db2 Warehouse on cloud additionally helps these ML options.

The machine studying use case

I’ll use a dataset of historic flights within the US. For every flight, the dataset has info such because the flight’s origin airport, departure time, flying time, and arrival time. Additionally, a column within the dataset signifies if every flight had arrived on time or late. Utilizing examples from the dataset, we’ll construct a classification mannequin with resolution tree algorithm. As soon as educated, the mannequin can obtain as enter unseen flight information and predict if the flight will arrive on time or late at its vacation spot.

1. Arrange Db2 tables

The dataset I exploit on this tutorial is on the market here as a csv file.

Making a Db2 desk

I exploit the next SQL for making a desk for storing the dataset.

db2start
connect with <database_name>

db2 "CREATE TABLE FLIGHTS.FLIGHTS_DATA_V3  (
ID INTEGER NOT NULL GENERATED BY DEFAULT AS IDENTITY,
YEAR INTEGER ,
QUARTER INTEGER ,
MONTH INTEGER ,
DAYOFMONTH INTEGER ,                       
DAYOFWEEK INTEGER ,                       
UNIQUECARRIER VARCHAR(50 OCTETS) ,               
ORIGIN VARCHAR(50 OCTETS) ,                      
DEST VARCHAR(50 OCTETS) ,                       
CRSDEPTIME INTEGER ,                       
DEPTIME INTEGER ,                       
DEPDELAY REAL ,                       
DEPDEL15 REAL ,                       
TAXIOUT INTEGER ,                       
WHEELSOFF INTEGER ,                       
CRSARRTIME INTEGER ,                       
CRSELAPSEDTIME INTEGER ,                       
AIRTIME INTEGER ,                       
DISTANCEGROUP INTEGER ,                       
FLIGHTSTATUS VARCHAR(1) )
ORGANIZE BY ROW";

After creating the desk, I exploit the next SQL to load the info, from the csv file, into the desk:

db2 "IMPORT FROM 'FLIGHTS_DATA_V3.csv' OF DEL COMMITCOUNT 50000 INSERT INTO FLIGHTS.FLIGHTS_DATA_V3"

I now have the ML dataset loaded into the FLIGHTS.FLIGHTS_DATA_V3 desk in Db2. I’ll copy a subset of the information from this desk to a separate desk for the ML mannequin improvement and analysis, leaving the unique copy of the info intact.

SELECT rely(*) FROM FLIGHTS.FLIGHTS_DATA_V3

— — —
1000000

Making a separate desk with pattern information

Create a desk with 10% pattern rows from the above desk. Use the RAND perform of Db2 for random sampling.

CREATE TABLE FLIGHT.FLIGHTS_DATA AS (SELECT * FROM FLIGHTS.FLIGHTS_DATA_V3 WHERE RAND() < 0.1) WITH DATA

Rely the variety of rows within the pattern desk.

SELECT rely(*) FROM FLIGHT.FLIGHTS_DATA

— — —
99879

Look into the scheme definition of the desk.

SELECT NAME, COLTYPE, LENGTH
FROM SYSIBM.SYSCOLUMNS
WHERE TBCREATOR = 'FLIGHT' AND TBNAME = 'FLIGHTS_DATA'
ORDER BY COLNO

FLIGHTSTATUS is the response or the goal column. Others are function columns.

Discover the DISTINCT values within the goal column.

From these values, I can see that it’s a binary classification process the place every flight arrived both on time or late.

Discover the frequencies of distinct values within the FLIGHTSTATUS column.

SELECT FLIGHTSTATUS, rely(*) AS FREQUENCY, rely(*) / (SELECT rely(*) FROM FLIGHT.FLIGHTS_DATA) AS FRACTION
FROM FLIGHT.FLIGHTS_DATA fdf
GROUP BY FLIGHTSTATUS

From the above, I see the lessons are imbalanced. Now I’ll not achieve any additional insights from your entire dataset, as this will leak info to the modeling section.

Creating prepare/take a look at partitions of the dataset

Earlier than accumulating deeper insights into the info, I’ll divide this dataset into prepare and take a look at partitions utilizing Db2’s RANDOM_SAMPLING SP. I apply stratified sampling to protect the ratio between two lessons within the generated coaching information set.

Create a TRAIN partition.

name IDAX.RANDOM_SAMPLE('intable=FLIGHT.FLIGHTS_DATA, fraction=0.8, outtable=FLIGHT.FLIGHTS_TRAIN, by=FLIGHTSTATUS')

Copy the remaining information to a take a look at PARTITION.

CREATE TABLE FLIGHT.FLIGHTS_TEST AS (SELECT * FROM FLIGHT.FLIGHTS_DATA FDF WHERE FDF.ID NOT IN(SELECT FT.ID FROM FLIGHT.FLIGHTS_TRAIN FT)) WITH DATA

2. Discover information

On this step, I’ll have a look at each pattern information and the abstract statistics of the coaching dataset to achieve insights into the dataset.

Look into some pattern information.

SELECT * FROM FLIGHT.FLIGHTS_TRAIN FETCH FIRST 10 ROWS ONLY

Some columns have encoded the time as numbers:

— CRSDEPTIME: Pc Reservation System (scheduled) Departure Time (hhmm)

— DepTime: Departure Time (hhmm)

— CRSArrTime: Pc Reservation System (scheduled) Arrival Time

Now, I gather abstract statistics from the FLIGHTS_TRAIN utilizing SUMMARY1000 SP to get a world view of the traits of the dataset.

CALL IDAX.SUMMARY1000('intable=FLIGHT.FLIGHTS_TRAIN, outtable=FLIGHT.FLIGHTS_TRAIN_SUM1000')

Right here the intable has the identify of the enter desk from which I need SUMMARY1000 SP to gather statistics. outtable is the identify of the desk the place SUMMARY1000 will retailer gathered statistics for your entire dataset. In addition to the outtable, SUMMARY1000 SP creates a number of extra output tables — one desk with statistics for every column kind. Our dataset has two kinds of columns — numeric and nominal. So, SUMMARY1000 will generate two extra tables. These extra tables observe this naming conference: the identify of the outtable + column kind. In our case, the column sorts are NUM, representing numeric, and CHAR, representing nominal. So, the names of those two extra tables shall be as follows:

FLIGHTS_TRAIN_SUM1000_NUM

FLIGHTS_TRAIN_SUM1000_CHAR

Having the statistics accessible in separate tables for particular datatypes makes it simpler to view the statistics that apply to particular datatype and scale back the variety of columns whose statistics are seen collectively. This simplifies the evaluation course of.

Examine the abstract statistics of the numeric column.

SELECT * FROM FLIGHT.FLIGHTS_TRAIN_SUM1000_NUM

For the numeric columns, SUMMARY1000 collect the next statistics:

Lacking worth rely
Non-missing worth rely
Common
Variance
Customary deviation
Skewness
Extra kurtosis
Minimal
Most

Every of those statistics may help uncover insights into the dataset. As an example, I can see that DEPDEL15 and DEPDELAY columns have 49 lacking values. There are giant values in these columns: AIRTIME, CRSARRTIME, CRSDEPTIME, CRSELAPSEDTIME, DEPDELAY, DEPTIME, TAXIOUT, WHEELSOFF, and YEAR. Since I’ll create a choice tree mannequin, I don’t must cope with the massive worth and the lacking values. Db2 will cope with each points natively.

Subsequent, I examine the abstract statistics of the nominal columns.

choose * from FLIGHT.FLIGHTS_TRAIN_SUM1000_CHAR

For nominal columns, SUMMARY1000 gathered the next statistics:

Variety of lacking values
Variety of non-missing values
Variety of distinct values
Frequency of probably the most frequent worth

3. Preprocess information

From the above information exploration, I can see that the dataset has no lacking values. These 4 TIME columns have giant values: AIRTIME, CRSARRTIME, DEPTIME, WHEELSOFF. I’ll depart the nominal values in all columns as-is, as the choice tree implementation in Db2 can cope with them natively.

Extract the hour half from the TIME columns — CRSARRTIME, DEPTIME, WHEELSOFF.

From trying up the outline of the dataset, I see the values within the CRSARRTIME, DEPTIME, and WHEELSOFF columns are encoding of hhmm of the time values. I extract the hour a part of these values to create, hopefully, higher options for the training algorithm.

Scale CRSARRTIME COLUMN: divide the worth with 100 provides the hour of the flight arrival time:

UPDATE FLIGHT.FLIGHTS_TRAIN SET CRSARRTIME = CRSARRTIME / 100

Scale DEPTIME COLUMN: divide the worth by 100 provides the hour of the flight arrival time:

UPDATE FLIGHT.FLIGHTS_TRAIN SET DEPTIME = DEPTIME / 100

Scale WHEELSOFF COLUMN: divide the worth by 100 will give the hour of the flight arrival time:

UPDATE FLIGHT.FLIGHTS_TRAIN SET WHEELSOFF = WHEELSOFF / 100

4. Prepare a choice tree mannequin

Now the coaching dataset is prepared for the choice tree algorithm.

I prepare a choice tree mannequin utilizing GROW_DECTREE SP.

CALL IDAX.GROW_DECTREE('mannequin=FLIGHT.flight_dectree, intable=FLIGHT.FLIGHTS_TRAIN, id=ID, goal=FLIGHTSTATUS')

I known as this SP utilizing the next parameters:

mannequin: the identify I wish to give to the choice tree mannequin — FLIGHT_DECTREE
intable: the identify of the desk the place the coaching dataset is saved
id: the identify of the ID column
goal: the identify of the goal column

After finishing the mannequin coaching, the GROW_DECTREE SP generated a number of tables with metadata from the mannequin and the coaching dataset. Listed below are a number of the key tables:

FLIGHT_DECTREE_MODEL: this desk accommodates metadata concerning the mannequin. Examples of metadata embrace depth of the tree, technique for dealing with lacking values, and the variety of leaf nodes within the tree.
FLIGHT_DECTREE_NODES: this desk supplies details about every node within the resolution tree.
FLIGHT_DECTREE_COLUMNS: this desk supplies info on every enter column and their function within the educated mannequin. The data consists of the significance of a column in producing a prediction from the mannequin.

This hyperlink has the whole checklist of mannequin tables and their particulars.

5. Generate predictions from the mannequin

For the reason that FLIGHT_DECTREE mannequin is educated and deployed within the database, I can use it for producing predictions on the take a look at information from the FLIGHTS_TEST desk.

First, I preprocess the take a look at dataset utilizing the identical preprocessing logic that I utilized to the TRAINING dataset.

Scale CRSARRTIME COLUMN: divide the worth by 100 will give the hour of the flight arrival time:

UPDATE FLIGHT.FLIGHTS_TEST SET CRSARRTIME = CRSARRTIME / 100

Scale DEPTIME COLUMN: divide the worth by 100 will give the hour of the flight arrival time:

UPDATE FLIGHT.FLIGHTS_TEST SET DEPTIME = DEPTIME / 100

Scale WHEELSOFF COLUMN: divide the worth by 100 will give the hour of the flight arrival time:

UPDATE FLIGHT.FLIGHTS_TEST SET WHEELSOFF = WHEELSOFF / 100

Producing predictions

I exploit PREDICT_DECTREE SP to generate predictions from the FLIGHT_DECTREE mannequin:

CALL IDAX.PREDICT_DECTREE('mannequin=FLIGHT.flight_dectree, intable=FLIGHT.FLIGHTS_TEST, outtable=FLIGHT.FLIGHTS_TEST_PRED, prob=true, outtableprob=FLIGHT.FLIGHTS_TEST_PRED_DIST')

Right here is the checklist of parameters I handed when calling this SP:

mannequin: the identify of the choice tree mannequin, FLIGHT_DECTREE
intable: identify of the enter desk to generate predictions from
outtable: the identify of the desk that the SP will create and retailer predictions to
prob: a boolean flag indicating if we wish to embrace within the output the likelihood of every prediction
outputtableprob: the identify of the output desk the place the likelihood of every prediction shall be saved

6. Consider the mannequin

Utilizing generated predictions for the take a look at dataset, I compute a number of metrics to judge the standard of the mannequin’s predictions.

Making a confusion matrix

I exploit CONFUSION_MATRIX SP to create a confusion matrix primarily based on the mannequin’s prediction on the TEST dataset.

CALL IDAX.CONFUSION_MATRIX('intable=FLIGHT.FLIGHTS_TEST, resulttable=FLIGHT.FLIGHTS_TEST_PRED, id=ID, goal=FLIGHTSTATUS, matrixTable=FLIGHT.FLIGHTS_TEST_CMATRIX')

In calling this SP, listed here are a number of the key parameters that I handed:

intable: the identify of the desk that accommodates the dataset and the precise worth of the goal column
resulttable: the identify of the desk that accommodates the column with predicted values from the mannequin
goal: the identify of the goal column
matrixTable: The output desk the place the SP will retailer the confusion matrix

After the SP completes its run, we’ve got the next output desk with statistics for the confusion matrix.

FLIGHTS_TEST_CMATRIX:

This desk has three columns. The REAL column has the precise flight standing. PREDICTION column has the expected flight standing. Since flight standing takes two values – 0 (on time) or 1 (delayed), we’ve got 4 doable mixtures between values within the REAL and the PREDICTION columns:

TRUE NEGATIVE: REAL: 0, PREDICTION: 0 — The mannequin has precisely predicted the standing of these flights that arrived on schedule. From that CNT column, we see that 11795 rows from the TEST desk belong to this mix.
FALSE POSITIVE: REAL: 0, PREDICTION: 1 — these are the flights that truly arrived on time however the mannequin predicted them to be delayed. 671 is the rely of such flights.
FALSE NEGATIVE: REAL: 1, PREDICTION: 0 — these flights have arrived late, however the mannequin predicted them to be on time. From the CNT desk, we discover their rely to be 2528.
TRUE POSITIVE: REAL: 1, PREDICTION: 1 — the mannequin has precisely recognized these flights that had been late. The rely is 4981.

I exploit these counts to compute a number of analysis metrics for the mannequin. For doing so, I exploit CMATRIX_STATS SP as follows:

CALL IDAX.CMATRIX_STATS('matrixTable=FLIGHT.FLIGHTS_TEST_CMATRIX')

The one parameter this SP wants is the identify of the desk that accommodates the statistics generated by the CONFUSION_MATRIX SP within the earlier step. CMATRIX_STATS SP generates two units of output. The primary one reveals general high quality metrics of the mannequin. The second consists of the mannequin’s predictive efficiency for every class.

First output — general mannequin metrics embrace correction predictions, incorrect prediction, general accuracy, weighted accuracy. From this output, I see that the mannequin has an general accuracy of 83.98% and a weighted accuracy of 80.46%.

With classification duties, it’s normally helpful to view the mannequin’s high quality elements for every particular person class. The second output from the CMATRIX_STATS SP consists of these class stage high quality metrics.

For every class, this output consists of the True Optimistic Price (TPR), False Optimistic Price (FPR), Optimistic Predictive Worth (PPV) or Precision, and F-measure (F1 rating).

Conclusions and key takeaways

If you wish to construct and deploy an ML mannequin in a Db2 database utilizing Db2’s built-in saved procedures, I hope you’ll discover this tutorial helpful. Listed below are the principle takeaways of this tutorial:

Demonstrated an entire workflow of making and utilizing a choice tree mannequin in a Db2 database utilizing in-database ML Saved procedures.
For every step within the workflow, I supplied concrete and purposeful SQL statements and saved procedures. For every code instance, when relevant, I defined intuitively what it does, and its inputs and outputs.
Included references to IBM Db2’s documentation for the ML saved procedures I used on this tutorial.

O’Reilly’s 2022 AI Adoption survey^[3] underscored challenges in constructing technical infrastructure and abilities hole as two high bottlenecks to AI adoption within the enterprise. Db2 solves the primary one by supplying an end-to-end ML infrastructure within the database. It additionally lessens the latter, the abilities hole, by offering easy SQL API for growing and utilizing ML fashions within the database. Within the enterprise, SQL is a extra frequent ability than ML.

Try the next sources to be taught extra concerning the ML options in IBM Db2 and see extra examples of ML use circumstances carried out with these options.

Explore Db2 ML Product Documentation

Explore Db2 ML samples in GitHub

References

Paleyes, A., Urma, R.G. and Lawrence, N.D., 2022. Challenges in deploying machine studying: a survey of case research. ACM Computing Surveys, 55(6), pp.1–29.
Amershi, S., Begel, A., Hen, C., DeLine, R., Gall, H., Kamar, E., Nagappan, N., Nushi, B. and Zimmermann, T., 2019, Could. Software program engineering for machine studying: A case research. In 2019 IEEE/ACM forty first Worldwide Convention on Software program Engineering: Software program Engineering in Apply (ICSE-SEIP) (pp. 291–300). IEEE.
Loukides, Mike, AI Adoption within the Enterprise 2022. https://www.oreilly.com/radar/ai-adoption-in-the-enterprise-2022/

Source link