Makogon R. О. “CAUSALITY BAYESIAN NETWORK ANALYSIS USING SAS BASE PROGRAMMING FOR DATA-MINING” – Кафедра інформатики і інформаційно-комунікаційних технологій

4-th year student

Educational-Scientific Complex “Institute for Applied System Analysis”

National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”

Terentiev O. M.

There is a great need for methods like Bayesian networks that allow finding out hidden relations between different variables in datasets, determining dependencies and patterns. Fields like diseases diagnosis by the sensor readings, automated crops supervision, based on the state of indicators, credit scoring, data classification, image processing possess wide datasets with raw data. All of these areas are currently on the brink of progress, so various methods are being developed to provide data analysis tools, capable of managing these datasets.

The purpose of this paper is to suggest the method of constructing Bayesian networks, using only input dataset to work with. It succeeds independently of network`s size or number of variables and unknown dependencies between them. The suggested program uses SAS Base programming language to build such network.

The example dataset contains 7.2 thousand observations divided by TAB delimiter. There are some categorical variables and Boolean ones. There are 9 different variables in the example dataset. It is inputted, using SAS Base programming language. Then the macro cycle is being used to provide the code compatibility with every similar TAB delimited dataset. So the column names from the original dataset are being retrieved to be used as a unique identifier for the dataset variables.

Having successfully read the data, method determines every one of the unique pairs of variables, assuming there is nodes in the network in total. It completely matches with the number of the columns of the inputted data. The method is using column names to identify and process the data.

Then the method needs to calculate the MI value for each such pair of variables. That step needs to be examined more closely. So the first step to achieve that would be counting the probabilities for each variable in the dataset to represent each one of its possible values.

MI calculation is done in a user friendly way, so it is possible to examine each one of the intermediate results and auxiliary datasets, gained on each step, the program was working with.

The next step would be to compute a matrix for each of node`s pairs, containing this pair`s mutual probabilities.

Every result of the step of the mutual probabilities calculation is also available for the analysis.

Having completed all needed auxiliary computations, the MI value is being estimated for every pair of the variables in dataset, using the formula provided.

Then there is an iteration process to collect all MI values from the pair`s datasets into one target dataset, to be able to work with it in the future.

The dataset is then being sorted by this values in the decreasing order, as the method suggests.

To start building different structures and estimating their MDLs some auxiliary computations are also needed. Firstly, entropy of each pair of variables is being computed, using previously constructed datasets with the mutual probabilities.

So it is the time to start searching for an optimal structure. The method creates a new matrix, the columns represent pairs of variables, rows stand for each structure of the built Bayesian network. The method starts with one row dataset, containing only zeroes, as there are no connections built yet.

Starting from the first column, the row is being tripled to represent each possible outcome for this pair of nodes: the first one if a parent for the second one, an opposite situation and a no connections situation.

Then the MDL value is being computed for each row of this dataset, using previously estimated H dataset, and only rows with minimum possible MDL can stay. So then this tripling procedure continues over time for all pairs, till the optimal Bayesian network structure, or structures, would be returned.

Having the matrix of the connections, it is possible to construct the network itself. So in the end the next structure was built by the method.

The results of the method are represented in a graphic structure. The SAS Base programming language was used for better performance and ability to process large amounts of data. The heuristic algorithm, based on mutual information and minimal description length functions was used. Some others datasets were also tested to check the program performance. Speed results show an under ten seconds time mark, whereas the exhausted search would take several hours of computing time just for some of them. The program is compatible with every SAS product and can be run and be used or tested there.

REFERENCES

Chow C.K., Liu C.N. Approximating discrete probability distributions with dependence trees. //IEE Transactions on information theory, May 1968. – Vol. IT-14, № – 6p.
Zheng Y. and Kwoh C.K. Improved MDL Score for Learning of Bayesian Networks // Proceedings of the international conference on artificial intelligence in science and technology (AISAT 2004). – 2004. – P. 98-103
Heckerman D., Geiger D., Chickering D.M., Learning Bayesian Networks: The Combination of Knowledge and Statistical Data // Machine Learning, v.20 n.3. – P. 197-234

Залишити відповідь Скасувати відповідь