Hauptnavigation

mySVM

mySVM/db - Support Vector Machine for Relational Databases

by Stefan Rüping, rueping@ls8.cs.uni-dortmund.de

About mySVM/db

mySVM/db is an implementation of the Support Vector Machine introduced by V. Vapnik (see [Vapnik/98a]). It is a further developement of mySVM and designed to run directly inside a relational database using an internal JAVA engine. The original algorithm is based on the optimization algorithm of SVMlight as described in [Joachims/99a]. mySVM/db provides code for both pattern recognition and regression analysis.

The goal of mySVM/db is to provide users of data-mining applications with an easy to use SVM implementation that makes use of the numerous advantages for handling large amounts of data - like usability, managability, security and standardized interfaces - that modern database systems offer. On the downside, please not that the direct connection of the program to a database server has a serious impact on runtime. So, if performance is your top goal, you should export your data from the database an run some SVM software (e.g. mySVM) on an own machine.

Please note that for now, mySVM/db should be still regarded as beta software. Do not run it on production servers unless you have tested it for your purposes thoroughly. In particular, mySVM/db was designed such that it only needs read access to the input data, so you should run it with minimal access rights. In conclusion, as the license says: The author is not responsible for implications from the use of this software.

License

This software is free only for non-commercial use. It must not be distributed without prior permission of the author. The author is not responsible for implications from the use of this software.

Download

Download the newest version of mySVM/db: mysvmDB.0.9.1.zip (Version 0.9.1, October 16th, 2002).

Installation

Installation Requirements:

mySVM/db was developed and tested on an Oracle 8.1.6 database. It should run on any newer Oracle database without a problem. With small modifications it should also run on any other database offering an JDBC interface. All you need to change is writing a JAVA class for your database system, corresponding to the one mySVMdb/Container/OracleDatabaseContainer.java, that instanciates the correct database driver.

If you are using mySVM/db on any other database than Oracle 8.1.6, please e-mail me your experiences to rueping@ls8.cs.uni-dortmund.de.

Installation on Oracle 8.1:

  • Take the downloaded zip archive and load the class files of mySVM/db into the database: loadjava -user scott/tiger -resolve -verbose mySVMdb.zip (you are clever enough to replace scott/tiger with your own user name / password, aren't you? :-) )
  • Open a sqlplus-session and create the corresponding procedure: create or replace procedure mySVM (param_table VARCHAR2) as language java name 'edu.udo.cs.mySVMdb.Learn.learn(java.lang.String)';

Using mySVM/db

What mySVM/db does, roughly:

When mySVM/db is called, it needs to be given the name of a parameter table. It then reads all necessary parameters, including the names for the input and output table, from the parameter table. For the optimization process, mySVM/db reads the examples from the input table. It also creates a temporary table to store some intermediate results. After the optimization is finished, the SVM model is stored in a new table. Also, a new view on the input table is created that shows the SVM predictions for each example.

The parameter table:

The parameter table must contain two VARCHAR rows called PARAMETER and VALUE, which contain the name and the value of each parameter. Please note, that there will be no error message if the table contains unknown parameter names, they will be silently ignored. Only wrong values and important missing parameters will be reported.

The input table:

WARNING: The spelling of all parameters are case sensitive. All mySVM/db-parameters have lower-case names.

The name of the input table (which actually can be a view as well) is given by the parameter trainset. The input table itself must consist of three parts: a column giving a unique key for each example, a column for the target value and columns for the attribute values. There may be additional columns in the table.

The key column is specified by the parameter key_column. If no such parameter is given, the Oracle pseudo-column ROWID is used, which is the recommended setting. If you use another key column, make sure there you have an index on this column.

The column of target values is given by the parameter y_column with a default name of Y. For regression analysis, this column must consist of numerical values. For pattern recognition, there are two possibilities: internally, the SVM uses values of -1 for negative examples and +1 for positive examples. If your data is already in this format, you don't have to do anything. In all other cases, you need to supply a parameter target_concept. All examples whose target value equals the value of target_concept are regarded as positive examples, all other examples are negative examples. Note that in the final prediction, confidence values for the target concept are given, not the concept itself.

The attributes can be specified by the parameter x_column. Each column in the attribute vector needs an own x_column entry in the parameters table. The columns must be of a type that can be converted into a real number. If no x_column parameter is specified, all columns of the table of matching types (excluding y_column and key_column) are used.

By default, all attribute columns are scaled to expectancy 0 and variance 1 before training. This is advised to avoid numerical problems in the optimization process. If your attribute values are already sensibly scaled, you can avoid the additional scaling by supplying a parameter scale with empty value.

If you don't want to train on all examples, you can provide a list of the keys of the training set by setting up the output table (a table consisting of a VARCHAR row called KEY and a NUMBER row called ALPHA) and inserting the keys into this table. Also, enter a parameter read_keys_from_model with value true. If this parameter is not given or there are no keys in the output table, all examples in the input table will be used.

The output:

The output consists of two parts: the SVM model, i.e. the Lagrangian multipliers of the support vectors, and a view of this table and the input table, that gives the SVM prediction for each example. For pattern recognition problems, the prediction is a confidence value for the target concept, i.e. the example is predicted to belong to the concept, if the value is greater than 0.

The name of the model table is given by the parameter model_name, the name of the view is given by view_name. The default model name is SVM_MODEL. If no parameter view_name exists, no view will be created.

If you create the output table before the training, please note that the table may be deleted and re-created during the training process.

if you start mySVM/db from a terminal, there will also be output on your screen. The parameter verbosity controls the amount of output that is generated. Default is 3, the higher, the more output.

Parameters concerning the SVM:

The following parameters control the SVM itself:
ParameterPossible valuesDefault valueDescription
svm_typeregression, patternregressionThe type of the SVM.
Creal values > 01Capacity constant.
Cposreal values > 0CCapacity constant for positive examples. Overrides C.
Cnegreal values > 0CCapacity constant for negative examples. Overrides C.
balance_costtrue, falsefalseMultiplies Cpos by the fraction of positive examples in the data and Cneg by the fraction of negative examples in the data.
quadraticLossPostrue, falsefalseUses a quadratic loss function on the positive examples.
quadraticLossNegtrue, falsefalseUses a quadratic loss function on the negative examples.
epsilonreal values >= 00Allowed error for regression estimation.
epsilon_posreal values >= 00Allowed error for positive examples for regression estimation.
epsilon_negreal values >= 00Allowed error for negative examples for regression estimation.

Parameters concerning the kernel function:

Currently, two kernels are implemented, the dot product and the radial basis function kernel. The kernel is defined by the this parameters:
ParameterPossible valuesDefault valueDescription
kernel:typedot, radialdotKernel type.
kernel:gammareal values >= 01Parameter gamma of RBF kernel: K(x,y) = exp(-gamma*||x-y||^2).

Parameters for experts:

These are parameters that control the optimization process of the SVM. Usually, the default values do not need to be changed. Don't touch this, unless you know what you're doing!

ParameterPossible valuesDefault valueDescription
working_set_sizeinteger > 110Size of the working set.
is_zeroreal value > 01e-10Numerical precision of the optimizer.
convergence_epsilonreal value > 01e-3Allowed error on the KKT conditions.
shrink_constinteger > 150Number of iterations before examples at bound is considered for shrinking.
descendreal value > 01e-15Minimum descend of the target function that the optimizer must make.
max_iterationsinteger >= 030000Maximum number of working set iterations.

Using mySVM/db outside the database:

You can also run mySVM/db from outside the database, letting it make its connection via JDBC. To do that, compile the source files as any other java programm and call the class edu/udo/cs/mySVMdb/Learn. You need to specify the paramteres table and the database URL, login and password as command line parameters. The format is: Learn db_url:url db_login:login db_password:password parameter_table. For example: "Learn db_url:jdbc:oracle:oci8:@some_instance db_login:scott db_url:tiger my_param_table".

References

Joachims/99a Joachims, Thorsten (1999). Making large-Scale SVM Learning Practical. In Advances in Kernel Methods - Support Vector Learning, chapter 11. MIT Press. [.ps.gz] [.pdf]
Vapnik/98a V. Vapnik (1998). Statistical Learning Theory. Wiley.