MADlib Release Notes -------------------- These release notes contain the significant changes in each MADlib release, with most recent versions listed at the top. A complete list of changes for each release can be obtained by viewing the git commit history located at https://github.com/madlib/madlib/commits/master. Current list of bugs and issues can be found at http://jira.madlib.net. -------------------------------------------------------------------------------- MADlib v0.6 Release Date: 2013-Apr-01 New Features / Improvements: * Generic cross-validation: - Support for k-fold cross-validation of any supervised learning algorithm * Heteroskedasticity of linear regression - Support for calculating heteroskedasticity via Breusch-Pagan test * Grouping support for linear regression - Support for linear regression on each group of data grouped by one or multiple columns * Grouping support for logistic regression - Refactor of logistic regression code - Support for logistic regression on each group of data grouped by one or multiple columns - Grouping support is added to the convex optimization framework * LDA: - Improved performance and scalability (MADLIB-480) * Elastic net regularization for both linear and logistic regressions - Support FISTA and IGD optimizers * Summary function - Support for an overview of data table * Eigen package upgrade - Now Eigen 3.1.2 is used by MADlib v0.6 * Unit testing framework: - A new unit testing framework is added for C++ abstraction layer Bug Fixes: * C++ abstraction layer: - Improved handling of NULL values in the input array (MADLIB-773) * Naive Bayes: - Improved the handling of NULL values. (MADlib-749) Known Issues: * K-means: - K-means crashes on some datasets, when the dimensionality of the points is not uniform on the data set. (MADlib-789) * Distribution Functions: - Certain quantile functions will abort their session on invalid input. (MADlib-786) * Multinomial Logistic Regression: - Signs of coefficient outputs are inconsistent with other tools like R and Stata. (MADlib-785) -------------------------------------------------------------------------------- MADlib v0.5 Release Date: 2012-Nov-15 Bug Fixes: * K-means: - Improved handling of invalid arguments (MADLIB-359, 361) * Sketch-based estimators: - Addressed security vulnerability (MADLIB-630) New Features / Improvements: * Association Rules (Apriori): - Improved reporting output format for better usability (MADLIB-411) - Significant improvement in performance (MADLIB-638) * C++ (Database) Abstraction Layer: - Extension to support modular transition states (MADLIB-499) - Extension to support functions returning set of values (MADLIB-638) * Conditional Random fields: - Support for Linear Chain Conditional Random Fields for NLP (MADLIB-628) * Decision Tree: - Improved performance for C4.5 and Random forests (MADLIB-605) - Improved encoding (MADLIB-590) * Infrastructure: - Convex optimization framework * K-means: - Code refactoring and Improved performance (MADLIB-454, MADLIB-522, MADLIB-678) - Silhouette function for k-means (MADLIB-681) * Low-rank Matrix Factorization - New module * Logistic Regression: - Support for Multinomial Logistic Regression (MADLIB-575) * Naive Bayes - Significant improvement in performance (MADLIB-611, 619, 626) * Regression Analysis: - Support for Cox Proportional Hazards test (MADLIB-576) * Sampling - Added weighted sampling of a single row (MADLIB-584) * SVD Matrix Factorization: - Improved performance (MADLIB-578) Documentation: * Conditional Random Fields: - Example added for CRF module (MADLIB-731) * SVD Matrix Factorization: - Incremental-gradient SVD algorithm (MADLIB-572) Known issues: * Multinomial Logistic Regression: - Number of independent variables cannot exceed 65535 (MADLIB-665) * Naive Bayes: - Current implementation of Naive Bayes is only suitable for categorical attributes (MADLIB-679) - NULL input values not accepted for attributes (MADLIB-614) - NULL probabilities given for test set values not seen in training set (MADLIB-523) -------------------------------------------------------------------------------- MADlib v0.4.1 Release Date: 2012-Aug-9 Bug Fixes: * PGXN: - Fixed installation problem that could occur on some platforms (MADLIB-589) New Features/Improvements: * C++ Abstraction Layer: - Increased ABI compatibility across multiple Greenplum versions (MADLIB-606) * Hypothesis Tests: - Tests that are not implemented as ordered aggregates are now also installed on PostgreSQL 8.4 and Greenplum 4.0. -------------------------------------------------------------------------------- MADlib v0.4 Release Date: 2012-Jun-18 Bug Fixes: * Association Rules: - assoc_rules() now uses schema-qualified function calls (MADLIB-435) * Decision Trees: - Enhanced correctness (MADLIB-409, 502, 503) - Improved handling of invalid arguments (MADLIB-331) * k-Means: - Improved handling of invalid arguments (MADLIB-336, 364, 459) * PLDA: - Improved robustness (MADLIB-474) * Sparse Vectors: - svec_sfv() now uses locale-aware sorting (MADLIB-457) - Operators now install to MADlib schema (MADLIB-470) New Features/Improvements: * C++ Abstraction Layer: - Support for "function pointers" (MADLIB-370) - Support for sparse vectors (MADLIB-371) - Support for more Eigen (linear algebra) types (MADLIB-533) * Decision Trees: - Code refactoring and optimization (MADLIB-410, 476, 504, 509) - Documentation improvments (MADLIB-507) - Output table now contains unencoded information (MADLIB-434) - Enhance the missing value handling for continuous features (MADLIB-493) * Hypothesis Tests: - Pearson chi-square test (MADLIB-390) - One- and two-sample t-Tests (MADLIB-391) - F-test (MADLIB-392) - Mann-Whitney U-test (MADLIB-393) - Kolmogorov-Smirnov test (MADLIB-394) - Wilcoxon-Signed-Rank test (MADLIB-405) - One-way ANOVA (MADLIB-406) * PostgreSQL Extensibility: - Support for CREATE EXTENSION in PostgreSQL >= 9.1 (MADLIB-316) - Availability on PGXN (MADLIB-334) * Probability Functions: - Wrap all distribution functions implemented by Boost (MADLIB-412) - Wrap Kolmogorov distribution function from CERN ROOT project (MADLIB-413) * Random Forests: - New module (MADLIB-419) * Support: - Add elementary matrix/vector functions (e.g., norm/distances etc.) (MADLIB-532) * Viterbi Feature Extraction: - New module (MADLIB-478) Known issues: - svec_sfv() does not support collations, as introduced with PostgreSQL 9.1 (MADLIB-558) - Invalid arguments are not always guaranteed to be handled gracefully and may lead to confusing error messages (MADLIB-28, 359, 361, 363) -------------------------------------------------------------------------------- MADlib v0.3 Release Date: 2012-Feb-9 New features: * Installer: - Single installer package targeting all supported DBMSs per OS (MADLIB-218) * C++ Abstraction Layer: - Switched from using Armadillo to using Eigen for linear-algebra operations, thereby eliminating the dependency on LAPACK/BLAS (MADLIB-275) - Reimplemented as a template library for performance improvements (MADLIB-295) * Decision Trees: - Major update - Now supports multiple split criteria (information gain, gini, gain ratio) - Now supports tree pruning using a validation set to address over fitting - Now supports additional functions for tree output - Now supports continuous features in addition to categorical features - Additional support for handling null values - Improved scalability and performance * k-Means Clustering: - Now handles any input that is convertible to SVEC. (MADLIB-42) - Multiple distance functions (L1-norm, L2-norm, cosine similarity, Tanimoto similarity) (MADLIB-43) - Supports multiple seedings methods (kmeans++, random, user-specified list of centroids) - Replaced goodness of fit with the (simplified) Silhouette coefficient (MADLIB-45) - New run-time parameters (MADLIB-47) * Linear Regression: - Major speed improvement * Logistic Regression: - Major speed improvement - Now handles any input that is convertible to BOOLEAN (dependent variable) or DOUBLE PRECISION[] (independent variables). (MADLIB-283) - An under-/overflow safe version to evaluate the (usual) logistic function, for scoring logistic regression (MADLIB-271) - A third optimizer: Incremental-gradient-descent (MADLIB-303) * Support: - For Greenplum <= 4.2.0, added a workaround for INSERT INTO in the same way as the existing CREATE TABLE AS workaround. This workaround is not needed in Greenplum >= 4.2.1 any more. (MADLIB-265) - Function version() returns Madlib build information (MADLIB-309) Bug fixes: * Sparse vectors: - Fixed sparse-vector type case problems (MADLIB-282, MADLIB-305) - Fixed a situation where using svec_svf() could cause a segmentation fault (MADLIB-350) - Increased compatibility with internal PostgreSQL conventions (MADLIB-257) * Logistic regression: - Handle numerical instability more gracefully (MADLIB-343, MADLIB-345) - Handle unexpected inputs more gracefully (MADLIB-284, MADLIB-344) - Fixed "Random variate x is nan, but must be finite" issue (MADLIB-356) Known issues: - Decision Trees not supported on Greenplum 4.0 (MADLIB-346, MADLIB-347) - K-means: the error '"nan" does not exist' may be raised when input vectors contain NaN. (MADLIB-364) - Association Rules require the madlib schema to be in the search path (MADLIB-353) - Invalid arguments are not always guaranteed to be handled gracefully and may lead to confusing error messages (MADLIB-28, 336, 359, 361, 363, 364) -------------------------------------------------------------------------------- MADlib v0.2.1beta Release Date: 2011-Sep-14 General changes: * numerous improvements to the C++ abstraction layer: - code clean-up - fixed issue where incorrect values were returned when used with debug builds of PostgreSQL/Greenplum (MADLIB-253) - fixed issue where returning arrays to PostgreSQL/Greenplum could lead to a crash (MADLIB-250) - allocated memory is now 16-byte aligned for improved stability and performance (MADLIB-236) * compiling with advanced warnings enabled by default now * all C/C++ code now free of warnings. On gcc <= 4.6, there might still be warnings due to "unclean" macros in DBMS header files (MADLIB-228) * prepared Solaris support in a later release (MADLIB-204) - added support for Sun Compiler in CMake build script - fixed all compilation errors with Sun compiler * added UDF to mimic "CREATE TABLE AS ...", as a workaround for a Greenplum issue (MADLIB-241). Included this as GP Compatibility module. * madpack utility: - dropped madpack dependency on PygreSQL (MADLIB-217) - improved security in madpack install-check (MADLIB-229) - fixed bashism in madpack (MADLIB-222) - fixed install-check not running on non-default schema (MADLIB-251) Modules/methods: * SVM (kernel_machines): - fixed cumulative error count in svm_cls_update() function - improved memory management in SVM module * Linear regression (regress): - fixed unexpected behavior for some edge cases (MADLIB-214) - fixed crashing with huge number of independent vars (MADLIB-250) * Logistic regression (regress): - added support for arbitrary expressions for dep./indep. variables, not just column names (MADLIB-255) * Quantile: - fixed quantile() function to be exact - added simple version for small data sets * Sparse Vectors: - added check for sorted dictionary to svec_sfv (MADLIB-187) * Decision Tree (decision_tree): - now can be run multiple times in one session (MADLIB-156) Known issues: * non-unified API for several SQL UDFs (MADLIB-208) * performance of the conjugate-gradient optimizer in logistic regression can be very poor (MADLIB-164) -------------------------------------------------------------------------------- MADlib v0.2.0beta Release Date: 2011-Jul-8 General changes: * new build and installation framework based on CMake * new C++ abstraction layer for easy and secure method development * new database installation utility (madpack) Modules/methods: * new: Association Rules (assoc_rules) * new: Array Operators (array_ops) * new: Decision Tree (decision_tree) * new: Conjugate Gradient (conjugate_gradient) * new: Parallel LDA (plda) * improved: all methods from previous release Known issues: * non-unified API for several SQL UDFs (MADLIB-208) * running decision tree more than once in one session fails (MADLIB-156) * performance of the conjugate-gradient optimizer in logistic regression can be very poor (MADLIB-164) * svec_sfv function doesn't check for sorted dictionary (MADLIB-187) -------------------------------------------------------------------------------- MADlib v0.1.0alpha Release Date: 2011-Jan-31 Initial release. Included modules/methods: * Naive-Bayes Classification (bayes) * k-Means Clustering (kmeans) * Support Vector Machines (kernel_machines) * Sketch-based Estimators (sketch) * Sketch-based Profile (data_profile) * Quantile (quantile) * Linear & Logistic Regression (regress) * SVD Matrix Factorisation (svdmf) * Sparse Vectors (svec) -------------------------------------------------------------------------------- MADlib v0.1.0prerelease Release date: 2011-Jan-25 Demo release.