In-Memory Columnar Store (IMCS)

Introduction

Columnar store or vertical representation of data allows to achieve better performance in comparison with classical horizontal representation due to three factors:

  1. Reducing size of fetched data: only columns involved in query are accessed.
  2. Vector operations. Applying an operator to set of values (tile) makes it possible to minimize interpretation cost. Also SIMD instructions of modern processors accelerate execution of vector operations.
  3. Compression of data. Certainly compression can also be used for all the records, but independent compression of each column can give much better results without significant extra CPU overhead. For example such simple compression algorithm like RLE (run-length-encoding) allows not only to reduce used space, but also minimize number of performed operations.
There are several database systems based on vertical data model: Vertica, SciDB,... There are also extensions to existed DBMSes, such as "Oracle In-Memory Option". This plug-in tries to provide such functionality for PostgreSQL.

Overview

As it is clear from the abbreviation (IMCS: In-Memory Columnar Store) this plugin adds to PostgreSQL in-memory columnar store. So vertical representation of data is complementary to standard horizontal representation. Data is imported in PostgreSQL database in usual way and is stored in normal table. Then columns from this table are fetched and stored in shared memory. IMCS provides a lot of timeseries functions which can be used for data analysis. Operations with timeseries are performed in vector mode allowing to reach maximal possible speed of such operations. Also IMCS makes it possible to parallelize execution of some queries (for example calculation of aggregates) and utilize all CPU cores. All this three factors: in-memory location of data, vector operations, parallel query execution, makes it possible to increase speed of some queries more than 100 times comparing with standard PostgreSQL queries.

To make access to timeseries as convenient as possible, IMCS provides generator of access functions. You should specify name of source table or view (from which data will be imported), name of timestamp field (this is a main key by which timeseries elements are accessed) and optionally timeseries identifier. The last one needs some explanations. In some cases all data from the table should be placed in a single timeseries. For example assume that we collect data about phone calls (date, duration, caller, callee,...). It will be a single timeseries. But for example in trading systems there are separate data (ticks) associated with each symbol. So we have separate timeseries for ABB, GOOG, IBM, YHOO,... In this case securities identifier (symbol) can be considered as identifier of timeseries.

IMCS supports the following element type for timeseries: "char", int2, int4, date, int8, time, timestamp, money, float4, float8, bpchar. All timeseries elements should have the same size, so only fixed size character types are supported: for example char(10), but not varchar. But it is possible map varying size strings into integer identifiers using IMCS dictionary. It will greatly reduce space used by columnar store and reduce queries execution time (manipulations with integers are more efficient than with strings). Certainly this approach works only if cardinality of such column is not so large: dictionary should fit in memory. Size of dictionary can be specified using "imcs.dictionary.size" parameter. Default value is 64kb. If size of dictionary is less or equal than 64kb, then IMCS uses two bytes integer to store string identifier. If it is larger than 64kb, then four bytes identifier is used. Please notice that the same dictionary is used for all table and columns. So dictionary size should be greater or equal than total size of cardinalities of all unlimited varchar columns loaded in columnar store. IMCS is able to automatically converts strings to identifiers and visa verse in output/input functions. But you can also explicitly translate identifier to string using cs_code2str function.

Also IMCS is not able to represent NULL values. It is not enforced that fields of the source table were declared as NOT NULL, but attempt to insert NULL value in timeseries will cause error (or optionally NULL can be substituted with zero). Please use default values instead of NULLs.

Given all this information IMCS generates corresponding types and functions for loading/appending/accessing this timeseries. Assume that we have table Quote. After calling cs_create('Quote', 'Day', 'Symbol') we will get Quote_load() function for loading data from table in memory, Quote_get(symbol char(10), low date, high date) function for fetching/slicing corresponding timeseries and triggers which will keep track updates in Quote table and propagate this changes to timeseries.

There are two ways of synchronizing original table and timeseries:

  1. Automatic: using triggers. In this case all inserts/deletes in original table are immediately reflected in timeseries.
  2. Manual: using explicit invocation of load/append/delete methods.
Execution of load() is significantly more efficient than propagation of updates using triggers. Mostly because of slowness of PL/pgSQL. Also please notice, that been stored in shared memory, timeseries have to be reloaded after restart of the server. Unfortunately PostgreSQL doesn't support database level triggers (like after startup on database in Oracle). IMCS provides two alternatives: use autoload mode or manually load data. In case of using autoload mode, data will be automatically loaded from table to columnar store on demand when it is first accessed by any query. Please notice that for large tables loading data can take substantial amount of time and so increase execution time of the query initiated this load (it can confuse an user which expects this query to complete very fast). Fortunately database servers are not restarted frequently...

When data is loaded from the table, records are sorted by timestamp and inserted in ascending order. You can append data to existed timeseries, but timestamps of inserted elements should be greater than already loaded. When timeseries is populated using insert trigger it is necessary to enforce that the data is inserted in the table in timestamp ascending order. Otherwise out-of-order error will be reported while inserting element in timeseries.

TABLE_get functions returns row of type TABLE_timeseries (this type is also generated by IMCS) which has the same columns as original table, but type of this columns is timeseries. So it is possible to refer to this timeseries as to any other columns and apply timeseries functions to them. For example query:

    select cs_max(Close) from Quote_get('IBM');
returns maximal close price for IBM. IMCS provide standard operators for timeseries type, allowing to write queries with more complex expressions in standard way:
    select cs_avg(High - (Open + Close)/2) from Quote_get('IBM');
Result of the query above is scalar value (because of used grand aggregate). But most of timeseries functions take timeseries as input and return also timeseries. For example result of the query below is timeseries:
    select cs_filter(Open < Close, Day) from Quote_get('IBM');
When you print result of execution of this query at the screen (for example by running this query in psql), it will be represented as large string literal in braces: 'date:{01/01/2010, 01/02/2010,...}' Certainly it is not convenient for really large timeseries and may even cause memory exhaustion. Alternatively it is possible to change vertical representation back to horizontal representation using TABLE_project or cs_project functions. Then produced tuples can be accessed in normal way using all SQL stuff. For example it is possible to sort them or perform more grouping/filtering.

Functions

General columnar store functions

General columnar store functions are used to generate table-specific API functions, get information about columnar store and perform cleanup.

FunctionDescription
function cs_create(table_name text, timestamp_id text, timeseries_id text default null, autoupdate bool default false) returns void This function is used to generate all API functions, types and triggers for the specified table or view table_name. These can be latter removed using table_name_drop function. timestamp_id is name of timestamp field by which timeseries elements are sorted in ascending order, allowing to efficiently extract time slices. timeseries_id is optional field identifying timeseries. For example for quotes it can be a symbol name. If this field is specified, then separate timeseries will be maintained for each symbol. If autoupdate parameter is true, then IMCS will create triggers which automatically update timeseries when new data is added/deleted to/from the source table. Alternatively it is possible to explicitly load/append/delete data to timeseries. Please notice that explicit bulk update/delete is significantly more efficient than row-level updated performed by trigger. If columnar store interface for a table was generated with autoupdate=false, then triggers are still generated but are disabled. You can enable them later using alter table TABLE enable trigger user command. As far as views cannot have row-level BEFORE or AFTER triggers in PostgreSQL, IMCS doesn't generate them if table_name is a view.
function cs_delete_all() returns bigint Deletes all timeseries in columnar store. This function can be used for most efficient cleanup of columnar store. Please notice that PostgreSQL doesn't allow to free shared memory, so it still be in use. But it can be reused in subsequent allocation requests of columnar store. This function returns total number of removed elements (in all timeseries)
function cs_used_memory() returns bigint Returns amount of memory used by columnar store.
function cs_profile(reset bool default false) returns setof cs_profile_item Returns number of calls of each IMCS command. If parameter is true, then all counters are reset after execution of this call.
function TABLE_timestamp() returns varcharReturns name of timeseries timestamp column for this table
function cs_str2code(str varchar) returns integer Returns code of string in IMCS dictionary or -1 if there is not such string. This function may be used to find some particular values in varying string timeseries.
function cs_code2str(code integer) returns varchar Returns string value for specified IMCS dictionary code.
function cs_code2str(str bytea, column_no integer) returns varchar Extracts identifier from compound (concatenated) key and returns correspondent name from dictionary. Column number is 1-based.

Generated data manipulation functions

Generated functions for loading/storing/deleting timeseries.

FunctionDescription
function TABLE_drop() returns void Deletes all generated functions and types for table TABLE.
function TABLE_load(already_sorted bool default false, filter text default null) returns bigint Populates timeseries with data from PostgreSQL table. If already_sorted parameter is true, then it is assumed that records in the table are stored in proper (timestamp ascending) order. Otherwise IMCS will add "order by" clause to select statement. Please notice that PostgreSQL vacuuming can change original order of the records. So disable vacuuming for the table if you want to preserve insert order. Optional filter parameter allows to specify additional selection criteria for table records. It allows to include in timeseries only some subset of the table. Particularly it can be used to append existed timeseries with new data. This function returns number of inserted timeseries elements. If filter is not specified then this function loads data from the table only if timeseries are not yet initialized. If filter is not null, then this functions always tries to load data, assuming that programmer has specified proper filter condition allowing to avoid duplicates and preserve proper timeseries order. If filter is null and timeseries are already initialized, then this function does nothing and immediately returns zero.
function TABLE_is_loaded() returns bool Checks if data was already loaded to columnar store. If you just need to ensure that data is loaded, there is no need to call this function: you can always call TABLE_load, it will perform this check itself and do nothing if data was already loaded. But if behavior of your application depends on state of columnar store, then this function may be useful.
function TABLE_append(start_from TIMESTAMP_TYPE) returns bigint Appends to timeseries records from the source table starting from start_from timestamp (inclusive). Use this function if on-update trigger is disabled (autoupdate=false in parameters of cs_create). Please also notice that this function is implemented in PL/pgSQL and so it is significantly slower than TABLE_load with the same filter condition. This function returns number of added timeseries elements.
function TABLE_truncate() returns void Truncates all timeseries for this table. This is most efficient way to delete vertical representation for the specific table. If you need to delete all data in columnar store, better use cs_delete_all() function.
function TABLE_project(input TABLE_timeseries input, positions timeseries default null,disable_caching bool default false) returns setof TABLE Makes horizontal projection of timeseries. Optional positions parameter specifies positions of selected timeseries elements. If parameter positions is omitted, then all timeseries elements are transformed to horizontal representation. So this function is opposite to TABLE_get(): get transforms horizontal representation to vertical and project does backward transformation. It is possible to use this function only if number of columns returned by TABLE_get() and element types of corresponding timeseries are not changed. For example it is possible to run query like this:
  select (Quote_project(abb.*,cs_top_max_pos(Close, 10))).* 
  from Quote_get('ABB',date('01-Jan-2010'),date('31-Mar-2010'))abb;

  select (Quote_project(abb.*)).* 
  from (select Symbol,Day,cs_maxof(Open,Close),
               High,Low,cs_minof(Open,Close),Volume 
        from Quote_get('ABB')) abb;
but not
  select (Quote_project(abb.*)).* 
  from (select Symbol,cs_maxof(Open,Close) 
        from Quote_get('ABB')) abb;
In the last case it is possible to use cs_project() function:
  select cs_project(abb.*) 
  from (select Symbol,cs_maxof(Open,Close) 
        from Quote_get('ABB')) abb;
Please notice that we can not use ().* clause in this case because cs_project returns anonymous row. But in PostgreSQL 9.3 we can use lateral join:
  select p.* 
  from (select Symbol,cs_maxof(Open,Close) 
        from Quote_get('ABB')) abb, 
             cs_project(abb.*) p(symbol char(10), max real);
Please find more information about projection of timeseries, problems with (...).* construction in PostgreSQL and purpose of disable_caching parameter in section Projection issues.

Generated data access functions for single timeseries

Functions generated for accessing single timeseries (timeseries having no identifier).

FunctionDescription
function TABLE_first() returns TIMESTAMP_TYPE Returns oldest timestamp.
function TABLE_last() returns TIMESTAMP_TYPE Returns most recent timestamp.
function TABLE_count() returns bigint Returns number of elements in timeseries.
function TABLE_get(low TIMESTAMP_TYPE default null, high TIMESTAMP_TYPE default null, limit_ts bigint default null) returns TABLE_timeseries Returns vertical representation of the whole table or its time slice. Returned record contains the same columns as record of the original table, but they have timeseries type instead of original scalar types. These columns can be used in timeseries functions (cs_*). If high or low parameters are not null, then them specify correspondingly upper/lower inclusive boundary for timestamp value. If some or both parameters are omitted, then corresponding boundary is open. It is possible to limit number of selected elements by specifying limit_ts parameter (if low boundary is open then last limit_ts elements will be selected, otherwise first limit_ts elements will be selected).
function TABLE_span(from_pos bigint default 0, till_pos bigint default 9223372036854775807) returns TABLE_timeseries Returns vertical representation of the whole table or its horizontal slice. Returned record contains the same columns as record of the original table, but they have timeseries type instead of original scalar types. These columns can be used in timeseries functions (cs_*). Parameter from_pos specifies start position in timeseries (inclusive) and parameter till_pos specifies end position (inclusive). If till_pos parameter is missed, then subsequence spans till end of timeseries. Values of both from_pos and till_pos parameters can be negative. In this case position is calculated from end of timeseries, i.e. TABLE_span(from_pos:=-1) extracts last element of the timeseries.
function TABLE_delete(low TIMESTAMP_TYPE, high TIMESTAMP_TYPE) returns bigint Deletes timeseries elements belonging to the specified interval. If high or low parameters are not null, then them specify correspondingly upper/lower inclusive boundary for timestamp value. If some or both parameters are nulls, then corresponding boundary is open. This function returns number of deleted elements.
function TABLE_delete(till TIMESTAMP_TYPE default null) returns bigint Deletes timeseries elements from the beginning till specified timestamp till (inclusive) or delete all elements if this parameter is null/omitted. This function is equivalent to TABLE_delete(null, till). IMCS provides separate function for it because it is intended to be the most frequent case of deleting elements from timeseries: it corresponds to shifting data window when new elements are appended and deteriorated are thrown away. This function returns number of deleted elements.
function TABLE_join(other timeseries, direction integer default 1) returns timeseries Joins timestamp with other unsorted timeseries. It returns positions of elements in this timeseries which timestamp matches correspondent element of joined timeseries. Semantic of matching depends on value of the direction parameter:
  • If direction is less than zero, then this timestamp should be less or equal than other timestamp (locate timeseries element preceding timestamp).
  • If direction is zero, then this timestamp should be equal to other timestamp (exact match of timestamps).
  • If direction is greater than zero, then this timestamp should be greater or equal than other timestamp (locate timeseries element succeeding timestamp).

Generated data access functions for multiple timeseries (identified by timeseries ID)

Functions generated for accessing multiple timeseries (source table contains identifier of timeseries, for example 'Symbol').

FunctionDescription
function TABLE_first(id TIMESERIES_ID_TYPE) returns TIMESTAMP_TYPE Returns oldest timestamp.
function TABLE_last(id TIMESERIES_ID_TYPE) returns TIMESTAMP_TYPE Returns most recent timestamp.
function TABLE_count(id TIMESERIES_ID_TYPE) returns bigint Returns number of elements in timeseries.
function TABLE_get(id TIMESERIES_ID_TYPE, TIMESTAMP_TYPE low default null, TIMESTAMP_TYPE high default null, limit_ts bigint default null) returns TABLE_timeseries Returns timeseries with specified identifier for the corresponding table or its time slice. Returned record contains the same columns as record of original table, but they have timeseries type instead of original scalar types. These columns can be used in timeseries functions (cs_*). If high or low parameters are not null, then them specify correspondingly upper/lower inclusive boundary for timestamp value. If some or both parameters are omitted, then corresponding boundary is open. It is possible to limit number of selected elements by specifying limit_ts parameter (if low boundary is open then last limit_ts elements will be selected, otherwise first limit_ts elements will be selected).
function TABLE_get(id TIMESERIES_ID_TYPE[], TIMESTAMP_TYPE low default null, TIMESTAMP_TYPE high default null, limit_ts bigint default null) returns setof TABLE_timeseries Does the same as function described above but for array of timeseries identifiers. For each timeseries identifier this function returns TABLE_timeseries record, so output will contain as much rows as there are identifiers.
function TABLE_span(id TIMESERIES_ID_TYPE, from_pos bigint default 0, till_pos bigint default 9223372036854775807) returns TABLE_timeseries Returns timeseries with specified identifier for the corresponding table or its horizontal slice. Returned record contains the same columns as record of the original table, but they have timeseries type instead of original scalar types. These columns can be used in timeseries functions (cs_*). Parameter from_pos specifies start position in timeseries (inclusive) and parameter till_pos specifies end position (inclusive). If till_pos parameter is missed, then subsequence spans till end of timeseries. Values of both from_pos and till_pos parameters can be negative. In this case position is calculated from end of timeseries, i.e. TABLE_span(id,from_pos:=-1) extracts last element of the timeseries.
function TABLE_span(id TIMESERIES_ID_TYPE[], from_pos bigint default 0, till_pos bigint default 9223372036854775807) returns setof TABLE_timeseries Does the same as function described above but for array of timeseries identifiers. For each timeseries identifier this function returns TABLE_timeseries record, so output will contain as much rows as there are identifiers.
function TABLE_concat(id TIMESERIES_ID_TYPE[], TIMESTAMP_TYPE low default null, TIMESTAMP_TYPE high default null) returns TABLE_timeseries Concatenates slices of timeseries for the specified identifiers. Returned record contains the same columns as record of original table, but they have timeseries type instead of original scalar types. Each such timeseries is concatenation of slices of timeseries for all specified identifiers. These columns can be used in timeseries functions (cs_*). If high or low parameters are not null, then them specify correspondingly upper/lower inclusive boundary for timestamp value. If some or both parameters are omitted, then corresponding boundary is open.
function TABLE_delete(id TIMESERIES_ID_TYPE, low TIMESTAMP_TYPE, high TIMESTAMP_TYPE) returns bigint Deletes timeseries elements belonging to the specified interval. If high or low parameters are not null, then them specify correspondingly upper/lower inclusive boundary for timestamp value. If some or both parameters are nulls, then corresponding boundary is open. This function returns number of deleted elements.
function TABLE_delete(id TIMESERIES_ID_TYPE, till TIMESTAMP_TYPE default null) returns bigint Deletes timeseries elements from the beginning till specified timestamp till (inclusive) or delete all elements if this parameter is null/omitted. This function is equivalent to TABLE_delete(id, null, till). IMCS provides separate function for it because it is intended to be the most frequent case of deleting elements from timeseries: it corresponds to shifting data window when new elements are appended and deteriorated are thrown away. This function returns number of deleted elements.
function TABLE_join(id TIMESERIES_ID_TYPE, other timeseries) returns timeseries Joins timestamp with other unsorted timeseries. It returns positions of elements in this timeseries which timestamp matches correspondent element of joined timeseries. Semantic of matching depends on value of the direction parameter:
  • If direction is less than zero, then this timestamp should be less or equal than other timestamp (locate timeseries element preceding timestamp).
  • If direction is zero, then this timestamp should be equal to other timestamp (exact match of timestamps).
  • If direction is greater than zero, then this timestamp should be greater or equal than other timestamp (locate timeseries element succeeding timestamp).
function TABLE_id() returns varcharReturns name of timeseries identifier for this table

Timeseries constructors

Functions constructing constant timeseries (timeseries of repeated value) or timseries created by parsing string literal.

FunctionDescription
function cs_parse(str text, elem_type cs_elem_type, elem_size integer default 0) returns timeseries Creates timeseries from string, for example '{1,2,3,4,5}'. Type of timeseries is specified by elem_type parameter. For timeseries of characters it is also necessary to specify size of timeseries element - elem_size. Please notice that PostgreSQL allows implicit cast from string to the target type using this type input function, but in this case information about timeseries element type and size should be encoded in the string: 'int4:{1,2,3,4,5}'.
function cs_const(val float8, elem_type cs_elem_type default 'float8') returns timeseries Creates timeseries of numeric (integer or floating point) elements. Type of timeseries is specified by elem_type parameter. Should be one of: 'char', 'int2', 'int4', 'int8', 'float4', 'float8'.
function cs_const(val timestamp, elem_type cs_elem_type) returns timeseries Creates timeseries of date/time elements. Type of timeseries is specified by elem_type parameter. Should be one of: 'date', 'time', 'timestamp'.
function cs_const(val text, elem_size integer) returns timeseries Creates timeseries of character type. Size of timeseries element is specified by elem_size parameter.
function cs_const(val text) returns timeseries Creates timeseries of character type. Size of timeseries element is equal to the length of val.

Binary operations

Binary operations with timeseries. These functions take two timeseries arguments and return result timeseries. IMCS tries to automatically adjust types of input arguments (for example if one timeseries has "int8" element type and another - "float8", then first one will be converted to "float8").

FunctionDescription
function cs_add(timeseries,timeseries) returns timeseries Adds elements of two timeseries
function cs_sub(timeseries,timeseries) returns timeseries Subtracts elements of two timeseries
function cs_mul(timeseries,timeseries) returns timeseries Multiplies elements of two timeseries
function cs_div(timeseries,timeseries) returns timeseries Divides elements of two timeseries
function cs_pow(timeseries,timeseries) returns timeseries Raises element of first timeseries to power specified by element of second timeseries
function cs_and(timeseries,timeseries) returns timeseries Bitwise AND of elements of two integer or boolean timeseries
function cs_or(timeseries,timeseries) returns timeseries Bitwise OR of elements of two integer or boolean timeseries
function cs_xor(timeseries,timeseries) returns timeseries Bitwise XOR of elements of two integer or boolean timeseries
function cs_eq(timeseries,timeseries) returns timeseries Checks if element of first timeseries is equal to element of second timeseries
function cs_ne(timeseries,timeseries) returns timeseries Checks if element of first timeseries is not equal to element of second timeseries
function cs_gt(timeseries,timeseries) returns timeseries Checks if element of first timeseries is greater than element of second timeseries
function cs_ge(timeseries,timeseries) returns timeseries Checks if element of first timeseries is greater or equal than element of second timeseries
function cs_lt(timeseries,timeseries) returns timeseries Checks if element of first timeseries is less than element of second timeseries
function cs_le(timeseries,timeseries) returns timeseries Checks if element of first timeseries is less or equal than element of second timeseries
function cs_maxof(timeseries,timeseries) returns timeseries Maximum of two elements
function cs_minof(timeseries,timeseries) returns timeseries Minimum of two elements
function cs_like(timeseries,pattern text) returns timeseries Finds elements of character timeseries matching specified pattern (case sensitive). Rules of matching are the same as for PostgreSQL LIKE predicate.
function cs_ilike(timeseries,pattern text) returns timeseries Finds elements of character timeseries matching specified pattern (ignore case). Rules of matching are the same as for PostgreSQL ILIKE predicate.
function cs_cat(timeseries,timeseries) returns timeseries Concatenates elements of two timeseries. Input timeseries can have any element type, result is always timeseries of characters which element size is equal to sum of element sizes of concatenated timeseries. For example cs_cat('bpchar1:{a,b,c}', 'bpchar1:{x,y,z}') = 'bpchar2:{ax,by,cz}'. In case of concatenation of character strings which actual length is smaller than fixed element size, result will contains filler character ('\0'). So if element size of concatenated timeseries in the above example is 3, then result will be E'{a\\000\\000x\\000\\000,b\\000\\000y\\000\000,c\\000\\000z\\000\000}'. If you prefer to get '{ax,by,cz}', then please use cs_add instead of cs_cat. Function cs_cat is intended to be used for concatenation of group-by keys (character or numeric) for aggregation.
function cs_concat(head timeseries,tail timeseries) returns timeseries Concatenates two timeseries. Result of this function is timeseries containing elements both of head and tail timeseries. For example cs_concat('int4:{1,2,3}','int4:{4,5,6}') = '{int4:1,2,3,4,5,6}'. Parameters head or tail may be null. In this case cs_concat returns just not-null timeseries.

Unary operations

Unary operations with timeseries. These functions take single timeseries and return timeseries as result.

FunctionDescription
function cs_neg(timeseries) returns timeseries Negates timeseries elements
function cs_not(timeseries) returns timeseries Logical NOT of boolean timeseries elements
function cs_bit_not(timeseries) returns timeseries Bitwise NOT of integer timeseries elements
function cs_abs(timeseries) returns timeseries Absolute value of timeseries element
function cs_norm(timeseries) returns timeseries Normalizes timeseries elements (divides each element by square root of sum of all elements)

Mathematical functions

Calculation of mathematical functions for all timeseries elements.

FunctionDescription
function cs_sin(timeseries) returns timeseries Sine function
function cs_cos(timeseries) returns timeseries Cosine function
function cs_tan(timeseries) returns timeseries Tangent function
function cs_exp(timeseries) returns timeseries Exponent function
function cs_asin(timeseries) returns timeseries Arcsine function
function cs_acos(timeseries) returns timeseries Arccosine function
function cs_atan(timeseries) returns timeseries Arctangent function
function cs_sqrt(timeseries) returns timeseries Square root function
function cs_log(timeseries) returns timeseries Natural logarithm function
function cs_ceil(timeseries) returns timeseries Rounds timeseries element to the smallest integer greater or equal than the element value
function cs_floor(timeseries) returns timeseries Rounds timeseries element to the largest integer less or equal than the element value
function cs_isnan(timeseries) returns timeseries Checks if floating point timeseries element is NaN

Date/time functions

Extracts components of date/time type. These functions are mostly needed in group-by operations to calculate aggregates for various intervals (days, weeks, months, quarters, years...).

FunctionDescription
function cs_year(timeseries) returns timeseries Extracts year from date/timestamp
function cs_month(timeseries) returns timeseries Extracts month (1..12) from date/timestamp
function cs_mday(timeseries) returns timeseries Extracts month day (1..31) from date/timestamp
function cs_wday(timeseries) returns timeseries Extracts week day (0..6 starting from Sunday) from date/timestamp
function cs_week(timeseries) returns timeseries Extracts week number since start of epoch from date/timestamp
function cs_quarter(timeseries) returns timeseries Extracts quarter (1..4) from date/timestamp
function cs_hour(timeseries) returns timeseries Extracts hour (0..23) from time/timestamp
function cs_minute(timeseries) returns timeseries Extracts minute (0..59) from time/timestamp
function cs_second(timeseries) returns timeseries Extracts second (0..59) from time/timestamp

Binary scalar functions

Functions of this group take two timeseries arguments and calculate single scalar value as result. IMCS tries to automatically adjust types of input arguments (for example if one timeseries has "int8" element type and another - "float8", then first one will be converted to "float8").

Execution of these functions can be parallelized.

FunctionDescription
function cs_wsum(timeseries,timeseries) returns float8 Weighted sum of timeseries elements
function cs_wavg(a timeseries,b timeseries) returns float8 Weighted average of timeseries elements: sum(a*b)/sum(a)
function cs_corr(a timeseries,b timeseries) returns float8 Correlation of two timeseries
function cs_cov(a timeseries,b timeseries) returns float8 Covariation of two timeseries

Timeseries transformation functions

Functions performing various transformations of input timeseries.

FunctionDescription
function cs_cast(input timeseries, elem_type cs_elem_type, elem_size default 0) returns timeseries Casts timeseries elements to the specified type defined in cs_elem_type enum: ('char', 'int2', 'int4', 'date', 'int8', 'time', 'timestamp', 'money', 'float4', 'float8', 'bpchar'). For character type it is necessary to specify element size. If converted value doesn't fit in specified size, it will be truncated. Explicit casts are rarely needed, in most cases IMCS performs implicit type conversion.
function cs_to_TYPE_array(timeseries) returns TYPE[] Converts timeseries to array. TYPE should be one of "char", int2, int4, date, int8, time, timestamp, money, float4, float8, bpchar and should match element type of the converted timeseries. Please notice that array is constructed in memory and large timeseries can cause memory overflow.
function cs_from_array(anyarray, elem_size integer default 0) returns timeseries Converts array to timeseries. This function creates timeseries iterator for the input array, allowing to apply to it any timeseries functions. Type of the result timeseries element is the same as type of the array element. Optional elem_size parameter is needed only for text array, it should specify maximal size of array element.
function cs_thin(timeseries, origin integer, step integer) returns timeseries Leaves only each step-th element of timeseries starting from origin.
function cs_limit(timeseries, from_pos bigint default 0, till_pos bigint default 9223372036854775807) returns timeseries Extracts subsequence from timeseries. Parameter from_pos specifies start position of subsequence (inclusive) and parameter till_pos specifies end position (inclusive). If till_pos parameter is missed, then subsequence spans till end of timeseries. Values of both from_pos and till_pos parameters can be negative. In this case position is calculated from end of timeseries, i.e. cs_limit(s, from_pos:=-1) extracts last element of the timeseries.
function cs_head(timeseries, n bigint default 1) returns timeseries Extracts n first elements of timeseries. This function is equivalent to cs_limit(0, n-1).
function cs_tail(timeseries, n bigint default 1) returns timeseries Extracts n last elements of timeseries. This function is equivalent to cs_limit(-n).
function cs_cut_head(timeseries, n bigint default 1) returns timeseries Extracts all except first n elements of timeseries. This function is equivalent to cs_limit(n).
function cs_cut_tail(timeseries, n bigint default 1) returns timeseries Extracts all except last n elements of timeseries. This function is equivalent to cs_limit(0,-n-1).
function cs_call(input timeseries, func oid) returns timeseries Calls specified function for all elements of input timeseries. To specify function cast function name to regproc or function prototype (name and argument types) to recprocedure:
  select cs_call(Close,'sin'::regproc) 
  from Quote_get('IBM');

  select cs_call(Close,'sin(float)'::regprocedure) 
  from Quote_get('IBM');
Please notice that sin is taken here only as example. There is special cs_sin in IMCS API. But you can specify here name of any function, including plpgsql function:
  create function mul2(x real) returns real as 
  $$ begin return x*2; end; $$ 
  language plpgsql strict immutable;

  select cs_call(Close, 'mult2'::regproc) 
  from Quote_get('IBM');
function cs_union(left timeseries, right timeseries) Unions two sorted timeseries (usually timestamps). For example cs_union('int8:{1,5,7,8}', 'int8:{2,3,5,6}') = 'int8:{1,2,3,5,5,6,7,8}'
function cs_iif(cond timeseries, then_ts timeseries, else_ts timeseries) returns timeseries Chooses one of two alternatives: if element of cond boolean timeseries is true, then use element of then_ts timeseries, otherwise use element of else_ts timeseries. All timeseries are traversed with the same speed: if we take element from then_ts timeseries, then corresponding element of else_ts timeseries is skipped. For example cs_iif('char:{1,0,1}', 'float4:{1.0,2.0,3.0}', 'float4:{0.1,0.2,0.3}') = 'float4:{1.0,0.2,3.0}'
function cs_if(cond timeseries, then_ts timeseries, else_ts timeseries) returns timeseries Conditional computation: if element of cond boolean timeseries is true, then take next element of then_ts timeseries, otherwise use element of else_ts timeseries. Unlike cs_iff then/else timeseries are accessed only on demand, so number of elements fetched from this timeseries depends on condition. For example cs_if('char:{1,0,1}', 'float4:{1.0,2.0,3.0}', 'float4:{0.1,0.2,0.3}') = 'float4:{1.0,0.1,2.0}'
function cs_filter(cond timeseries, input timeseries) returns timeseries Leaves only those elements from timeseries input for which condition cond is true. For example cs_filter('char:{1,0,1}', 'float4:{1.0,2.0,3.0}') = 'float4:{1.0,3.0}'
function cs_filter_pos(cond timeseries) returns timeseries Returns positions of timeseries elements for which condition cond is true. For example cs_filter_pos('char:{1,0,1}') = 'int8:{0,2}'
function cs_filter_first_pos(cond timeseries, n integer) returns timeseries Finds first N positions of timeseries elements for which condition cond is true. For example cs_filter_first_pos('char:{1,0,1}', 1) = 'int8:{0}' Execution of this function can be parallelized.
function cs_unique(timeseries) returns timeseries Removes subsequent duplicate values. To eliminate all duplicates in timeseries it should be sorted prior applying cs_unique. For example cs_unique('int4:{1,1,2,2,2,1,3}') = 'int4:{1,2,1,3}'
function cs_reverse(timeseries) returns timeseries Reverses order of timeseries elements .For example cs_reverse('int4:{1,2,3}') = 'int4:{3,2,1}'
function cs_trend(input timeseries) returns timeseries Calculates sequence trend: sign of difference between pairs of non-equal sequence elements, for example cs_trend('int8:{1,2,3,3,2,2,4,5,6,5,5}') = 'int1:{0,1,1,1,-1,-1,1,1,1,-1,-1}'
function cs_diff(input timeseries) returns timeseries Calculates difference between pairs of subsequent timeseries elements: result[0] = 0, result[i] = input[i] - input[i-1]. For example cs_diff('int8:{1,3,2,5}') = 'int8:{0,2,-1,3}'
function cs_project(anyelement, positions timeseries default null, disable_caching book default false) returns setof record Transforms vertical representation (all timeseries elements or just elements on specified positions) to horizontal representation. This is more generic version of TABLE_project which can be applied to arbitrary set of columns. But as far as result row is anonymous, it is not possible to unnest it using PostgreSQL ().* clause. In PostgreSQL 9.3 it is possible to use cs_project in FROM list (lateral join) providing alias with description of returned columns. Concerning optional disable_caching parameter please read section Projection issues.
function cs_project_agg(anyelement, positions timeseries default null, disable_caching book default false) returns setof cs_agg_result This is specialized version of cs_project for transposing result of hash_agg_* functions. They return two timeseries: the first one with values of aggregate and the second one with values of group-by key. cs_project_agg transforms this result to set of cs_agg_result rows, consisting of two columns: (agg_val float8, group_by bytea). In case of combining several keys for group-by key, it can be splitted back into separate values using cs_cut or cs_as functions. Concerning optional disable_caching parameter please read section Projection issues.
function cs_map(input timeseries, positions timeseries) returns timeseries Extracts from first timeseries elements with positions specified in the second timeseries. Example of cs_map usage: cs_map('float8:{3.14,0.1,-10}', 'int8:{1,2}')='float8:{-0.1,10}'

Grand aggregates

Functions calculating grand aggregates: aggregation of all timeseries elements.

Execution of these functions can be parallelized (except cs_median).

FunctionDescription
function cs_count(timeseries) returns bigint Counts number of timeseries elements.
function cs_empty(timeseries) returns bool Checks if timeseries contains no elements. This function is usually more efficient than cs_count() %lt;%gt; 0 except cases when filter is applied to large timeseries and relatively small number of elements fits filter condition (unlike cs_empty, cs_count can be executed in parallel)
function cs_approxdc(timeseries) returns bigint Approximates number of different timeseries elements.
function cs_max(timeseries) returns float8 Maximal value of timeseries elements.
function cs_min(timeseries) returns float8 Minimum value of timeseries elements
function cs_avg(timeseries) returns float8 Average value of timeseries elements.
function cs_sum(timeseries) returns float8 Sum of timeseries elements.
function cs_prd(timeseries) returns float8 Product of timeseries elements.
function cs_var(timeseries) returns float8 Variation of timeseries elements.
function cs_dev(timeseries) returns float8 Standard deviation of timeseries elements.
function cs_median(timeseries) returns float8 Median element of timeseries.
function cs_all(timeseries) returns bigint Bitwise AND of elements of integer timeseries.
function cs_any(timeseries) returns bigint Bitwise OR of elements of integer timeseries.

Group-by aggregates

Functions calculating aggregates for each group. Groups are identified by sequence of elements with the same value in group-by timeseries. It is not mandatory to sort this timeseries. But you should realize that sequences of the same value in different parts of the timeseries will form different groups. For example, there are four groups in timeseries '{1, 1, 2, 1, 1, 1, 2, 2,}': ('{1, 1}', '{2}', '{1, 1, 1}', '{2, 2}'). If you want to perform aggregation for all timeseries element with the same value, then use cs_hash_* functions instead.

FunctionDescription
function cs_group_count(timeseries) returns timeseries Returns number of elements in each group (sequences of repeated values)
function cs_group_apporaxdc(input timeseries, group_by timeseries) returns timeseries Approximates number of distinct values in each group. group_by timeseries identifies groups: sequences of repeated values.
function cs_group_max(input timeseries, group_by timeseries) returns timeseries Maximal value of each group. group_by timeseries identifies groups: sequences of repeated values.
function cs_group_min(input timeseries, group_by timeseries) returns timeseries Minimal value of each group. group_by timeseries identifies groups: sequences of repeated values.
function cs_group_sum(input timeseries, group_by timeseries) returns timeseries Sum of elements of each group. group_by timeseries identifies groups: sequences of repeated values.
function cs_group_avg(input timeseries, group_by timeseries) returns timeseries Average value of each group. group_by timeseries identifies groups: sequences of repeated values.
function cs_group_var(input timeseries, group_by timeseries) returns timeseries Variation of each group. group_by timeseries identifies groups: sequences of repeated values.
function cs_group_dev(input timeseries, group_by timeseries) returns timeseries Standard deviation of each group. group_by timeseries identifies groups: sequences of repeated values.
function cs_group_first(input timeseries, group_by timeseries) returns timeseries First element of each group. group_by timeseries identifies groups: sequences of repeated values.
function cs_group_last(input timeseries, group_by timeseries) returns timeseries Last element of each group. group_by timeseries identifies groups: sequences of repeated values.
function cs_group_any(input timeseries, group_by timeseries) returns timeseries Bitwise OR of elements of each group. group_by timeseries identifies groups: sequences of repeated values.
function cs_group_all(input timeseries, group_by timeseries) returns timeseries Bitwise AND of elements of each group. group_by timeseries identifies groups: sequences of repeated values.

Grid aggregates

Splitting timeseries into intervals intervaland calculating aggregate for each interval (grid cell).

FunctionDescription
function cs_grid_max(input timeseries, step integer) returns timeseries Maximal value of each interval. Parameter step specifies size of interval.
function cs_grid_min(input timeseries, step integer) returns timeseries Minimal value of each interval. Parameter step specifies size of interval.
function cs_grid_avg(input timeseries, step integer) returns timeseries Average value of each interval. Parameter step specifies size of interval.
function cs_grid_sum(input timeseries, step integer) returns timeseries Sum of each interval. Parameter step specifies size of interval.
function cs_grid_var(input timeseries, step integer) returns timeseries Variation of each interval. Parameter step specifies size of interval.
function cs_grid_dev(input timeseries, step integer) returns timeseries Standard deviation of each interval. Parameter step specifies size of interval.

Window (moving) aggregates

Aggregation is done for window - N subsequent elements of timeseries where N is window size. At each step window is moved at one position forward. So result timeseries has the same number of elements as input timeseries. First N-1 elements of result are calculated for windows smaller than N. You can use cs_limit(cs_window_AGG(input, N), N-1) to skip these elements.

FunctionDescription
function cs_window_max(input timeseries, window_size integer) returns timeseries Maximal value of each window. Parameter window_size specifies size of window.
function cs_window_min(input timeseries, window_size integer) returns timeseries Minimal value of each window. Parameter window_size specifies size of window.
function cs_window_avg(input timeseries, window_size integer) returns timeseries Average value of each window. Parameter window_size specifies size of window.
function cs_window_sum(input timeseries, window_size integer) returns timeseries Sum of each window. Parameter window_size specifies size of window.
function cs_window_var(input timeseries, window_size integer) returns timeseries Variation of each window. Parameter window_size specifies size of window.
function cs_window_dev(input timeseries, window_size integer) returns timeseries Standard deviation of each window. Parameter window_size specifies size of window.
function cs_window_ema(input timeseries, window_size integer) returns timeseries Exponential Moving Average (EMA) indicator with window_size period. Coefficient of weighting decrease p=2/(window_size + 1). Formula: EMA[0] = input[0], EMA[i] = input[i]*p + EMA[i-1]*(1-p).
function cs_window_atr(input timeseries, window_size integer) returns timeseries Average True Range (ATR) indicator with window_size period. Formula: ATR[i] = (ATR[i-1]*(n-1) + TR[i])/n, where n=min(i+1, window_size). First window_size-1 elements of result can be skipped to get correct ATR sequence.

Hash aggregates (group-by using hash function)

Aggregation with group-by. These function perform grouping and aggregation similar to SQL. All elements of group_by sequence with the same value forms single group. It is done using hash function, so cs_hash_* aggregates require additional memory for building hash table. These functions contain two out parameters: return two timeseries. The first one contains calculated aggregates. The second one contains corresponding group keys.

If it is necessary to perform grouping by more than one key, it is possible to use cs_cat (or || SQL operator) to concatenate several columns. Later it is possible to use cs_cut or cs_as functions to split concatenated value back into components.

Execution of these functions can be parallelized.

FunctionDescription
function cs_hash_count(group_by timeseries, out count timeseries, out groups timeseries) returns record Counts number of elements having the same value. This function has two output parameters: count timeseries contains counters and groups timeseries contains corresponding group key values. So result of cs_hash_count('float4:{1,3,1,4,2,2}') will be ('int8:{2,2,1,1}', 'float4:{1,2,3,4}').
function cs_hash_dup_count(input timeseries, group_by timeseries, out count timeseries, out groups timeseries, min_occurrences integer default 1) returns record Counts number of duplicates for each group. Groups are identified by group_by timeseries. This function has two output parameters: count timeseries contains counters and groups timeseries contains corresponding group key values. Parameter min_occurrences specifies minimal number of occurrences of element in each group. It should be positive number. With default value 1 of min_occurrences this function calculates number of distinct values.
function cs_hash_max(input timeseries, group_by timeseries, out max timeseries, out groups timeseries) returns record Calculates maximal value for each group. Groups are identified by group_by timeseries. This function has two output parameters: max timeseries contains calculated maximums and groups timeseries contains corresponding group key values.
function cs_hash_min(input timeseries, group_by timeseries, out max timeseries, out groups timeseries) returns record Calculates minimal value for each group. Groups are identified by group_by timeseries. This function has two output parameters: min timeseries contains calculated minimums and groups timeseries contains corresponding group key values.
function cs_hash_avg(input timeseries, group_by timeseries, out avg timeseries, out groups timeseries) returns record Calculates average value for each group. Groups are identified by group_by timeseries. This function has two output parameters: avg timeseries contains calculated averages and groups timeseries contains corresponding group key values.
function cs_hash_sum(input timeseries, group_by timeseries, out avg timeseries, out groups timeseries) returns record Calculates sum of elements for each group. Groups are identified by group_by timeseries. This function has two output parameters: sum timeseries contains calculated sums and groups timeseries contains corresponding group key values.
function cs_hash_any(input timeseries, group_by timeseries, out avg timeseries, out groups timeseries) returns record Calculates bitwise OR of elements for each group. Groups are identified by group_by timeseries. This function has two output parameters: sum timeseries contains calculated bitmasks and groups timeseries contains corresponding group key values.
function cs_hash_all(input timeseries, group_by timeseries, out avg timeseries, out groups timeseries) returns record Calculates bitwise AND of elements for each group. Groups are identified by group_by timeseries. This function has two output parameters: sum timeseries contains calculated bitmasks and groups timeseries contains corresponding group key values.

Cumulative aggregates

Aggregates calculated for all preceding elements of timeseries. Result timeseries has the same number of elements as input timeseries.

FunctionDescription
function cs_cum_max(timeseries) returns float8 Cumulative maximal value of timeseries elements
function cs_cum_min(timeseries) returns timeseries Cumulative minimum value of timeseries elements
function cs_cum_avg(timeseries) returns timeseries Cumulative average value of timeseries elements
function cs_cum_sum(timeseries) returns timeseries Cumulative sum of timeseries elements
function cs_prd(timeseries) returns timeseries Cumulative product of timeseries elements
function cs_var(timeseries) returns timeseries Cumulative variation of timeseries elements
function cs_dev(timeseries) returns timeseries Cumulative standard deviation of timeseries elements

Sort functions

Top functions find out top-N values of timeseries. N can not be larger than imcs.tile_size (default value 128).

Execution of cs_top_* functions can be parallelized. Please notice that calculation of TOP-n is much faster than full sort.

FunctionDescription
function cs_top_max(timeseries, top integer) returns timeseries Returns top maximal elements in timeseries in descent order. For example cs_top_max('float4:{1.1,0.1,2.2,0.2}', 2)='float4:{2.2,1.1}'
function cs_top_min(timeseries, top integer) returns timeseries Returns top minimum elements in timeseries in ascent order. For example cs_top_min('float4:{1.1,0.1,2.2,0.2}', 2)='float4:{0.1,0.2}'
function cs_top_max_pos(timeseries, top integer) returns timeseries Returns positions of top maximal elements in timeseries in descent order. For example cs_top_max_pos'float4:{1.1,0.1,2.2,0.2}', 2)='int8:{2,0}'
function cs_top_min_pos(timeseries, top integer) returns timeseries Returns positions of top minimum elements in timeseries in ascent order. For example cs_top_min_pos('float4:{1.1,0.1,2.2,0.2}', 2)='int8:{1,3}'
function cs_sort(timeseries, sort_order cs_sort_order default 'asc') returns timeseries Sorts specified timeseries of scalar element type. For example cs_sort('float4:{1.1,0.1,2.2,0.2}')='float4:{0.1,0.2,1.1,2.2}'
function cs_sort_pos(timeseries, sort_order cs_sort_order default 'asc') returns timeseries Returns positions of timeseries scalar elements in specified order. For example cs_sort_pos('float4:{1.1,0.1,2.2,0.2}')='int8:{1,3,0,2}'
function cs_rank(timeseries, sort_order cs_sort_order default 'asc') returns timeseries Returns rank of scalar timeseries elements. For example cs_rank('float4:{1.1,0.1,2.2,0.2,0.1}')='int8:{4,1,5,3,1}'
function cs_dense_rank(timeseries, sort_order cs_sort_order default 'asc') returns timeseries Returns dense rank of scalar timeseries elements. For example cs_rank('float4:{1.1,0.1,2.2,0.2,0.1}')='int8:{3,1,4,2,1}'
function cs_quantile(timeseries, q_num integer) returns timeseries Calculates q-quantiles for timeseries of scalar element type. The quantiles are the data values marking the boundaries between consecutive subsets. This functions returns timeseries with q_num+1 values of the same type as input timeseries For example cs_quantile('float4:{10,3,0,3,4,5,9,11,7,3,3}', 2)='float4:{0,4,11}'

Special functions

This group contains functions performing quite complex processing of timeseries.

FunctionDescription
function cs_histogram(input timeseries, min float8, max float8, n_intervals integer) returns timeseries Builds histogram for the input timeseries. Minimal (inclusive) and maximal (exclusive) values for input timeseries should be specified as well as number of interval (histogram columns). Values outside specified range [min_value,max_value) are ignored. Number of intervals should not be greater than tile size. Execution of these functions can be parallelized.
function cs_cross(input timeseries, first_cross_direction integer) returns timeseries Finds positions in input timeseries where it crosses zero, if first_cross_direction is positive then starts with first cross over, if negative then starts with cross below, if zero it doesn't matter (first cross can be over or below) For example cs_cross('float4:{1,2,-1,0.5,0.6,0.0,0.1,0.3,-5}', 0)='int8:{2,3,7}'
function cs_extrema(input timeseries, first_extremum integer) returns timeseries Finds positions of extrema (local minimum and maximums) in input timeseries, if first_extremum is positive then starts with first local maximum, if negative starts with local minimum, if zero it doesn't matter. For example cs_extrema('float4:{1,2,3,2,1,0,0,1,1,2,4,0}', 0)='int8:{2,6,10}'
function cs_stretch(ts1 timeseries, ts2 timeseries, vals timeseries, filler float8 defaults 1.0) returns timeseries Stretches vals timeseries to the length of first timeseries. Repeats elements of vals timeseries while corresponding timestamp (timeseries ts2) is larger than timestamp from ts1. For example cs_stretch('int4:{1,2,3,4,5}', 'int4:{2,4}', 'float4:{1.1,2.2}', 1.0) = 'float4:{1.1,2.2,2.2,1.0,1.0}'. This function can be used to calculate split adjusted price. We need to revert timeseries of splits, calculate cumulative product, stretch and multiply it on price.
function cs_stretch0(ts1 timeseries, ts2 timeseries, vals timeseries, filler float8 defaults 0.0) returns timeseries Injects missed elements in vals timeseries (associated with ts2) so that corresponding timestamps of ts1 and ts2 are matched. For example cs_stretch0('int4:{1,2,3,5}', 'int4:{2,3,4}', 'float4:{1.1,1.2,1.3}', 0.0) = 'float4:{0.0,1.1,1.2,1.3,0.0}'. This function may be useful if we need to perform operations with trading data for different symbols and this data can contains some holes (no trading for particular symbol for this date)
function cs_asof_join(ts1 timeseries, ts2 timeseries, vals timeseries) returns timeseries Gets values from third timeseries corresponding to the timestamp from ts2 closest to the timestamp from ts1. For example cs_asof_join('int4:{4,9}', 'int4:{1,3,6,10}', 'float4:{0.1,0.3,0.6,1.0}') = 'float4:{0.3,1.0}'.
function cs_asof_join_pos(ts1 timeseries, ts2 timeseries) returns timeseries Gets positions of elements in sorted timeseries ts2 closest to the elements in sorted timeseriests1. For example cs_asof_join_pos('int4:{4,9}', 'int4:{1,3,6,10}') = 'int8:{1,3}'.
function cs_join(ts1 timeseries, ts2 timeseries, vals timeseries) returns timeseries Gets elements from vals timeseries which corresponds to elements of sorted timeseries ts2 matching elements of sorted timeseries ts1. For example cs_join_pos('int4:{0,2,3,8,10}', 'int4:{1,3,6,10}', 'float4:{0.1,0.3,0.6,1.0}') = 'int8:{0.3,1.0}'.
function cs_join_pos(ts1 timeseries, ts2 timeseries) returns timeseries Gets positions of elements in sorted timeseries ts2 matching elements in sorted timeseriests1. For example cs_join_pos('int4:{0,2,3,8,10}', 'int4:{1,3,6,10}') = 'int8:{1,3}'.
function cs_cut(str bytea, format cstring) returns record Splits binary string into components. This function is reverse to cs_cat which may be needed to construct combined group-by key for aggregate functions. format string describes types of component. Type is specified by one letter followed by field length. Below is list of supported types:
Format specificationPostgreSQL type
i1"char"
i2int2
i4int4
i8int8
f4float4
f8float8
d4date
t8time
T8timestamp
m8money
cNchar(N)
For example format string 'i4f4c10' corresponds to a row with one integer, one float and one character component with length 10.
function cs_as(str bytea, type_name cstring) returns record Yet another function splitting binary string into components. This function is reverse to cs_cat which may be needed to construct combined group-by key for aggregate functions. Parameter type_name specifies composite type which components will be fetched from the input string. Below is example of using this function:
  create type char16 as (body char(16));
  select agg_val,cs_as(group_by,'char16') 
  from (select (cs_project_agg(cs_hash_sum(volenquired,fxvenue))).* 
        from DbItem_get()) agg;
In this example group-by consists of just one key of char(16) type. It is also possible to print it without cs_as:
  select agg_val,encode(btrim(group_by,E'\\000'::bytea),'escape') 
  from(select(cs_project_agg(cs_hash_sum(volenquired,fxvenue))).* 
       from DbItem_get()) agg;
Please notice that cs_as functions returns type record and PostgreSQL doesn't allow to convert it to composite type. So you can not write:
  select agg_val,cs_as(group_by,'char16')::char16 
  from(select(cs_project_agg(cs_hash_sum(volenquired,fxvenue))).* 
       from DbItem_get()) agg;
And as far as PostgreSQL has no information about columns, you can not use (cs_as(...)).* clause to extract columns of the row. But you can create function returning proper type and bind it to cs_as C implementation:
  create function to_char16(body bytea, 
                            type_name cstring default 'char16') 
  returns char16 as '$libdir/imcs', 'cs_as' 
  language C stable strict;

  select agg_val,(to_char16(group_by)).* 
  from (select (cs_project_agg(cs_hash_sum(volenquired,fxvenue))).* 
        from DbItem_get()) agg;
But please be careful: nobody will check that this function really returns declared type.

Operators

IMCS provides standard SQL operators for timeseries type plus some specific operators for timeseries processing. The following table contains mapping between operators and corresponding timeseries functions:

OperatorIMCS functionDescription
a + bcs_addAdds elements of two timeseries
a - bcs_subSubtracts elements of two timeseries
a * bcs_mulMultiplies elements of two timeseries
a / bcs_divDivides elements of two timeseries
a % bcs_modDivides by modulo elements of two timeseries
a ^ bcs_powRaises to power
a ~ bcs_corrCorrelation of two timeseries
a ? bcs_filterFilters elements of right timeseries according to condition specified by left timeseries
a ?cs_filter_posReturns position of elements for which condition is true
a +* bcs_wsumWeighted sum of two timeseries
a // bcs_wavgWeighted average of two timeseries, for example Volume//Price is volume weighted average price (VWAP)
a & bcs_andBitwise AND (can be also used for boolean timeseries)
a | bcs_orBitwise OR (can be also used for boolean timeseries)
a # bcs_xorBitwise XOR (can be also used for boolean timeseries)
a || bcs_catConcatenates correspondent elements of two timeseries
a ||| bcs_concatConcatenates two timeseries
a = bcs_eqChecks if element if left timeseries is equal to element of right timeseries
str ~~ patterncs_ilikeFinds elements of character timeseries matching specified pattern
a <> bcs_neChecks if element if left timeseries is not equal to element of right timeseries
a < bcs_ltChecks if element if left timeseries is less than element of right timeseries
a <= bcs_leChecks if element if left timeseries is less than or equal to element of right timeseries
a > bcs_gtChecks if element if left timeseries is greater than element of right timeseries
a >= bcs_geChecks if element if left timeseries is greater than or equal to element of right timeseries
a -> bcs_asof_join_posFinds positions of elements in second sorted timeseries closest to the elements of first sorted timeseries
a <-> bcs_join_posFinds positions of elements in second sorted timeseries equal to the elements of first sorted timeseries
a << ncs_cut_headSkips first n elements of timeseries
a >> ncs_cut_tailSkips last n elements of timeseries
-acs_negNegates elements of timeseries
!acs_notLogical NOT
~acs_bit_notBitwise NOT
@acs_absAbsolute values of timeseries elements

Please notice that operators & | # in PostgreSQL have precedence different from precedence of standard SQL AND OR operators. Please always use parenthesis.

Binary operators accept not only timeseries OP timeseries operands. Also it is possible to pass as right parameter one of

  1. Numeric constant (integer or floating point): select Price*2.0 from Quote_get()
  2. Date, time or timestamp: select Day=date('11-Nov-2013') from Quote_get()
  3. String literal (1): select Close*'{2.0,2.1,2.2}'::text from Quote_get('11-Nov-2013', '13-Nov-2013')
  4. String literal (2): select Close*'float4:{2.0,2.1,2.2}' from Quote_get('11-Nov-2013', '13-Nov-2013')

In first two cases constant timeseries (timeseries containing the same value) is implicitly created for right operand using cs_const function. In third case timeseries is created from string literal using cs_parse function. And in the last case conversion to timeseries type is implicitly made by PostgreSQL using input function of this type.

Projection issues

There are several functions in IMCS API returning a row or a set of rows: cs_hash_*, cs_project*, cs_as. PostgreSQL provides two ways of decomposition of compound type into columns:

  1. select (foo()).*
  2. select * from foo()
Unfortunately case 1) is implemented very inefficiently: a function is called as many times as there are columns in a returned row (see discussion of this question at StackOverlow). For example cs_hash_sum function has two output parameters: sum and groups. Output parameters are actually returned in PostgreSQL as anonymous row. So if we write:
  select (cs_project_agg(cs_hash_sum(Close,Day%7))).* from Quote_get('IBM');
then PostgreSQL will call function cs_hash_sum twice. It means that aggregation will be performed twice: we will have to do double amount of work. I failed to find a way to make PostgreSQL to avoid these redundant calls. But this problem is solvable.

First of all it is possible to avoid (...).* construction and access composite type attributes explicitly:

  select (q.p).agg_val,(q.p).group_by 
  from (select cs_project_agg(cs_hash_sum(Close,Day%7)) p from Quote_get('IBM')) q;
In this case PostgreSQL generally will not perform redundant calls. To guarantee that multiple evaluation won't be performed you can use the OFFSET 0 hack or abuse PostgreSQL's failure to optimise across CTE boundaries:
  select (q.p).agg_val,(q.p).group_by 
  from (select cs_project_agg(cs_hash_sum(Close,Day%7)) p 
        from Quote_get('IBM') offset 0) q;
But IMCS also tries to provide workaround for (cs_project(...)).* construction: cs_project and cs_project_agg functions can cache their results, avoiding redundant calculations. Unfortunately there are some restrictions. For example it is not possible to use cs_project more than once in one query. You can disable such caching for the particular invocation by setting disable_caching optional parameter to false. Or completely disable caching by setting imcs.project_caching configuration parameter to true.

And concerning case 2) calling function in FROM list: it is possible if function doesn't depend on other data sources at the same query layer. For example in IMCS cs_hash_sum accepts timeseries arguments which are provided by Quote_get. So we can write:

  select (p).* from Quote_get('IBM') q,cs_project_agg(cs_hash_sum(q.Close,q.Day%7)) p;
But it works only in PostgreSQL 9.3 which supports lateral joins. A lateral join enables a subquery in the FROM part of a SELECT to reference columns from preceding items in the FROM list. Also function calls in PostgreSQL 9.3 can now directly reference columns from preceding FROM items, even without the LATERAL keyword. This is why query above correctly works with PostgreSQL 9.3 and higher (function is called only once) and generates function expression in FROM cannot refer to other relations of same query level error in previous PostgreSQL versions.

Also using projection function in FROM list allows to specify alias and describe columns:

  select (s).* from Quote_get('IBM') q,
  cs_project(cs_hash_sum(q.Close,q.Day%7)) s(sum float8,group_by integer);

Conclusion:
  1. Avoid using of (cs_hash_AGG(...)).* construction to eliminate redundant calculations.
  2. Better use IMCS with PostgreSQL 9.3 and build queries using lateral joins.

Implementation details

Timeseries are stored in shared memory as B-Tree pages. This B-Tree provides fast access to timeseries element by position (for all types) or by value (only for timestamp). There is separate B-Tree for each timeseries. PostgreSQL hash is used to locate timeseries by identifier. Hash key includes name of the source table, name of the corresponding field and optionally identifier of timeseries. For example for Quote table identifier of timeseries may be 'quote-close-IBM'. Size of B-Tree pages is determined by "imcs.page_size" configuration parameter. Default value is 4kb.

IMCS uses RW (read-write) lock to synchronize access to columnar store. It means that multiple read-only queries can be performed concurrently, but adding or removing timeseries elements is possible only in exclusive mode. Lock is set when timeseries is accessed first time. If imcs.serializable configuration parameter is true (default), then lock is hold till the end of transaction. Such locking policy provides serializable isolation level for timeseries. If imcs.serializable is false, then lock is released at the end of query execution. It corresponds to "read committed" isolation level.

Right now IMCS supports RLE compression for timeseries of character type. But duplicates are eliminated only at B-Tree pages. When elements are extracted into tile, them are decompressed. Using RLE at tiles level can significantly increase speed of some operations. For example if we perform aggregation (let's say sum) of timeseries with large number of repeated duplicate values, then RLE can significantly reduce number of performed operation. If value is repeated 100 times, then with RLE we can just calculate 100*x instead of performing 100 additions. But IMCS is first of all oriented on financial data (trading systems). And here duplicates are not so often, at least for numeric characteristics. (price, volume, date,...). Even if value of some stock option is quite durable (variation about few cents per year), small fluctuations of this option during a days normally occur. Our experiments show that RLE encoding cause only degrade of performance in case of standard queries on securities data.

Most of cs_* functions are not actually performing any executions. Instead of it, they are constructing pipe of operators (or expression tree). A unit of exchanging data between operators is a tile (vector). So operators are performing vector operations to reduce interpretation cost. Size of a tile should be large enough to minimize overhead of organizing work of the pipe. But it should fit in L1 CPU cache to keep processing speed high. Default size of the tile is 128.

IMCS is able to execute some operations in parallel. Now it is done for grand and hash aggregates, top-N functions (all operators where size of output is smaller than size of input). IMCS maintains pool of threads. Number of threads in the pool can be specified using "imcs.n_threads" configuration parameter. By default (zero value of this parameters), number of threads is detected automatically based on number of CPUs (cores) in the system. IMCS clones expression subtree and splits into segments timeseries accessed in the leave nodes of this tree (timeseries stored in columnar store). Number of segments corresponds to number of threads. So each thread is processing its own part of timeseries. Then results are merged using operator-specific merge function. Merge requires synchronization, so only one thread can perform merge at each moment of time.

Please notice that PostgreSQL is not able to parallelize execution of SQL query. Certainly it is possible to manually split query into several subqueries and execute them concurrently. But it is not trivial and not convenient. The fact that IMCS can overcome this limitation is very important for OLAP queries.

Scaling beyond physical memory

IMCS was originally designed to hold all data in main memory. In this case it shows the best performance. But there are cases when available data doesn't fit in server's RAM. After receiving requests from several customers I have added to IMCS possibility to swap data to the disk. Certainly performance of disk version of IMCS is not so high as of in-memory version. But there are two factors which allows to expect quite good performance of disk version also:

  1. IMCS uses B-Tree to store timeseries data and B-Tree is one of the most efficient data structure for disk lookups (minimizing number of read operations)
  2. Most of IMCS queries are performing sequential scans of large timeseries intervals. It allows to sequentially read data from the disk with disk head speed (up to Gbit per second for modern disks).

To use IMCS in disk mode, you should rebuild it with USE_DISK=1 make option. In this case IMCS will store timeseries data in specified file or raw partition and use page pool (disk cache) to optimize access to the disk. You need to specify path to file or raw partition and size of disk cache (number of pages). Cache is placed in shared memory so it can be accessed by all PostgreSQL processes. Please notice that size of the cache should be smaller than size of shared memory reserved for IMCS extension ("imcs.shmem_size").

Usually the larger cache is used, the better performance you will get. Certainly if cache fits in main memory, in case of swapping large cache can only cause degrade of performance. But most of IMCS queries perform sequential scan of data. If size of data is larger than size of the cache, then it doesn't matter how large this cache is: there will be no cache hits in any case (page is thrown away from the cache by LRU algorithm before it is accessed second time). IMCS uses two level LRU replacement algorithm trying to keep in memory internal pages of B-Tree and protect them from throwing away from the cache by leaf pages during large scans.

Also please notice that caching is also done at OS level (file system cache). It means that the same page can be stored in memory twice: in IMCS shared memory and in OS disk cache. And extra memory copies are needed to move data between OS cache and IMCS cache. IMCS cache provides faster access (requires no context switches), but only OS has precise knowledge about availability of memory and so it is more flexible in assignment of available memory resources. And IMCS knows specific of accessed pages (leaf or internal B-Tree page) and so may choose more efficient replacement policy for each of them. So there is no simple answer to the question how to split memory between OS and internal IMCS cache. You can not certainly control size of OS file system cache, but the larger IMCS cache is, the less memory left to OS and can be used for caching at OS level.

Disk version of IMCS doesn't provides durability (persistence) of data: after restart of server it is still necessary to reload all IMCS data. There are two main reasons for it:

  1. Some IMCS data is still not persistent, for example hash table used to locate timeseries.
  2. Due to performance reasons, IMCS is not using WAL (write ahead logging) or some other approach to provide all ACID properties of transaction. So in case of some error (power failure, OS crash, postgresql crash,...) IMCS data file can be corrupted and there is no way to atomically recover it.

IMCS never shrinks size of used data file. If you deallocate some table, then correspondent pages will be marked as free and can be reused in subsequent allocation queries. But size of the file is not decreased. Even after restart of the server file is not truncated, because:

  1. IMCS can work not only with normal OS file but also with raw partitions which can not be truncated;
  2. extension of file requires update of the file metadata which adds additional overhead.
So you need to delete the file explicitly if you want to truncate it.

Installation and tuning

As far as IMCS is using PostgreSQL shared memory, it should be loaded via shared_preload_libraries. Please add '$libdir/imcs' to shared_preload_libraries in postgresql.conf file:

  shared_preload_libraries = '$libdir/imcs' # (change requires restart)
Size of shared memory used by IMCS can be specified using imcs.shmem_size parameter. At most systems maximal size of System V shared memory is limited by quite small constant. So you may also need to alter system configuration (please refer to OS manual about how to do it). PostgreSQL 9.3 uses mmap instead of System V shared memory, so there should be no problem with system quotas. But there is yet another limitation in Linux: it is not able to create shared memory segment larger than 256Gb with standard 4Kb pages. And now servers with 1Tb memory is not something very exotic. To utilize all available memory in this case it is possible to create multiple shared memory segments. But PostgreSQL is not able to do it. Another solution is to increase page size. Linux supports huge pages. Unfortunately PostgreSQL still doesn't provide any way of using huge pages: you need to patch PostgreSQL source: add MAP_HUGETLB to PG_MMAP_FLAGS define in sysv_shmem.c:
  #define PG_MMAP_FLAGS	(MAP_HUGETLB|MAP_SHARED|MAP_ANONYMOUS|MAP_HASSEMAPHORE)
IMCS distributive contains smarter patch sysv_shmem.patch for PostgreSQL 9.3.1 which sets MAP_HUGETLB flag only if size of shared memory segment is larger than 256Gb and only if MAP_HUGETLB is defined (since Linux 2.6.32).

Below is list of all IMCS configuration parameters:

Parameter nameDescriptionDefault valueRecommendations
imcs.shmem_sizeSize of shared memory (Mb) used by columnar store.8*1024 (8Gb)Make it large enough to fit all data requiring vertical representation. It can not be increased without restart of the server.
imcs.n_timeseriesEstimation for number of timeseries10000This value is needed for PostgreSQL hash implementation. Too small value may cause large number of collisions.
imcs.n_threadsNumber of threads in thread pool for concurrent execution of a query0 - autodetect number of CPUsUsually number of threads should be equal to number of physical execution units in the system. Please notice that in case of using hyperthreading number of reported CPUs is twice large than real number of cores. Set this parameter to 1 to disable concurrent execution
imcs.page_sizeSize of B-Tree page size in bytes4096As far as B-Tree is stored in memory, it is not so critical to use large pages. But small page may increase per-element storage overhead.
imcs.tile_sizeSize of tile or vector that is used to organize vector operations128The larger tile is, the less influence of interpretation overhead. But best performance can be achieved only if tile fits in CPU L1 cache. Please notice that some operators have two or more parameters, so more than one tile can be calculated at each stage of operator's pipe processing. Also memory may be needed for other purposes, so to reduce probability of cache misses, keep this value reasonably small.
imcs.dictionary_sizeSize of dictionary used by IMCS to map unlimited size strings to integer identifiers64kbIf size of dictionary is set to zero, then it is not possible to load in columnar store columns with unlimited size types (i.e. VARCHAR). It size of the dictionary is less or equal than 64kb, then IMCS maps strings to 16-bit integer identifiers. If size of the dictionary is greater than 64kb, then IMCS maps strings to 32-bit integer identifiers. The size of the dictionary should be larger than cardinality of all varying size columns of all tables which are loaded in columnar store. And it should fit in memory reserved for IMCS using imcs.shmem_size parameter.
imcs.substitute_nullsSubstitutes NULLs with 0 while loading data in columnar storefalseBy default attempt to insert NULL value will cause an error. When value of this option is set to true, IMCS doesn't report an error and stores zero instead of NULL.
imcs.autoloadAutomatically loads data in columnar store when it is accessed first time by any querytrueLoading data from large table can take substantial amount of time and so increase execution time of the query initiated this load. It can confuse an user which expects this query to complete very fast. In such case explicit load of data after server restart can be more desirable (it can be completed before receiving any user's query).
imcs.serializableHold lock till the end of transactiontrueSuch locking policy provides serializable isolation level for columnar store. If this parameter is set to false, then lock is released at the end of query execution. It corresponds to "read committed" isolation level.
imcs.traceTrace IMCS commandsfalseSends information about executed IMCS command to client and PostgreSQL server log (NOTICE log level).
imcs.output_string_limitLimit for length of timeseries string representation1024Trying to print result of query returning larger timeseries can cause memory overflow or at least produce a lot of screens of hardly readable text. Setting this limit allows to restrict size of printed timeseries: only part of timeseries elements will be printed and then "..." indicates that timeseries was truncated. Setting this parameter to 0 disables this limitation.
imcs.project_cachingCache cs_project results to avoid redundant calculations in (cs_project(...)).* expression.trueCaching can cause incorrect behavior in some cases: when cs_project is used twice in the same query. In this case disable it: everything should work correctly, may be only with some performance penalty in case of using (cs_project(...)).* construction. Also it is possible to disable caching for each particular cs_project invocation by assigning false to optional disable_caching parameter. Please read more in section Projection issues.
imcs.use_rleUse RLE encoding for character timeseriesfalseRLE allows to significantly reduce size of used memory for timeseries with large fraction of duplicates.
imcs.cache_size(*)Size of IMCS disk cache (in pages)256*1024Total size in bytes used by cache is imcs.cache_size*imcs.page_size. With default values of parameters it is 1Gb. It should be smaller than imcs.shmem_size. See more about choosing optimal setting for this parameter in section Scaling beyond physical memory.
imcs.flush_file(*)Flush changes to the file during committrueWrite dirty pages to the disk during commit. Pages are written in offset increasing order, so disk writes are more or less sequential minimizing disk head movements. That is why it can be faster than random writes of dirty pages thrown away by LRU replacement algorithm. But it can increase number of writes, especially in case of short transactions (for example if triggers are used to propagate updates to IMCS).
imcs.file_path(*)Path to IMCS disk file or partition."imcs.dbs"Location of IMCS file or raw partition. Please notice that IMCS never tries to truncate this file.
*) These parameters are available only in disk mode

So steps of using IMCS extension are the following:

  1. Change PostgreSQL configuration file postgresql.conf by adding IMCS to list of preloaded libraries and specifying maximal size of IMCS storage.
  2. Install IMCS extension (read PostgreSQL documentation about installation of extensions).
  3. Create extension using create extension imcs command (you need to have superuser permissions for it).
  4. Generate interface functions using cs_create function.
  5. If data is already present in the database, load it in columnar store using TABLE_load() function.
  6. You should either enable autoload (imcs.autoload configuration property), either manually call TABLE_load() each time you restart the server.

Performance comparison

Consider the following definition of Quote table:

  create table Quote (
    Symbol char(10), 
    Day date, 
    Open real, 
    High real, 
    Low real, 
    Close real, 
    Volume integer);
Let's populate it with NYSE data for ten years (about 6 million records).
  \copy Quote from 'NYSE_2003-2013.csv' with csv header;
This load is completed at my system in 2.5 minutes. Now lets create vertical representation of this table:
  create extension imcs;
  select cs_create('Quote', 'Day', 'Symbol');
  select Quote_load(); 
Loading data in columnar store takes at my computer just 15 seconds. If we call cs_create prior to loading data in Quote table, then time of importing data from CSV file will increase from 2.5 minutes to 6.5 minutes. It is because of using trigger to propagate inserts to in-memory columnar store.

Now let's calculate volume-weighted price for IBM for the period from 2010 till 2013:

  select cs_wavg(Volume, Close) as VWAP 
  from Quote_get('IBM', date('01-Jan-2010'), date('01-Jan-2013'));
Query execution time is 10 milliseconds. Now do the same thing with standard SQL:
  select sum(Close*Volume)/sum(Volume) 
  from Quote where Symbol='IBM' and Day between date('01-Jan-2010') and date('01-Jan-2013');
It takes 750 milliseconds.

Now let calculate VWAP for all symbols. To simplify it we will first create table containing information about all symbols. Actually this data is usually available and contains much more information than just symbol name. But here we need just symbol name:

  create table Securities (Symbol char(10));
  insert into Securities select distinct Symbol from Quote;
  create view SecurityQuotes as select (Quote_get(Symbol)).* from Securities;
Now we are ready to execute query:
  select Symbol,cs_sum(Close*Volume) / cs_sum(Volume) as VWAP 
  from SecurityQuotes;
Time is about 500 milliseconds. Now do the same using standard SQL:
  select Symbol,sum(Close*Volume)/sum(Volume) as VWAP 
  from Quote group by Symbol;
Result is returned after 2243 milliseconds.

Now let's test filter queries with projection back to horizontal representation. The following query finds all dates for 'ABB' symbol when close price was more than 1% large than open price for the particular quarter:

  select (Quote_project(abb.*,cs_filter_pos(Close>Open*1.01))).* 
  from Quote_get('ABB', '01-Jan-2010', '31-Mar-2010') abb;
It returns 14 results in 12 milliseconds. Now do the same using SQL:
  select * from Quote where Symbol='ABB' 
  and Day between date('01-Jan-2010') and date('31-Mar-2010') and Close>Open*1.01;
The same result in 640 milliseconds.

Actually timeseries in this example are not long enough: size of timeseries for each symbol is about 2608 elements. Let's now investigate use case with single large timeseries:

  select Quote_drop();
  select cs_create('Quote', 'Day');
  select Quote_load();
Load is completed twice faster than in case of Quote table: 7.5 seconds. Now let's execute VWAP for this timeseries:
  select Volume//Close as VWAP from Quote_get();
Query is completed at my system in 10 milliseconds. The same query using SQL:
  select sum(Close*Volume)/sum(Volume) as VWAP from Quote;
One second (one thousand milliseconds). So IMCS query is 100 times faster.

Now perform filter query for large timeseries:

  select cs_count((Close>Open*1.1)?) from Quote_get();
  select count(*) from Quote where Close>Open*1.1;
Ratio of the query execution times is once again 100: 6.274 msec vs. 768.251 msec.

Now consider real use case wth one timeseries and large enough records. There are about 10 million records with ~40 columns. Database size is about 5Gb. Queries perform groupping by various combinations of fields and cacluate aggregates for some characterestics. For example:
  select trader,desk,office,sum(score*volenquired)/sum(volenquired) 
  from DbItem group by trader,desk,office;
Execution of this query takes 320 seconds. IMCS analog:
  select agg_val,cs_cut(group_by,'c22c30c10') from 
    (select (cs_project_agg(ss1.*)).* from 
       (select (s1).sum/(s2).sum,(s1).groups from DbItem_get() q, 
          cs_hash_sum(q.score*q.volenquired, q.trader||q.desk||q.office) s1,  
          cs_hash_sum(q.volenquired, q.trader||q.desk||q.office) s2) ss1) ss2;
takes ... 144 milliseconds. The ratio is more than 2 thousands times. But this is the result with default PostgreSQL parameters (only "shared_buffers" was increased to hold all database in memory). If we also increase "work_mem" from default 1Mb to 1Gb, then times of the query is reduced to 33 seconds for first execution and 7 seconds for subsequent executions.

So summarizing these results: IMCS provides about 5-10 times increase of performance for relatively small timeseries (thousands elements) and 100 times faster speed for large timeseries (millions elements) on standard desktop with quad core processor. For SMP server with larger number of cores this ratio is expected to be even higher.

License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the Software), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHOR OF THIS SOFTWARE BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Please send any feedbacks, complaints, bug reports, change requests to the author Konstantin Knizhnik. Latest version of this software can be obtained from the site http://www.garret.ru.