{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Theta Sketch Examples" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Basic Sketch Usage" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from datasketches import theta_sketch, update_theta_sketch, compact_theta_sketch\n", "from datasketches import theta_union, theta_intersection, theta_a_not_b" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To start, we'll create a sketch with 1 million points in order to demonstrate basic sketch operations." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "### Theta sketch summary:\n", " num retained entries : 6560\n", " seed hash : 37836\n", " empty? : false\n", " ordered? : false\n", " estimation mode? : true\n", " theta (fraction) : 0.00654224\n", " theta (raw 64-bit) : 60341508738660257\n", " estimate : 1.00271e+06\n", " lower bound 95% conf : 978261\n", " upper bound 95% conf : 1.02778e+06\n", " lg nominal size : 12\n", " lg current size : 13\n", " resize factor : 8\n", "### End sketch summary\n", "\n" ] } ], "source": [ "n = 1000000\n", "k = 12\n", "sk1 = update_theta_sketch(k)\n", "for i in range(0, n):\n", " sk1.update(i)\n", "print(sk1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The summary contains most data fo interest, but we can also query for specific information. And in this case, since we know the exact number of distinct items presented ot the sketch, we can look at the estimate, upper, and lower bounds as a percentage of the exact value." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Upper bound (1 std. dev) as % of true value:\t 101.5208\n", "Sketch estimate as % of true value:\t\t 100.2715\n", "Lower bound (1 std. dev) as % of true value:\t 99.0374\n" ] } ], "source": [ "print(\"Upper bound (1 std. dev) as % of true value:\\t\", round(100*sk1.get_upper_bound(1) / n, 4))\n", "print(\"Sketch estimate as % of true value:\\t\\t\", round(100*sk1.get_estimate() / n, 4))\n", "print(\"Lower bound (1 std. dev) as % of true value:\\t\", round(100*sk1.get_lower_bound(1) / n, 4))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can serialize and reconstruct the sketch. Serialization necessarily produces a compact sketch, meaning the sketch can be deserialized and queried or used for further unions or set operations but can not be updated directly." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "52504" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sk1_bytes = sk1.compact().serialize()\n", "len(sk1_bytes)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Estimate: \t\t 1002714.745231455\n", "Estimation mode: \t True\n" ] } ], "source": [ "new_sk1 = compact_theta_sketch.deserialize(sk1_bytes)\n", "print(\"Estimate: \\t\\t\", new_sk1.get_estimate())\n", "print(\"Estimation mode: \\t\", new_sk1.is_estimation_mode())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sketch Unions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Theta Sketch unions make use of a separate union object. The union will accept input sketches with different values of $k$.\n", "\n", "For this example, we will create a sketch with distinct values that partially overlap those in `sk1`." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "### Theta sketch summary:\n", " num retained entries : 12488\n", " seed hash : 37836\n", " empty? : false\n", " ordered? : false\n", " estimation mode? : true\n", " theta (fraction) : 0.0123336\n", " theta (raw 64-bit) : 113757656857900725\n", " estimate : 1.01252e+06\n", " lower bound 95% conf : 994626\n", " upper bound 95% conf : 1.03073e+06\n", " lg nominal size : 13\n", " lg current size : 14\n", " resize factor : 8\n", "### End sketch summary\n", "\n" ] } ], "source": [ "offset = int(3 * n / 4)\n", "sk2 = update_theta_sketch(k+1)\n", "for i in range(0, n):\n", " sk2.update(i + offset)\n", "print(sk2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now feed the sketches into the union. As constructed, the exact number of unique values presented to the two sketches is $\\frac{7}{4}n$." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Union estimate as % of true value: 99.6787\n" ] } ], "source": [ "union = theta_union(k)\n", "union.update(sk1)\n", "union.update(sk2)\n", "result = union.get_result()\n", "print(\"Union estimate as % of true value: \", round(100*result.get_estimate()/(1.75*n), 4))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sketch Intersections" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Beyond unions, theta sketches also support intersctions through the use of an intersection object. These set intersections can have vastly superior error bounds than the classic inclusion-exclusion rule used with sketches like HLL." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Has result: True\n", "### Theta sketch summary:\n", " num retained entries : 1668\n", " seed hash : 37836\n", " empty? : false\n", " ordered? : true\n", " estimation mode? : true\n", " theta (fraction) : 0.00654224\n", " theta (raw 64-bit) : 60341508738660257\n", " estimate : 254959\n", " lower bound 95% conf : 242739\n", " upper bound 95% conf : 267789\n", "### End sketch summary\n", "\n" ] } ], "source": [ "intersection = theta_intersection()\n", "intersection.update(sk1)\n", "intersection.update(sk2)\n", "print(\"Has result: \", intersection.has_result())\n", "result = intersection.get_result()\n", "print(result)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case, we expect the sets to have an overlap of $\\frac{1}{4}n$." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Intersection estimate as % of true value: 101.9834\n" ] } ], "source": [ "print(\"Intersection estimate as % of true value: \", round(100*result.get_estimate()/(0.25*n), 4))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Set Subtraction (A-not-B)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we have the set subtraction operation. Unlike `theta_union` and `theta_intersection`, `theta_a_not_b` always takes as input 2 sketches at a time, namely $a$ and $b$, and directly returns the result as a sketch." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "### Theta sketch summary:\n", " num retained entries : 4892\n", " seed hash : 37836\n", " empty? : false\n", " ordered? : true\n", " estimation mode? : true\n", " theta (fraction) : 0.00654224\n", " theta (raw 64-bit) : 60341508738660257\n", " estimate : 747756\n", " lower bound 95% conf : 726670\n", " upper bound 95% conf : 769452\n", "### End sketch summary\n", "\n" ] } ], "source": [ "anb = theta_a_not_b()\n", "result = anb.compute(sk1, sk2)\n", "print(result)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By using the same two sketches as before, the expected result here is $\\frac{3}{4}n$." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "A-not-B estimate as % of true value: 99.7008\n" ] } ], "source": [ "print(\"A-not-B estimate as % of true value: \", round(100*result.get_estimate()/(0.75*n), 4))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.10.6 64-bit", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" }, "vscode": { "interpreter": { "hash": "b0fa6594d8f4cbf19f97940f81e996739fb7646882a419484c72d19e05852a7e" } } }, "nbformat": 4, "nbformat_minor": 2 }