New Built-in Functions for Databricks SQL

Engineering


Built-in functions extend the power of SQL with specific transformations of values for common needs and use cases. For example, the LOG10 function accepts a numeric input argument and returns the logarithm with base 10 as a double-precision floating-point result, and the LOWER function accepts a string and returns the result of converting each character to lowercase.

As part of our commitment to making it easy to migrate your data warehousing workloads to the Databricks lakehouse platform, we have carefully designed and launched dozens of new built-in functions into the core ANSI compliant Standard SQL dialect over the last year. The open-source Apache Spark community has also made significant contributions to this area, which we have integrated into Databricks runtime as well. In this blog post we mention a useful subset of these new functions and describe, with examples, how they may prove useful for your data processing journeys over the coming days. Please enjoy!

Process strings and search for elements

Use Databricks SQL to quickly inspect and process strings with new functions in this category. You can quickly check if a string contains a substring, inspect its length, split strings, and check for prefixes and suffixes.


> SELECT contains('SparkSQL', 'SQL'), 
         contains('SparkSQL', 'Spork')
  true, false

> SELECT len('Spark SQL ');
  10

> SELECT split_part('Hello,world,!', ',', 1);
  Hello

> SELECT startswith('SparkSQL', 'Spark'),
         endswith('SparkSQL', 'dataframes');
 true, false

Use regular expression operations to compare strings against patterns, or specialized functions to convert to or from numbers using specialized formats, and to and from URL patterns.


> WITH w AS (SELECT
    'Steven Jones and Stephen Smith' AS target,
    'Ste(v|ph)en' AS pattern)
-- Return the first substring that matches the pattern.
SELECT regexp_substr(target, pattern) FROM w;
 Steven

-- This format string expects:
--  * an optional sign at the beginning,
--  * followed by a dollar sign,
--  * followed by a number between 3 and 6 digits long,
--  * thousands separators,
--  * up to two digits beyond the decimal point.
> SELECT to_number('-$12,345.67', 'S$999,099.99');
 -12345.67

-- This format string produces five characters before the decimal point and two after.
> SELECT '(' || to_char(123, '99999.99') || ')';
 (123.00)

> SELECT url_decode('http%3A%2F%2Fspark.apache.org%2Fpath%3Fquery%3D1');
 http://spark.apache.org/path?query=1

Compare numbers and timestamps

Get into the details by extracting bits and perform conditional logic on integers and floating-point numbers. Convert floating point numbers to integers by rounding up or down with an optional target scale, or compare numbers for equality with support for NULL values.


> SELECT bit_get(23Y, 3),
         bit_get(23Y, 0);
 0, 1

> SELECT ceil(5.4),
         ceil(-12.345, 1);
 6, -12.3

> SELECT floor(3345.1, -2);
 3300

> SELECT equal_null(2, 2),
         equal_null(2, 1),
         equal_null(NULL, NULL),
         equal_null(NULL, 1);
 true, false, true, false

Work with temporal values using new strongly-typed conversions. Cast input expression to or from one of the INTERVAL data types, query the current date, or add and subtract to dates and timestamps.


> SELECT cast(INTERVAL '12:04.9900' MINUTE TO SECOND AS STRING);
  INTERVAL '12:04.99' MINUTE TO SECOND

> SELECT curdate()
 2022-12-21

-- March 31, 2022 minus 1 month yields February 28, 2022.
> SELECT dateadd(MONTH, -1, TIMESTAMP'2022-03-31 00:00:00');
 2022-02-28 00:00:00.000000
-- One month has passed even though it's not the end of the month yet because
-- the day and time line up.
> SELECT datediff(MONTH, TIMESTAMP'2021-02-28 12:00:00', TIMESTAMP'2021-03-28 12:00:00');
 1

Work with arrays, structs, and maps

Make sophisticated queries for your structured and semi-structured data with the array, struct, and map types. Construct new array values with the array constructor, or inspect existing arrays to see if they contain specific values or figure out what their positions are. Check how many elements are in an array, or extract specific elements by index.


-- This creates an array of integers.
> SELECT array(1, 2, 3);
 [1,2,3]

> SELECT array_contains(array(1, 2, 3), 2),
         array_position(array(3, 2, 1, 4, 1), 1);
 true, 3

> SELECT array_size(array(1, NULL, 3, NULL));
  4

> SELECT get(arr, 0), get(arr, 2), arr[2] FROM VALUES(array(1, 2, 3)) AS T(arr);
 1, 3, 3

> SELECT element_at(array(1, 2, 3), 2),
         try_element_at(array(1, 2, 3), 5);
 2, NULL

Maps are a powerful data type that support inserting unique keys associated with values and efficiently extracting them later. Use the map constructor to create new map values and then look up values later as needed. Once created, you can concatenate them together, or extract their keys or values as arrays.


> SELECT map(1.0, '2', 3.0, '4');
 {1.0 -> 2, 3.0 -> 4}

> SELECT map_contains_key(map(1, 'a', 2, 'b'), 2);
 true

> SELECT map_concat(map(1, 'a', 2, 'b'), map(3, 'c'));
  {1 -> a, 2 -> b, 3 -> c}

> SELECT map_keys(map(1, 'a', 2, 'b')),
         map_values(map(1, 'a', 2, 'b'));
  [1,2], [a,b]

Perform error-safe computation

Enjoy the benefits of standard SQL with ANSI mode while also preventing your long running ETL pipelines from returning errors with new error-safe functions. Each such function returns NULL instead of raising an exception. For example, take a look at try_add, try_subtract, try_multiply, and try_divide. You can also perform casts, compute sums and averages, and safely convert values to and from numbers and timestamps using custom formatting options.


> SELECT try_divide(3, 2), try_divide(3 , 0);
 1.5, NULL

> SELECT try_cast('10' AS INT);
 10

> SELECT try_cast('a' AS INT);
 NULL

> SELECT try_sum(col) FROM VALUES (5), (10), (15) AS tab(col);
 30

> SELECT try_avg(col) FROM VALUES (5e37::DECIMAL(38, 0)), (5e37::DECIMAL(38, 0)) AS tab(col);
 NULL

-- A plus sign is optional in the format string, and so are fractional digits.
> SELECT try_to_number('$345', 'S$999,099.99');
 345.00

-- The number format requires at least three digits.
> SELECT try_to_number('$45', 'S$999,099.99');
 NULL

Aggregate groups of values together in new ways

Make data-driven decisions by asking questions about groups of values using new built-in aggregate functions. For example, you can now return any value in a group, concatenate groups into arrays, and compute histograms. You can also perform statistical calculations by querying the median or mode of a group, or get specific by looking up any arbitrary percentile.


> SELECT any_value(col) FROM VALUES (10), (5), (20) AS tab(col);
 10

> SELECT array_agg(col) FROM VALUES (1), (2), (NULL), (1) AS tab(col);
 [1,2,1]

> SELECT histogram_numeric(col, 5) FROM VALUES (0), (1), (2), (10) AS tab(col);
 [{"x":0,"y":1.0},{"x":1,"y":1.0},{"x":2,"y":1.0},{"x":10,"y":1.0}]

> SELECT median(DISTINCT col) FROM VALUES (1), (2), (2), (3), (4), (NULL) AS tab(col);
 2.5

-- Return the median, 40%-ile and 10%-ile.
> SELECT percentile_cont(array(0.5, 0.4, 0.1)) WITHIN GROUP (ORDER BY col)
    FROM VALUES (0), (1), (2), (10) AS tab(col);
 [1.5, 1.2000000000000002, 0.30000000000000004]

The new regr_* family of functions help you ask questions about the values of a group where the input expression(s) are NOT NULL.


-- Returns the intercept of the univariate linear regression line.
> SELECT regr_intercept(y, x) FROM VALUES (1, 2), (2, 3), (2, 3), (null, 4), (4, null) AS T(y, x);
  0.7777777777777779

-- Returns the coefficient of determination from the values.
> SELECT regr_r2(y, x) FROM VALUES (1, 2), (2, 3), (2, 3), (null, 4), (4, null) AS T(y, x);
  1

-- Returns the sum of squares of one of the input expression values of a group.
> SELECT regr_sxy(y, x) FROM VALUES (1, 2), (2, 3), (2, 3), (null, 4), (4, null) AS T(y, x);
  0.6666666666666666

Each of these can also be invoked as a window function using the OVER clause.

Use encryption

Protect access to your data by encrypting it at rest and decrypting it when needed. These functions use the Advanced Encryption Standard (AES) to convert values to and from their encrypted equivalents.


> SELECT base64(aes_encrypt('Spark', 'abcdefghijklmnop'));
  4A5jOAh9FNGwoMeuJukfllrLdHEZxA2DyuSQAWz77dfn

> SELECT cast(aes_decrypt(unbase64('4A5jOAh9FNGwoMeuJukfllrLdHEZxA2DyuSQAWz77dfn'),
                          'abcdefghijklmnop') AS STRING);
  Spark

Apply introspection

Programmatically query properties of your Databricks cluster or configuration with SQL. For example, you can ask about the current version of your Databricks SQL or Databricks Runtime environment. You can also now use SQL to return the list of secret keys populated so far within the Databricks secret service which the current user is authorized to see, and request to extract specific secret values by scope and key.


> SELECT current_version().dbsql_version;
  2022.25

> SELECT current_version();
  { NULL, 2022.25, ..., ... }

> SELECT * FROM list_secrets();

  scope         key
  ------------  ---------------
  secrets.r.us  theAnswerToLife

> SELECT secret('secrets.r.us', 'theAnswerToLife');
  fortyTwo

Build yourself a geospatial lakehouse

Efficiently process and query vast geospatial datasets at scale. In this section, we describe new SQL functions now available for organizing and processing data in this way, along with examples of how to call the functions with different input data types. For a more detailed background, please refer to the separate dedicated “Processing Geospatial Data at Scale With Databricks” blog post.

This is a geospatial visualization of taxi dropoff locations in New York City with cell colors indicating aggregated counts therein.

As of today, Databricks now supports a new collection of geospatial functions operating over H3 cells. Each H3 cell represents a unique region of space on the planet at some resolution, and has its own associated unique cell ID represented as a BIGINT or hexadecimal STRING expression. The boundaries of these cells can convert to open formats including GeoJSON, a standard designed for representing simple geographical features using JSON, or WKT, an open text based format for expressing geospatial data using strings (along with WKB, its binary equivalent).


-- This returns the center of the input H3 cell as a point in GeoJSON or WKB or
-- WKT format.
> SELECT h3_centerasgeojson(599686042433355775)
  {"type":"Point","coordinates":[-121.97637597255,37.345793375368]}

You can inspect the distance between points by querying the H3 cells that are within (grid) distance k of the origin cell. The set of these H3 cells is called the k-ring of the origin cell. It is possible to convert input H3 cell IDs to or from their equivalent hexadecimal string representations.


> SELECT h3_distance('85283447fffffff', '8528340ffffffff')
 2

> SELECT h3_h3tostring(599686042433355775)
  85283473fffffff

-- Returns an array of H3 cells that form a hollow hexagonal ring centered at the
-- origin H3 cell and that are at grid distance k from the origin H3 cell.
> SELECT h3_hexring('85283473fffffff', 1)  [8528340bfffffff,85283447fffffff,8528347bfffffff,85283463fffffff,85283477fffffff,8528340ffffffff]

Furthermore, you can now compute an ARRAY of H3 cell IDs (represented as BIGINTs or STRINGs) corresponding to hexagons or pentagons that are contained by the input area geography. The try_ versions return NULL instead of raising errors.


-- This is a simple example where the input is a triangle in hexadecimal WKB format.
> SELECT h3_polyfillash3(unhex('0103000000010000000400000050fc1873d79a5ec0d0d556ec2fe342404182e2c7988f5dc0f46c567dae064140aaf1d24d628052c05e4bc8073d5b444050fc1873d79a5ec0d0d556ec2fe34240'), 2)
[586146350232502271,586147449744130047,586198577034821631,586152397546455039,586199676546449407,586153497058082815,586142501941805055,586201325813891071]

You can compute the parent or child H3 cell of the input H3 cell at the specified resolution, or check whether one H3 cell is a child of another. Representing polygons as (potentially exploded) arrays of H3 cells and points via their H3 cells of containment supports performing very efficient spatial analytics operating on the H3 cells as opposed to original geographic objects. Also, please refer to our recent blog that describes how to perform spatial analytics at any scale and how to supercharge spatial analytics using H3.

Finally, you can validate H3 cells by returning the input value of type BIGINT or STRING if it corresponds to a valid H3 cell ID.


> SELECT h3_toparent('85283473fffffff', 0)
  8029fffffffffff

> SELECT h3_tochildren(599686042433355775, 6)     
[604189641121202175,604189641255419903,604189641389637631,604189641523855359,604189641658073087,604189641792290815,604189641926508543]

> SELECT h3_ischildof(608693241318998015, 599686042433355775)
  true

> SELECT h3_validate(599686042433355776)
  [H3_INVALID_CELL_ID] 599686042433355776 is not a valid H3 cell ID

> SELECT h3_isvalid(599686042433355776)
  false
  
> SELECT h3_try_validate(599686042433355776)
  NULL

Databricks SQL lets you do anything

Standards compliance and easy migration came to Databricks SQL previously with the birth of ANSI mode, and it already sets the world record in performance. With the addition of this wide array of new built-in functions, SQL workloads now have significant newfound expressibility on the lakehouse.

Now feel free to chop up strings, aggregate values, manipulate dates, analyze geographies, and more. And if some functionality is missing from these built-ins, check out Python user-defined functions and SQL user-defined functions to define your own logic that behaves the same way at call sites as the built-ins.

Thanks for using Databricks SQL, and happy querying!



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *