How is MySQL used in data science

Introduction to Elasticsearch SQL with practical examples - Part 1

Version 6.3 of Elastic Stack offers more new features than almost any previous version. Not only have we opened code and support structures for X-Pack and added rollups, but we are now offering SQL support as an experimental feature in response to popular requests from Elasticsearch users.

In this blog series, we'll explore some of the features and functionality that Elasticsearch SQL currently supports, talk about the few limitations that exist now, and our plans for the future.

Target audience

Elastic has long hesitated to add SQL to its product for a variety of reasons. The most common questions included:

  • What does SQL support cover?
  • Which features are supported? Specifically: Are JOINs supported?
  • And what about expressions / functions or grouping?
  • Do we need to support JDBC / ODBC connectivity?
  • Do we have to support SQL at all or are we just not providing our users with enough documentation and help to get started with our DSL (Domain Specific Language)?

After several attempts, we managed to limit the required features to those that we thought would be useful for users. In discussions with our users, two main target groups emerged for SQL:

  • New users from Elastic Stack, who feel overwhelmed by Elasticsearch DSL as a starting point, or who just don't want to learn the full syntax: For example, if users convert an existing SQL-based application for performance and scalability reasons, they may just need the appropriate query without doing it having to learn the complete syntax. We are also aware that a successful and widespread learning strategy is to learn new things by making a reference to existing knowledge.
  • Data consumerswho neither want to nor need to learn Elasticsearch DSL: B. data scientists who just want to extract the data for external processing, or less tech-savvy BI users who are largely familiar with SQL and use this database language on a daily basis.

SQL is not only interesting for the target groups mentioned above, it is also an extremely attractive programming paradigm as a declarative language for all users, as this blog series will demonstrate again and again. The primacy of SQL relies on its ability to express the logic of computation and the goal you want to achieve without first defining the flow of control. We will also demonstrate how problems that are difficult to express with Elasticsearch DSL can be elegantly defined with specific SQL queries. The strength of Elasticsearch DSL lies in the skillful description of full-text search problems, while SQL can describe structured, analysis-based queries more effectively.

Features of Elasticsearch SQL

Elasticsearch SQL offers a read-only user interfacethat one Subgroup corresponds to the ANSI SQL specification and enables the display of Elasticsearch as a tabular data source. In addition, we offer additional operators that, in contrast to RDBMS-based implementations, enable Elasticsearch-specific functionality. Our goal is one easy, fast Implementation with minimal external dependencies and few movable components. However, this does not make Elasticsearch a comprehensive relational database (with its associated properties), nor does it make data modeling superfluous. While some data manipulation functions and expressions are implemented by the SQL plugin, the Pushdown-Principle always applied when the result value and the result order are impaired or a grouping is requested. In this way, the current processing of data by the Elasticsearch SQL plug-in is limited to pure result processing (e.g. field functions) and the client (JDBC driver / CLI or browser) is limited to pure rendering. This approach leverages the scalability and performance of Elasticsearch to do difficult work for you.

Mapping concepts: indexes and documents or tables and rows

In the early days of Elasticsearch, indexes and types were often compared to relational databases and tables in an RDBMS in order to introduce users to new concepts and to make them easier to work with. As explained in the Elasticsearch 6.3 documentation, this analogy was not only false and misleading, but potentially dangerous. We are removing types from Elasticsearch, but we still need a suitable and usable logical correspondence between the schema-less, document-oriented Elasticsearch model and the strongly typed SQL concepts.

Fortunately, like RDBMS tables, Elasticsearch indexes are physically isolated and should be used in much the same way (that is, to store related data). Although lines are rather strict (stronger constraints) and documents are somewhat more flexible / looser (without losing their structure), lines and documents are also a natural correspondence, as they allow fields / columns to be grouped. In Elasticsearch, fields represent a name entry and support various data types that may have a list of values ​​associated with them. With the exception of these multi-valued fields, this concept is directly assigned to SQL columns. Note: If the SELECT statement is executed on a multi-valued field, an error message is issued at query time.

With other concepts, the mapping is not as straightforward: an SQL database and an Elasticsearch cluster have little equivalence. However, this does not generally bother Elasticsearch SQL users. For more information on this topic, see the documentation Mapping concepts across SQL and Elasticsearch.

In short, use the index name in the WHERE clause to determine the selection conditions for a particular index or table. The specific documents are then each returned as a row and the fields are assigned as columns. Thanks to this largely transparent mapping, we can use these terms synonymously in the following.

In short, use the index name in the WHERE clause to determine the selection conditions for a particular index or table. The specific documents are then each returned as a row and the fields are assigned as columns. Thanks to this largely transparent mapping, we can use these terms synonymously in the following.

Implementation details

The Elasticsearch SQL implementation has four phases of execution:

From Parser the SQL query is converted into an abstract syntax tree (AST). Any syntax validation is done at this stage before the Analyzer validates the abstract syntax tree and resolves all tables, columns, functions, aliases and name fields to create a logical plan. This logical plan is first created prior to being converted into an executable physical plan (i.e. Elasticsearch DSL) optimized and redundant terms are removed. The Query executor then performs the actual query and streams the results to the client. All required type and table conversions are carried out as required, such as B. converting the aggregation tree into a table.

Connectivity Methods

When deploying SQL solutions, connectivity support is a critical factor. A pure REST interface may be acceptable to some users, but most users expect to be able to connect using standard interfaces - usually JDBC and ODBC. Support for ODBC is planned and is currently being developed, but JDBC will be available as of this first release and can be downloaded.

Important: All communication with these drivers continues to run via HTTP and our REST interface. This offers a number of advantages:

  1. The process of granting SQL access to your users is no different from opening and exposing an Elasticsearch port with its native integration with security functions. We are therefore able to deploy SQL instantly on our hosted Elasticsearch Service on the Elastic Cloud, and existing users can combine this with the OOTB access control permissions.
  2. In this way we can use SQL directly via the REST interface and provide an additional CLI client to increase the ease of use. We expect the CLI client to be particularly appealing to administrators who are already familiar with the command line interaction that is common in RDBMS.

The JDBC driver uses the newly created XContent library, which is responsible for parsing queries and responses (historically this code was closely tied to Elasticsearch). That way, the driver doesn't depend on all of the Elasticsearch libraries, so it remains lightweight and portable. The decoupling continues to be improved so that the driver will be smaller and faster in future versions.

Some simple examples

Let's look at some examples that use a combination of the CLI and REST API. For our examples we will use a sample dataset that will be shipped with Kibana shortly. If you don't want to wait that long, you can also find this flight record at demo.elastic.co. There you can run the following examples from the Kibana console. We provide links to demo.elastic.co that should be automatically completed with the relevant query throughout the blog. Alternatively, we provide a full list of the questions that can be run in the Kibana demo console. In some cases the results will be different if no explicit sorting or restriction of the sample query is specified. This is due to the natural ordering of results in Elasticsearch, which occurs when no relevance or sort order is applied.

Retrieving Elasticsearch Schema Information: Comparing DSL and SQL

Let us first identify the schema of the table / index and the fields that are available to us. This is done via the REST interface:

inquiry

POST _xpack / sql {"query": "DESCRIBE flights"}

Try it out on demo.elastic.co.

answer

The above answer can also be formatted in a table using the URL parameter. Example:

POST _xpack / sql? Format = txt {"query": "DESCRIBE flights"}

Try it out on demo.elastic.co.

column | type -------------------------- + --------------- AvgTicketPrice | REAL Canceled | BOOLEAN Carrier | VARCHAR Carrier.keyword | VARCHAR Dest | VARCHAR Dest.keyword | VARCHAR DestAirportID | VARCHAR DestAirportID.keyword | VARCHAR DestCityName | VARCHAR DestCityName.keyword | VARCHAR DestCountry | VARCHAR VARCHAR DestCountry.keyword | VARCHAR DestCountry.keyword | VARCHAR .keyword | VARCHAR DestLocation.lon | VARCHAR DestLocation.lon.keyword | VARCHAR DestRegion | VARCHAR DestRegion.keyword | VARCHAR DestWeather | VARCHAR DestWeather.keyword | VARCHAR DistanceKilometers | REAL DistanceMiles | REAL FlightDelay | BOOLEAN FlightDelayMin | BIGINT FlightDelayType | VARCHAR FlightDelayType.keyword | VARCHAR FlightNum | VARCHAR FlightNum.keyword | VARCHAR FlightTimeHour | REAL FlightTimeMin | REAL OriginCir | VARCHAR OriginCHA | VARCHARITY OriginCityName.keyword | VARCHAR OriginCountry | VARCHAR OriginCountry.keyword | VARCHAR OriginLocation | STRUCT OriginLocation.lat | VARCHAR OriginLocation.lat.keyword | VARCHAR OriginLocation.lon | VARCHAR OriginLocation.lon.keyword | VARCHAR OriginRegion | VARCHAR n.keyword | VARCHAR OriginWeather | VARCHAR OriginWeather.keyword | VARCHAR dayOfWeek | BIGINT timestamp | TIMESTAMP

In the future, we will always use the table response structure shown above when we provide an exemplary response from the REST API. To archive this query through the console, we need to log in using the following credentials:

./elasticsearch-sql-cli http: // elastic @ localhost: 9200

After responding to the password prompt ...

sql> DESCRIBE flights; column | type ------------------ + --------------- AvgTicketPrice | REAL Canceled | BOOLEAN Carrier | VARCHAR Dest | VARCHAR DestAirportID | VARCHAR DestCityName | VARCHAR DestCountry | VARCHAR DestLocation | OTHER DestRegion | VARCHAR DestWeather | VARCHAR DistanceKilometers | REAL DistanceMiles | REAL FlightDelay | BOOLEAN FlightDelayMin | INTEGER FlightDelayType | VARCHAR FlightNum | VARCHAR OriginLocation | OTHER OriginRegion | VARCHAR OriginWeather | VARCHAR dayOfWeek | INTEGER timestamp | TIMESTAMP sql>

The schema shown above is also returned with every query for the fields represented in the SELECT statement. This will give any potential drivers the type information they need to format the results or work with the results. For example, consider a simple SELECT where the answer is kept short with the LIMIT clause. By default, 1000 rows are returned.

Simple SELECT statement

POST _xpack / sql? Format = txt {"query": "SELECT FlightNum FROM flights LIMIT 1"}

Try it out on demo.elastic.co (different results possible).

FlightNum --------------- 1Y0TZOE

This REST request / response is processed by the JDBC driver and the console, but remains hidden from the user.

sql> SELECT OriginCountry, OriginCityName FROM flights LIMIT 1; OriginCountry | OriginCityName --------------- + --------------- US | San Diego

Try it out on demo.elastic.co (different results possible).

Note: If you request a field that does not exist (case-sensitive), an error is returned due to the semantics that apply to a tabular data store with strict types. Elasticsearch, on the other hand, shows a different behavior: the affected field is simply not returned. For example, if the field name "OrigincityName" is used instead of the field name "OriginCityName" in the previous query, the following error message with useful information is displayed:

{"error": {"root_cause": [{"type": "verification_exception", "reason": "Found 1 problem (s) \ nline 1: 8: Unknown column [OrigincityName], did you mean any of [OriginCityName , DestCityName]? " }], "type": "verification_exception", "reason": "Found 1 problem (s) \ nline 1: 8: Unknown column [OrigincityName], did you mean any of [OriginCityName, DestCityName]?" }, "status": 400}

Try it out on demo.elastic.co.

If we try to use a function or an expression on an incompatible field, an appropriate error message is issued. In general, an analyzer error occurs earlier when validating the abstract syntax tree. To do this, Elasticsearch needs to know the index mapping and the functions of the individual fields. For this reason, every client that accesses the secure SQL interface must have the necessary authorizations.

At this point we can only present a few queries of increasing complexity and with interesting comments in order not to make the blog too lengthy.

SELECT statement with the WHERE and ORDER BY clauses

"Find the 10 longest flights from US airports with a flight time of over 5 hours."

POST _xpack / sql? Format = txt {"query": "SELECT OriginCityName, DestCityName FROM flights WHERE FlightTimeHour> 5 AND OriginCountry = 'US' ORDER BY FlightTimeHour DESC LIMIT 10"}

Try it out on demo.elastic.co.

OriginCityName | DestCityName --------------- + --------------- Atlanta | Durban Louisville | Melbourne Peoria | Melbourne Albuquerque | Durban Birmingham | Durban Bangor | Brisbane Seattle | Durban Huntsville | Sydney Savannah | Shanghai Philadelphia | Xi'an

The operator for limiting the number of rows depends on the specific SQL implementation. For Elasticsearch SQL we implement the LIMIT operator in line with Postgresql / Mysql.

Random selection

Selection according to the Random principle ...

sql> SELECT ((1 + 3) * 1.5 / (7 - 6)) * 2 AS random; random --------------- 12.0

Try it out on demo.elastic.co.

This illustration shows an example of how part of the post-processing of functions is carried out on the server side. There is no equivalent Elasticsearch DSL query for this.

Functions and Expressions

"Find all flights from July with a flight time of more than 5 hours and sort them according to the longest flight time."

POST _xpack / sql? Format = txt {"query": "SELECT MONTH_OF_YEAR (timestamp), OriginCityName, DestCityName FROM flights WHERE FlightTimeHour> 5 AND MONTH_OF_YEAR (timestamp)> 6 ORDER BY FlightTimeHour DESC LIMIT 10"}

Try on demo.elastic.co

MONTH_OF_YEAR (timestamp [UTC]) | OriginCityName | DestCityName ------------------------------ + --------------- + - ----------------- 7 | Buenos Aires | Shanghai 7 | Stockholm | Sydney 7 | Chengdu | Bogota 7 | Adelaide | Cagliari 7 | Osaka | Buenos Aires 7 | Buenos Aires | Chitose / Tomakomai 7 | Buenos Aires | Shanghai 7 | Adelaide | Washington 7 | Osaka | Quito 7 | Buenos Aires | Xi'an

These functions require code to be written in the Elasticsearch scripting language Painless in order to achieve an equivalent result in Elasticsearch. In the functional procedure agreements of SQL, however, any scripting is avoided. Note the use of the function in both the WHERE clause and the SELECT statement. The component of the WHERE clause is passed to Elasticsearch because it affects the result counter. However, the SELECT function is processed by the plug-in on the server after it has been submitted.

Note: SHOW FUNCTIONS can be used to view a list of the available functions.

Try it out on demo.elastic.co.

We combine this with our earlier math functions and so we can start formulating queries that would be difficult for most DSL users due to their complexity.

“Find the flight distance and the average speed of the two fastest flights that depart Monday, Tuesday or Wednesday mornings between 9:00 and 11:00 h and the flight distance is greater than 500 km. Round up or down the flight distance and speed to the nearest whole number. If the speed of several flights is the same, show the flight with the greatest flight distance. "

Try it out on demo.elastic.co.

sql> SELECT timestamp, FlightNum, OriginCityName, DestCityName, ROUND (DistanceMiles) AS distance, ROUND (DistanceMiles / FlightTimeHour) AS speed, DAY_OF_WEEK (timestamp) AS day_of_week FROM flights WHERE DAY_OF_WEEK (timestamp) tim> = 0 AND DAY = (.OF_WEEK) < 2 AND HOUR_OF_DAY (timestamp)> = 9 AND HOUR_OF_DAY (timestamp) <= 10 ORDER BY speed DESC, distance DESC LIMIT 2; timestamp | FlightNum | OriginCityName | DestCityName | distance | speed | day_of_week ------------------------ + --------------- + -------- ------- + --------------- + --------------- + ---------- ----- + --------------- 2018-07-03T10: 03: 11.000Z | REPKGRT | Melbourne | Norfolk | 10199 | 783 | 2 2018-06-05T09: 18 : 29.000Z | J72Y2HS | Dubai | Lima | 9219 | 783 | 2

The query may seem convoluted and strange, but hopefully it clarifies what has been said here. The way in which we create field aliases and refer to them in the ORDER BY clause is also interesting.

Note: Not all fields used in the WHERE and ORDER BY clauses need to be specified in the SELECT statement. This probably differs this implementation from the SLQ implementations you have previously used. The following code is therefore unrestrictedly valid:

POST _xpack / sql {"query": "SELECT timestamp, FlightNum FROM flights WHERE AvgTicketPrice> 500 ORDER BY AvgTicketPrice"}

Try it out on demo.elastic.co.

Transferring SQL queries in DSL using "translate"

Some SQL queries can only be transferred into Elasticsearch DSL with difficulty or do not seem to be formulated optimally. The new SQL interface supports inexperienced Elasticsearch users with such problems. We simply append to the “sql” endpoint via the REST interface in order to receive the Elasticsearch query that the driver can use.

Let's look at some of the previous queries:

POST _xpack / sql / translate {"query": "SELECT OriginCityName, DestCityName FROM flights WHERE FlightTimeHour> 5 AND OriginCountry = 'US' ORDER BY FlightTimeHour DESC LIMIT 10"}

Try it out on demo.elastic.co.

The DSL equivalent should look like this:

{"size": 10, "query": {"bool": {"filter": [{"range": {"FlightTimeHour": {"from": 5, "to": null, "include_lower": false , "include_upper": false, "boost": 1}}}, {"term": {"OriginCountry.keyword": {"value": "US", "boost": 1}}}], "adjust_pure_negative" : true, "boost": 1}}, "_source": {"includes": ["OriginCityName", "DestCityName"], "excludes": []}, "sort": [{"FlightTimeHour": {" order ":" desc "}}]}

The WHERE clause was carried over to the range and term queries as expected. The way in which the variant “OriginCountry.keyword” of the subfield is used instead of the higher-level “OriginCountry” (a field of the “Text” type) for the exact comparison of the expression (term) is interesting. The user does not need to know the differences in behavior of the underlying mapping, the correct field type is selected automatically. Interestingly, the interface tries to optimize the retrieval performance by using "docvalue_fields" instead of "_source", if these are available, i. H. for exact types (numeric values, dates, keywords) with activated document values. Elasticsearch SQL reliably generates the optimal DSL for the given query.

Now let's look at the most complex query we used last time:

POST _xpack / sql / translate {"query": "SELECT timestamp, FlightNum, OriginCityName, DestCityName, ROUND (DistanceMiles) AS distance, ROUND (DistanceMiles / FlightTimeHour) AS speed, DAY_OF_WEEK (timestamp) AS day_of_week FROM flights WHERE DAY_OF_WEEK (timest flights WHERE DAY_OF_WEEK) > = 0 AND DAY_OF_WEEK (timestamp) <= 2 AND HOUR_OF_DAY (timestamp)> = 9 AND HOUR_OF_DAY (timestamp) <= 10 ORDER BY speed DESC, distance DESC LIMIT 2 "}

Try it out on demo.elastic.co.

The answer is:

{"size": 2, "query": {"bool": {"filter": [{"bool": {"filter": [{"script": {"script": {"source": "( params.v0 <= doc [params.v1] .value.getDayOfWeek ()) && (doc [params.v2] .value.getDayOfWeek () <= params.v3) "," lang ":" painless "," params ": {" v0 ": 0," v1 ":" timestamp "," v2 ":" timestamp "," v3 ": 2}}," boost ": 1}}, {" script ": {" script " : {"source": "doc [params.v0] .value.getHourOfDay ()> = params.v1", "lang": "painless", "params": {"v0": "timestamp", "v1" : 9}}, "boost": 1}}], "adjust_pure_negative": true, "boost": 1}}, {"script": {"script": {"source": "doc [params.v0] .value.getHourOfDay ( ) <= params.v1 "," lang ":" painless "," params ": {" v0 ":" timestamp "," v1 ": 10}}," boost ": 1}}]," adjust_pure_negative ": true, "boost": 1}}, "_source": false, "stored_fields": "_none_", "docvalue_fields": ["timestamp", "FlightNum", "OriginCityName", "DestCityName", "DistanceMiles", " FlightTimeHour "]," sort ": [{" _script ": {" script ": {" source ":" Math.round ((doc [params.v0] .value) / (doc [params.v1] .value) ) "," lang ":" painless "," params ": {" v0 ":" DistanceMiles "," v1 ":" Fl ightTimeHour "}}," type ":" number "," order ":" desc "}}, {" _script ": {" script ": {" source ":" Math.round (doc [params.v0]. value) "," lang ":" painless "," params ": {" v0 ":" DistanceMiles "}}," type ":" number "," order ":" desc "}}]}

The WHERE and ORDER BY clauses were converted to Painless scripts and used in the sort and script query provided by Elasticsearch. These scripts are parameterized evenly in order to avoid compilations and the use of script caching.

Note: The code shown above represents the optimal transfer of the SQL statement, but it does Not necessarily the best solution to the larger problem. In practice, we would encode the day of the week, the hour, and the speed for the document during indexing, so we can limit ourselves to using simpler range queries. This is probably a better and quicker way of solving this particular problem than using Painless scripts. For this reason, some of these fields are actually already present in the document. While the Elasticsearch SQL implementation is reliably providing us with optimal transfers, it can only use the fields specified in the query and therefore does not necessarily provide an optimal solution to the larger problem. For an optimal approach, the capabilities of the underlying platform must also be taken into account. The _translate API can be seen as the first step in this process.

Subject of the next blog in this series

In the blog An Introduction to Elasticsearch SQL with Practical Examples - Part 2, we show other uses for the _translate API and illustrate some of the more complex Elasticsearch SQL features. In addition, we look at current restrictions and introduce some planned enhancements to the product in the future.

Sign up for product updates!