An introduction to Exploration

Exploration forms part of the Autonomous Data Cloud (ADC) product and is one of many options available to explore our unified data virtualization layer.

It offers use of a common SQL interpreter to access data across multiple data sources. These data sources could include Hadoop data lakes, SIEM solutions, relational databases, Splunk indexes, Elastic stacks, and even IoT devices. This is achieved by Gemini’s unique Zero Copy Data Virtualization, that standardizes data to tabular format without requiring to move or copy data from disparate locations, to a centralized repository. Exploration provides unified access to the data allowing fast and easy data exploration and preparing a dataset for Machine Learning use cases..

Use Exploration therefore, to explore data that was made available through Gemini, produce insightful documentation or create more visual graph-based analysis of your entire dataset.

General overview of Exploration

Exploration offers the user a notebook style interface to explore data within ADC. It has the ability to present your data in different ways using a language interpreter, assigned at the beginning of each analysis paragraph.

Exploration also offers you vizualisation functionality, offering the ability to create and produce dashboards, based on your findings.

To clarify the terminology in use with our Notebook:

Explore Otherwise known as
Note Dashboard / View
Paragraph A panel on a dashboard
Interpreter Language you wish to use (ie. %sql)

To open Exploration, Select the Exploration option from the Global App Bar located at the top of your screen

The 'Welcome to Exploration!' panel should now be visible. The ‘Notebook’ menu will give you access to importing, creating or opening existing Notes.

Expand any folder icons to reveal more Notes

Select and Load data for use in Exploration

Before being able to search and query data, we must first load some libraries and converge any datasets on which we want to operate, into the Exploration cache. This would for instance, bring together Elastic and Splunk storage for a project involving both datasets.

It is possible to operate only on Data which has already been connected to ADC. If there is a need to explore and analyse data from other datasets, first connect those datasets to ADC. Please refer to the section entitled, ‘Connecting Data Sources to ADC’ for more information.

From the ‘Data Loader for Exploration’ page, locate the ‘Load Libraries’ paragraph which you will find just under the Instruction panel. ‘Run’ this paragraph using the triangular icon at the top right of the panel (Shift + Enter). This will load libraries required by Exploration enabling access to the underlying file system. This is a once only activity to be completed at the start of your project.

Now to select the data you wish to query for your project. Locate the paragraph titled ‘Available Datasets’ and choose appropriate datasets together with the required time range for your project. Locate and ‘Run’ this paragraph using the triangular icon (Shift + Enter). If there is a dataset that is missing from the available selection, you will need to connect that data into ADC (refer to the Note above).

To complete the process of building the Exploration cache ready for analysis, the remaining Paragraphs on this page should now be run in the order they appear. Run the ‘Load Datasets’, ‘Cache Datasets’ and ‘Validate Datasets’ paragraphs respectively. The ‘Cache Datasets’ paragraph will improve the performance of your exploration projects during analysis. This is optional, but recommended for performance.

If you wish to change the selected criteria of datasets and/or time range, including updating to the very latest data, return to the ‘Data Loader for Exploration’ page, and make the appropriate changes in the ‘Available Datasets’ paragraph. Paragraphs that follow ‘Available Datasets’ also need to be run, to enable the Exploration cache with the latest data. The easiest way of achieving this, is to locate the ‘Run’ icon from the Exploration Toolbar at the top of the screen. This will run all paragraphs within the Note, exactly what is required to refresh the data.

To verify that the data is present or to begin a new project, open a new paragraph here or in a New Note, and use the following SQL command to reveal the datasets available for Exploration.

show tables

Resetting the Environment

Whilst working with Exploration, it is sometimes necessary to clear its cache. This is particularly necessary when dealing with huge datasets in order to free resources within the system. This process will completely clear the Exploration cache of all data, and allow the operator to rebuild it again by running all paragraphs in Data Loader for Exploration.

To achieve the reset, locate and click on the ‘gear icon’ - top right of the screen. This will open the Interpreter Binding page, at the top of which is the Spark Interpreter. Click on the small icon to the left of the Spark Interpreter and follow the prompt to ‘save’. This will reset the cache and allow new datasets to be loaded in the usual way.

This process will clear the cache for all user sessions running in Exploration. Be aware of any other users before proceeding

Creating a new Exploration project

  • From the Notebook menu, choose ‘Create new note’ option.
  • Enter a Note Name. To create a note in a directory structure, use the ‘/’ symbol. For example, /DirA/Project1 will create a new Project1 note inside DirA directory.
  • Choose the ‘Default Interpreter’ that you will use in the project. If this is unknown, simply leave it at the default ‘spark’ interpreter, allowing scala to be used by default.
Selecting a default will still allow switching between interpreters during your project

A series of icons will be visible running along the top of the screen which will help you to navigate your projects.

Introduction to exploring data using Exploration

Whether you are creating a new Note or opening an existing Note, the process of exploring and analyzing your data is the same.

Exploration will automatically create a new Paragraph for you to process additional commands, using the Interpreter of your choice.

By utilizing one of the most common Interpreters (%sql) we have the ability to discover and explore multiple datasets together, that may have originated from completely different sources, such as Splunk, Elastic, or Hadoop.

We have a Tutorial built into ADC. Locate this yourself under the Notebook menu, it is simply called Tutorial/Basic Features (Spark). This will help to show and explain the basic features of Spark and the running of simple SQL queries. The data is called ‘Bank’ and is based on publicly available data for research purposes only (refer to the citation request at the bottom of the Tutorial Note). There are paragraphs to show how we can work with and display our results visually. If you are familiar with this Notebook interface, please skip to the next section.

Tutorial 1: Exploring data using Exploration

Based on the Bank data used in this tutorial, type the following searches in new paragraphs to familiarize yourself with this interface.

You will first need to run the Paragraph, titled ‘Load data into table’, in order to make the Bank data available for Exploration. Locate and click on the ‘Run’ icon(triangle), located top right of the Paragraph. Check out the Java code which is collecting data from s3.amazon.com and making this available for Exploration.

Search 1:

  • Locate the next Paragraph, and once again ‘Run’ this Paragraph, to reveal the result of the following command.
show tables [the Bank table should be listed]

Search 2:

  • Explore the content of our bank data using this command, resulting in the following output
SELECT * FROM bank

Search 3:

  • In order to create a visualization, try something like this, and click on the Graph tabs to create a column, pie, or scatter chart for instance. There is also an export option.
SELECT age, count(1) value FROM bank
WHERE age < 30
GROUP BY age
ORDER BY age
Click on settings to see and enable more features for both tables and charts

If you make any changes to either the search statement, or the visualization options, you will need to replay the statement to activate those changes. Locate the triangular shaped icon at the top of each Paragraph window to achieve this.

An exception to this however, is when you employ selection features within your visualization. In Search 4 (below), we introduce this feature. Run Search 4 and click on the Gear Icon at the top right hand corner of your Paragraph. The first option here will be a ‘run on selection change’. Enabling this feature allows the ability to automatically replay on any changes made.

Search 4:

  • Create a feature dropdown box within a visualization using the following syntax. Experiment by running this statement as both an area or column chart.
SELECT age, count(1) value FROM bank
WHERE marital=”${marital=single,single|divorced|married}”
GROUP BY age

Additional Paragraph Settings - Gear icon

At the top right hand corner of each Paragraph window you will find a Gear Icon. Use this to change the look and feel of the Paragraphs on your screen, and navigate the editor window.

Setting Notes on using this setting Shortcut key
width Default = 12 which equates to full page width. If you want two Paragraphs equidistant around the centre of the Note, change each Paragraph to the value ‘6’ N/A
font size Relates to the font of the statement N/A
move up Relocates the Paragraph higher up the Note. ctrl+alt+k
move down Relocates the Paragraph lower down the Note. ctrl+alt+j
show title Allows editing of Paragraph title ctrl+alt+t
show line num Show/hide line numbers in the code section ctrl+alt+m
disable run Run will no longer function for this Paragraph ctrl+alt+r
link this para Export the current Paragraph as an iframe ctrl+alt+w
clear output Clears the Results section ctrl+alt+l
remove Deletes the current Paragraph ctrl+alt+d

Using the Exploration Toolbar

Situated at the top of the screen, the Exploration toolbar is a useful way of navigating the interface.

Icon Icon Action
first icon (Run) allows the running of all paragraphs within the Note, if required.
second icon will show/hide the respective code used within a paragraph, the language shown is based on the interpreter used within each paragraph.
third icon (book) will show/hide the output of the paragraph, whether that be a graph, table or text.
fourth icon (eraser) will clear resultant output from every paragraph of the Note.
fifth icon (copy) will clone the entire Note.
sixth icon will download the entire Note in the form of a json file.
seventh icon (people) will switch to ‘Personal’ mode. All Notes are public by default, use this icon to make them private.

Exploring multiple data sources with Exploration

So we have already connected data from multiple data silos, possibly form Splunk, Elastic, RDBMS databases, Active Directory, etc. If you need to add more data for analysis, refer to the ‘Connecting Data Sources to ADC’ section.

We have also learnt how to select and load this data for analysis in Exploration. If necessary refer back to the ‘Select and load Data for use in Exploration’ section.

This tutorial suggests how to work with multiple datasets. we will suggest a scenario in which we have Splunk and Elastic data that needs to be explored as one converged entity. Comments below within square brackets ‘[ ]’ are by way of an explanation to the demo scenario. For purposes of example, each search was conducted in its own Paragraph.

Tutorial 2: Explore identical converged Data Sources

Search 1:

  • In the next available Paragraph, type the following at the prompt to verify the existence of Splunk and Elastic data, and the names of the tables.
show tables

[This shows us we have two tables called: elastic_data, splunk_data]

Search 2:

  • Explore the content of those datasets in two further Paragraphs
SELECT * FROM elastic_data
SELECT * FROM splunk_data

[This shows the data from both sources happens to be identical in format, therefore we can use a ‘union’ command to converge the tables. If the tables were dissimilar we would use the ‘join’ command]

Search 3:

  • Create a UNION between the identically formatted datasets
SELECT ‘splunk’ dataset, * FROM splunk_data
UNION
SELECT ‘elastic’ dataset, * FROM elastic_data

[The output table has a new field called dataset to show the merged rows]

For purposes of this tutorial, we will suggest that there is a requirement to investigate failed logins. During this investigation we discover many failed logins for the ‘root’ user, so we run another report on failed logins for Administrative user accounts.

Remember, we are searching across both Splunk and Elastic datasets.

Relevant feature fields:

  • user - contains the username
  • actiion - describes a login status of failure/success
  • src - contains the source value user

Search 4:

  • Discover whether we have an issue with failed logins in general
SELECT user, action, source, count(1) AS count FROM
(SELECT ‘splunk’ dataset, * FROM splunk_data
UNION
SELECT ‘elastic’ dataset, * FROM elastic_data)
GROUP BY user, action
ORDER BY count DESC
Click on the Bar chart icon, put the display in stacked mode for a quick chart

Search 5:

  • Create a table of login results for failed logins from; admin, root and administrator
SELECT source, user, action, count(1) as count FROM
(select explode(src) AS source, * FROM splunk_data
UNION
SELECT explode(src) AS source, * FROM elastic_data)
WHERE user in (“admin”, “root”, “administrator”)
HAVING action=”failure”
We have used the ‘explode’ param here. A built-in feature of Scala is to create ‘wrapped arrays’ which collate all common rows together. The explode param can be used to expand these rows and make them visible when using Exploration

Tutorial 3: Explore dissimilar converged Data Sources

Typically, you will work on various sources of dissimilar data. In this tutorial we explore this scenario by using a combination of JOIN and UNION commands here.

Imagine we have three separate tables in which we have different fields, but contain a common 'dest' field.

splunk_net_traffic
elastic_auth
splunk_web

Search 6:

  • JOIN(INNER) tables to extract the rows in which 'dest' matches all three datasets
SELECT * FROM splunk_net_traffic
JOIN elastic_auth ON elastic_auth.dest = splunk_net_traffic.dest
JOIN splunk_web ON splunk_web.dest = splunk_net_traffic.dest

Let us now suggest we now want to create a summary table/graph displaying the detail involved in a JOIN between the traffic and web data, split by 'user' and which has over 4 occurrences.

SELECT splunk_web.user, count(splunk_web.user) as count FROM splunk_web
JOIN splunk_net_traffic on splunk_net_traffic.dest = splunk_web.dest
GROUP BY splunk_web.user
HAVING count > 4
We need to use the HAVING clause here as we are aggregating the result. The WHERE clause would not work.

JOIN can be expensive and add a lot more columns. To reduce the complexity of a search, consider using UNION to extract just the fields on which you want to work.

In the next example we want to work explicitly on the three fields; dest, src and user.

Search 7:

  • Use UNION to just select the three specific fields from Splunk & Elastic tables
SELECT 'splunk' dataset, dest, src, user FROM splunk_auth
UNION
SELECT 'elastic' dataset, dest, src, user FROM elastic_auth

Or maybe we want to know the numbers returned from each table respectively

SELECT 'splunk' dataset, count(1) FROM splunk_auth
UNION
SELECT 'elastic' dataset, count(1) FROM elastic_auth
The columns must be identically named and have the same format for this to work

Search 8:

  • OUTER Join: A need to bring back all the records of Table1 and only those in Table2 that are a match.
SELECT * FROM Table1
LEFT JOIN Table2 ON Table1.user=Table2.username

Search 9:

  • Subquery example: Listing employee_id’s, but only if they have made a visit.
SELECT * FROM employee
WHERE employee_id IN (SELECT employee_id FROM visit)

As usual with subqueries, adding a boolean NOT before the IN, will bring back employee_id’s that have not made a visit.

Search 10:

  • Correlated scalar subquery example: These are special subqueries that can reference the outer query. This example recreates a ‘Splunk eventstats’ command, producing a new field called ‘max_age’
SELECT A.dep_id, A.employee_id, A.age, B.max_age
FROM employee A
LEFT OUTER JOIN (SELECT dep_id, MAX(age) max_age
                                FROM employee B
                               GROUP BY dep_id) B
ON B.dep_id = A.dep_id
ORDER BY dep_id, employee_id
In this example, we have used aliases for the Tables, to save typing. The ‘employee’ table is given an alias of ‘A’, but in the subsearch, it is given an alias of ‘B’ to differentiate from the outer search.

Shortcut Keys

Function Key Combination
Run paragraph shift + enter
Run all above/below paragraphs ctrl + shift + enter
Cancel ctrl + alt + c
Move cursor up ctrl + p
Move cursor down ctrl + n
Remove paragraph ctrl + alt + d
Insert new paragraph above ctrl + alt + d
Insert new paragraph below ctrl + alt + b
Insert copy of paragraph below ctrl + shift + c
Move paragraph up ctrl + alt + k
Move paragraph down ctrl + alt + j
Enable/disable run paragraph ctrl + alt + r
Toggle output ctrl + alt + o
Toggle editor ctrl + alt + e
Toggle line number ctrl + alt + m
Toggle title ctrl + alt + t
Clear output ctrl + alt + l
Link this paragraph ctrl + alt + w
Reduce paragraph widtg ctrl + shift + '-'
Increase paragraph width ctrl + shift + '+'