SLS Data Processing Upgraded to Integrate SPL Syntax

1. Overview of data processing

Logs are one of the most important pieces of information in system development, operation and maintenance, and their biggest advantage is that they are simple and straightforward. However, there is a pair of hard-to-reconcile conflicts in the whole log lifecycle: output and collection of logs should be as simple and convenient as possible vs. data formatting and on-demand storage for log analysis.

In order to solve the former, to ensure the stability and efficiency of the service, different high-performance data pipeline solutions have been proposed, such as AliCloud SLS cloud services, and open source middleware such as Kafka.
As for the latter, it is necessary to provide downstream standardized and complete data for business analysis and other scenarios, and SLS data processing is the function to achieve this requirement.

As shown in the figure above, the common scenarios of SLS data processing:

Regularization: This is the most frequently used scenario, such as extracting key information from text log data and turning it into normalized data.
Enrichment: For example, the user's click data contains only the product ID, which needs to be associated with detailed information from the database when analyzing.
Desensitization: With the improvement of China's information security related laws, the requirements for handling sensitive data (such as personal information) are getting higher and higher.
Splitting: When data is written out, multiple data will be combined and output for performance and convenience reasons, while independent data entries need to be split before analyzing.
Distribution: Different types of data are written to different specific targets for downstream customization.

2. New Data Processing Enhancements

SPL Integration, Unified Syntax
SPL is a unified data processing syntax provided by SLS for data processing scenarios such as log collection, interactive query, streaming consumption, and data processing, etc. For details, please refer to SPL syntax [1]. The new version of data processing SPL writing process supports line-by-line debugging and code prompts, closer to the IDE coding experience.
10+ times performance improvement, smoother processing of large data volumes and data floods.
In the scenario of processing irregular log data with the same processing complexity, the new version of data processing has 10+ times performance improvement compared with the old version, which can support higher data throughput. In addition, by upgrading the scheduling system, the new version of data processing can be more agile to realize concurrent expansion of computation in the face of data peaks of thousands of times of the usual traffic, so as to minimize the backlog brought about by the peak as much as possible.
Better cost, down to 1/3 of the old version
Through the iterative upgrading of data processing services, the new version of data processing provides a more advantageous use of the cost of only 1/3 of the old version of data processing, so when the demand scenarios are already supported, it is recommended to use data processing (new version).

3. Integration of SPL, Unified Syntax

3.1 Principle of the new version of data processing
The new version of data processing realizes real-time processing of log data by hosting real-time data consumption tasks and combining with the SPL rule consumption function of the log service, the principle is shown in the following figure.
Scheduling Mechanism
For each processing task, the scheduler of the processing service will start one or more running instances to concurrently execute data processing, and each running instance plays the role of a consumer to consume one or more shards from the source Logstore. the scheduler decides the number of running instances according to the resource consumption of the running instances and the progress of the processing, so as to realize flexible concurrency. The concurrency limit for a single task is the number of shards in the source Logstore.
Running Instances
According to the SPL rules of the task and the configuration information of the target logstore, the running instance consumes the source log data from the shards allocated by the data processing service using the SPL rules, and distributes and writes the processed results based on the SPL rules to the corresponding target logstore. When the task is stopped and restarted, it will continue to consume from the breakpoint.

3.2 SPL Syntax Comparison to Older DSLs
The logging service SPL syntax improves ease of use over the Data Processing (old) DSL language as follows. 1:

1. The Data Processing (old) DSL syntax, as a subset of Python syntax, requires functional programming and is redundant in syntax notation. In contrast, the Logging Service SPL language is a Shell-like command syntax that minimizes the redundancy of syntax symbols. The following example:
Older versions used the function v to reference the field value v(“field”), SPL directly refer to the field, such as | where field='ERROR'.
The old version of the function call func(arg1, arg2), to SPL instruction | cmd arg1, arg2, writing more concise. 2.
2. The SPL language supports temporary field type retention during processing and data type conversion [2]; in contrast, the definition of data processing (old version) DSL, the field value is fixed to a string type, and the intermediate result of type conversion does not support retention. For example, the following DSL script requires two calls to the ct_int function:

e_set(“ms”, ct_float(v(“sec”))*1000) e_keep(ct_float(v(“ms”)) > 500)

The corresponding SPL logic is much simpler and does not require two type conversions, as shown below:

| extend ms=cast(sec as double)*1000| where ms>1024

3. In addition, the SPL language can seamlessly use logging service SQL functions without additional understanding and learning costs, see SQL Functions Overview [3] for an overview of the SQL functions supported by SPL.

3.3 Data Processing SPL Code Debugging
Debugging Menu
Data Processing (new version) SPL debugging menu reference, menu buttons are defined as follows:
Run button: Run the SPL rules in the edit box completely.
Debug Button: Enable the debug mode and execute the run to the first breakpoint, after that, you can execute line-by-line or breakpoint debugging.
Next breakpoint button: debugging run to the next breakpoint.
Next line button: debugging to the next line.
Stop debugging button: stop the current debugging process.

The blank area in front of the line number of the code editing box is the breakpoint area, click the mouse in the breakpoint area, you can add debugging breakpoints in the line corresponding to the click point, the effect is as follows. Click the added debug breakpoint to remove it.

Debugging Process
1. Prepare test data and write SPL rules. 2.
2. Add a breakpoint on the line to be debugged. 3.
3. Click the Debug button to enable debug mode, as shown below. The yellow background line indicates the current pause position (statement not executed), and the blue background line indicates the executed SPL statement. 4.
4. Check the Processing Results tab to see if the run results meet the requirements.
If it meets the requirements: go to step 5 and continue with the debug run.
Otherwise, click the Stop Debugging button, go back to step 1, modify the SPL rules, and start the debugging process again.

4. Continuous Iterative Upgrade

Next, the new version of data processing will continue to be upgraded iteratively, here we only talk about the two upgrades that will be released in the near future.

1. Supporting complete data processing scenarios
At present, the new version of data processing focuses on irregular data processing scenarios, i.e., computational, and does not cover data flow scenarios, such as data distribution to multiple targets or dynamic targets, dimension table enrichment, IP resolution, and synchronization of data across geographic regions.

Next, the new version of data processing will focus on supporting these scenarios and provide more stable and easy-to-use services through architectural upgrades, such as the acceleration of cross-region synchronization and dataset-based distribution.

2. Seamless Upgrade of Old Data Processing Tasks
For the old version of data processing tasks running online, it is necessary to apply the new version upgrade as described above. Data processing services will support the in-place upgrade of existing tasks from two technical perspectives:

First, the data processing service ensures data integrity by automatically migrating the current data consumption point of view to the new version of data processing, from which data will continue to be consumed after the upgrade.
Second, based on AST syntax tree parsing, it automatically translates the DSL scripts of the old version of data processing tasks to the SPL of the equivalent data processing logic.

Related links:
[1] SPL Syntax
[2] Data Type Conversion
[3] SQL Functions Overview