“Spring XD 1.0 Milestone 1 Released”表示Spring XD的1.0版本的第一个里程碑版本已发布。以下是具体解析:
- Spring XD简介:Spring XD是一个统一的、分布式的、可扩展的系统,用于数据摄取、实时分析、批处理和数据导出。其项目目标是简化大数据应用程序的开发。它建立在Spring Integration和Spring Batch项目的基础上,提供了一个轻量级的运行时环境,可通过简单的DSL(领域特定语言)轻松配置和组装。
- 里程碑版本意义:软件版本号中的“Milestone”(里程碑)代表着具有一些全新功能或具有里程碑意义的版本。通常在软件正式版发布之前,会发布多个里程碑版本,用于逐步完善软件的功能和稳定性。1.0版本通常是软件的第一个主要版本,标志着软件具备了核心功能和基本的稳定性,而Milestone 1则是这个主要版本开发过程中的一个重要节点,意味着Spring XD 1.0版本的开发取得了阶段性成果,已经具备了一些关键的功能特性可供开发人员进行测试和使用。
- 功能特性:在Spring XD 1.0 Milestone 1中,主要包括以下功能特性。
- 流(Streams):流定义了数据如何被收集、处理、存储或转发。例如,一个流可以收集系统日志数据,对其进行过滤,并将其存储在HDFS中。Spring XD提供了DSL来定义流,既可以使用类似Unix管道和过滤器的语法构建简单的线性处理流,也可以使用扩展语法描述更复杂的流。
- 源和接收器(Sources and Sinks):简单的线性流由输入源、(可选的)处理步骤和输出接收器组成。支持多种类型的源,如文件、时间、HTTP、尾部日志、Twitter搜索、GemFire(连续查询和缓存事件)、Syslog和TCP等;支持的接收器有日志、文件、HDFS、GemFire分布式数据网格和TCP等。可以通过DSL指定源和接收器,还可以传递选项参数来更改默认值。例如,“http --port=9090 | file --dir=/var/streams --name=data.txt”指定了一个HTTP源监听在9090端口,将数据输出到指定目录下名为data.txt的文件中。
这一版本的发布为Spring XD后续的开发和完善奠定了基础,也为大数据应用程序的开发者提供了一个初步的平台,以便他们能够开始探索和使用Spring XD来构建相关应用。
Spring XD 1.0 Milestone 1(M1)于2013年6月12日发布,以下是对其的详细解析:
产品定位与目标
Spring XD是一个统一、分布式、可扩展的系统,用于数据接收、实时分析、批处理和数据导出,旨在简化大数据应用程序的开发。它借鉴了Spring Integration和Spring Batch项目在构建企业集成和批处理应用程序方面的成熟经验,提供了一个轻量级的运行时环境,通过简单的DSL(领域特定语言)进行配置和组装。
核心组件
- Streams(流):定义数据的收集、处理和存储或转发方式。例如,可收集syslog数据、过滤后存储到HDFS。Spring XD提供DSL来定义流,支持简单的UNIX管道和过滤器语法构建线性处理流,也允许描述更复杂的流。
- Sources(源)和Sinks(汇):源是数据输入点,如文件、时间、HTTP等;汇是数据输出点,如日志、文件、HDFS等。M1版本支持多种源和汇,还允许用户添加自定义的源和汇。
- Processors(处理器):用于对流中的数据进行处理,多个处理步骤通过通道连接。通道可以是内存中的,也可以由Redis、JMS、RabbitMQ等中间件支持,从而实现简单的分布式处理模型。
- Taps(分流器):允许从流中分流数据,以便进行实时分析等操作。例如,在将数据写入HDFS之前创建分流器,将数据传输到计数器以统计特定标签的提及次数。
- Analytics(分析):M1版本支持计数器、字段值计数器、测量值和富测量值等分析指标,这些指标可以存储在内存或Redis中,并作为DSL表达式中的汇使用。
运行时架构
Spring XD有两种运行模式:单节点模式和分布式模式。单节点模式便于快速开始和简化应用开发与测试;分布式模式允许将处理任务分布在多个机器上,由管理服务器发送命令控制集群中的处理任务。M1版本的分布式架构通过Redis队列在流的各个模块之间传递数据。
特点与优势
- 基于Spring的编程模型:流的编程模型基于Spring Integration,输入源将外部数据转换为包含头信息和负载的消息,消息通过消息通道在流中流动。
- 灵活的配置与扩展性:通过DSL进行配置,易于上手和扩展。用户可以自定义源、汇和处理器,还可以添加现有的Spring Integration通道适配器。
- 支持多种数据源和目标:M1版本支持多种常见的数据源和目标,如文件、HTTP、Syslog、HDFS等,并计划在未来版本中增加对MQTT、RabbitMQ、JMS、Kafka等的支持。
下一步计划
未来的版本将增加对XDContainer运行Spring Batch作业的支持,这些作业可用于从HDFS导出数据到关系数据库,以及在集群上协调执行Hadoop作业。此外,还将提供额外的度量库、基于HTTP/JMX的管理功能,以及基于Reactor项目的高性能源。
Today we are pleased to announce the 1.0 M1 release of Spring XD (download).Spring XD is a unified, distributed, and extensible system for data ingestion, real time analytics, batch processing, and data export. The project’s goal is to simplify the development of big data applications.
From the 10,000 foot view, big data applications share many characteristics with Enterprise Integration and Batch applications. Spring has provided proven solutions for building integration and batch applications for more than 6 years now via the Spring Integration and Spring Batch projects. Spring XD builds upon this foundation and provides a lightweight runtime environment that is easily configured and assembled via a simple DSL.
In this blog we will introduce the key components of Spring XD, namely Streams, Jobs, Taps, Analytics and the DSL used to declare them, as well as the runtime architecture. Many more details can be found in the XD Guide.
Streams
A Stream defines how data is collected, processed and stored or forwarded. For example, a stream may collect syslog data, filter it, and store it in HDFS. Spring XD provides a DSL to define a stream. The DSL allows you to start simple using a UNIX pipes-and-filters syntax to build a linear processing flow but lets you also describe more complex flows using an extended syntax.
Sources and Sinks
A simple linear stream consists of the sequence: Input Source, (optional) Processing Steps, and an Output Sink. As a simple example consider the collection of data from a HTTP Source writing to a File Sink. The DSL to describe this stream is
http | file
You tell Spring XD to create a stream by making a HTTP request to the XD Admin Server which runs on port 8080 by default. In the M2 release we will provide an interactive shell to communicate with XD, but for M1 the easiest way is to interact with XD is using ‘curl’.
curl -d “http | file” http://localhost:8080/streams/httptest
The name of the stream is httptest, the default HTTP port to listen on is 9000, and the default file location is /tmp/xd/output/${streamname}.
If you post some data on port 9000 with curl
curl -d “hello world” http://localhost:9000
You will see the string hello world inside the file /tmp/xd/output/httptest
To change the default values, you can pass in option arguments
http --port=9090 | file --dir=/var/streams --name=data.txt
The supported sources in M1 are file, time, HTTP, Tail, Twitter Search, Gemfire (Continuous Queries), Gemfire (Cache Event), Syslog and TCP. The supported sinks are Log, File, HDFS, Gemfire Distributed Data Grid, and TCP. To capture syslog data to HDFS, the DSL is simply
syslog | hdfs --namenode=“http://192.168.1.100:9000”
You can also add your own custom sources and sinks. Existing Inbound and Outbound Channel Adapters in Spring Integration can be added by following a simple recipe. Future releases will add support for MQTT, RabbitMQ, JMS, and Kafka. We would love a pull request to contribute your preferred source and sink modules.
The programming model for a Stream is based on Spring Integration. Input Sources convert external data to a Message that consists of headers, containing key-value pairs and a payload that can be any Java type. Messages flow through the stream through Message Channels. This is shown below for a stream with an Input Source, Processing Step, and an Output Sink.
Processors
A stream that incorporates multiple processing steps is shown below. The processing steps are all connected together via Channels.
In the DSL, the pipe symbol corresponds to the channel that passes data from each processing step to the next. The channels in Spring XD can either be in-memory or be backed by middleware such as Redis, JMS, RabbitMQ etc. This allows for a simple distributed processing model which will be discussed shortly.
The DSL expression that represents streams with processing steps is of the form
source | filter | transform | sink
The supported processors in M1 are filter, transformer, json-field-extractor, json-field-value-filter, and script. The filter and transformer processors support using the Spring Expression Language (SpEL) as well as Groovy. To transform the payload of the HTTP request to uppercase in the previous example using SpEL,
http | transform --expression=payload.toUpperCase() | file
The script processor also allows you to execute custom Groovy code.
Taps
A Tap allows you to “listen in” to data from another stream and process the data in a separate stream. The original stream is unaffected by the tap and isn’t aware of its presence, similar to a phone wiretap. WireTaps are part of the standard catalog of EAI patterns and are part of the Spring Integration framework used by Spring XD.
A tap can consume data from any point along the target stream’s processing pipeline. For example, if you have a stream called mystream, defined as
source | filter | transform | sink
You can create a tap using the DSL
tap mystream.filter | sink2
This would tap into the stream’s data after the filter has been applied but before the transformer. So the untransformed data would be sent to sink2.
For example, if you create a stream named httpstream using the command:
curl -d “http --port=9898 | filter --expression=‘payload.length() > 5’
| transform --expression=payload.toUpperCase()
| file” http://localhost:8080/streams/httpstream
Then to create a tap on the stream named httptap that writes the filtered data stream to a separate file use the following command:
curl -d “tap httpstream.filter | file --dir=/tmp --name=filtered.txt” http://localhost:8080/streams/httptap
Posting data such as
curl -d “hello world” http://localhost:9898
curl -d “he” http://localhost:9898
curl -d “hello world 2” http://localhost:9898
Will result with HELLO WORLD and HELLO WORLD 2 in the file /tmp/xd/output/httpstream and lower cased equivalents in /tmp/filtered.txt. The text ‘he’ will not be present in either file.
A primary use case is to perform realtime analytics at the same time as data is being ingested via its primary stream. For example, consider a Stream of data that is consuming Twitter search results and writing them to HDFS. A tap can be created before the data is written to HDFS, and the data piped from the tap to a counter that correspond to the number of times specific hashtags were mentioned in the tweets.
Analytics
Ask 10 developers what ‘real time analytics’ are and you will get 20 answers. The answers range from very simple (but extremely useful) counters, to moving averages, to aggregated counters, to histograms, to time-series, to machine learning algorithms to Embedded CEP engines. Spring XD intends to support a wide range of these metrics and analytical data structures as a general purpose class library that works with several backend storage technologies. They are also exposed to XD as a type of Sink for use in DSL expressions.
In the M1 release there is support for Counter, Field Value Counter, Gauge, and Rich Gauge. These metrics can be stored in-memory or in Redis. See the JavaDocs and Analytics section of the user guide for more details and also a list of what will be implemented in future releases.
As an example, consider the case of collecting a real time count of the frequency of hashtags in a stream of tweets. To do this with SpringXD, create a new stream definition that uses the twitter search source module and name it ‘spring’
curl -d “twittersearch --query=‘spring’ --consumerKey= --consumerSecret=
| file” http://localhost:8080/streams/spring
This stores the tweets in the local filesystem. Note, to get a consumerKey and consumerSecret you need to register a twitter application. If you don’t already have one set up, you can create an app at the Twitter Developers site to get these credentials.
Next create a create a tap named ‘springtap’ on the output of the twittersearch source to count the frequency of hashtags in the tweets.
curl -d “tap spring.twittersearch | field-value-counter
–fieldName=entities.hashTags.text
–counterName=hashTagFrequency” http://localhost:8080/streams/springtap
The field entities.hashTags.text is the path to the hashtags in the JSON representation of a Spring Social Tweet object used in the underlying implementation. To view the top 5 hashtags use the redis-cli to view the contents of the sorted set named fieldvaluecounters.hashTagFrequency. Note, it will often take a few minutes to collected enough tweets that have hashtag entities.
redis-cli
redis 127.0.0.1:6379>ZREVRANGEBYSCORE fieldvaluecounters.hashTagFrequency +inf -inf WITHSCORES LIMIT 0 5
1] “spring”
2] “6”
3] “Turkey”
4] “6”
5] “Arab”
6] “6”
7] “summer”
8] “3”
9] “fashion”
10] “3”
Architecture
Spring XD has two modes of operation - single-node and distributed. The first is a single process that handles all processing and administration. This mode helps you get started easily and simplifies the development and testing of your application. The distributed mode allows processing tasks to be spread across a cluster of machines and an administrative server sends commands to control processing tasks executing on the cluster.
The distributed architecture in the M1 release is simple. Each part of a stream, called a module, can execute in its own container instance. The data is passed between the modules using a Redis queue. See the Architecture section for more details. The primary focus of this release was getting the abstractions right, such as having the pipe symbol in the DSL be pluggable across various transports. Other transports and performance improvements will be coming in future releases as well as execution inside a Hadoop cluster.
More to come
Some other topics not covered in this post are the introduction of Tuple data structure and how you can create custom processors. An important part of the next release will be support for the XDContainer to run Spring Batch jobs. These jobs can be used to help export data from HDFS to relational-databases as well as orchestrate the execution of Hadoop Jobs, either MapReduce, Pig, Hive, or Cascading jobs, on the cluster. We will also be providing additional libraries for metrics such as aggregate counters, HTTP/JMX based management, as well as some high performing sources based on the Reactor project so stay tuned!
We would love to hear your feedback as we continue working hard towards the final Spring XD 1.0.0 release. If you have any questions, please use Stackoverflow (Tag: springxd), and to report any bugs or improvements, please use either the Jira Issue Tracker or file a GitHub issue.
今天,我们很高兴地宣布Spring XD(下载)的1.0 M1版本。Spring XD是一个用于数据接收、实时分析、批处理和数据导出的统一、分布式和可扩展系统。该项目的目标是简化大数据应用程序的开发。
从10000英尺的角度来看,大数据应用程序与企业集成和批处理应用程序有许多共同的特点。Spring已经通过Spring集成和Spring批处理项目为构建集成和批处理应用程序提供了6年多的成熟解决方案。Spring XD建立在这个基础上,并提供了一个轻量级运行时环境,通过简单的DSL很容易配置和组装。
在本文中,我们将介绍Spring XD的关键组件,即流、作业、Taps、分析和用于声明它们的DSL,以及运行时体系结构。更多细节可以在XD指南中找到。
溪流
流定义如何收集、处理、存储或转发数据。例如,流可以收集syslog数据,对其进行过滤,并将其存储在HDFS中。Spring XD提供了一个DSL来定义流。DSL允许您使用UNIX管道和过滤器语法简单地开始构建线性处理流,但也允许您使用扩展语法描述更复杂的流。
源汇
简单的线性流由以下序列组成:输入源(可选)处理步骤和输出接收器。作为一个简单的例子,考虑从HTTP源向文件接收器写入数据的集合。描述此流的DSL是
http |文件
告诉Spring XD通过向XD管理服务器发出HTTP请求来创建流,XD管理服务器默认运行在端口8080上。在M2版本中,我们将提供一个与XD通信的交互式shell,但对于M1,最简单的方式是使用curl与XD交互。