首页 首页 大数据 查看内容

Apache Spark 2.1.0正式发布,Structured Streaming有重大突破

木马童年 2019-8-10 23:30 65 0

Apache Spark 2.1.0是 2.x 版本线的第二个发行版。此发行版在为Structured Streaming进入生产环境做出了重大突破,Structured Streaming现在支持event time watermarks了,并且支持Kafka 0.10。此外,此版本更侧重于 ...

Apache Spark 2.1.0是 2.x 版本线的第二个发行版。此发行版在为Structured Streaming进入生产环境做出了重大突破,Structured Streaming现在支持event time watermarks了,并且支持Kafka 0.10。此外,此版本更侧重于可用性,稳定性和优雅(polish),并解决了1200多个tickets。以下是本版本的更新(等下班我翻译一下)

Core and Spark SQL

API updates

SPARK-17864: Data type APIs are stable APIs.

SPARK-18351: from_json and to_json for parsing JSON for string columns

SPARK-16700: When creating a DataFrame in PySpark, Python dictionaries can be used as values of a StructType.

Performance and stability

SPARK-17861: Scalable Partition Handling. Hive metastore stores all table partition metadata by default for Spark tables stored with Hive’s storage formats as well as tables stored with Spark’s native formats. This change reduces first query latency over partitioned tables and allows for the use of DDL commands to manipulate partitions for tables stored with Spark’s native formats. Users can migrate tables stored with Spark’s native formats created by previous versions by using the MSCK command.

SPARK-16523: Speeds up group-by aggregate performance by adding a fast aggregation cache that is backed by a row-based hashmap.

Other notable changes

SPARK-9876: parquet-mr upgraded to 1.8.1

Programming guides: Spark Programming Guide and Spark SQL, DataFrames and Datasets Guide.

Structured Streaming

API updates

SPARK-17346: Kafka 0.10 support in Structured Streaming

SPARK-17731: Metrics for Structured Streaming

SPARK-17829: Stable format for offset log

SPARK-18124: Observed delay based Event Time Watermarks

SPARK-18192: Support all file formats in structured streaming

SPARK-18516: Separate instantaneous state from progress performance statistics

Stability

SPARK-17267: Long running structured streaming requirements

Programming guide: Structured Streaming Programming Guide.

MLlib

API updates

SPARK-5992: Locality Sensitive Hashing

SPARK-7159: Multiclass Logistic Regression in DataFrame-based API

SPARK-16000: ML persistence: Make model loading backwards-compatible with Spark 1.x with saved models using spark.mllib.linalg.Vector columns in DataFrame-based API

Performance and stability

SPARK-17748: Faster, more stable LinearRegression for < 4096 features

SPARK-16719: RandomForest: communicate fewer trees on each iteration

Programming guide: Machine Learning Library (MLlib) Guide.

SparkR

The main focus of SparkR in the 2.1.0 release was adding extensive support for ML algorithms, which include:

New ML algorithms in SparkR including LDA, Gaussian Mixture Models, ALS, Random Forest, Gradient Boosted Trees, and more

Support for multinomial logistic regression providing similar functionality as the glmnet R package

Enable installing third party packages on workers using spark.addFile (SPARK-17577).

Standalone installable package built with the Apache Spark release. We will be submitting this to CRAN soon.

Programming guide: SparkR (R on Spark).

GraphX

SPARK-11496: Personalized pagerank

Programming guide: GraphX Programming Guide.

Deprecations

MLlib

SPARK-18592: Deprecate unnecessary Param setter methods in tree and ensemble models

Changes of behavior

Core and SQL

SPARK-18360: The default table path of tables in the default database will be under the location of the default database instead of always depending on the warehouse location setting.

SPARK-18377: spark.sql.warehouse.dir is a static configuration now. Users need to set it before the start of the first SparkSession and its value is shared by sessions in the same application.

SPARK-14393: Values generated by non-deterministic functions will not change after coalesce or union.

SPARK-18076: Fix default Locale used in DateFormat, NumberFormat to Locale.US

SPARK-16216: CSV and JSON data sources write timestamp and date values in ISO 8601 formatted string. Two options, timestampFormat and dateFormat, are added to these two data sources to let users control the format of timestamp and date value in string representation, respectively. Please refer to the API doc of DataFrameReader and DataFrameWriter for more details about these two configurations.

SPARK-17427: Function SIZE returns -1 when its input parameter is null.

SPARK-16498: LazyBinaryColumnarSerDe is fixed as the the SerDe for RCFile.

SPARK-16552: If a user does not specify the schema to a table and relies on schema inference, the inferred schema will be stored in the metastore. The schema will be not inferred again when this table is used.

Structured Streaming

SPARK-18516: Separate instantaneous state from progress performance statistics

MLlib

SPARK-17870: ChiSquareSelector now accounts for degrees of freedom by using pValue rather than raw statistic to select the top features.

Known Issues

SPARK-17647: In SQL LIKE clause, wildcard characters ‘%’ and ‘_’ right after backslashes are always escaped.

SPARK-18908: If a StreamExecution fails to start, users need to check stderr for the error.

在不久的将来,多智时代一定会彻底走入我们的生活,有兴趣入行未来前沿产业的朋友,可以收藏多智时代,及时获取人工智能、大数据、云计算和物联网的前沿资讯和基础知识,让我们一起携手,引领人工智能的未来!

未分类
0
为您推荐
大数据技术改变城市的运作方式,智慧城市呼之欲出

大数据技术改变城市的运作方式,智慧城市呼

纽奥良虽像大多数城市一样有火灾侦测器安装计划,但直到最近还是要由市民主动申装。纽…...

大数据分析面临生死边缘,未来之路怎么走?

大数据分析面临生死边缘,未来之路怎么走?

大数据分析开始朝着营销落地,尤其像数果智能这类服务于企业的大数据分析供应商,不仅…...

什么是工业大数据,要通过3B和3C来理解?

什么是工业大数据,要通过3B和3C来理解?

核心提示:工业视角的转变如果说前三次工业革命分别从机械化、规模化、标准化、和自动…...

大数据普及为什么说肥了芯片厂商?

大数据普及为什么说肥了芯片厂商?

科技界默默无闻的存在,芯片行业年规模增长到了3520亿美元。半导体给无人驾驶汽车带来…...

大数据技术有哪些,为什么说云计算能力是大数据的根本!

大数据技术有哪些,为什么说云计算能力是大

历史规律告诉我们,任何一次大型技术革命,早期人们总是高估它的影响,会有一轮一轮的…...

个人征信牌照推迟落地,大数据 重新定义个人信用!!

个人征信牌照推迟落地,大数据 重新定义个

为金融学的基础正日益坚实。通过互联网大数据精准记录海量个人行为,进而形成分析结论…...