2.1.4 Apache Sparkに関する非互換

Apache Sparkに関する非互換を以下に示します。

基準	代表的な事例	JIRA	Summary
外部仕様の変更	コマンド仕様の変更（注1）	SPARK-19287	JavaPairRDD flatMapValues requires function returning Iterable, not Iterator
		SPARK-23429	Add executor memory metrics to heartbeat and expose in executors REST API
		SPARK-24958	Add executors' process tree total memory information to heartbeat signals
		SPARK-25865	Add GC information to ExecutorMetrics
		SPARK-26140	Enable custom shuffle metrics implementation in shuffle reader
		SPARK-26141	Enable custom shuffle metrics implementation in shuffle write
		SPARK-26877	Support user-level app staging directory in yarn mode when spark.yarn.stagingDir specified
		SPARK-27071	Expose additional metrics in status.api.v1.StageData
		SPARK-27575	Spark overwrites existing value of spark.yarn.dist.* instead of merging value
		SPARK-31449	Investigate the difference between JDK and Spark's time zone offset calculation
	オプションの内容／値の変更／省略値の変更（注2）	SPARK-23472	Add config properties for administrator JVM options
		SPARK-24203	Make executor's bindAddress configurable
		SPARK-25040	Empty string should be disallowed for data types except for string and binary types in JSON
		SPARK-25641	Change the spark.shuffle.server.chunkFetchHandlerThreadsPercent default to 100
		SPARK-26089	Handle large corrupt shuffle blocks
		SPARK-26771	Make .unpersist(), .destroy() consistently non-blocking by default
		SPARK-27868	Better document shuffle / RPC listen backlog
		SPARK-31582	Being able to not populate Hadoop classpath
	チェック強化（注3）	SPARK-26340	Ensure cores per executor is greater than cpu per task
		SPARK-26530	Validate heartheat arguments in HeartbeatReceiver
		SPARK-31968	write.partitionBy() creates duplicate subdirectories when user provides duplicate columns
	公開しているファイルの内容／形式（注4）	SPARK-22860	Spark workers log ssl passwords passed to the executors
		SPARK-23191	Workers registration failes in case of network drop
		SPARK-25118	Need a solution to persist Spark application console outputs when running in shell/yarn client mode
		SPARK-25855	Don't use Erasure Coding for event log files
		SPARK-29112	Expose more details when ApplicationMaster reporter faces a fatal exception
	メッセージ内容の変更（注5）	SPARK-24345	Improve ParseError stop location when offending symbol is a token
		SPARK-24355	Improve Spark shuffle server responsiveness to non-ChunkFetch requests
		SPARK-24544	Print actual failure cause when look up function failed
		SPARK-25683	Updated the log for the firstTime event Drop occurs.
		SPARK-25689	Move token renewal logic to driver in yarn-client mode
		SPARK-25712	Improve usage message of start-master.sh and start-slave.sh
		SPARK-25773	Cancel zombie tasks in a result stage when the job finishes
		SPARK-26117	use SparkOutOfMemoryError instead of OutOfMemoryError when catch exception
		SPARK-26195	Correct exception messages in some classes
		SPARK-26529	Add debug logs for confArchive when preparing local resource
		SPARK-26600	Update spark-submit usage message
		SPARK-26660	Add warning logs for large taskBinary size
		SPARK-26697	ShuffleBlockFetcherIterator can log block sizes in addition to num blocks
		SPARK-27010	find out the actual port number when hive.server2.thrift.port=0
		SPARK-27192	spark.task.cpus should be less or equal than spark.task.cpus when use static executor allocation
		SPARK-27219	Misleading exceptions in transport code's SASL fallback path
		SPARK-27989	Add retries on the connection to the driver
		SPARK-28676	Avoid Excessive logging from ContextCleaner
		SPARK-28907	Review invalid usage of new Configuration()
		SPARK-28929	Spark Logging level should be INFO instead of Debug in Executor Plugin API[SPARK-24918]
		SPARK-29070	Make SparkLauncher log full spark-submit command line
		SPARK-29833	Add FileNotFoundException check for spark.yarn.jars
		SPARK-29885	Improve the exception message when reading the daemon port
		SPARK-31485	Barrier stage can hang if only partial tasks launched
		SPARK-31532	SparkSessionBuilder shoud not propagate static sql configurations to the existing active/default SparkSession
		SPARK-31941	Handling the exception in SparkUI for getSparkUser method
		SPARK-32003	Shuffle files for lost executor are not unregistered if fetch failure occurs after executor is lost
		SPARK-32560	improve exception message
	メッセージの追加・削除（注6）	SPARK-9853	Optimize shuffle fetch of contiguous partition IDs
		SPARK-22590	Broadcast thread propagates the localProperties to task
		SPARK-25829	remove duplicated map keys with last wins policy
		SPARK-26060	Track SparkConf entries and make SET command reject such entries.
		SPARK-26892	saveAsTextFile throws NullPointerException when null row present
		SPARK-27348	HeartbeatReceiver doesn't remove lost executors from CoarseGrainedSchedulerBackend
		SPARK-27637	If exception occured while fetching blocks by netty block transfer service, check whether the relative executor is alive before retry
		SPARK-27665	Split fetch shuffle blocks protocol from OpenBlocks
		SPARK-28483	Canceling a spark job using barrier mode but barrier tasks do not exit
		SPARK-30416	Log a warning for deprecated SQL config in `set()` and `unset()`
SPARK-30590		can't use more than five type-safe user-defined aggregation in select statement
SPARK-3137		Use finer grained locking in TorrentBroadcast.readObject
SPARK-31632		The ApplicationInfo in KVStore may be accessed before it's prepared
使用リソースの増加	使用メモリ量の増加	SPARK-25035	Replicating disk-stored blocks should avoid memory mapping
使用リソースの増加	使用メモリ量の増加	SPARK-25998	TorrentBroadcast holds strong reference to broadcast object
実行結果の変更	誤った実装の修正（注7）	SPARK-23643	XORShiftRandom.hashSeed allocates unnecessary memory
		SPARK-29273	Spark peakExecutionMemory metrics is zero
		SPARK-30752	Wrong result of to_utc_timestamp() on daylight saving day
		SPARK-30793	Wrong truncations of timestamps before the epoch to minutes and seconds
		SPARK-30826	LIKE returns wrong result from external table using parquet
		SPARK-30857	Wrong truncations of timestamps before the epoch to hours and days
		SPARK-31456	If shutdownhook is added with priority Integer.MIN_VALUE, it's supposed to be called the last, but it gets called before other positive priority shutdownhook
		SPARK-31500	collect_set() of BinaryType returns duplicate elements
		SPARK-31519	Cast in having aggregate expressions returns the wrong result
		SPARK-31663	Grouping sets with having clause returns the wrong result
		SPARK-31935	Hadoop file system config should be effective in data source options
		SPARK-32115	Incorrect results for SUBSTRING when overflow
		SPARK-32167	nullability of GetArrayStructFields is incorrect
		SPARK-32364	Use CaseInsensitiveMap for DataFrameReader/Writer options
		SPARK-32377	CaseInsensitiveMap should be deterministic for addition
		SPARK-32693	Compare two dataframes with same schema except nullable property
		SPARK-32810	CSV/JSON data sources should avoid globbing paths when inferring schema

注1）実行結果、実行権限、実行多重度の変更など

注2）設定画面、操作画面など、画面情報含む

注3）指定可能範囲の変更、定義間の整合チェック、チェックの厳密化による有効範囲の拡大／縮小

注4）ログファイルの出力項目や形式の変更など

注5）ポップアップメッセージなどの変更により従前の操作が変わるものを含む。メッセージ内容、メッセージレベルの変更、メッセージ改善

注6）既存機能を使用する範囲で障害修正、改善などによるメッセージ新規追加・削除。

注7）外部仕様に反した外部動作を正規の動作に修正する場合、または誤った解釈の基で実装した標準的な技術の動作を正規の動作に修正する場合

参照

詳細は、下記のサイトを参照してください。

https://issues.apache.org/jira/secure/Dashboard.jspa