|Spark on K8s
|S3? any better one
During its evolution phase, Hadoop provided three main functionalities that made it a Big Data-ready solution: a distributed computer mechanism (MapReduce), a robust data storage (HDFS), and a resource manager (YARN/Mesos). But modern technologies now provide a better replacement for each of these three components: Kubernetes as an efficient resource manager, Amazon S3 for data storage, and Spark and Flink as distributed computation solutions.
So do we need to use Hadoop as a distributed file system with Containers and Kubernetes? It really depends on application requirements and value proposition needs. Technically it’s feasible to run Hadoop with Docker and Kubernetes, however the entire ecosystem lacks smooth integration. Recent couple of open source projects try to solve this problem however if Hadoop will be a going forward solution or we need a new/different distributed file system platform only time will tell. Currently we have many solutions like Cloud storage platforms, Kafka, Elastic-search/logstash solves the storage scalability problem with their own strengths in specific areas while Hadoop and entire Hadoop ecosystem continue to be a dominant big data platform.