Cloudera VIP Customer Meetup

Cloudera的Hadoop版本擁有全世界80%的安裝佔有率，這個VIP客戶同時擁有多座的Hadoop Cluster。本活動邀請到兩位資料科學與資料工程的華人專家來台灣訪問，Hadoop的使用者將可藉這個難得的機會，面對面交流資料工程與資料科學領域的實務經驗。

Agenda:

13:30– 13:45 Opening 開場致詞

13:45 – 14:45 Cloudera Data Science Workbench (CDSW), the Machine Learning Platform for Enterprise 邁向企業級機器學習平台

Guest speaker: Josh Yeh, Software Engineer of Cloudera, Palo Alto

Work on Cloudera Data Science Workbench (CDSW) with Apache HDFS, YARN, Spark

Work on E2E with ML/DL/AI framework: Keras, TensorFlow and etc, in CDSW

Education: EECS, UC Berkeley College of Engineering

Synopsis:

Machine learning is all the rage. Machine learning poses great opportunities for enterprises who already capture vast amount of data, and Cloudera’s customers are using our platform to solve Machine Machine learning problems everyday.

However getting data from an enterprise data hub is no trivial task for a data scientist. The main challenges (but not limited to) are:

Accessing vast amount of production data from secured production cluster.

Maintaining machine learning tools, libraries, and frameworks

Training efficiency with GPU clusters.

Which result data scientists could only have limited dataset for data modeling and training. Without dataset from production, it creates data silo, small dataset problem. In addition, data governance is another set of problems for cluster administrators. Data scientists also want to shorten the development cycles to deploy trained model into production as efficient as possible, which is really hard to accomplish in production environment. Cloudera Data Science Workbench is the solution to enable data scientist while meeting enterprises data security requirements.

14:45 – 15:00 Break

15:00 – 16:00 Cloudera’s storage systems overview解構巨量資料平台儲存系統

Guest speaker: Wei-Chiu Chuang, Ph.D., Software Engineer of Cloudera, Palo Alto

and applications in Taiwan. Wei-Chiu received his Ph.D. in Computer Science from Wei-Chiu joined Cloudera in 2015 as a software engineer, where he is responsible for development of Cloudera’s storage systems, mostly the Hadoop Distributed File System (HDFS). He is an Apache Hadoop Committer/Project Management Committee member for his contribution in the open source project. He is also a co-founder of Taiwan Data Engineering Association, an organization that promotes better Data Engineering technologies Purdue University for his research in distributed systems and programming models.

Synopsis:

In the past, Cloudera’s platform (CDH) supports two storage types where Big Data Applications can leverages: HDFS and HBase. But Cloudera’s customers are constantly discovering new use cases and expecting the platform to support more types of workloads. Therefore, Cloudera’s storage systems team are now supporting three new storage systems optimized for different use cases: Kudu for IoT, and S3 and ADLS for the cloud. Meanwhile, the good, old HDFS and HBase are getting a refresh with the release of Hadoop 3.0 and HBase 2.0.

In this talk, I will present an overview of these storage systems. I will highlight the new capabilities brought into Hadoop 3.0 and HBase 2.0, and then shift the focus to Kudu for IoT use cases, followed by S3/ADLS for the cloud use cases. With the new storage system options available, we fully believe Cloudera’s customers will find better use of data, and make what’s impossible today, possible tomorrow.

Resources

Cloudera VIP Customer Meetup