Results for ""
As the world's largest professional network, LinkedIn has access to mammoth-sized high-quality data that have been used by data scientists and AI developers have been studying and engaging via different tools for EDA (exploratory data analysis) and visualization through multiple query and storage engines. For these undertakings, the professionals needed a unified “one-stop” data science platform to consolidate and service the varied demands.
Therefore, LinkedIn created the DARWIN workbench - the Data Science and Artificial Intelligence Workbench at LinkedIn - which targets use-cases similar to the ones highlighted and studied by other prominent data science platforms in the industry. DARWIN runs on the Jupyter ecosystem (https://jupyter.org/) but the system can run on other systems so that the data scientists and AI engineers at LinkedIn can use it without bars.
According to LinkedIn's official blog, DARWIN has addressed two principal challenges that the Data scientists and AI engineers have faced:
Earlier, different needs were catered to by different products - Jupyter notebooks help in data visualisation and evaluation, dataset libraries such as GDMix, XGBoost, and TensorFlow are used to train and measure different machine learning libraries and Tableau has been used to visualise data to give insights.
DARWIN has been created to meet these challenges and help LinkedIn's data scientists and AI engineers, business analysts, metrics developers who generate and publish metrics using LinkedIn’s Unified Metrics Platform (UMP), and data developers.
Additionally, DARWIN needed to fulfil data the need for visualisation, training and insights, so largely these were essential elements that were introduced in DARWIN:
DARWIN also includes code support like an IDE, with support for several languages and committing code directly to the project repositories. With LinkedIn’s ethos of providing trusted solutions, DARWIN gives secure and compliant access to the hosted platform.
"A key principle we decided to abide by was to leverage open-source projects and contribute to the open-source community while keeping the platform extensible for accommodating rapid innovations in this space. Some of the key open source technologies we chose were JupyterHub, Kubernetes, and Docker," states the LinkedIn blog.
DARWIN: One-stop shop for data platforms
LinkedIn datasets can be queried using DARWIN on several engines by users who have direct access to HDFS data, valid when using platforms like Tensorflow. Further, Python, R, Scala, or Spark SQL can use Spark
The underpinnings of the DARWIN platform
Darwin was able to be scaled to accommodate an expanding team of users who will be studying LinkedIn data with the help of Kubernetes.
Docker images provide extensibility
Darwin helps different teams and users to build and innovate on top of different libraries and apps with the help of Docker, making DARWIN a “Bring Your Own Application” (BYOA) environment.
DARWIN has also integrated Greykite - a forecasting library - that helps with input data visualization, model parameterization, and forecast visualization/interpretation which users can access using Jupyter Notebook
Fine-grained access control is used to control access to DARWIN resources, preventing any unwanted access.
"DARWIN was designed to act as a means for accessing and sharing knowledge amongst users to enhance collaboration and learning. We envisaged DARWIN to be the one-stop place for all the knowledge related to working with data, without having to leave the platform, be it accessing data, understanding it, analyzing it, finding references to build context, or generating reports. Next, we cover some of the work we have done towards achieving this vision," states the blog.
In addition, DARWIN enables search and discovery of metadata with the support of DataHub, allows users to share resources with other users to encourage collaborations, provide storage services and platform services that manage DARWIN's resource metadata. Lastly, DARWIN uses React.js heavily by building React-based JupyterLab extensions for supporting the frontend of most of our features. React.js, with its vibrant community, rich plugin support, and excellent performance, has become the framework of choice for DARWIN.
The key features that set DARWIN apart to particularly cater to different personas are that it supports all languages like various languages, including Python, SQL, R, and Scala for Spark, that is used by LinkedIn's data scientists and AI engineers. The multi-lingually supported capabilities of DARWIN known as Intellisense is short for capabilities such as code completion, doc help, and function signatures, which are some of the most important features in an IDE. DARWIN also features SQL workbooks for citizen data scientists, business analysts, or anyone comfortable working with SQL.
There are several other features that make DARWIN a one-stop solution for AI engineers, and data analysts at LinkedIn but these are only the beginning. "Our goal is that the DARWIN platform continues to evolve to best meet the growing (and changing) needs of our users," concluded the blog.