CTC Undisclosed Job Location United States of America (USA) Experience 7 - 10 yrs
At Data and Performance Team, we are building applications powered by machine learning, to apply data quality on global Data Lake for all enterprise data. It is a Big Data platform fully hosted on Amazon AWS and connected today to more than 40 data sources. The data sources will include different ERP systems, HR, supply chain, marketing, sales etc.
We are looking for an experienced Machine learning engineer who can build AI engine for Data quality platform using Spark. Our team is building Machine learning algorithms that can improve Data quality at scale. We are looking for an experienced candidate who can build resilient, scalable and distributed data pipelines in production and build and deploy machine learning algorithms on to the ETL for prediction and retraining. The individual should be able to understand business requirements and build, manage or update pipelines with... minimal support on AWS platform.
Schneider Electric creates connected technologies that reshape industries, transform cities and enrich lives. Our 135,000 employees thrive in more than 100 countries. From the simplest of switches to complex operational systems, our technology, software and services improve the way our customers manage and automate their operations. Help us deliver solutions that ensure Life Is On everywhere, for everyone and at every moment.
Great people make Schneider Electric a great company.
What do you get to do in this position?
Connecting new data sources to enrich the scope of the data quality platform
Build pipelines using AWS services such as Glue (with triggers), Sagemaker, Airflow, cloud watch, lambda and step function
Build distributed and scalable ETL pipelines that can ingest and process millions of records (apply feature engineering for machine learning predictions greater than 1 TB) on serverless architecture using Pyspark or SCALA
Build resilient data pipelines that can ingest millions of records and calculate statistics (mean, median, standard deviation, count, sum), apply when and case statements, apply loops without explicit for, apply data transformations either as batch or on streaming data sets
Design and develop new features based on consumer application requests to ingest data into data quality platform
Develop reusable functions based on Object Oriented Programming concepts
Automate the integration and delivery of data objects and data pipelines
Build and integrate ML algorithms with ETL flow for prediction and retraining
Do you have the experience and skills?
The duties and responsibilities of this job are to build resilient data pipelines to apply feature engineering and make it available in an efficient and optimized format for applications like AI-Data Quality, Data Profiling and Rules management. The job requires to work with current technologies used in AWS, in particular Spark, PySpark or SCALA, Python, RedShift, EMR, AWS Glue, Sagemaker, S3, Postgres, cloud watch, step functions, lambda, AWS CI/CD pipelines, Git, EC2, Kubernetes, JSON and Parquet files and Athena on AWS environment.
We know skills and competencies show up in many different ways and can be based on your life experience. If you do not necessarily meet all the requirements that are listed, we still encourage you to apply for the position.
This job might be for you if:
Model deployment requirements:
Prior experience in machine learning model prototype, build, deploy including ETL for feature engineering using advanced analytics tools such AWS Glue, Lambda, Batch, Sagemaker, cloud watch and Step function
Experience in spark (Pyspark and SCALA is a must) for feature engineering, deployments and prediction on Batch and Streaming datasets
Experience in tuning Spark parameters (Spark architecture) and related configurations to optimize data pipelining especially in fine tuning shuffle partitions, defining partitions with respect to available processing nodes and threads, handling cache between in memory and storage
Experience with handling User Defined Functions (UDFs) and spark data frames (joins, broadcasting etc.)
Experience and good understanding of DAG graphs to optimize spark jobs
Experience handling large volumes in terra bytes of data for processing and building features for machine learning models
Extensive knowledge on handling Transformations and Actions on spark
Experience in error handling and logging errors
Experience with docker for model deployments on EC2 instances and Kubernetes
Analyze data models, identify and implement performance optimizations on all the existing data pipelines
Experience integrating JSON files as input to Pyspark and SCALA functions
Experience using SparkMLlib packages on large data sets for clustering and classification type models
Experience building micro reusable services and defining the right server requirements (define correct nodes including serverless architecture) for the spark jobs
Programming language: Python, Scala, PySpark, SQL
AWS analytics: S3, EMR, RedShift, Athena, Cloud Formation tools, Airflow, postgres
AWS ML pipeline: Sagemaker, EC2, Glue, Lambda, Step function, Docker
CI/CD pipeline Code commit, Code pipeline, Code build, Git
Packages Pyspark [sql, session, ml, context], sparkmllib, sklearn, Koalas
ML prior implementation experience on regression (linear logistic, regularized, ridge, lasso, elastic), trees (decision tree, random forest, ada boost, xg boost, deep learning sequence models such as RNN, GRN, LSTM, BERT, HMM, Viterbi, EM non gaussian models, clustering and anomaly detection, recommendation engine, ranking systems
At least 5 years experience in software development with Object Oriented Programming, with proven experience in building and deploying data pipelines using PySpark or SCALA and model building and deployments for machine learning on Amazon AWS platform to production environment.
Nice to have:
Master's or PhD in Computer Science, Machine Learning, Physics, Mathematics, Statistics, Operations Research or related technical field with 7 years of experience building and deploying spark-based machine learning pipelines to production
Experience with one or more of the following: Classification, Clustering, Natural Language Processing, recommendation systems, ranking systems or similar
Experience with Machine Learning libraries from python, tensorflow, pytorch, keras
Experience with Data structures and Algorithms
AWS certified associate solutions architect certificate
AWS certified Machine Learning specialty certificateRead more