Senior System Reliability Engineer

Symphony

About us @Symphony 

We’ve spent the last 8 years building the financial markets largest, most trusted communication network. Over 500 market participants across the buy-side, sell-side, securities servicing, and beyond. Over half a million users from trading desks to operations and custody teams interacting securely and in real-time on Symphony.

But that was only chapter one. We’re now using our technology foundation to accelerate far beyond secure collaboration to become the standard connective layer that enables more efficient and automated workflows across the industry to bring the future to financial markets.

The opportunity and our ambition are huge. But we need passionate, dedicated individuals to get there. At Symphony we work hard and fast. Our unique blend of technology and financial services makes it an environment you won’t get elsewhere.

Role Summary:

We are seeking a Systems Reliability Engineer (SRE) to assist in supporting our production platform running on the AWS public cloud.  In this role, you’ll be working with the DevOps team monitoring KPIs, scaling services, validating capacity through load testing, building dashboards and implementing proactive alerting to ensure overall performance and site reliability for our mission critical global SaaS service.

You will be working on non-production and production environments, monitoring, collecting data and reviewing configuration management, as well as participate in on call responsibilities, disaster recovery planning, capacity engineering, reliability improvement initiatives and platform automation. We are seeking someone who asks questions, learns from others, proactively communicates and has the ability to create order in chaos.

What you’ll be doing: 

Serve as a key member of the DevOps team while managing the overall system health, performance, and capacity of Cloud9 internal and client facing systems.
Implement & improve tooling to measure and visualize availability, performance and stability of our services and ensure the highest level of SLA through operational excellence
Proactively monitor KPI metrics and statistics for high availability production web services
Respond to alerts and investigate issues or performance bottlenecks. Investigate and document failure analysis, post-mortem, and root cause analysis when required.
Troubleshoot application and service issues or system outages while clearly communicating status updates with management and engineering teams.
Administration and management of Linux systems & implement SRE frameworks to support multi-cloud environments,
Improve the quality of our technical engineering documentation
Manage our global AWS services and infrastructure while assisting in the deployment and roll-out of new features and products.
Assisting with upgrading & scaling our time series system and application performance trending platform including recommending solutions and creating POCs.
Contribute to Continuous Integration and Continuous Deployment (CI/CD) solutions in an AWS environment with Jenkins at the center.
Participate in an on-call rotation.

 

Background and capability: 
Deep understanding of SRE philosophy, technologies, platforms and tools, SLA management, incident resolution, and automation
Prior experience with large scale distributed systems & technologies, where uptime and continuous availability was core to the business
Experienced working in an AWS environment with S3, EC2, Cloudwatch, VPC, ELB/ALB, Route53 and RDS
Experience with and knowledge of tools and services for logging, such as Splunk or ELK for example
Experience with alerting and escalation tools like Nagios, Pagerduty & OpsGenie
Hands-on experience with monitoring metrics systems (e.g. SignalFx, DataDog, Graphite, Grafana) and APM tools (e.g. New Relic, AppDynamics, Dynatrace)

 

Experience and interests:

Experienced with Linux / UNIX systems administration including shell scripting, database configuration & management and managing server infrastructures
Well versed in internet architectures, including web, application, and database components such as Apache, IIS, memcache, Redis, MySQL, NoSQL, pub/sub, messaging, caching technologies, and data warehousing
Proficiency scripting in one or more programming Languages: Python, Perl, Shell Scripting, Bash, or a similar language
Experience with Puppet, Chef, or Ansible and solid understanding of configuration management theory, implementation, and scaling.
Understanding of networking and cloud technologies and experience with network troubleshooting including IP fundamentals, DNS, load balancing, proxies and firewalls

 

Benefits and Perks:

Competitive salary
Bonus Plan
Build your own Benefits (BYOB) perk
Benefits based on location
Many other fun and exciting benefits and activities!
We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status. We will ensure that individuals with disabilities are provided reasonable accommodation to participate in the job application or interview process, to perform essential job functions, and to receive other benefits and privileges of employment. Please contact us to request an accommodation.

Symphony reserves the right of ownership for all unsolicited resumes submitted for this requisition and is not responsible for any fees associated with unsolicited resumes. Symphony participates in E-Verify.

Any offer of employment is conditioned upon the completion of an I-9 form and submission of the appropriate documents for identity and work authorization.

To apply for this job please visit symphony.com.