sr_tpm · google_cloud · ml_infrastructure

Hitesh Kaushal.

Twenty years shipping ambitious, high‑stakes software programs at the world’s most demanding engineering organizations. Currently leading ML infrastructure reliability for GPU and TPU fleets at Google Cloud.

Years shipping
20+
Engineers led
40+
US patents
4
Risk buys
$900M

01 · about

Operator’s instinct,
engineer’s rigor.

I run the kind of programs most leaders quietly hope someone else takes — ambitious, cross‑org, technically dense, and visible from the top. At Google Cloud I lead ML infrastructure reliability programs for the GPU and TPU fleets that power the world’s most demanding AI workloads. Before that, I led AWS’s largest compute platform launches, shipped flagship Surface devices for Microsoft, and managed nine BlackBerry product introductions.

I hold an MBA, four U.S. patents, and a published engineering paper. The thread across all of it: pulling clarity out of ambiguity and turning it into something a team of 40+ engineers can actually build and ship.

02 · why ai

The most important programs are in AI right now.

AI tools changed how I work — and that changed what I want to work on. I use Claude, Gemini, and ChatGPT daily, not as productivity shortcuts but as genuine thinking partners. Working on the infrastructure that makes these models possible, I’ve watched the technology move from impressive to transformative faster than any platform shift I’ve seen in twenty years.

What draws me to labs like Anthropic isn’t the hype. It’s the specific, hard problem: how do you ship the most capable AI systems ever built — safely, reliably, at speed, at scale? That is a program management problem. The stakes are higher, the cross‑org complexity is greater, and the cost of getting the design‑vs‑execution tradeoff wrong is more consequential than anything I’ve encountered before.

My career has been a series of platform‑defining bets — BlackBerry’s global supply chain, AWS compute at pandemic scale, Surface at $50M development budgets, and now ML infrastructure for Google’s GPU and TPU fleets. Frontier AI is the next platform. I want to help build the operational foundation that lets the best researchers in the world do their best work — safely.

I’ve started building with these tools too — a Python optimization add‑on for Google Sheets being the most recent example. Not because I’m becoming an engineer, but because the gap between “I have an idea” and “I have a working tool” has collapsed. That changes how I think about what’s possible inside a program.

Claude Gemini ChatGPT Google AI IDE Python

03 · philosophy

The trade‑off no one names out loud.

Every program lives on a single dial: how much time you spend designing versus how much you spend executing.

Get that ratio wrong in either direction and the program fails — over‑designed programs miss the window, under‑designed ones burn cash rediscovering what a week of thinking would have surfaced.

The right ratio is not a constant. It moves with the cost of being wrong, the reversibility of the decision, and the rate at which the surrounding context is changing. A flagship hardware launch with a fixed factory window deserves more design up front than a software experiment whose first release teaches you more than any document could.

In AI development, this problem becomes asymmetric. Moving too slowly cedes the frontier. Moving without enough design produces reliability and safety failures that compound in ways that are hard to recover from. The dial matters more here than anywhere I’ve worked — and calibrating it correctly, in real time, at speed, is the whole job.

Everything else I do follows from that: how I structure milestones, how I escalate, how I trade scope, how I decide when to commit. Programs do not fail because people are unskilled. They fail because someone optimized the wrong side of that dial.

04 · experience

Twenty years. Four flagship companies.

  1. GC
    Oct 2022 — PresentKirkland, WA

    Google Cloud

    Senior Software Technical Program Manager — ML Infrastructure

    • Chairs SVP‑level forum defining ML infrastructure reliability strategy across the entire GPU and TPU fleet — with a focus on increasing MTBF, reducing MTTR, and achieving best‑in‑class operational efficiency via lower holdback, downtime, and spares costs.
    • Led cross‑functional program to reduce H100 tray swap rates and implement granular, lower‑cost FRU schemas, delivering significant Opex savings.
    • Led large program with multiple workstreams to lower Time to Repair for ML machines through reduction of process latency, improved fault attribution, and enhanced diagnosability.
    • Led cross‑functional new product launches for data center infrastructure, influencing key architectural decisions to optimize feature sets and TTM, saving months of SWE development time.
    • Leads a diverse team of 40+ software engineers delivering critical data center software solutions supporting Google Cloud’s rapid AI infrastructure expansion.
  2. AWS
    Jan 2020 — Oct 2022Seattle, WA

    Amazon Web Services

    Senior Supply Chain Product Manager — AWS Compute

    • Launched AWS’s largest Intel Icelake compute platform (CMR), AWS Mac instances, CloudFront, Storage and ML Accelerator products.
    • Executed $900M in risk buys, building the case to SVP level to pre‑empt pandemic‑era supply shocks across AWS global data centers.
    • Built and owned the long‑range forecasting process and product roadmap for Amazon’s Infrastructure, Supply Chain & Procurement (ISCaP) team.
    • Designed six executive dashboards eliminating 10,000+ person‑hours annually of manual supply‑chain data analysis.
  3. MS
    Apr 2016 — Jan 2020Redmond, WA

    Microsoft

    Senior Technical Program Manager — Surface

    • End‑to‑end ownership of Surface Book 2 and Surface Hub 2 (combined dev budgets >$50M), business case through product quality signoff.
    • Architected and coded (PowerQuery) a BOM management process that cut 2‑week transfer cycles to 1 day, saving 10,000+ person‑hours/year. Now standard across all Surface teams.
    • Achieved <0.01% excess & obsolete cost at Surface Hub 2S launch — the record for any Surface product.
  4. BB
    Jan 2011 — Mar 2016Waterloo, Canada

    BlackBerry

    Program Manager — NPI, Manufacturing, Supply Chain

    • Shipped nine BlackBerry devices including Z30, Passport, and PRIV with exceptional yield and factory readiness.
    • Built supply planning for 20,000 SKUs across seven factories during the Kinaxis‑to‑SAP transition.
  5. MM
    2003 — 2011Canada

    Mold‑Masters & Earlier Roles

    NPD Engineer · General Manager

    Created a $100M caps‑and‑closures product line the industry had pursued for two decades. Four U.S. patents and a peer‑reviewed engineering paper emerged from this chapter.

05 · selected impact

Outcomes, not plans.

$900M

Risk buys executed at AWS

Built the case across product, finance, and SVP‑level stakeholders to commit advance purchases that pre‑empted pandemic‑era supply shocks across AWS global data centers.

aws · 2020–22

First of kind

AWS Mac mini cloud instance

Launched AWS’s first Apple silicon cloud instance, solving an end‑to‑end supply chain that did not previously exist inside AWS.

aws · ec2 mac

10,000+ hrs/yr

Saved by a new BOM process

Architected and coded (PowerQuery) a BOM transfer process at Microsoft that cut cycle time from 2 weeks to 1 day. Now used by every Surface team.

microsoft · surface

< 0.01%

Excess & obsolete cost at launch

Set the record for any Surface product at Surface Hub 2S launch — the lowest E&O ratio in program history on the most component‑dense Surface device ever built.

microsoft · hub 2s

06 · skills

Tools of the trade.

AI & daily tools

Claude Gemini ChatGPT Google AI IDE

Data & analytics

SQL Python Tableau QuickSight PowerQuery DAX

Program leadership

OKRs SVP stakeholders 40+ eng teams Cross-org programs

Domains

ML infrastructure Data centers Supply chain NPI

07 · patents & publications

The engineer underneath.

Four issued U.S. patents and a peer‑reviewed paper from an earlier chapter in engineering. They matter here because the program instinct is grounded in knowing how things actually get built.

08 · contact

Get in touch.

Send a message directly, or connect on LinkedIn. I read everything.

kirkland, washington