Hitesh Kaushal — Senior Technical Program Manager

01 · about

Operator’s instinct,
engineer’s rigor.

Compute program leader who keeps hyperscale GPU/TPU AI fleets reliable, efficient, and online. 25 years across Google Cloud, AWS, Microsoft, and BlackBerry owning the full compute lifecycle — capacity bring‑up, fleet reliability and utilization, hardware lifecycle, and cost efficiency — with the fleet‑wide observability that gives leadership a single source of truth on supply, demand, and health. At my best owning ambiguous, high‑stakes problems end to end: leading crisis response across software, hardware, manufacturing, and supply chain, and turning it into repeatable systems. Cumulative impact to date: $1B+ in savings, cost avoidance, and downtime mitigation.

02 · why ai

The most important programs are in AI right now.

AI tools changed how I work — and that changed what I want to work on. I use Claude, Gemini, and ChatGPT daily, not as productivity shortcuts but as genuine thinking partners. Working on the infrastructure that makes these models possible, I’ve watched the technology move from impressive to transformative faster than any platform shift I’ve seen in twenty‑five years.

What makes AI infrastructure uniquely demanding isn’t the hype. It’s the specific, hard problem: how do you ship the most capable AI systems ever built — safely, reliably, at speed, at scale? That is a program management problem. The stakes are higher, the cross‑org complexity is greater, and the cost of getting the design‑vs‑execution tradeoff wrong is more consequential than anything I’ve encountered before.

My career has been a series of platform‑defining bets — BlackBerry’s global supply chain, AWS compute at pandemic scale, Surface at $50M development budgets, and now ML infrastructure for Google’s GPU and TPU fleets. Frontier AI is the next platform. I want to help build the operational foundation that lets the best researchers in the world do their best work — safely.

I’ve started building with these tools too — a portfolio app that consolidates my investments across multiple institutions into a single performance view, and a Python optimization add‑on for Google Sheets. Not because I’m becoming an engineer, but because the gap between “I have an idea” and “I have a working tool” has collapsed. That changes how I think about what’s possible inside a program.

Claude Gemini ChatGPT Google AI IDE Python

03 · philosophy

The trade‑off no one names out loud.

Every program lives on a single dial: how much time you spend designing versus how much you spend executing.

Get that ratio wrong in either direction and the program fails — over‑designed programs miss the window, under‑designed ones burn cash rediscovering what a week of thinking would have surfaced.

The right ratio is not a constant. It moves with the cost of being wrong, the reversibility of the decision, and the rate at which the surrounding context is changing. A flagship hardware launch with a fixed factory window deserves more design up front than a software experiment whose first release teaches you more than any document could.

In AI development, this problem becomes asymmetric. Moving too slowly cedes the frontier. Moving without enough design produces reliability and safety failures that compound in ways that are hard to recover from. The dial matters more here than anywhere I’ve worked — and calibrating it correctly, in real time, at speed, is the whole job.

Everything else I do follows from that: how I structure milestones, how I escalate, how I trade scope, how I decide when to commit. Programs do not fail because people are unskilled. They fail because someone optimized the wrong side of that dial.

04 · experience

Twenty‑five years. Four flagship companies.

GC
Oct 2022 — PresentKirkland, WA

Google Cloud

Senior Software Technical Program Manager & Portfolio Lead — ML Infrastructure
- Owns a 9‑program portfolio across software, hardware, manufacturing, and supply chain for custom AI compute fleets, delivering $1.03B+ in cumulative savings, cost avoidance, and downtime mitigation.
- Shifted accelerator servicing from full‑tray decommissioning to component‑level board replacement (“Accelerator‑as‑a‑FRU”), cutting tray‑swap rate 37.3% → 4.7% — $761M first‑year savings, 4,800+ swaps avoided.
- Cut new‑cluster turn‑up 58% (34 → 14 days) with a modular maintenance model aligning hardware and software engineering to bring AI capacity online faster.
- Led a fleet‑wide AI‑chip overheating crisis (Code Yellow): protected $1.5B in active assets, stood up VP war‑room and dashboard governance, drove MTTR under 2 days across 1,600+ chips.
- Built a global silent‑data‑corruption diagnostic pipeline (84% faster scans) and a “drainless” live‑maintenance program enabling zero‑downtime hot‑swaps — $65M saved, 85% repair coverage.
- Designed the Domain Health & SLO dashboard for next‑gen AI supercompute, plus LLM‑driven support‑triage automation — one view of fleet health, repair trends, and out‑of‑SLO domains.
- Led the cross‑functional hardware‑recovery workstream that secured an unprecedented RMA agreement with a leading GPU manufacturer — replacing 1,500+ chips and raising thermal limits.
- Recognized with Google Technical Impact Award, Gold Star (crisis leadership), Four Star (lean & dependable ops), and Spot & Peer bonuses.
AWS
Jan 2020 — Oct 2022Seattle, WA

Amazon Web Services

Senior Supply Chain Program Manager — AWS Infrastructure New Product Launches
- Ran end‑to‑end hardware NPI and global deployment for AWS’s largest server instances, Mac compute instances, content‑delivery and storage platforms, and custom AI accelerators.
- Scaled architectures from prototype to high‑volume manufacturing across regional data center operations, global supply chain, and tier‑1 external manufacturers.
- Governed engineering and sustaining‑change pipelines to contain supply‑chain risk, component obsolescence, and capacity bottlenecks.
MS
Apr 2016 — Jan 2020Redmond, WA

Microsoft

Senior Technical Program Manager — Surface
- Led Surface Hub and Surface Book lines through the full stage‑gate lifecycle to quality signoff (combined dev budgets >$50M).
- Redesigned the BOM management process (PowerQuery) to automate cross‑team design alignment and simplified risk‑buy workflows, saving ~$5M in excess & obsolete material cost.
BB
Jan 2011 — Mar 2016Waterloo, Canada

BlackBerry

Program Manager — NPI, Manufacturing, Supply Chain
- Automated manufacturing ramp‑planning, standardizing KPIs and cutting cycle time 50%.
- Built demand‑waterfall and carrier‑rollout analytics for the COO; led supply planning across seven factories during the Kinaxis‑to‑SAP transition.
MM

2003 — 2011Canada

Mold‑Masters & Earlier Roles

NPD Engineer · General Manager

Created a $100M caps‑and‑closures product line the industry had pursued for two decades. Four U.S. patents and a peer‑reviewed engineering paper emerged from this chapter.

05 · selected impact

Outcomes, not plans.

$1.03B+

Portfolio impact at Google Cloud

Own a 9‑program portfolio across software, hardware, manufacturing, and supply chain — delivering cumulative savings, cost avoidance, and downtime mitigation across custom AI compute fleets.

google cloud · ml infra

$761M

Accelerator‑as‑a‑FRU savings

Shifted servicing from full‑tray decommissioning to component‑level board replacement, cutting tray‑swap rate 37.3% → 4.7% — 4,800+ swaps avoided in year one.

google cloud · hardware

58% faster

GPU/TPU cluster turn‑up

Modular maintenance model reduced new‑cluster bring‑up from 34 to 14 days — accelerating delivery of new AI infrastructure capacity at Google Cloud scale.

google cloud · gpu/tpu

84% faster

Silent‑data‑corruption scans

Fleet‑wide diagnostic pipeline cut scan time from 6.3 days to under 24 hrs, enabling faster fault isolation across the AI accelerator fleet.

google cloud · reliability

$1.5B protected

Code Yellow crisis response

Led fleet‑wide AI‑chip overheating crisis: VP war‑room, dashboard governance, MTTR under 2 days across 1,600+ chips.

google cloud · crisis

$65M saved

Drainless live‑maintenance

Zero‑downtime hot‑swap program enabling 85% repair coverage without draining compute capacity from live AI workloads.

google cloud · ml infra

06 · skills

Tools of the trade.

AI & daily tools

Claude Gemini ChatGPT Google AI IDE

Data & analytics

SQL Python Tableau QuickSight PowerQuery DAX

Program leadership

OKRs SVP stakeholders 40+ eng cross-org Cross-org programs

Domains

ML infrastructure Data centers Supply chain NPI

07 · patents & publications

The engineer underneath.

Four issued U.S. patents and two published works from an earlier chapter — one in engineering, one in leadership thinking. They matter here because the program instinct is grounded in knowing how things actually get built, and in writing clearly about why they should be built that way.

US 20100278962 ↗

paperElectrification of Hot Runners — ANTEC, NPE 2009
essayLeadership — visions for the future — The Tribune (India), June 2000

08 · contact

Get in touch.

Send a message directly, or connect on LinkedIn or X. I read everything.

LinkedIn X / Twitter

kirkland, washington

Hitesh Kaushal.

Operator’s instinct,
engineer’s rigor.

The most important programs are in AI right now.

The trade‑off no one names out loud.

Twenty‑five years. Four flagship companies.

Google Cloud

Amazon Web Services

Microsoft

BlackBerry

Mold‑Masters & Earlier Roles

Outcomes, not plans.

Portfolio impact at Google Cloud

Accelerator‑as‑a‑FRU savings

GPU/TPU cluster turn‑up

Silent‑data‑corruption scans

Code Yellow crisis response

Drainless live‑maintenance

Tools of the trade.

The engineer underneath.

Injection Molding Runner Apparatus Having Pressure Seal

Edge‑gated Nozzle

Melt Channel Geometries for an Injection Molding System

Injection Molding Nozzle Wedge Seal

Get in touch.

Hitesh Kaushal.

Operator’s instinct,engineer’s rigor.

The most important programs are in AI right now.

The trade‑off no one names out loud.

Google Cloud

Amazon Web Services

Microsoft

BlackBerry

Mold‑Masters & Earlier Roles

Portfolio impact at Google Cloud

Accelerator‑as‑a‑FRU savings

GPU/TPU cluster turn‑up

Silent‑data‑corruption scans

Code Yellow crisis response

Drainless live‑maintenance

Tools of the trade.

Injection Molding Runner Apparatus Having Pressure Seal

Edge‑gated Nozzle

Melt Channel Geometries for an Injection Molding System

Injection Molding Nozzle Wedge Seal

Get in touch.

Operator’s instinct,
engineer’s rigor.