Home » Posts

Profling mat-mul optimizations across platforms

2025-11-19 · Arsh Sharma
  1. Delulu
  2. In enters Xcode

Delulu

The MacOS that runs on Apple latops/systems is different from Linux, and at a very high level we can say that it's UNIX based, so it doesn't have support for the perf tool that we used so extensively in the previous profling blog.

To be able to benchmark the mat-mul code on an M4 CPU, we can quickly think of three realistic options:

I could run Linux in a VM and keep using perf but that is something I want to try later. I don't think we will have access to HW PM signals in VM(at least that was the case on my windows laptop).

In enters Xcode

I have only beein a Mac user for a few months now, and in terms of development, I am in a comfort zone with VSCode+nano :/

As it turns out, on Apple silicon, the official way to get hardware performance counters (instructions, branches, cache events, etc.) is Instruments which ships with Xcode and exposes low-level PMU events.

clang++ -march=native -O0 -o <simple_matmul_4096> -g ../../../specific_runner.cpp
xcrun xctrace record --template 'My Matmul Counters' --output matmul.trace --launch ./mat_mul