Automatic Descriptive Transcription of Carnatic Music

  • 14



Name of the Speaker: Venkata Subramanian Viraraghavan (EE16D024)
Guide: Dr. Aravind R
Co-Guide: Dr. Hema A Murthy
Date/Time: 14th December 2022, 2.00 PM

Carnatic music (CM) employs a profusion of continuous pitch variation called gamakas in addition to the usual 12 musical notes per octave. However, CM notation is in terms of svaras with little or no gamaka information and, therefore, cannot be synthesized. Previous work on extracting gamaka information in a descriptive transcription for synthesis is fairly recent, and treats the pitch curve as a whole. In this research, we aim to automatically extract a descriptive transcription for CM by separating the pitch curve into its components. Towards this end, we define a constant pitch note (CPN) as a segment of the pitch curve whose pitch is within empirical limits. The pitch-curve is then viewed as consisting of CPNs and transients, which are the segments of the pitch curve outside CPNs. We further define stationary points, or STAs, as the maxima and minima of transients.

A histogram of pitch-values folded to one octave has significant values between the musical notes due to gamakas. By contrast, histograms of only CPN pitch-values show sharp peaks at notes in the raga. We further propose a novel view of a CPN in CM as an upward anchor and/or a downward anchor depending on the direction of adjacent pitch movements. The peaks in the histograms of upward and downward anchors in a raga are detected as anchor-targets. Next, we separate STAs into maxima (max-STAs) and minima (min-STAs). We detect max-STA targets and min-STA targets from the peaks of the respective histograms. The anchor-targets and STA-targets explain not only the notes and gamakas in a CM-raga, but also serve as a reference for component-wise precision measurement. The difference in measured precision, ~20 cents for CPNs and ~60 cents for STAs, suggests that transcription should also be done component-wise.

We further propose the use of anchor-specific STA-targets and obtain them from max-STAs and min-STAs adjacent to each anchor in the raga. We found that treating STAs as being in the state of an anchor or of a transient, is beneficial in quantizing to the anchor-specific targets. We propose state based transcription (SBT) using maximum likelihood sequence estimation. The pitch-value of each CPN or STA is quantized to the target corresponding to its state in the estimated sequence. These quantized pitch-values in semitones, and the timing information of CPNs and STAs, constitute the descriptive transcription. We use cosine interpolation to generate a pitch curve from the descriptive transcription and feed this pitch curve to a five-harmonic synthesizer. In a subjective listening test, expert-ratings of synthesized samples show that the descriptive transcription captures gamakas.

Synthesized audio samples will be played during the talk to support the technical points above. A sample of a re-synthesized audio track mixed with the original source is available at:
For discernibility, the two sources are occasionally played individually.