Interfacing Gestural Data from Instrumentalists

Martin Jaroszewicz – [email protected]
  University of California, Riverside

This paper presents preliminary work on a system for capturing gestures from music instrumentalists. A saxophone player performed an étude from standard repertoire in two different manners: firstly, constraining their physical gestures to pro- duce sound and execute the written music; secondly, to exaggerate the gestures to include expressions of emotions. The author used a “non-invasive” way to capture the performance gesture using wearable IMU devices with sensor fusion. Using open source 3D creation software, the author extracted motion paths from the data generated by the performer. A 3D Kernel Density Estimation (KDE) algorithm was implemented to create a visualization of density of the trajectories of the head, left elbow and left hand. Analyzing the gestural space of a work of music to develop composition strategies when interfacing the machine and the performer in real time electroacoustic music is highlighted.


Research shows that body movement expresses emotion (Clynes, 1978), and that visual information can be more informative in the perceiver’s understanding of the performer’s expressive intentions (Davidson, 1993). For the performer in a concert setting, gestures produce sound, communicate with other musicians, serve as a counter to keep the pulse, and express emotions. Recognizing and analyzing the gestural space of a music performance can aid the composer of interactive electroacoustic music to use the performance gesture to his advantage.

Traditionally, composers relied on hardware interfaces that can transmit serial data using MIDI keyboards or controllers, optical motion tracking devices, and other sensors that can be interfaced using microcontrollers such as the Arduino or the Raspberry Pi12.On the one hand, the MIDI keyboard and its variants act as an awkward extension of the acoustic instrument. These controllers have not been designed for such a task and provide a “musically invasive” way of connecting the performer and the machine, distracting the audience from the performance and the music. On the other hand, sensors that use accelerometers and gyroscopes are more interesting because they can take advantage of the performer’s gesture to trigger an event or change the parameters of a software synthesizer in real time. The author suggests a non-invasive approach for interfacing the performer and the machine that does not use a visible interface and does not force the performer to make extraneous gestures. I propose the use of inertial measurement units (IMU) with magnetometers because they can be used in combination with sensor fusion3 and Kalman filters4 to track points in a 3D space. Accelerometers have been used by interactive artist to map acceleration and orientation to synthesis techniques in the most basic and direct forms. Devices such as smartphones, the Wii controller and other gaming apparatus can be easily interfaced using visual software such as Max/MSP5, Pure Data6 and Touch Designer78 for real time performance and interactivity.

Understanding the gestural space of the work can help the composer synchronize the electronics in a more integrated and coherent manner. A mindful approach to interfacing the performer and the machine is non-invasive for both the performer: who is not wearing or reaching a device other than his instrument, and for the audience: who is not distracted by gestures that do not express emotions or produce sound. An analysis of the channels of communication between the performer and the machine has been explored by Cadoz (1988) and applications for new digital instruments beyond the keyboard by Wanderley (2006), who wrote extensively about gestural control and sound synthesis (Wanderley et al., 2014; 2012; 2006; 2008; 2011; 2017). The inspiration for this research was a reading of Derrida’s Typewritter Ribbon where the French philosopher poses the question of the event and the machine as indissociable concepts (Derrida, 2002).

Performance Gestures

The instrumental communication with the computer creates a triangular relationship of machine, performer and environment. Gestures have been classified by Cadoz, Wanderley and Delalande into the following categories:

  • Gestures where no physical contact with the instrument is involved. These have been called free, semiotic or naked gestures.
  • Gestures where physical contact with the instrument takes place. These have been called ergotic, haptic or interactive.
  • Effective or sound-producing, whose direct action consequence is sound generation.
  • Accompanist gestures or sound-facilitating, which support sound-producing motor movements but themselves do not generate sound directly.
  • Figurative or ancillary, non-technical or concurrent movements, which are not involved with sound production or sound facilitation.

For the purpose of this research, I propose a classification that is concerned with the musical aspect of the performance. This classification might be called the gestural space of the performer:

  • Gestures that produce sound.
  • Gestures that convey rhythmic information; e.g. The performer using her body to count and keep the pulse9.
  • Gestures that communicate information to other performers; e.g. A hand signal or a nod of the head.
  • Gestures that interface the performer and the machine; e.i. These are gestures that can be read with sensors such as a hardware pad or an accelerometer.
  • Gesture that express emotions.

Research conducted by Wanderley (2002) showed that the movements of clarinetists are replicable across performances and are not essential to the physical execution of a piece. The research was conducted with optical motion tracking devices and claimed that performers were unaware of their movements. Players were also instructed on the manner of playing by limiting and exaggerating their movements. The research, using functional Data Analysis, observed that the importance of the visual component to the typical tension judgment is dependent upon the loudness and the note density in sound and that visuals carry much of the same structural information as the sound. In this paper, I am concerned with the analysis of the gestural space to interface the machine and the performer maintaining the relationship between the visual and the auditory information that conveys the structure of the work, without disturbing the listener’s level of engagement. From a composition point of view, coherence and integrity provide a sense of wholeness (Reynolds, McAdams, 2002). This sense is gradually unwrapped by the listener during the performance of the work; a seamless integration of the performance gesture and the music only makes sense, and should not be disturbed by the extraneous movements a performer may engage trying to interact ad hoc with devices thus impacting the exploration of the work by the listener in a negative way.

Capturing gestures from performers

In order to capture performance gestures of instrumentalists, the author used 6 Notch (Notch, 2017) motion capture sensors that were placed on the upper body of a professional saxophonist. Repertoire for the saxophone included an étude piece by Sigfrid Karg-Elert (III-Consolation) of a duration of approximately 2.5 minutes. I chose the étude for saxophone for its duration, rhythmic and tonal simplicity and the performer’s acquaintances with the work10. The performer was asked to perform in two different manners: firstly, limiting her movements to produce sound, and secondly, exaggerating the performance. According to the performer, not being expressive by limiting the gestural space was more difficult than exaggerating the performance. The idea was to create different recordings of the data with the two manners of playing, to see the variations of the gestural space to extract information that could be valuable for developing strategies when interacting the machine and the performer from a composition point of view.

Professional saxophone player Kelsey Broersma showing sensor placement.

Figure 1: Professional saxophone player Kelsey Broersma showing sensor placement.

The system consisted of the following equipment:

  • 6x Notch Motion capture sensors with adjustable straps.
  • 1 Samsung Galaxy Tab A tablet running Android version 6.0.1.
  • 1 Ricoh Theta S 360° camera.
  • Sound recording equipment.

The sensors were set to capture data at a rate of 40Hz. For the saxophone performance, sensors were placed on the head, left upper and lower arm and chest, as shown in figure 1. Multiple recordings of the data had to be performed because the sensors are sensitive to ferrous metals. The saxophone (brass) is non-ferrous and did not cause any interference but the metal neck strap caused the readings to be off on the y axis11. The data is available as hierarchical skeleton data in the Biovision (.BHV) format12, or as angles in csv format.

Visualizing gestures

Motion capture data was imported into Blender (Blender Foundation, 2017). BVH provides a specification for an initial pose of a human skeleton and its subsequent poses for each frame. The root starts at the hip and the nodes corresponding to each joint are hierarchically attached as shown in figure 2a. I created points on each area of interest and attached them to their respective joints in the armature. In the case of the saxophone player: head, left elbow and left hand. This procedure allowed me to assign the created points as children of the joints in the armature, to be able to trace their paths as a way to visualize movement in time. See figure 2b.

Blender's modular interface showing a view of the skeleton's hierachy

Figure 2a: Blender’s modular interface showing a view of the skeleton’s hierachy.

Blender's modular interface showing motion paths generated from the data captured from a saxophone player

Figure 2b: Blender’s modular interface showing motion paths generated from the data captured from a saxophone player.

Blender provides access to scripting in Python as a way to extend its functionality in most areas. To interact with Blender, scripts can make use of the integrated API (Application Programming Interface) (Blender Foundation, 2017b). For the purpose of this research, a simple script was written to export the path of each joint as points in a 3D space, one point for each frame at 24 FPS. The script exports the x,y,z coordinates and write them to a text file:

import bpy
from bpy import context

obj = context.active_object
v =[0]
co_final = obj.matrix_world *

file = open("06_elbow.txt","w")
for frame in range(300,6400):
    v =[0]
    co_final = obj.matrix_world *
    x = co_final.x
    y = co_final.y
    z = co_final.z

In this particular experiment, the movement of the head suggests an oval shape with a maximum diameter of approximately 30cm as shown in figure 3.

Oval shape visualized by the trace of trajectories generated with data from head.

Figure 3: Oval shape visualized by the trace of trajectories generated with data from head.

Kernel Density Estimation

Kernel Density Estimation (KDE) is a mathematical method that computes density by a convolution of a kernel K with data points. It is suitable for showing an overview of amounts of data (Hurter, 2016). It involves placing a symmetrical surface over each point, evaluating the distance from the point to a reference location based on a mathematical function, and summing the value of all the surfaces for that reference location. This procedure is repeated for all reference locations:

Math: $$g(x_j)= \sum \{ [ w_i * i_i ] * \frac{1}{h^2 * 2\pi} * e^{-\frac{d_{ij}^2}{2*h^2}} \}$$ (1)

where Math: $d_{ij}$ is the distance between a point and any reference point in the region, Math: $h$ is the standard deviation of the normal distribution (the bandwidth), Math: $W_i$ is a weight at the point location and Math: $I_i$ is an intensity at the point location. The function extends to infinity in all directions. The computer algorithm used by the author is a modified form of a C++ command line program (Alberti, 2017) that outputs a file that can be visualized as a 3D density contour plot in Paraview (Kitware, 2017). Figure 4 shows the density map generated by the movement of the head of the saxophonist while playing the étude. Red shows normal movement, and blue the areas reached by exaggerated gestures.

KDE density plot of the trace of the head of the performer.

Figure 4: KDE density plot of the trace of the head of the performer.

Triggering events

For triggering score events in real time based on performance gestures IMUs are sufficient when the gestural space is not a concern. Using MbientLab’s MetaWear C13, the author developed an iOS app using the Swift programming language, that captures acceleration and rotation of the performer’s arm and sends it to a computer using the OSC protocol. The data was visualized in Pure Data in order to experiment with different Kalman filtering techniques. The system is inexpensive and portable, and has the advantage of being non-invasive: sensors can be worn like a watch or can be attached to an instrument with a 3D printed accessory. Several sensors can be connected to the same app making it suitable for using them in small laptop orchestras. The Kalman filtering can be adapted in the app to accommodate different levels of sensitivity. The sensors connect to the iOS device using Bluetooth and have a range up to 10 meters. They are powered by a CR2032 coin-cell battery which is a good option for portability and prolonged use.

Wearable IMU and coin-cell battery

Figure 5a: Wearable IMU and coin-cell battery.

iOS app that sends OSC data in real time to the computer

Figure 5b: iOS app that sends OSC data in real time to the computer.

Future Work

Future work includes parsing the 3D KDE algorithm to Python and create an environment in Blender for visualizing density plots in an OpenGL subview of the viewport, and adding waveform and video views. A portable system that can visualize and interact with the performer and the computer through tablet devices or phones using OSC is being developed. The author has been testing several wireless wearable sensors (MbientLab, 2017) using different Kalman filters and sensor fusion techniques. Sensors with built-in sensor fusion are available at the time of the writing and mounts for this sensors can be easily created and adapted to fit different instruments using 3D printers.


The aim of this work was to create a system for exploring the gestural space of the performer to analyze and take advantage of the performance gesture, to trigger score events in real time using non-invasive wireless portable devices. The development of non-invasive systems might facilitate the findings of gesture patterns across pieces and instruments, the finding of technical difficulties of a given work, or performance habits that can lead to injury. Although the author is concerned with the use of the gestural space for developing strategies for music composition, the use of wearable devices capable of tracking the performance gesture in real time finds applications in laptop performance, digital instruments, the development of hyperinstruments and interactive works involving dancers. The author explored a non-invasive way of interfacing the machine and performer and began the development of an integrated analysis and creative environment for interactivity using open source 3D software, wearable sensors and iOS devices. From a theoretical perspective, a classification of the performance gesture that might be called the gestural space of the performer was suggested in order to include the space the performer might use to interact with the computer in a non-invasive manner.


M. Clynes, Sentics: the touch of the emotions. New York, N.Y.: Anchor Press/Doubleday, 1978.

J. W. Davidson, “Visual perception of performance manner in the movements of solo musicians,” Psychology of Music, vol. 21, no. 2, pp. 103–113, 1993.

C. Cadoz, “Instrumental gestures and musical composition,” in Proceedings of the 1988 International Computer Music Conference, ICMC 1988, Cologne, Germany, September 20-25, 1988.

E. R. Miranda and M. M. Wanderley, New digital musical instruments. Middleton, WI: A-R Ed., 2006.

R. A. Seger, M. M. Wanderley, and A. L. Koerich, “Automatic detection of musician’s ancillary gestures based on video analysis,” Expert Systems With Applications, vol. 41, no. 4(2), pp. 2098–2106, 2014.

B. Caramiaux, M. M. Wanderley, and F. Bevilacqua, “Segmenting and Parsing Instrumentalists’ Gestures,” Journal of New Music Research, vol. 41, no. 1, pp. 13–29, 2012.

V. Verfaille and M. M. Wanderley, “Mapping strategies for sound synthesis, digital audio effects, and sonification of performer gestures,” The Journal of the Acoustical Society of America, vol. 119, no. 5, p. 3439, 2006.

V. Verfaille, M. M. Wanderley, and P. Depalle, “Indirect acquisition of flutist gestures: a case study of harmonic note fingerings,” The Journal of the Acoustical Society of America, vol. 123, no. 5, p. 3796, 2008.

A. Bouënard, M. M. Wanderley, S. Gibet, and F. Marandola, “Virtual Gesture Control and Synthesis of Music Performances: Qualitative Evaluation of Synthesized Timpani Exercises,” Computer Music Journal, vol. 35, no. 3, pp. 57–72, 2011.

M. Schumacher and M. M. Wanderley, “Integrating gesture data in computer-aided composition: A framework for representation, processing and mapping,” Journal of New Music Research, vol. 46, no. 1, pp. 87–101, 2017.

J. Derrida, Without alibi. Stanford, CA: Stanford University, 2002.

M. M. Wanderley, “Quantitative analysis of non-obvious performer gestures,” Lecture notes in computer science, no. 2298, pp. 241–253, 2002.

R. Reynolds and S. McAdams, Form and method: composing music. New York, N.Y.: Routledge, 2002.

Notch, “Notch: Smart motion capture for mobile devices.” Accessed: 2017-03-23.

B. Foundation, “Blender.” Accessed: 2017-03-23.

B. Foundation, “Blender manual.” Accessed: 2017-03-23.

C. Hurter, Image-based visualization: interactive multidimensional data exploration. San Rafael, CA: Morgan & Claypool, 2016.

M. Alberti, “3d point density calculation: a c++ program.” Accessed: 2017-03-23.

Kitware, “Paraview.” Accessed: 2017-03-23.

MbientLab, “Mbientlab.” Accessed: 2017-03-23.

S. Dahl and A. Friberg, “Visual perception of expressiveness in musicians’ body movement,” Music perception, vol. 24, no. 5, pp. 433–454, 2007.

W. D. Hairston, K. W. Whitaker, A. J. Ries, J. M. Vettel, J. C. Bradford, S. E. Kerick, and K. McDowell, “Usability of four commercially-oriented eeg systems,” Journal of Neural Engineering, vol. 11, no. 4, p. 046018, 2014.

A. B. Tsybakov, Introduction to nonparametric estimation. New York, N.Y.: Springer, 2010.

B. W. Vines, M. M. Wanderley, C. L. Krumhansl, R. L. Nuzzo, and D. J. Levitin, “Performance gestures of musicians: What structural and emotional information do they convey?,” Lecture Notes in Computer Science, no. 2915, pp. 468–478, 2004.


3 Combining the data of all sensors to obtain Euler angles or Quaternions.

4  Kalman filters smooth noisy data and provide estimates of parameters of interest.

8 Some of the most popular graphical tools for interfacing devices and map the data to sound synthesis or graphics.

9 These gestures might be interpreted by the listener as expression of emotions.

10  An étude is a piece written as a technical exercise for the performer.

11 Right hand coordinate system.

12 Biovision Motion Capture Studios is defunct. The hierarchical data format specification can be found here: