Edited by: Aaron Williamon, Royal College of Music/Imperial College London, UK
Reviewed by: Karen Wise, Guildhall School of Music and Drama, UK; Esther H. S. Mang, Hong Kong Baptist University, China
*Correspondence: Zacharias Vamvakousis
This article was submitted to Performance Science, a section of the journal Frontiers in Psychology
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
We present and evaluate the EyeHarp, a new gaze-controlled Digital Musical Instrument, which aims to enable people with severe motor disabilities to learn, perform, and compose music using only their gaze as control mechanism. It consists of (1) a step-sequencer layer, which serves for constructing chords/arpeggios, and (2) a melody layer, for playing melodies and changing the chords/arpeggios. We have conducted a pilot evaluation of the EyeHarp involving 39 participants with no disabilities from both a performer and an audience perspective. In the first case, eight people with normal vision and no motor disability participated in a music-playing session in which both quantitative and qualitative data were collected. In the second case 31 people qualitatively evaluated the EyeHarp in a concert setting consisting of two parts: a solo performance part, and an ensemble (EyeHarp, two guitars, and flute) performance part. The obtained results indicate that, similarly to traditional music instruments, the proposed digital musical instrument has a steep learning curve, and allows to produce expressive performances both from the performer and audience perspective.
Music performance and learning to play a musical instrument have been showed to provide several benefits for acquiring non-musical skills (Coffman,
The idea of implementing Adaptive Digital Musical Instruments (ADMI) for people with motor disabilities is not new. Depending on the type of motor disability, various ADMIs have been proposed. Kirk et al. (
In more severe cases of motor disabilities, such as people with locked-in syndrome (LIS), none of the mentioned interfaces is appropriate. LIS is a condition in which a patient is conscious but not able to move or communicate verbally due to complete paralysis of nearly all voluntary muscles in the body except the muscles which control the eyes (Bauer et al.,
In eye-tracking-based (gaze-controlled) applications, gaze data might be used alone or in combination with other input methods, such as head, limb, or breath-controlled buttons. Blinking (closing both eyes) or winking (closing just one eye) might also be used as input. In this case, usually the gaze coordinates are used for pointing, and any other input is used for triggering actions. In case the gaze input is used alone, as the eye movements are often non-intentional, gaze information must be interpreted carefully to avoid unwanted responses to user actions. This is described as the “Midas Touch” problem. The most common gaze selection methods that intend to handle the Midas touch problem are: (i) screen button introduced by Ohno (
In the case of screen button method, each target is separated in the command name area and the selection area. Selections are made only when a fixation is detected in the selection area. An extension of the screen button method is the pEYE method introduced by Huckauf and Urbina (
In the case of the dwell time method, when a fixation lasts for more than a given time period (typically about 1 s), a selection is made. The spatial accuracy of eye trackers is usually limited, not allowing the selection of small targets. Dwell time selection method is often combined with magnification methods in order to increase accuracy in gaze pointing and selection (e.g., Lankford,
An extensive review of eye-controlled music performance systems was recently made by Hornof (
In this study we propose an interface in which only the gaze is used as input and which allows a similar interaction and expressiveness as traditional musical instruments. The EyeHarp, using the screen button gaze selection method, allows the control of chords, arpeggios, melody, and loudness using only the gaze as input. Eight people with medium to advanced musical skills took part in an experimental session in which the usability of the EyeHarp was quantitatively and qualitatively studied from the perspective of the performer. Additionally, the EyeHarp was evaluated by 31 participants from the perspective of the audience in a concert setting which consisted of two parts: a solo performance and an ensemble performance (EyeHarp, two guitars, and flute).
The EyeHarp allows the user to control pitch, timing and dynamics of a melody, as well as chords and arpeggios in a performance. The EyeHarp interface consists of two layers: the Step Sequencer layer and the Melody layer. In the Step Sequencer layer chords and arpeggios can be constructed and in the melody layer these can be controlled and a melody can be played. The number of available note buttons can be adapted according to the accuracy of the eye tracker and the expertise of the performer. The user can switch between the two layers through a dwell-time activated button.
The EyeHarp is implemented using openFrameworks open source C++ toolkit
The interface is diatonic and by default tuned to the C major scale. Nevertheless, through a configuration menu it can be tuned to any possible scale. Only the basic functionality of the EyeHarp interface will be described here. A more detailed overview of the more advanced features of the interface was presented by Vamvakousis and Ramirez (
Figure
In order to select a button of the step sequencer, dwell-time selection is applied. The default dwell time value of the EyeHarp interface is 700 ms. The buttons are cyclic with a small focus point at the center. The focus point helps the users to focus their gaze at a point in the center of the target thereby improving the accuracy of the tracking data(Kumar et al.,
A number of control buttons (e.g., for changing the meter or tempo, clearing the selected notes, switching between layers) are provided and may be selected using dwell time selection method (see Figure
The Step Sequencer Layer is a layer for constructing arpeggios, whose harmony is controlled in the melody layer. The note that corresponds to bottom row of the EyeHarp's Step Sequencer is determined by the base note of the selected chord in the melody later. The notes corresponding to the other rows in the step sequencer are mapped to the consecutive notes. For example, if the EyeHarp is tuned to the C major scale and the selected chord in the Melody Layer is the tonic (C major), the buttons of the first row correspond to the note
The Melody layer (Figure
If the set scale is C major, c in the 4th octave is placed at 180°. The scale then goes up counterclockwise. As a default option the pie comes with 14 slices, but the number of slices can be adapted though the setup menu. If the setup button is pressed in the melody layer, a number of configuration buttons appear as shown in Figure
If the “chords” button is active, the last six notes of the pie are replaced by six chords. These buttons control the harmony of the arpeggio constructed in the Step Sequencer layer as explained in Section 2.1.1. In order to play a note or change the chord, the user can either look directly at the selection area of the note/chord or -in case there is a big distance on the screen between two consecutive notes- he can focus on the command name area before focusing on the selection area. This is expected to improve the spatial and temporal accuracy, as Fitt's law also applies to gaze interaction as shown by Miniotas (
In order to release a note, the user has to look at any place outside the pie. For that reason some fixation points are placed outside the pie. When a fixation is detected at the selection area of a note the note sounds and a button appears at the center of the pie. This allows the user to repeat the same note twice. If a fixation is detected inside the button's area, the same note sounds again. If a fixation is detected elsewhere inside the inner (neutral) area, the “repeat” button disappears.
O'Modhrain (
Transparency describes the level to which a performer or spectator can understand the relationship between the input (gesture) and output (sound). According to Hunt et al. (
Reeves et al. (
A concert was organized at the concert hall of Universitat Pompeu Fabra. The performer had been practicing the EyeHarp for a period of 10 weeks, playing three times a week. Every practice session lasted for ~20 min. The concert consisted of two parts. In the first part the EyeHarp player performed a piece composed by him for EyeHarp solo performance and in the second he performed along with two guitar players and a flute player in a jam session. One of the eyes of the performer was shown at the center of the screen and the coordinates of his gaze were visualized by a small cross. A recorded video of the performance Cause comprehension: were the available input gestures clear? (1: not at all. 5: very clear) Effect comprehension: were the available control parameters clear? (1: not at all. 5: very clear) Mapping comprehension: was the connection between the input gestures and the control parameters clear? (1: not at all. 5: very clear) Intention comprehension: how well did the system allow the user to express his musical intentions? (1: not at all. 5: very well) Error comprehension: if there had been errors in the performance, would they have been noticeable? (1: not at all. 5: very noticeable) Enjoyment: how much did you enjoy the performance? (1: not at all. 5: a lot)
The performer perspective evaluation was carried out with written informed consent from eight participants in accordance with the Declaration of Helsinki. Procedures were positively evaluated by the Parc de Salut MAR - Clinical Research Ethics Committee, Barcelona, Spain, under the reference number: 2013/5459/I. Participants (7 male, 1 female) with mean age of 34 years (SD 6.7) participated in a single-session quantitative evaluation task. All participants had some musical instrument playing experience. The quantitative evaluation consisted of a set of tasks using both the step sequencer and melody layer. Apart from one subject, no participant had previous experience with the EyeHarp DMI.
The Eyetribe low-cost commercial eye-tracker was used for acquiring the raw gaze data. Participants were comfortably seated at ~60 cm away from a 15.6 inches laptop screen placed at eyes level. All participants calibrated with nine calibration points and 800 ms of sample and transition time. All participants achieved a 5-star calibration quality in the Eyetribe calibration software (expected visual angle accuracy = 0.5°). A set of M-Audio AV40 self-amplified speakers were connected to the laptop audio output. The ASIO4ALL low latency driver was used, providing an audio output latency of 7 ms. The EyeHarp application was sending MIDI messages through loopBe1 virtual MIDI port to Reaper Digital Audio Workstation (DAW)
The step sequencer layer evaluation task consisted of constructing arpeggios with varying number of buttons in the step sequencer grid. All arpeggios were constructed three times. The first time the gaze pointer was hidden and no magnification method was applied (basic method). The second time the gaze pointer appeared along with additional focus point (gaze feedback method). The third time the gaze pointer was hidden and the described magnification method was applied (magnification method). In all cases when the gaze was detected inside a button, the fixation point of that button was turning green. Figure
In a previous study Vamvakousis and Ramirez ( Notes tended to be played earlier (i.e., in advance). Two diametric distant buttons in the pEYE resulted in an average asynchrony of −46 ms, while two adjacent buttons resulted in −94 ms. The temporal accuracy of the participants improved with practice. The temporal variance value was 10 times higher when compared to the input from a computer keyboard.
In the current evaluation, instead of examining the temporal performance of the interface, we examined the overall usability of the interface. Four different tasks of increasing difficulty were designed. Users practiced for about 2 min before recording three repetitions of each task. At the beginning of each task an arpeggio was constructed in the step sequencer layer that served as a metronome. Figure
After the quantitative evaluation session participants filled in a questionnaire. Participants responded (in a linear scale from 1 to 5) to the following questions:
How much previous practice and training does the performer need for performing with the instrument, when compared to a traditional musical instrument? (1: no practice required. 5: extensive practice required) How much control does the performer have on the musical output? (1: restricted (equivalent to a DJ). 5: extensive musical control that allows expressive performance. How much real-time feedback (e.g., visual, auditory) does the user receive from the system? (1: low feedback. 5: high, multimodal feedback) How tiring is it to play music with your eyes when compared to the hands? (1: not tiring at all. 5: very tiring) Is it hard to play in tempo with your eyes when compared to hands? (1: equally hard. 5: much harder.) Which approach between the magnification lens and the fixation points do you consider more user-friendly? (1: I prefer the fixation points. 5: I prefer the magnification lens)
All questions were verbally explained to the participants. If anything seemed unclear to the participants they were free to ask for questions, which were clarified orally. In the first question, it was orally clarified that by the phrase “performing with the instrument” it is meant to achieve some basic, but rewarding interaction with the instrument. By the response “1: no practice required” we refer to the practice required to achieve a rewarding performance in a gaming music interface, like the guitar hero of Microsoft Xbox. By the response “5: extensive practice”, we refer to the practice required to achieve a rewarding performance in a musical instrument that is considered to be difficult to learn, like the violin. Similarly, regarding the second question, it was clarified that by the response “5: extensive musical control that allows expressive performance” we refer to the control offered by an instrument like the violin. In question 4, it was orally clarified that users should respond “1: not tiring at all” if they consider it equally tiring as playing with the hands.
Figure
Figure
In all tasks the experienced user performed about 2 to 3 times faster than the average speed across the native users. The best average performance (selections per minute) in the case of the 12 × 12 and 16 × 16 grid was achieved with the gaze feedback method. In the case of the 8 × 8 grid task, it was achieved with the basic feedback method. The lowest standard deviation value was achieved in all tasks with the magnification method.
Figure
In all tasks the experienced user played around 20% more notes in tempo than the novice users, performed less accidental notes and no pauses.
Figure
In the present study an evaluation of the proposed digital musical instrument has been conducted. This evaluation has been conducted both from the audience and the performer perspective. According to the audience's evaluation responses, the EyeHarp digital music instrument offers a transparent correspondence between input gestures and the produced sound, i.e., participants in the study average rating of their understanding of the cause (cause comprehension), the effect (effect comprehension), and gesture-sound correspondence (mapping comprehension) was greater than 3.5 out of 5 (see Figure
Regarding the results of the evaluation from the performer's perspective, in the first task of the qualitative evaluation of the Step Sequencer Layer (i.e., the 8 × 8 grid task) it was achieved the best average time per selection. The resulting average time for the case of the 12 × 12 grid was almost double of the average time for the 8 × 8 grid task. This was expected, as small targets are harder to select. However, in the case of the 16 × 16 grid task the average selection time was less than the average for the 12 × 12 grid task. This can be explained by the fact that most of the notes in the 16 × 16 grid task were adjacent notes, which makes the visual search task easier.
The 8 × 8 grid arpeggio task can be compared to typical dwell-time eye-typing task, where the notes are replaced by characters. As seen in Figure
In the case of the 8 × 8 grid the gaze feedback method produced the same results as the basic method, where the only visual feedback to the user is the brightening of the focus point at the center of the attended button. This result may be explained by considering the size of the 8 × 8 buttons: given their big size there was no difference with the two methods. On the contrary, in the case of the 12 × 12 and 16 × 16 grid, when the detected gaze coordinates were given as visual feedback, along with additional focus points, the performance (number of selected buttons per minute) increased with the gaze feedback method.
The experienced user participating in the study completed all the tasks of the step sequencer on average 2.8 times faster than the rest of the users. The difference is even higher in the case of the gaze feedback method. As concluded by Majaranta and Bulling (
The magnification method always performed worse than the gaze feedback method and only in the case of the 12 × 12 grid the obtained results were better than those obtained by the basic selection method. However, the magnification method always showed the lowest standard deviation on the number of selections per minute. This might explain why, as shown in Figure
All in all, the evaluation of the step sequencer layer, confirmed all results reported by similar gaze controlled systems in which selecting targets using dwell-time selection method is required (Hansen et al.,
Figure
The number of omitted notes is higher in the tasks that require playing consecutively the same note (tasks 2 and 3). This is due to the behavior of the button responsible for note repetition: if a fixation is performed in the inner area of the pie but outside the “repeat note” button, the button disappears. In addition, due to noisy gaze tracking data, the user may be focusing on the center of the repeat button but the initial detected gaze point may fall outside the button area.
Although the tasks were designed with increasing difficulty, the average performance in the first task was similar to the average performance in the last task. This may be due to the training effect which compensates the different difficulty levels of the tasks. The last task is the most demanding, as it requires changing the chords along with the melody. A high number of accidental notes were observed during this task (as shown in Figure
The participants in the performer's perspective evaluation responded that the practice required to play the EyeHarp is comparable to the practice required to play a traditional musical instrument of average difficulty (3 out of 5 on average). The same response was given on average on the question about the the control the user has over the musical output (average value 3.1 out of 5), meaning that the control over the output is equivalent to that of a musical instrument that offers average control over the musical output.
The real-time feedback was rated high by most performers (average 3.9 out of 5). Most performers agree that playing music with the eyes is more tiring than playing with the hands (average 3.6 out of 5). Playing in tempo with the eyes is considered to be harder than playing with the hands (3.2 out of 5). Summarizing the above responses, we could conclude that performing music with the eyes is more difficult that performing with traditional means. Nevertheless, learning the EyeHarp gaze-controlled musical instrument wouldn't be harder than learning a traditional musical instrument.
The performer perspective evaluation was conducted with people with experience in playing musical instruments and no disabilities. In order to evaluate the EyeHarp in a more realistic setting, we would have required to test it with LIS patients. This, we believe, should be done in the future, and we have started looking for possible participants.
As summary, we have presented and evaluated the EyaHarp, a new gaze-controlled digital musical instrument. The system was evaluated from the performer and audience perspective. The obtained results indicate that, similarly to traditional music instruments, the proposed digital musical instrument allows to produce expressive performances both from the performer and audience perspective. The participants in the evaluation from the perspective of the performer responded that the practice required to master the EyeHarp DMI is similar to the average practice required to master a traditional musical instrument of average difficulty. The steep learning curve of the instrument is also reflected on the quantitative data, when comparing the performances of the experienced user with the novice users.
The cost of eye-tracking technology decreases every year. The last 5 years the cost of commercial eye-trackers has been reduced more than 10 times. Eye-tracking is slowly being incorporated in common place laptops, tablets and mobile phones. Such devices would allow many users, including users with motor disabilities, to have access to gaze-controlled applications, including the EyeHarp DMI.
The pEYE interface in the melody layer, provides a solution to the Mida's touch problem making it possible to play melodies in-tempo when the gaze of the user is used as the only input. If the physical abilities of the user allow it, other selection techniques like blinking, using physical buttons or blowing could be considered. If such selection methods were utilized, the user would be able to freely visually search the screen without triggering any undesired notes. This would allow increasing the number of available notes on the screen, as the central (neutral) area of the melody layer wouldn't be necessary. As future work, it would be interesting to compare the performance -in terms of overall usability, temporal accuracy and speed- of such an interface with the current version of the EyeHarp. The advantage of the screen button selection method may be that just one action is required to play a note: looking at the selection area. This might allow playing faster than in the case of using an independent clicking method which requires two actions (i.e., looking at the selection area and clicking). On the other hand, using an independent clicking method might allow placing more notes on the screen and might allow better temporal accuracy.
Probably the main target group of the proposed DMI is that of people diagnosed with Amyotrophic Lateral Sclerosis (ALS). ALS is a progressive neurodegenerative disease that affects nerve cells in the brain and the spinal cord. Individuals affected by the disorder may ultimately lose the ability to initiate and control all voluntary movements. Nevertheless, muscles responsible for eye movement are usually spared until the final stages of the disorder (Layer,
ZV developed the software and technology presented in the paper and analyzed the experimental data. ZV and RR together designed the conducted the experiments and wrote the manuscript of the study.
This project has received funding from the European Unions Horizon 2020 research and innovation program under grant agreement No. 688269, as well as from the Spanish TIN project TIMUL under grant agreement TIN2013-48152-C2-2-R.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
1
2
3
4
5
6
7
8
9Source code and binaries available at
10
11
12
13
14Available online at
15