In telecommunications, a speech signal is subject to a chain of processing units before it reaches the ear of the listener. In this transmission process typical elements impacting the signal are e.g.
- microphone and loudspeaker characteristics
- speech enhancement / noise reduction algorithms
- speech codecs
- transmission network (including the physical channel).
Due to this process, the received signal is not identical to the one originally transmitted. While some of the changes might not actually be audible, others will cause a significant deterioration of speech quality.
Whether a speech sample is of excellent or poor quality, is subjective to the listener and his/her internal reference by which the sample is judged. In order to assess quality, listening experiments may be performed where test subjects are asked to rate the quality of a speech sample on a scale from 1 (bad) to 5 (excellent). The mean value of these ratings then forms the so-called Mean Opinion Score (MOS), which is used as the general numerical indicator for speech quality.
Conducting listening experiments is, however, rather impractical if cost and time-efficiency are important. Therefore, algorithms (ITU-T P.862/863) have been developed in the past which are able to estimate the outcome of listening experiments. The score used for quality indication is then not one given by test subjects (subjective MOS) but the one calculated by the estimator (objective MOS), since a high correlation of these scores is assumed.
While the MOS can already be reliably estimated in most cases, these estimators do not deliver enough information about the causes of sub-optimal quality. One approach being investigated is the interpretation of quality as a point in a vector space, where the dimensions score specific degradation characteristics. The set of dimensions currently investigated consists of Noisiness, Coloration, Dicontinuity and Loudness. Since the dimensions are assumed to be independent (othogonal) of each other, they can be individually scored in listening experiments, resulting in a set of dimension-based MOS values. An associated estimation algorithm would then be targeted at calculating the dimension-based MOS objectively. Such an algorithm is a current work item of the ITU-T under the title Percepual Approaches for Multi-Dimensional Analysis (P.AMD).
Another work item of the ITU-T is the Technical Cause Analysis (P.TCA). Here, the numerical scoring by MOS is abandoned for an exhaustive list of audible signal degradations. In listening experiments the listener is to choose the most prominent degradation from this list. Since the perceptual categorization of speech impairments by naive listeners is not reliable, expert listeners (speech processing background) are required for these experiments. Their annotations may then be used to develop the corresponding estimator, which could either assign each degradation category an estimate of its relative annotation frequency or a hard decision value (0 or 1) signaling if the category is a probable cause for the observed sub-optimal speech quality. This estimator is part of our research here in the DSS team.
Although above work item is titled Technical Cause Analysis, annotations and, hence, the estimations are still purely perception-based. However, for telecommunication providers it is of special interest to find the „technical truth“, meaning the unit in the processing chain responsible for the sub-optimal quality due to malfunction, iconvenient parametrization or some other internal reason. Therefore, our team decided to also focus on what we call Root Cause Analysis. Here, algorithms are to be developed which return likelihood values for each processing unit or network error type regarding their involvement in the observed quality degradation.