Masterarbeit/chapter_04.tex

\section{Hardware implementation and optimization of the ANR Algorithm on a low-power system}
This section can be considered as the heart of this thesis. The first subchapter describes the hardware, on which the \ac{ANR} algorithm is implemented, including its environment, which serves as a link to the \ac{CI} system itself. The following subchapter continues with the basic implementation of the \ac{ANR} algorithm on the hardware itself and shall provide the reader with a basic understanding of its challenges, possibilities and limitations. This basic implementation is then low-level simulated with some of the previuous use cases to get some idea of the general performance.\\
The last subchapter picks the final optimizations of the \ac{ANR} algorithm itself as a central theme, especially with respect to the capabilities of a hybrid \ac{ANR} approach.
\subsection{Low-power system architecture and integration}
This thesis considers a low-power \ac{SOC} architecture that integrates a general-purpose \ac{ARM} core with a dedicated \ac{DSP} core. The system combines the flexibility of an \ac{ARM}-based control processor with the computational efficiency of a specialized \ac{DSP}, splitting general computing tasks from real-time signal processing workloads.
\subsubsection{ARM and DSP hardware architecture overview}
A 32-bit \ac{ARM} core serves as the primary control unit of the system. It is responsible for high-level application logic, system configuration, peripheral management as also scheduling and serves as a general-purpose processing unit. Due to its universal instruction set and extensive input/output interface, the \ac{ARM} core is well suited for handling general tasks and the interaction with the \ac{CI} system. Time-critical numerical processing is intentionally offloaded to the \ac{DSP} core in order to reduce computational load and power consumption on the control processor.\\ \\
The \ac{DSP} used for the implementation features a 32-bit dual Harvard, dual \ac{MAC} architecture primarily designed for audio signal-processing applications in low-power embedded systems. It doesn´t feature a designated boot ROM, as it is initialized and managed by the \ac{ARM} core. The firmware executing the \ac{ANR} algorithm is developed and programmed in the C programming language. The used propretiery compiler is highly efficient and generates optimized assembler code, which is then translated in machine code to execute the \ac{ANR} algorithm on incoming samples.\\ \\
All memory instances and registers of the \ac{SOC} are directly addressable by the \ac{ARM} through the standard busses, also enabling a simplified control of the \ac{DSP} through a shared memory section. The memory consists mainly out of the two followng parts:
\begin{itemize}
    \item \textbf{Program Memory:} This memory section stores the executable code for both the \ac{ARM} core and the \ac{DSP} core. It contains the compiled instructions that define the behavior of the system, including the \ac{ANR} algorithm implemented on the \ac{DSP}.
    \item \textbf{Data Memory:} This memory section is used for storing runtime data and variables, required during the execution of the program. This also includes the memory section for input/output audio samples and intermediate processing results. The shared memory section between the \ac{ARM} core and the \ac{DSP} core is also part of the data memory, featuring a total size of 64 KB.
\end{itemize}
The data memory is supported by an integrated \ac{DMA} controller, which allows efficient data transfers between peripherals and memory without burdening the processing cores. This is particularly needed for transferring audio samples from the \ac{PCM} interface to the shared memory section for further processing by the \ac{DSP}, as well as transferring processed samples back to the \ac{PCM} interface for playback. The shared memory section is crucial for enabling efficient communication and data exchange between the two processing units, further described in the following subchapter.\\ \\
When the \ac{DSP} is not required to process audio data, it can be paused by pausing the clock provided to the \ac{DSP} core. When paused, the \ac{DSP} core enters a low-power state, still allowing the \ac{ARM} core to access its shared memory and wake up the \ac{DSP} core when needed. This mechanism helps to reduce overall power consumption, which is crucial for battery-operated devices like cochlear implants.\\ \\
The processing unit of the \ac{DSP} is equipped with load/store architecture, meaning that, initially all operands need to be moved from the memory to the registers, before any operation can be performed. After this task is performed, the execution units (\ac{ALU} and multiplier) can perform their operations on the data and write back the results into the registers. Finally, the results need to be explicitly moved back to the memory.\\ \\
Processor-wise, the \ac{DSP} includes a three stage pipeline consisting of fetch, decode, and execute stages, allowing for overlapping instruction execution and improved throughput. The architecture is optimized for high cycle efficiency when executing computationally intensive signal-processing workloads. The featured dual Harvard, dual \ac{MAC} architecture (two separate \ac{ALU}s) enables the execution of two \ac{MAC} operations, two memory operations (load/store) and two pointer updates in a single processor cycle.
\subsubsection{Inter-core communication mechanisms}
In order to ensure a smooth, but power-efficient, operation together with the \ac{CI} system, a interrupt-driven communication between the \ac{ARM} core and the \ac{DSP} core is crucial. Center of communication between the the cores is the already mentioned shared memory region accessible by both processing units. This shared memory enables the exchange of data without the need for separate communication protocols or input/output interfaces (refer to Figure \ref{fig:fig_dsp_setup.jpg}). Synchronization between the cores is achieved using interrupt-based signaling: the \ac{ARM} core initiates processing requests by waking up the \ac{DSP} and triggering an interrupt which sets an action flag, while the \ac{DSP} notifies the \ac{ARM} core upon completion of a task also by changing an interrupt register (for simplicity reasons, this behaviour will be just called ``interrupts'' in the remaining thesis). This approach ensures efficient coordination while minimizing active waiting (polling) and therefore unnecessary power consumption.
\begin{figure}[H]
    \centering
    \includegraphics[width=1.0\linewidth]{Bilder/fig_dsp_setup.jpg}
    \caption{Simplified visualization of the interaction between the \ac{CI}-System, the \ac{ARM} core and the \ac{DSP} core, making use of the \ac{PCM} interface and shared memory for audio data exchange.}
    \label{fig:fig_dsp_setup.jpg}
\end{figure}
\noindent The \ac{ARM} Core receives the 16-bit audio data (the corrupted signal and the reference noise signal via two channels) from the CI system via a \ac{PCM} interface, which offers one 32-bit input and one 32-bit output register. An interrupt triggers the integrated \ac{DMA} controller when the input register is occupied, which transfers the audio data from the \ac{PCM} interface to the input buffer in a predefined memory location (now in two 16-bit samples again). Once completed, the \ac{DSP} core is requested to start processing the audio data. The \ac{DSP} core then reads the audio samples from the shared memory, processes them using the implemented \ac{ANR} algorithm, and writes the a 16-bit processed sample back to an output buffer, also located in the shared memory. Finally, the \ac{ARM} core is notified via an interrupt from the \ac{DSP} core, that the processing is complete - the \ac{DMA} controller then transfers the processed audio samples from the output buffer back to the \ac{PCM} interface for playback (refer to Figure \ref{fig:fig_dsp_comm.jpg}).\\ \\
\begin{figure}[H]
    \centering
    \includegraphics[width=0.9\linewidth]{Bilder/fig_dsp_comm.jpg}
    \caption{Simplified flowchart of the sample processing between the \ac{ARM} core and the \ac{DSP} core via interrupts and shared memory.}
    \label{fig:fig_dsp_comm.jpg}
\end{figure}

\subsection{Software architecture and execution flow}
\subsubsection{ARM–DSP communication and data exchange details}
In contrary, to the high-level simulation environment written in Python from the previous chapter, the implementation of the \ac{ANR} algorithm on the \ac{DSP} requires a low-level programming approach, as which takes into account the specific architecture and capabilities of the processor and its environment. This includes considerations such as memory management, data types, and optimization techniques specific to the \ac{DSP} architecture. The implementation is required to be done in the C programming language, which is a standard for embedded systems, as it allows low-level hardware implementation.\\ \\
The implementation of the \ac{ANR} algorithm on the \ac{DSP} follows the same overall structure as the high-level variant, but now the focus lies on memory management, interrupt-handling and communication between the two cores. The \ac{ARM} operates in a continious loop, structured into several states:
\begin{itemize}
    \item \textbf{Idle:} The \ac{ARM} core waits for an interrupt from the \ac{DMA} controller, indicating that new audio samples are available in the input buffer.
    \item \textbf{Work:} After receiving the interrupt, the \ac{ARM} core triggers an interrupt on the \ac{DSP} core to start processing the audio samples.
    \item \textbf{Wait:} After recieving the first interrupt, which signals that the \ac{DSP} startedprocessing the sample, the \ac{ARM} core waits for the second interrupt, indicating that the processing is complete.
    \item \textbf{Done/Idle:} Once the processing is complete, the \ac{ARM} core triggers the \ac{DMA} controller to transfer the processed audio samples from the output buffer back to the \ac{PCM} interface for playback. The \ac{ARM} core then returns to the idle state, waiting for the next batch of audio samples.
\end{itemize}
On the contrary, the \ac{DSP} core operates in an interrupt-driven manner:
\begin{itemize}
    \item \textbf{Idle:} The \ac{DSP} core remains in a halted state, waiting to be activated by the \ac{ARM} core to start processing.
    \item \textbf{Work:} Once activated, the \ac{DSP} core signals the \ac{ARM} core the beginning of its work cycle by raising an interrupt, reads the audio samples from the input buffer located in the shared memory, processes them using the implemented \ac{ANR} algorithm, and writes the processed samples back to the output buffer in the shared memory.
    \item \textbf{Done/Idle:} After completing the processing, the \ac{DSP} core triggers an interrupt on the \ac{ARM} core to notify that the processing is complete. The \ac{DSP} core then returns to its sleeping state, waiting for the next processing request.
\end{itemize}
\noindent The \ac{DMA} controller plays a crucial role in this architecture by handling the data transfers between the \ac{PCM} interface and the shared memory buffers. It operates independently of both the \ac{ARM} core and the \ac{DSP} core, allowing for efficient data movement without burdening the processing units.
\begin{figure}[H]
    \centering
    \includegraphics[width=1.0\linewidth]{Bilder/fig_dsp_dma.jpg}
    \caption{Detailed visualization of the \ac{DMA} operations between the PCM interface to the shared memory section. When the memory buffer occupied, an interrupt is triggerd, either to the \ac{DSP} core or to the \ac{ARM} core, depending if triggered during a Read- or Write-operation.}
    \label{fig:fig_dsp_dma.jpg}
\end{figure}
\noindent Figure \ref{fig:fig_dsp_dma.jpg} visualizes the concrete operation of the \ac{DMA} controller during the audio sample processing. The \ac{DMA} controller is configured to samplewise transfer the audio samples from the \ac{PCM} interface to the input buffer of the shared memory. When the input buffer is filled with one sample of both channels, an interrupt is triggered to the \ac{DSP} core, notifying it to start processing the available samples. After processing, the results are written into the output buffer in the shared memory. Once the output buffer is occupied, another interrupt is triggered to the \ac{DMA} controller, indicating that the processed samples are ready to be transferred back to the \ac{PCM} interface for playback. \\ \\
As the \ac{ARM} operation is not the main focus of this thesis and its behavior is already sufficiently described, further implementaion details will be omitted in the following while the focus will be put on implementation of the \ac{ANR} algorithm on the \ac{DSP} core itself.
\subsubsection{System control flow and main processing loop}
The implementation of the \ac{ANR} algorithm on the \ac{DSP} core is structured into several key sections, each responsible for specific aspects of the algorithm's functionality. The following paragraphs outline the main components:
\paragraph{Memory initialization and mapping}
The memory initialization section starts with the definition of the interrupt register (0xC00004) and the corresponding bit masks used to control the interrupt behavior of the \ac{DSP} core. Afterwards, a section in the shared memory is defined for the storage of input and output audio samples after/before the transport to/from the \ac{PCM} interface. The output section is initialized with an offset of 16 bytes from the input section (0x800000), resulting in a storage capability of 4 32-bit double-words for each of the two memory sections - this is more than needed, but prevents future memory relocation, if the necessety for more space would arise. After this initialization, the interrupt register and the memory sections are declared as volatile variables, telling the compiler, that these variables can be changed outside the normal program flow (e.g., by hardware interrupts), preventing certain optimizations. The final input/output buffers are then declared in form of two 16-bit arrays, consisting of 4 elements each. Finally, a variable is declared to signal the \ac{DSP} core, an interrupt has occured, which changes the state of the interrupt register and signals a processing request.
\begin{figure}[H]
    \centering
    \begin{lstlisting}[language=C]
        #define CSS_CMD 0xC00004
        #define CSS_CMD_0 (1<<0)
        #define CSS_CMD_1 (1<<1)

        #define INPUT_PORT0_ADD 0x800000
        #define OUTPUT_PORT_ADD (INPUT_PORT0_ADD + 16)

        volatile static unsigned char set_storage(DMIO:CSS_CMD) css_cmd_flag;

        static volatile int16_t set_storage(DMB:INPUT_PORT0_ADD) input_port[4];
        static volatile int16_t set_storage(DMB:OUTPUT_PORT_ADD) output_port[4];

        static volatile int action_required;
    \end{lstlisting}
    \label{fig:fig_dps_code_memory}
    \caption{Low-level implementation: Memory initialization and mapping}
\end{figure}
\noindent Figure \ref{fig:fig_compiler.jpg} shows an exemplary memory map of the input buffer array, taken from the compiler debugger. As the array is initialized as a 16 bit integer array, each element occupies 2 bytes of memory.
\begin{figure}[H]
    \centering
    \includegraphics[width=1.0\linewidth]{Bilder/fig_compiler.jpg}
    \caption{Exemplary memory map of the 4-element input buffer array. As it is initialized as a 16 bit integer array, each element occupies 2 bytes of memory, resulting in a total size of 8 bytes for the entire array. As the DSP architecture works in 32-bit double-words, the bytewise adressing is a result of the compiler abstraction.}
    \label{fig:fig_compiler.jpg}
\end{figure}
\paragraph{Main loop and interrupt handling}
The main loop of the \ac{DSP} core is quite compact, as it mainly focuses on handling interrupts and delegating the sample processing to the \ac{ANR} function. The loop starts by enabling interrupts with a compiler-specific function and setting up pointers for the output buffer and the sample variable. After setting the action flag to zero, the main function enters an infinite loop, signaling the \ac{ARM} core it´s halted state by setting the interrupt register to 1 and halting the core.\\ \\
If the \ac{ARM} core requests a sample to be processed, it activates the \ac{DSP} core and triggers an interrupt, which sets the action flag to 1. The main loop then checks the action flag, and sets the interrupt register back to 0, indicating the \ac{ARM} core it is now processing the sample. After resetting the action flag, the output pointer is updated to point to the next position in the output buffer using a cyclic addition function. Before triggering the calc()-function, the calculated sample from the previous cycle is moved from its temporary memory location to the current position in the output buffer. Afterwards, the calc()-function is triggered for the current cycle and the loop restarts. The flow diagram in Figure \ref{fig:fig_dsp_logic.jpg} visualizes the described behavior of the main loop and interrupt handling on the \ac{DSP} core.
\begin{figure}[H]
    \centering
    \begin{lstlisting}[language=C]
        int main(void) {
            enable_interrupts();
            output_pointer = &output_port[1];
            sample_pointer = &sample;
            action_required = 0;
            while (1){
                css_cmd_flag = CSS_CMD_1;
                core_halt();
                if (action_required == 1) {
                    css_cmd_flag = CSS_CMD_0;
                    action_required = 0;
                    out_pointer = cyclic_add(output_pointer, 2, output_port, 4);
                    *output_pointer = *sample_pointer;
                    calc(&corrupted_signal, &reference_noise_signal, mode, &input_port[1], &input_port[0], sample_pointer);
                }
            }
        }
    \end{lstlisting}
    \caption{Low-level implementation: Main loop and interrupt handling}
    \label{fig:fig_dps_code_mainloop}
\end{figure}
\begin{figure}[H]
    \centering
    \includegraphics[width=1.0\linewidth]{Bilder/fig_dsp_logic.jpg}
    \caption{Flow diagram of the code implementation of the main loop and interrupt handling on the \ac{DSP} core.}
    \label{fig:fig_dsp_logic.jpg}
\end{figure}
\paragraph{Calc()-function}
The calc()-function at the very end of the main process loop represents the heart of the \ac{DSP}code, as it is responsible for applying the \ac{ANR} algorithm on the two input samples. As it follows the same structure as the high-level implementation described in the previous chapter, the general functionality will not be described in detail again. The technical implementation on the \ac{DSP} however will be outlined in detail in the following subchapter, as the hardware-specific optimizations, responsible for a real-time capable implementation, are a key element for the estimation of the expectable power consumption of the system.\\ \\

\subsection{DSP-level implementation of the ANR algorithm}
The ability to process audio samples in real-time on the \ac{DSP} core is strongly dependent on compiler-specific optimizations and hardware-specific implementation techniques, which allow a far more efficient execution of the algorithm compared to a native C implementation.
\subsubsection{DSP-specific optimizations for real-time processing}
In the following, some examples of optimization possibilities shall be outlined , before the entire \ac{ANR} implementation on the \ac{DSP} is analyzed in regard of its performance.
\paragraph{Logic operations}
Logic operstions, such as finding the maximum or minimum of two values, are quite common in signal processing algorithms. However, their implementation in C usually involves conditional statements (if-else), which can be inefficient on certain architectures due to pipeline stalls.\\ \\
The simple function shown in Figure \ref{fig:fig_dsp_code_find_max} returns the maximum of two given integer values. Processing this manual implementation on the \ac{DSP} takes 12 cycles to execute, while the intrinsic function of the \ac{DSP} compiler allows a 4-cycle execution.
\begin{figure}[H]
    \centering
    \begin{lstlisting}[language=C]
    int find_max(int a, int b){
        return (a > b) ? a : b;
    }
    \end{lstlisting}
    \caption{Manual implementation of function, returning the maximum of two integer values, taking 12 cycles to execute. The intrinsic functions of the DSP compiler allows a 4-cycle implementation of such an operations.}
    \label{fig:fig_dsp_code_find_max}
\end{figure}
\paragraph{Cyclic array iteration}
Basically every part of the \ac{ANR} algorithm relies on iterating through memory sections in a cyclic manner. In C, this is usually implemented by defining an array, containing the data, and a pointer, which is incremented after each access. When the pointer reaches the end of the array, it is reset to the beginning again. This approach requires several different operations, such as pointer incrementation, if-clauses and for-loops.
\begin{figure}[H]
    \centering
    \begin{lstlisting}[language=C]
    int* cyclic array iteration(int *pointer, int increment, int *pointer_start, int buffer_length){
        int *new_pointer=pointer;
        for (int i=0; i < abs(increment); i+=1){
                new_pointer ++;
                if (new_pointer >= pointer_start + buffer_length){
                    new_pointer=pointer_start;
                }
            }
        return new_pointer;
    }
    \end{lstlisting}
    \caption{Manual implementation of a cyclic array iteration function in C, taking the core 20 cycles to execute an pointer inremen of 1. The intrinsic functions of the DSP compiler allows a single-cycle implementation of such cyclic additions.}
    \label{fig:fig_dsp_code_cyclic_add}
\end{figure}
\noindent Figure \ref{fig:fig_dsp_code_cyclic_add} shows a manual implementation of such a cyclic array iteration function in C, which updates the pointer to a new address. This implementation takes the \ac{DSP} 20 cycles to execute, while the already implemented compiler-optimized version only takes one cycle, making use of the specific architecture of the \ac{DSP} allowing such a single-cycle operation.
\paragraph{Fractional fixed-point arithmetic}
As already mentioned during the beginning of the current chapter, the used \ac{DSP} is a fixed point processor, meaning, that it does not support floating-point arithmetic natively. Instead, it relies on fixed-point arithmetic, which represents numbers as integers scaled by a fixed factor. This is a key requirement, as it allows the use of the implemented dual \ac{MAC} \ac{ALU}s. This approach is also faster and more energy efficient, and therefore more suitable for embedded systems. However, it also introduces challenges in terms of precision and range, which need to taken into account when conducting certain calculations.\\ \\
To tackle this issues, the \ac{DSP} compiler provides intrinsic functions for fractional fixed-point arithmetic, such as a fractional multiplication function, which takes two 32-bit integers as input and return an already bit-shifted 64-bit output, representing the fractional multiplication result. This approach prevents the need for manual bit-shifting operations after each multiplication.\\ \\
To support such operations, a 72-bit accumulator is provided, allowing to store intermediate 64-bit results of 32-bit multiplications without losing precision - the remaining 8 bit serve as an overflow space. If needed, a saturation function is also provided, to round the 64-bit result back to a 32-bit value.

\subsubsection{Performance evaluation and quantization of the DSP implementation}