### Main Application Entry Point Source: https://context7.com/espressif/esp-sr/llms.txt Sets up the necessary hardware drivers and starts the speech recognition tasks. Ensure I2S and other peripherals are initialized before calling this function. ```c void app_main(void) { // Initialize I2S, GPIO, etc. i2s_driver_install(...); // Initialize speech recognition speech_recognition_init(); // Start processing tasks xTaskCreatePinnedToCore(feed_task, "feed", 4096, NULL, 5, NULL, 0); xTaskCreatePinnedToCore(detect_task, "detect", 8192, NULL, 5, NULL, 1); } ``` -------------------------------- ### MultiNet Initialization and Setup Source: https://github.com/espressif/esp-sr/blob/master/docs/zh_CN/speech_command_recognition/README.rst Steps to initialize and configure MultiNet for command recognition, including model loading and setting command words. ```APIDOC ## MultiNet Initialization ### Description This section covers the necessary steps for initializing the MultiNet module, including loading the appropriate models and configuring command words. ### Model Loading Refer to the documentation on model loading for details on how to load MultiNet models. - **See also**: :doc:`Model Loading <../flash_model/README>` ### Setting Command Words Configure the specific command words that MultiNet should recognize. - **See also**: :ref:`command-requirements` ``` -------------------------------- ### Install g2p-en Package Source: https://github.com/espressif/esp-sr/blob/master/tool/README.md Install the g2p-en Python package using pip. This is required for processing English text into phonemes for the MultiNet5 model. ```bash pip install g2p_en ``` -------------------------------- ### WakeNet Wake Word Detection Example Source: https://context7.com/espressif/esp-sr/llms.txt This example demonstrates how to initialize, configure, and use the WakeNet engine for detecting wake words from audio streams. Ensure the necessary headers are included and models are available. ```c #include "esp_wn_iface.h" #include "esp_wn_models.h" // Get WakeNet model handle const esp_wn_iface_t *wakenet = esp_wn_handle_from_name("wn9_hilexin"); // Available models: wn9_hilexin, wn9_hiesp, wn9_alexa, wn9_nihaoxiaozhi, etc. // Create WakeNet instance model_iface_data_t *model_data = wakenet->create("wn9_hilexin", DET_MODE_90); // DET_MODE_90: Normal sensitivity // DET_MODE_95: Aggressive (higher detection, more false positives) // Get processing parameters int chunksize = wakenet->get_samp_chunksize(model_data); // Samples per frame int sample_rate = wakenet->get_samp_rate(model_data); // 16000 Hz int word_num = wakenet->get_word_num(model_data); // Number of wake words // Get wake word information for (int i = 1; i <= word_num; i++) { char *name = wakenet->get_word_name(model_data, i); float threshold = wakenet->get_det_threshold(model_data, i); printf("Wake word %d: %s (threshold: %.2f)\n", i, name, threshold); } // Set custom detection threshold (0.4 - 0.9999) wakenet->set_det_threshold(model_data, 0.85, 1); // threshold=0.85, word_index=1 // Process audio frames int16_t *audio_buffer = (int16_t *)malloc(chunksize * sizeof(int16_t)); while (1) { // Fill audio_buffer with 16-bit mono audio @ 16kHz read_audio(audio_buffer, chunksize); wakenet_state_t state = wakenet->detect(model_data, audio_buffer); if (state == WAKENET_DETECTED) { int channel = wakenet->get_triggered_channel(model_data); int start_point = wakenet->get_start_point(model_data); printf("Wake word detected! Channel: %d, Start: %d samples back\n", channel, start_point); } } // Reset state for new session wakenet->clean(model_data); // Cleanup wakenet->destroy(model_data); free(audio_buffer); ``` -------------------------------- ### Run ESP-SR Test Suite Source: https://github.com/espressif/esp-sr/blob/master/test_apps/README.md Commands to install test dependencies and execute the pytest suite for the ESP32-S3 target. ```bash pip install -r test_apps/requirement.txt pytest test_apps --target esp32s3 ``` -------------------------------- ### Get Sample Chunk Size - MultiNet Source: https://github.com/espressif/esp-sr/blob/master/docs/zh_CN/speech_command_recognition/README.rst Retrieves the required sample chunk size for MultiNet input. This value must match the AFE fetch frame length. ```c int mu_chunksize = multinet->get_samp_chunksize(model_data); ``` -------------------------------- ### Get Recognition Results - MultiNet Source: https://github.com/espressif/esp-sr/blob/master/docs/zh_CN/speech_command_recognition/README.rst Retrieves the results of the command word recognition. Call this API when the state is ESP_MN_STATE_DETECTED. ```c esp_mn_results_t *mn_result = multinet->get_results(model_data); ``` -------------------------------- ### Create AFE Instance from Configuration Source: https://github.com/espressif/esp-sr/blob/master/docs/zh_CN/audio_front_end/README.rst Creates an AFE instance using the provided configuration. Obtain the AFE interface handle first, then use it to create the data structure for the AFE instance. ```c // 获取句柄 esp_afe_sr_iface_t *afe_handle = esp_afe_handle_from_config(afe_config); // 创建实例 esp_afe_sr_data_t *afe_data = afe_handle->create_from_config(afe_config); ``` -------------------------------- ### Initialize AFE Configuration Source: https://github.com/espressif/esp-sr/blob/master/docs/en/audio_front_end/README.rst Use afe_config_init to set up the AFE parameters, including input format and model selection. ```c srmodel_list_t *models = esp_srmodel_init("model"); afe_config_t *afe_config = afe_config_init("MMNR", models, AFE_TYPE_SR, AFE_MODE_HIGH_PERF); ``` -------------------------------- ### Prepare English Commands for MultiNet6 Source: https://github.com/espressif/esp-sr/blob/master/tool/README.md Create a text file with command IDs and their corresponding sentences for English. Each line should follow the format '# command_id command_sentence'. ```text # command_id command_sentence 1 TELL ME A JOKE 2 MAKE A COFFEE ``` -------------------------------- ### Build ESP-SR Test Applications Source: https://github.com/espressif/esp-sr/blob/master/test_apps/README.md Commands to set up the environment and build test applications for the ESP32-S3 target. ```bash . ${IDF_PATH}/export.sh pip install idf_build_apps python test_apps/build_apps.py test_apps -t esp32s3 ``` -------------------------------- ### Implement AEC using AFE and Direct Interfaces Source: https://context7.com/espressif/esp-sr/llms.txt Demonstrates both the recommended AFE AEC interface for multi-channel audio and the direct AEC interface for manual reference signal handling. ```c #include "esp_afe_aec.h" #include "esp_aec.h" // Using AFE AEC interface (recommended) afe_aec_handle_t *afe_aec = afe_aec_create( "MNR", // input_format: M=mic, N=unused, R=reference 4, // filter_length in frames AFE_TYPE_SR, // AFE_TYPE_SR or AFE_TYPE_VC AFE_MODE_LOW_COST // AFE_MODE_LOW_COST or AFE_MODE_HIGH_PERF ); int frame_size = afe_aec->frame_size; int nch = afe_aec->pcm_config.total_ch_num; int16_t *input_data = (int16_t *)malloc(frame_size * nch * sizeof(int16_t)); int16_t *output_data = (int16_t *)malloc(frame_size * sizeof(int16_t)); while (1) { // Read multi-channel audio (interleaved format) read_multichannel_audio(input_data, frame_size * nch); // Process AEC afe_aec_process(afe_aec, input_data, output_data); // output_data contains echo-cancelled audio process_clean_audio(output_data, frame_size); } afe_aec_destroy(afe_aec); // Using direct AEC interface aec_handle_t *aec = aec_create( 16000, // sample_rate 4, // filter_length 1, // mic_channels AEC_MODE_SR_LOW_COST // or AEC_MODE_SR_HIGH_PERF ); int16_t *mic_data = (int16_t *)malloc(frame_size * sizeof(int16_t)); int16_t *ref_data = (int16_t *)malloc(frame_size * sizeof(int16_t)); int16_t *out_data = (int16_t *)malloc(frame_size * sizeof(int16_t)); while (1) { read_mic_audio(mic_data, frame_size); read_speaker_ref(ref_data, frame_size); aec_process(aec, mic_data, ref_data, out_data); // out_data has echo removed } aec_destroy(aec); ``` -------------------------------- ### Manage AFE Instance and Pipeline Source: https://context7.com/espressif/esp-sr/llms.txt Creates an AFE instance from a configuration and provides methods to control audio processing algorithms and retrieve runtime parameters. ```c // Get AFE handle from configuration const esp_afe_sr_iface_t *afe_handle = esp_afe_handle_from_config(afe_config); // Create AFE instance esp_afe_sr_data_t *afe_data = afe_handle->create_from_config(afe_config); // Print the processing pipeline afe_handle->print_pipeline(afe_data); // Output: [input] -> |AEC(VOIP_HIGH_PERF)| -> |WakeNet(wn9_hilexin)| -> [output] // Get processing parameters int feed_chunksize = afe_handle->get_feed_chunksize(afe_data); // Samples per frame int feed_nch = afe_handle->get_feed_channel_num(afe_data); // Input channel count int fetch_chunksize = afe_handle->get_fetch_chunksize(afe_data); // Output samples per frame int sample_rate = afe_handle->get_samp_rate(afe_data); // Sample rate (16000 Hz) // Control individual algorithms afe_handle->disable_wakenet(afe_data); // Disable wake word detection afe_handle->enable_wakenet(afe_data); // Re-enable wake word detection afe_handle->disable_aec(afe_data); // Disable echo cancellation afe_handle->enable_aec(afe_data); // Re-enable echo cancellation afe_handle->disable_vad(afe_data); // Disable voice activity detection afe_handle->enable_vad(afe_data); // Re-enable voice activity detection afe_handle->reset_vad(afe_data); // Reset VAD state // Adjust wake word detection threshold (0.4 - 0.9999) afe_handle->set_wakenet_threshold(afe_data, 1, 0.8); // wakenet_index=1, threshold=0.8 afe_handle->reset_wakenet_threshold(afe_data, 1); // Reset to default // Cleanup afe_handle->destroy(afe_data); afe_config_free(afe_config); esp_srmodel_deinit(models); ``` -------------------------------- ### Get MultiNet Chunk Size Source: https://github.com/espressif/esp-sr/blob/master/docs/en/speech_command_recognition/README.rst Retrieves the required frame length (chunk size) for data passed to MultiNet. This size must match the AFE fetch frame length. ```APIDOC ## multinet->get_samp_chunksize ### Description Get the length of the frame that needs to be passed to MultiNet. This size is exactly the same as the number of data points per frame obtained in AFE. ### Method (Not specified, likely a method call on a MultiNet object) ### Endpoint (Not applicable, this is a library function) ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body - **model_data** (const model_iface_data_t *) - Required - The model object to query. ### Request Example (Not applicable) ### Response #### Success Response (200) - int - The size of the sample chunk (short) required for each frame passed to MultiNet. ### Response Example ```c int mu_chunksize = multinet->get_samp_chunksize(model_data); ``` ``` -------------------------------- ### Configure Audio Front-End (AFE) Source: https://context7.com/espressif/esp-sr/llms.txt Initializes AFE models and configures processing parameters like AEC, NS, VAD, and WakeNet. Requires valid model partitions and specific input channel formatting. ```c #include "esp_afe_sr_models.h" #include "esp_afe_sr_iface.h" // Initialize models from partition srmodel_list_t *models = esp_srmodel_init("model"); // Initialize AFE configuration // input_format: "MMNR" = 2 mic channels, 1 unused, 1 reference afe_config_t *afe_config = afe_config_init("MMNR", models, AFE_TYPE_SR, AFE_MODE_HIGH_PERF); // Customize configuration parameters afe_config->aec_init = true; // Enable Acoustic Echo Cancellation afe_config->ns_init = true; // Enable Noise Suppression afe_config->vad_init = true; // Enable Voice Activity Detection afe_config->vad_mode = VAD_MODE_1; // VAD aggressiveness (0-4) afe_config->vad_min_speech_ms = 128; // Minimum speech duration (ms) afe_config->vad_min_noise_ms = 1000; // Minimum noise duration (ms) afe_config->wakenet_init = true; // Enable WakeNet afe_config->wakenet_mode = DET_MODE_90; // Wake word detection sensitivity afe_config->agc_init = true; // Enable Automatic Gain Control afe_config->afe_linear_gain = 1.0; // Output gain factor [0.1 - 10.0] afe_config->memory_alloc_mode = AFE_MEMORY_ALLOC_MORE_PSRAM; // Print configuration for debugging afe_config_print(afe_config); ``` -------------------------------- ### Audio Front-End (AFE) Configuration Source: https://context7.com/espressif/esp-sr/llms.txt This section covers the initialization and configuration of the AFE framework, including setting up audio processing algorithms like AEC, NS, VAD, and WakeNet. ```APIDOC ## Audio Front-End (AFE) Configuration The AFE framework processes audio data for speech recognition and voice communication applications. It initializes and configures all audio processing algorithms including AEC, NS, VAD, and WakeNet detection based on the input format specification where 'M' represents microphone channels, 'R' represents playback reference channels, and 'N' represents unused channels. ### Method C Code Example ### Endpoint N/A ### Parameters #### Request Body - **input_format** (string) - Required - Specifies microphone, reference, and unused channels (e.g., "MMNR"). - **models** (srmodel_list_t*) - Required - Pointer to initialized speech recognition models. - **afe_type** (afe_type_t) - Required - Type of AFE (e.g., AFE_TYPE_SR). - **afe_mode** (afe_mode_t) - Required - Performance mode (e.g., AFE_MODE_HIGH_PERF). ### Request Example ```c #include "esp_afe_sr_models.h" #include "esp_afe_sr_iface.h" // Initialize models from partition srmodel_list_t *models = esp_srmodel_init("model"); // Initialize AFE configuration afe_config_t *afe_config = afe_config_init("MMNR", models, AFE_TYPE_SR, AFE_MODE_HIGH_PERF); // Customize configuration parameters afe_config->aec_init = true; // Enable Acoustic Echo Cancellation afe_config->ns_init = true; // Enable Noise Suppression afe_config->vad_init = true; // Enable Voice Activity Detection afe_config->vad_mode = VAD_MODE_1; // VAD aggressiveness (0-4) afe_config->vad_min_speech_ms = 128; // Minimum speech duration (ms) afe_config->vad_min_noise_ms = 1000; // Minimum noise duration (ms) afe_config->wakenet_init = true; // Enable WakeNet afe_config->wakenet_mode = DET_MODE_90; // Wake word detection sensitivity afe_config->agc_init = true; // Enable Automatic Gain Control afe_config->afe_linear_gain = 1.0; // Output gain factor [0.1 - 10.0] afe_config->memory_alloc_mode = AFE_MEMORY_ALLOC_MORE_PSRAM; // Print configuration for debugging ave_config_print(afe_config); ``` ### Response #### Success Response (200) - **afe_config_t*** - Pointer to the initialized AFE configuration structure. #### Response Example N/A ``` -------------------------------- ### Create AFE Instance from Configuration Source: https://github.com/espressif/esp-sr/blob/master/docs/en/audio_front_end/migration_guide.rst Obtain an AFE handle using `esp_afe_handle_from_config` with the initialized configuration. The previous `ESP_AFE_SR_HANDLE` and `ESP_AFE_VC_HANDLE` are no longer used. ```c esp_afe_sr_iface_t *afe_handle = esp_afe_handle_from_config(afe_config); ``` -------------------------------- ### Configure ESP-SR Project CMakeLists.txt Source: https://github.com/espressif/esp-sr/blob/master/test_apps/esp-sr/CMakeLists.txt Sets up the CMake build system for the ESP-SR test subproject. Includes necessary components and defines the project name. ```cmake cmake_minimum_required(VERSION 3.5) # Include the components directory of the main application: # set(EXTRA_COMPONENT_DIRS "$ENV{IDF_PATH}/tools/unit-test-app/components" "../../../esp-sr") include($ENV{IDF_PATH}/tools/cmake/project.cmake) project(esp_sr_test) ``` -------------------------------- ### Create AFE Instance Source: https://github.com/espressif/esp-sr/blob/master/docs/en/audio_front_end/README.rst Initialize the AFE handle and data instance based on the previously defined configuration. ```c // get handle esp_afe_sr_iface_t *afe_handle = esp_afe_handle_from_config(afe_config); // create instance esp_afe_sr_data_t *afe_data = afe_handle->create_from_config(afe_config); ``` -------------------------------- ### Prepare Chinese Commands for MultiNet6 Source: https://github.com/espressif/esp-sr/blob/master/tool/README.md Create a text file with command IDs and their corresponding Pinyin sentences for Chinese. Each line should follow the format '# command_id command_sentence'. ```text # command_id command_sentence 1 da kai kong tiao 2 guan bi kong tiao ``` -------------------------------- ### Initialize and Configure MultiNet Speech Commands Source: https://context7.com/espressif/esp-sr/llms.txt This snippet demonstrates how to initialize the MultiNet model, add custom speech commands with IDs and strings, and optionally add phonemes for improved accuracy. It also shows how to update the command list and print active commands. Ensure the correct model name (e.g., "mn7_en") is used. ```c #include "esp_mn_iface.h" #include "esp_mn_models.h" #include "esp_mn_speech_commands.h" // Get MultiNet model handle const esp_mn_iface_t *multinet = esp_mn_handle_from_name("mn7_en"); // Available: mn5q8_cn, mn5q8_en, mn6_cn, mn6_en, mn7_cn, mn7_en // Create MultiNet instance with timeout (ms) model_iface_data_t *mn_model = multinet->create("mn7_en", 6000); // 6 second timeout // Initialize speech commands management esp_mn_commands_alloc(multinet, mn_model); // Add speech commands (command_id, command_string) esp_mn_commands_add(1, "turn on the light"); esp_mn_commands_add(2, "turn off the light"); esp_mn_commands_add(3, "play music"); esp_mn_commands_add(4, "stop music"); esp_mn_commands_add(5, "increase volume"); esp_mn_commands_add(6, "decrease volume"); // For MultiNet7, optionally add phonemes for better accuracy esp_mn_commands_phoneme_add(7, "tell me a joke", "TfL Mm c qbK"); // Apply command changes (required after add/remove/modify) esp_mn_error_t *err = esp_mn_commands_update(); if (err != NULL) { printf("Error adding %d commands\n", err->num); } // Print active commands esp_mn_active_commands_print(); // Get processing parameters int mn_chunksize = multinet->get_samp_chunksize(mn_model); char *language = multinet->get_language(mn_model); // "en" or "cn" ``` ```c // Modify commands at runtime esp_mn_commands_modify("play music", "start playing"); esp_mn_commands_remove("stop music"); esp_mn_commands_update(); // Clear all commands esp_mn_commands_clear(); esp_mn_commands_update(); // Cleanup multinet->destroy(mn_model); esp_mn_commands_free(); free(audio_buffer); ``` -------------------------------- ### Initialize AFE Configuration Source: https://github.com/espressif/esp-sr/blob/master/docs/en/audio_front_end/migration_guide.rst Use `afe_config_init` to initialize configurations, specifying the input format, models, AFE type, and performance mode. The previous `AFE_CONFIG_DEFAULT()` method is removed. ```c afe_config_t *afe_config = afe_config_init("MMNR", models, AFE_TYPE_SR, AFE_MODE_HIGH_PERF); afe_config_print(afe_config); // print all configurations ``` -------------------------------- ### AFE Instance Creation and Pipeline Control Source: https://context7.com/espressif/esp-sr/llms.txt This section describes how to create an AFE instance, control the processing pipeline, adjust algorithm parameters, and manage the AFE lifecycle. ```APIDOC ## AFE Instance Creation and Pipeline Control Create an AFE instance using the configuration and access the interface handle for audio processing operations. The handle provides methods to feed audio data, fetch processed results, and control individual algorithms in the pipeline. ### Method C Code Example ### Endpoint N/A ### Parameters #### Path Parameters N/A #### Query Parameters N/A #### Request Body N/A ### Request Example ```c // Get AFE handle from configuration const esp_afe_sr_iface_t *afe_handle = esp_afe_handle_from_config(afe_config); // Create AFE instance esp_afe_sr_data_t *afe_data = afe_handle->create_from_config(afe_config); // Print the processing pipeline ave_handle->print_pipeline(afe_data); // Output: [input] -> |AEC(VOIP_HIGH_PERF)| -> |WakeNet(wn9_hilexin)| -> [output] // Get processing parameters int feed_chunksize = afe_handle->get_feed_chunksize(afe_data); // Samples per frame int feed_nch = afe_handle->get_feed_channel_num(afe_data); // Input channel count int fetch_chunksize = afe_handle->get_fetch_chunksize(afe_data); // Output samples per frame int sample_rate = afe_handle->get_samp_rate(afe_data); // Sample rate (16000 Hz) // Control individual algorithms ave_handle->disable_wakenet(afe_data); // Disable wake word detection ave_handle->enable_wakenet(afe_data); // Re-enable wake word detection ave_handle->disable_aec(afe_data); // Disable echo cancellation ave_handle->enable_aec(afe_data); // Re-enable echo cancellation ave_handle->disable_vad(afe_data); // Disable voice activity detection ave_handle->enable_vad(afe_data); // Re-enable voice activity detection ave_handle->reset_vad(afe_data); // Reset VAD state // Adjust wake word detection threshold (0.4 - 0.9999) ave_handle->set_wakenet_threshold(afe_data, 1, 0.8); // wakenet_index=1, threshold=0.8 ave_handle->reset_wakenet_threshold(afe_data, 1); // Reset to default // Cleanup ave_handle->destroy(afe_data); ave_config_free(afe_config); esp_srmodel_deinit(models); ``` ### Response #### Success Response (200) - **esp_afe_sr_data_t*** - Pointer to the created AFE instance. - **int** - Processing chunk size, channel number, fetch chunk size, sample rate. - **void** - Control functions return void. #### Response Example N/A ``` -------------------------------- ### Initialize Speech Recognition System Source: https://context7.com/espressif/esp-sr/llms.txt Initializes the Audio Front-End (AFE) and MultiNet for speech recognition. Ensure models are loaded and AFE/MultiNet configurations are set appropriately for your hardware and desired performance. ```c #include "esp_afe_sr_models.h" #include "esp_afe_sr_iface.h" #include "esp_mn_iface.h" #include "esp_mn_models.h" #include "esp_mn_speech_commands.h" // Global handles static srmodel_list_t *models = NULL; static const esp_afe_sr_iface_t *afe_handle = NULL; static esp_afe_sr_data_t *afe_data = NULL; static const esp_mn_iface_t *multinet = NULL; static model_iface_data_t *mn_model = NULL; void speech_recognition_init(void) { // Load models models = esp_srmodel_init("model"); // Configure AFE afe_config_t *afe_config = afe_config_init("MMR", models, AFE_TYPE_SR, AFE_MODE_HIGH_PERF); afe_config->wakenet_model_name = "wn9_hiesp"; afe_config->vad_init = true; afe_config->vad_min_speech_ms = 128; // Create AFE afe_handle = esp_afe_handle_from_config(afe_config); afe_data = afe_handle->create_from_config(afe_config); afe_config_free(afe_config); // Initialize MultiNet multinet = esp_mn_handle_from_name("mn7_en"); mn_model = multinet->create("mn7_en", 6000); // Configure speech commands esp_mn_commands_alloc(multinet, mn_model); esp_mn_commands_add(1, "turn on the light"); esp_mn_commands_add(2, "turn off the light"); esp_mn_commands_add(3, "set brightness to maximum"); esp_mn_commands_add(4, "set brightness to minimum"); esp_mn_commands_update(); } ``` -------------------------------- ### Implement AFE Feed and Fetch Tasks Source: https://context7.com/espressif/esp-sr/llms.txt Demonstrates the structure of feed and fetch tasks for the AFE pipeline, including buffer allocation, I2S reading, and result processing. ```c // Allocate audio buffer for feeding int16_t *feed_buff = (int16_t *)malloc(feed_chunksize * feed_nch * sizeof(int16_t)); // Feed Task - runs on dedicated core void feed_task(void *arg) { afe_task_into_t *task_info = (afe_task_into_t *)arg; const esp_afe_sr_iface_t *afe_handle = task_info->afe_handle; esp_afe_sr_data_t *afe_data = task_info->afe_data; int feed_chunksize = afe_handle->get_feed_chunksize(afe_data); int feed_nch = afe_handle->get_feed_channel_num(afe_data); int16_t *i2s_buff = (int16_t *)malloc(feed_chunksize * feed_nch * sizeof(int16_t)); while (1) { // Read audio from I2S (channel-interleaved format) i2s_read(I2S_NUM_0, i2s_buff, feed_chunksize * feed_nch * sizeof(int16_t), &bytes_read, portMAX_DELAY); // Feed to AFE pipeline afe_handle->feed(afe_data, i2s_buff); } free(i2s_buff); vTaskDelete(NULL); } // Fetch Task - runs on dedicated core void fetch_task(void *arg) { afe_task_into_t *task_info = (afe_task_into_t *)arg; const esp_afe_sr_iface_t *afe_handle = task_info->afe_handle; esp_afe_sr_data_t *afe_data = task_info->afe_data; while (1) { // Fetch processed audio and detection results afe_fetch_result_t *result = afe_handle->fetch(afe_data); // Or with custom timeout: // afe_fetch_result_t *result = afe_handle->fetch_with_delay(afe_data, 100 / portTICK_PERIOD_MS); if (!result || result->ret_value == ESP_FAIL) { break; } // Access processed audio data int16_t *audio_data = result->data; int data_size = result->data_size; // Size in bytes float volume_db = result->data_volume; // Volume in dB // Check VAD cache (prevents speech truncation) if (result->vad_cache_size > 0) { int16_t *vad_cache = result->vad_cache; // Prepend vad_cache to audio_data for complete speech } // Check Voice Activity Detection state if (result->vad_state == VAD_SPEECH) { printf("Speech detected\n"); } else { printf("Silence/Noise\n"); } // Check Wake Word Detection if (result->wakeup_state == WAKENET_DETECTED) { int wake_word_index = result->wake_word_index; // Which wake word (1-based) int model_index = result->wakenet_model_index; // Which model detected int wake_length = result->wake_word_length; // Samples of wake word printf("Wake word %d detected by model %d\n", wake_word_index, model_index); } } vTaskDelete(NULL); } // Create tasks xTaskCreatePinnedToCore(feed_task, "feed", 8*1024, &task_info, 5, NULL, 0); xTaskCreatePinnedToCore(fetch_task, "fetch", 8*1024, &task_info, 5, NULL, 1); ``` -------------------------------- ### Create and Configure VAD Instance Source: https://context7.com/espressif/esp-sr/llms.txt This snippet shows how to create a Voice Activity Detection (VAD) instance using different modes or custom parameters. It highlights the available modes from `VAD_MODE_0` to `VAD_MODE_4` and allows for detailed configuration of sample rate, frame length, and speech/noise durations. Choose the mode and parameters that best suit your application's accuracy and performance needs. ```c #include "esp_vad.h" // Create VAD instance with mode vad_handle_t vad = vad_create(VAD_MODE_3); // VAD_MODE_0: Normal (more speech detected) // VAD_MODE_1: Aggressive // VAD_MODE_2: Very Aggressive // VAD_MODE_3: Very Very Aggressive // VAD_MODE_4: Very Very Very Aggressive (less false positives) // Or create with detailed parameters vad_handle_t vad_custom = vad_create_with_param( VAD_MODE_2, // mode 16000, // sample_rate (8000, 16000, or 32000) 30, // frame_length_ms (10, 20, or 30) 200, // min_speech_ms 500 // min_noise_ms ); ``` -------------------------------- ### Generate Model Binary for Arduino Source: https://github.com/espressif/esp-sr/blob/master/docs/en/flash_model/README.rst Use this Python script to generate the srmodels.bin file required for manual loading in the Arduino framework. ```bash python {esp-sr_path}/movemodel.py -d1 {sdkconfig_path} -d2 {esp-sr_path} -d3 {build_path} ``` -------------------------------- ### MultiNet Running and Detection Source: https://github.com/espressif/esp-sr/blob/master/docs/zh_CN/speech_command_recognition/README.rst Instructions on how to run MultiNet for command recognition after AFE and WakeNet are enabled, including data formatting and API calls. ```APIDOC ## MultiNet Running ### Description Once the AFE and WakeNet are enabled, MultiNet can be run for command recognition. Ensure that the frame length passed to MultiNet matches the AFE fetch frame length. The supported audio format is 16 KHz, 16 bit, mono channel. ### Get Sample Chunk Size Determine the required chunk size for audio data to be fed into MultiNet. ```c int mu_chunksize = multinet->get_samp_chunksize(model_data); ``` - **`mu_chunksize`**: The number of `short` type audio samples per frame required by MultiNet. This should be equal to the number of samples fetched by AFE per frame. ### MultiNet Detection Feed the real-time fetched audio data from AFE into the `detect` API. ```c esp_mn_state_t mn_state = multinet->detect(model_data, buff); ``` - **`buff`**: The audio data buffer. Its length should be `mu_chunksize * sizeof(int16_t)`. ``` -------------------------------- ### Configure CMakeLists.txt for ESP-SR Test Project Source: https://github.com/espressif/esp-sr/blob/master/test_apps/esp32c5/CMakeLists.txt Sets the required CMake version, defines extra component directories, and includes the ESP-IDF project configuration. ```cmake cmake_minimum_required(VERSION 3.5) # Include the components directory of the main application: # set(EXTRA_COMPONENT_DIRS "$ENV{IDF_PATH}/tools/unit-test-app/components" "../../../esp-sr") include($ENV{IDF_PATH}/tools/cmake/project.cmake) project(esp32c5_test) ``` -------------------------------- ### Add Speech Commands via menuconfig Source: https://github.com/espressif/esp-sr/blob/master/tool/README.md Navigate through the ESP-IDF configuration menu to add speech commands. Access the 'ESP Speech Recognition' section and select 'Add speech commands'. ```bash idf.py menuconfig ESP Speech Recognition -> Add speech commands ``` -------------------------------- ### Configure Voice Data Partition Source: https://github.com/espressif/esp-sr/blob/master/test_apps/esp-tts/main/CMakeLists.txt Defines a custom build target to include voice data and configures the flash process to write the data to a specific partition. ```cmake set(voice_data_image ${PROJECT_DIR}/../../esp-tts/esp_tts_chinese/esp_tts_voice_data_xiaoxin_small.dat) add_custom_target(voice_data ALL DEPENDS ${voice_data_image}) add_dependencies(flash voice_data) partition_table_get_partition_info(size "--partition-name voice_data" "size") partition_table_get_partition_info(offset "--partition-name voice_data" "offset") if("${size}" AND "${offset}") esptool_py_flash_to_partition(flash "voice_data" "${voice_data_image}") else() set(message "Failed to find model in partition table file" "Please add a line(Name=voice_data, Type=data, Size=3890K) to the partition file.") endif() ``` -------------------------------- ### Configure Model Partition and Flash Target in CMake Source: https://github.com/espressif/esp-sr/blob/master/CMakeLists.txt Use this logic within a CMake component file to verify the existence of a 'model' partition and register the custom build and flash commands. ```cmake if(CONFIG_PARTITION_TABLE_CUSTOM) partition_table_get_partition_info(size "--partition-name model" "size") partition_table_get_partition_info(offset "--partition-name model" "offset") if("${size}" AND "${offset}") set(MVMODEL_EXE ${COMPONENT_PATH}/model/movemodel.py) idf_build_get_property(build_dir BUILD_DIR) set(image_file ${build_dir}/srmodels/srmodels.bin) add_custom_command( OUTPUT ${image_file} COMMENT "Move and Pack models..." COMMAND python ${MVMODEL_EXE} -d1 ${SDKCONFIG} -d2 ${COMPONENT_PATH} -d3 ${build_dir} DEPENDS ${SDKCONFIG} VERBATIM) add_custom_target(srmodels_bin ALL DEPENDS ${image_file}) add_dependencies(flash srmodels_bin) esptool_py_flash_to_partition(flash "model" "${image_file}") else() set(message "Failed to find model in partition table file" "Please add a line(Name=model) to the partition file if you want to use esp-sr models.") endif() endif() ``` -------------------------------- ### Initialize and Use Chinese TTS Source: https://context7.com/espressif/esp-sr/llms.txt Initializes the TTS module by finding a voice data partition, creating a voice set from a template, and then synthesizing Chinese text, pinyin, or monetary amounts. Ensure the 'voice_data' partition exists and is correctly formatted. The speech rate can be adjusted during playback. ```c #include "esp_tts.h" #include "esp_tts_voice_xiaole.h" #include "esp_partition.h" // Initialize voice set from partition const esp_partition_t *part = esp_partition_find_first( ESP_PARTITION_TYPE_DATA, ESP_PARTITION_SUBTYPE_DATA_FAT, "voice_data" ); if (part == NULL) { printf("Voice data partition not found!\n"); return; } spi_flash_mmap_handle_t mmap; uint16_t *voicedata; esp_partition_mmap(part, 0, part->size, SPI_FLASH_MMAP_DATA, (const void **)&voicedata, &mmap); // Create voice set from template esp_tts_voice_t *voice = esp_tts_voice_set_init(&esp_tts_voice_template, voicedata); // Create TTS handle esp_tts_handle_t tts = esp_tts_create(voice); // Synthesize Chinese text char *text = "欢迎使用乐鑫语音合成"; if (esp_tts_parse_chinese(tts, text)) { int len = 0; do { // Get audio data stream (speed: 0=slowest, 5=fastest) short *data = esp_tts_stream_play(tts, &len, 3); // speed=3 if (len > 0) { // Output via I2S (16-bit mono @ 16kHz) i2s_write(I2S_NUM_0, data, len * sizeof(short), &bytes_written, portMAX_DELAY); } } while (len > 0); } // Synthesize using pinyin char *pinyin = "da4 jia1 hao3"; // 大家好 if (esp_tts_parse_pinyin(tts, pinyin)) { int len = 0; do { short *data = esp_tts_stream_play(tts, &len, 2); if (len > 0) { i2s_write(I2S_NUM_0, data, len * sizeof(short), &bytes_written, portMAX_DELAY); } } while (len > 0); } // Synthesize payment amounts (for payment terminals) // yuan=72, jiao=1, fen=0, mode=ALI_PAY_MODE if (esp_tts_parse_money(tts, 72, 1, 0, ALI_PAY_MODE)) { // "支付宝收款 72.1 元" int len = 0; do { short *data = esp_tts_stream_play(tts, &len, 2); if (len > 0) { i2s_write(I2S_NUM_0, data, len * sizeof(short), &bytes_written, portMAX_DELAY); } } while (len > 0); } // Pay modes: NONE_MODE, ALI_PAY_MODE, WEIXIN_PAY_MODE // Reset TTS state for next synthesis esp_tts_stream_reset(tts); // Cleanup esp_tts_destroy(tts); esp_tts_voice_set_free(voice); ``` -------------------------------- ### Configure ESP-SR Component Source: https://github.com/espressif/esp-sr/blob/master/esp-tts/CMakeLists.txt Sets include directories and registers the component. Links interface libraries based on the IDF target. ```cmake set(COMPONENT_ADD_INCLUDEDIRS ./esp_tts_chinese/include ) register_component() target_link_libraries(${COMPONENT_TARGET} INTERFACE "-L ${CMAKE_CURRENT_SOURCE_DIR}/esp_tts_chinese") if(IDF_TARGET STREQUAL "esp32") target_link_libraries(${COMPONENT_TARGET} INTERFACE esp_tts_chinese voice_set_xiaole voice_set_template ) endif() if(IDF_TARGET STREQUAL "esp32s2") target_link_libraries(${COMPONENT_TARGET} INTERFACE esp_tts_chinese_esp32s2 voice_set_xiaole_esp32s2 voice_set_template_esp32s2 ) endif() if(IDF_TARGET STREQUAL "esp32s3") target_link_libraries(${COMPONENT_TARGET} INTERFACE esp_tts_chinese_esp32s3 voice_set_xiaole_esp32s3 ) endif() ``` -------------------------------- ### Initialize and Stream TTS Synthesis Source: https://github.com/espressif/esp-sr/blob/master/docs/en/speech_synthesis/readme.rst Initializes the voice data partition and performs streaming synthesis of Chinese text to I2S output. ```c #include "esp_tts.h" #include "esp_tts_voice_female.h" #include "esp_partition.h" /*** 1. create esp tts handle ***/ // initial voice set from separate voice data partition const esp_partition_t* part=esp_partition_find_first(ESP_PARTITION_TYPE_DATA, ESP_PARTITION_SUBTYPE_DATA_FAT, "voice_data"); if (part==0) printf("Couldn't find voice data partition!\n"); spi_flash_mmap_handle_t mmap; uint16_t* voicedata; esp_err_t err=esp_partition_mmap(part, 0, part->size, SPI_FLASH_MMAP_DATA, (const void**)&voicedata, &mmap); esp_tts_voice_t *voice=esp_tts_voice_set_init(&esp_tts_voice_template, voicedata); // 2. parse text and synthesis wave data char *text="欢迎使用乐鑫语音合成"; if (esp_tts_parse_chinese(tts_handle, text)) { // parse text into pinyin list int len[1]={0}; do { short *data=esp_tts_stream_play(tts_handle, len, 4); // streaming synthesis i2s_audio_play(data, len[0]*2, portMAX_DELAY); // i2s output } while(len[0]>0); i2s_zero_dma_buffer(0); } ``` -------------------------------- ### Add Speech Commands via Reset Function Source: https://github.com/espressif/esp-sr/blob/master/tool/README.md Use the `multinet->reset` function to dynamically add speech commands. Commands are split by ';' and phrases within a command are split by ','. ```c // Function definition // typedef void (*esp_mn_iface_op_reset_t)(model_iface_data_t *model_data, char *command_str, char *err_phrase_id); // "," is used to split different phrase with same command id // ";" is used to split different command id char *new_commands_str="hcLb WkLD,hi fST;TkN nN jc LiT;TkN eF jc LiT;" // char err_id[256]; multinet->reset(model_data, new_commands_str, err_id); // hello world,hi ESP -> commond id=0 // turn on the light -> commond id=1 // turn off the light -> commond id=2 ``` -------------------------------- ### Process Audio with MultiNet for Command Detection Source: https://context7.com/espressif/esp-sr/llms.txt This snippet shows how to process audio buffers using a MultiNet model after wake word detection. It continuously detects speech commands, retrieves results including command IDs and probabilities, and handles recognized commands. The model is cleaned after each detection or timeout. Ensure `audio_buffer` is filled with audio data matching `mn_chunksize`. ```c // Process audio after wake word detection int16_t *audio_buffer = (int16_t *)malloc(mn_chunksize * sizeof(int16_t)); while (1) { // Get audio from AFE fetch // audio_buffer should match mn_chunksize esp_mn_state_t mn_state = multinet->detect(mn_model, audio_buffer); if (mn_state == ESP_MN_STATE_DETECTING) { // Still listening... } else if (mn_state == ESP_MN_STATE_DETECTED) { // Command recognized esp_mn_results_t *results = multinet->get_results(mn_model); printf("Recognized: %s\n", results->string); printf("Top %d results:\n", results->num); for (int i = 0; i < results->num; i++) { int cmd_id = results->command_id[i]; float prob = results->prob[i]; char *cmd_str = esp_mn_commands_get_string(cmd_id); printf(" [%d] %s (prob: %.2f)\n", cmd_id, cmd_str, prob); } // Use top result int best_command = results->command_id[0]; handle_command(best_command); // Reset for next recognition multinet->clean(mn_model); } else if (mn_state == ESP_MN_STATE_TIMEOUT) { // No command detected within timeout printf("Recognition timeout\n"); multinet->clean(mn_model); break; // Wait for next wake word } } ``` -------------------------------- ### Run Multinet G2P Script Source: https://github.com/espressif/esp-sr/blob/master/tool/README.md Execute the multinet_g2p.py script with a string of commands separated by commas and semicolons. The script converts the input text into a phoneme representation. ```python python multinet_g2p.py -t "hello world,hi ESP;turn on the light;turn off the light" ``` ```text ------ in: hello world,hi ESP;turn on the light;turn off the light out: hcLb WkLD,hi fST;TkN nN jc LiT;TkN eF jc LiT; ``` -------------------------------- ### MultiNet6 Command Configuration Format Source: https://github.com/espressif/esp-sr/blob/master/docs/en/speech_command_recognition/README.rst Format for defining English speech commands using command IDs and graphemes. ```text # command_id,command_grapheme 1,TELL ME A JOKE 2,MAKE A COFFEE ``` -------------------------------- ### Allocate Model Partition in CSV Source: https://github.com/espressif/esp-sr/blob/master/docs/en/flash_model/README.rst Add this line to your partitions.csv file to reserve space for speech recognition models. ```csv model, data, , , 6000K ``` -------------------------------- ### MultiNet7 Command Configuration Format Source: https://github.com/espressif/esp-sr/blob/master/docs/en/speech_command_recognition/README.rst Format for defining English speech commands using command IDs, graphemes, and phonemes. ```text # command_id,command_grapheme,command_phoneme 1,tell me a joke,TfL Mm c qbK 2,sing a song,Sgl c Sel ``` -------------------------------- ### Process VAD Cache and Fetch Results Source: https://github.com/espressif/esp-sr/blob/master/docs/en/vadnet/README.rst Fetch audio data using the AFE handle. Check `vad_cache_size` to determine if there is cached VAD data to be written, which helps prevent truncation of the first word. Also, print the current VAD state, which can be either 'noise' or 'speech'. ```c afe_fetch_result_t* result = afe_handle->fetch(afe_data); if (result->vad_cache_size > 0) { printf("vad cache size: %d\n", result->vad_cache_size); fwrite(result->vad_cache, 1, result->vad_cache_size, fp); } printf("vad state: %s\n", res->vad_state==VAD_SILENCE ? "noise" : "speech"); ``` -------------------------------- ### Print Active Speech Commands Source: https://github.com/espressif/esp-sr/blob/master/docs/en/speech_command_recognition/README.rst Prints all currently active speech commands. ```c /** * @brief Print all commands in linked list. */ void esp_mn_active_commands_print(void); ``` -------------------------------- ### Feed Audio Data to AFE Source: https://github.com/espressif/esp-sr/blob/master/docs/en/audio_front_end/README.rst Allocate a buffer for audio data based on the AFE's required chunk size and channel count. ```c int feed_chunksize = afe_handle->get_feed_chunksize(afe_data); int feed_nch = afe_handle->get_feed_channel_num(afe_data); int16_t *feed_buff = (int16_t *) malloc(feed_chunksize * feed_nch * sizeof(int16_t)); ```